A few days ago we published a blog post called OpenStack Swift as backend for Git – part 1 where we explained what could be the advantages of using Swift as backend for Git and gave you some details about what happens in Git (server side) when a client pushes or fetches objects.
In this second blog post, we will first do a quick introduction of Dulwich, the project we use to tackle our challenge and describe how we handle Swift as a backend to store repositories. Then we’ll finish by giving you the necessary resources to try Dulwich with Swift as its backend.
Quick overview of Dulwich
To get over our challenge we decided to use the amazing Python project Dulwich. It is a Python library developed by Jelmer Vernooij that gives an interface to local and remote Git repositories. This library handles a lot of stuff like :
- Create, read, manage loose objects (blob, tree, commit, tag)
- Create, read, manage pack files
- Create, read, manage references files
- Manage staging area
- Manage a local copy
- Implement the Git smart protocol through git-upload-pack and git-receive-pack
- Implement the Git, HTTP, SSH listeners to start Dulwich as a Git server
- Implement some client side command like pull, fetch, clone, … and some other porcelain commands
The Dulwich library has all the needed base elements for tackling our challenge especially by offering a full Python implementation of git-upload-pack and git-receive-pack. The really interesting parts for us are its server capabilities, Git smart protocol implementation and repository interface.
How do we handle Swift as backend with Dulwich
We added an additional repository implementation SwiftRepo along with the traditional Repo (File system backend) and the MemoryRepo. As you can see, the full implementation is located in the dulwich/swift.py module. Below are some explanations about the Dulwich implementation of the Swift backend interface:
The repository layout in a Swift account
The SwiftRepo implementation authenticates against Swift and manages repositories at account’s container level. The repository’s container will include the following objects:
- objects/pack/[pack-sha-1.pack, pack-sha-1.idx, pack-sha-1.info]*
It also includes other objects like description, config, info/exclude that can be ignored for now. These are the minimal requirements to have a working repository.
- info/refs object stores the reference’s names and the corresponding object’s sha-1
- pack objects store the Git objects
The info/refs object
Instead of using the standard way to store references one file by reference we prefered to store all references in one file. The common way will produce a long list of Swift objects while the amount of branches and tags grow. The discovery process of all references will require a bunch of Swift GET requests. This is why in our Swift backend we use info/refs because it requires just one GET request to load all the references.
The pack files
The C Git implementation of git-receive-pack will explode a received pack file from client to a bunch of loose Git objects (tree, blob, commit, tag). Dulwich will instead keep the pack format to store the objects. We kept the Dulwich way to store the objects in order to reduce the amount of Swift objects we need to store in an unique container. In our experience, Swift does not deal efficiently with containers that contain a huge amount of objects.
A pack file can contain a huge amount of objects. The advantage of the pack format against storing each individual loose object in a file is that an object can be a delta of a base object. Storing delta instead of full object content can significantly reduce the size of a repository. To a better understanding of what a pack file is, have a look at the pack format documentation.
On a file system, retrieving an object from a pack file requires to seek into it (at a known offset) and load a known amount of bytes. The pack index “.idx” contains the offset of all the objects included in a corresponding .pack file. In our Swift backend implementation for Dulwich, we use the Range header of the GET request to only read the required parts of a pack to retrieve the objects by their sha-1.
To improve the performance and reduce the delay when seeking over stored packs, creating or verifying a pack it was quite obvious that adding concurrency at object retrieval was a better option. Dulwich does not rely on any sort of concurrency when walking over a pack as local disk IO are generally pretty fast.
Our Swift backend implementation is able to perform HTTP requests to Swift concurrently. This, for instance, is particularly efficient when we need to build a custom pack for a client. When we know all the sha-1 of the objects we need to integrate in the pack we can concurrently perform the requests to Swift.
In addition, we use a controlled pool of HTTP connections that can be reused, thanks to the geventhttpclient library and the minimal Swift client integrated in dulwich/swift.py.
The pack.info objects
In a traditional Git repository a pack is always accompanied by an index file. We decided to add a third file. This .info object is like an index with more information than the .idx object. For instance it contains the parents commit listing for each commit of a pack. With this content we can quickly build the parent commit’s chain for a given reference simply by reading this file.
This .info file is automatically created when a pack is pushed by client and stored in the Swift backend. Without this file we would need to walk (GET request the pack file) over all the commit objects one by one (synchronously) in order to build the commit parent chain from a given reference. This can be slow for some projects with a big amount of commits. The .info object contains other additional useful information to speed up the object discovery.
A configuration file is needed by the Swift repo implementation. The configuration file lets you specify the user credentials to perform the requests against Swift (tenant, user and password) together with the authentication method (v1 or v2). You can also configure the concurrency limit and the size of the HTTP connections pool. Please have a look at the configuration template.
How to retrieve and use
The Swift repository implementation for Dulwich is currently usable in the eNovance fork of Dulwich. The installation and usage instructions can be found in the README.swift. There is currently a pull request for this feature on the official Dulwich repository.