Designing Storage for Darcs
Let’s set out some requirements first. We are storing two things: a pristine cache, and history of patches. The first is just a mirror of a working checkout, the second is patches: a sequence of text diffs, each with some metadata.
We need the storage to be efficiently accessible through “dumb” access protocols: requesting a file is always available, and it may be possible to get part of a file of a given size from a given offset (HTTP range request). Moreover, we know that the dumb transport has costly roundtrips: we want to do as few requests as possible. Other than minimising number of roundtrips, we of course want to minimise the amount of data that we need to repeatedly re-download (or more generally, download in waste).
The local operations are (the first 2 operations need to be quite efficient):
- append a new patch to the sequence
- apply patch(es) to the pristine cache
- remove a patch from arbitrary place in the sequence
- place a new patch in the middle of the sequence
And the remote operations:
- make a copy of the whole repository
- retrieve a list of patches we have not yet seen
- retrieve the actual patch content
We would also like local copies to use hardlinking, to reduce space cost of branches. This can be done for any shared, immutable files (and under some circumstances, it could be done in other cases, but that probably complicates matters more than it’d be worth).
There are also two approaches to getting a copy of the repository: the “full copy” that includes all the history, and the “lazy” one, that copies the pristine cache first and avoids any patch application and possibly also patch download.
Basically, there are two extreme approaches to the challenge (and also some middle ground in between that we will discuss later). First, we could keep everything in a single (a few) append-only files. The remove/insert operations would require an expensive copy operation to keep the append property, but this is not so much of a problem, as these are fairly rare. This is basically the Camp approach to repository storage. Camp keeps pristine cache in a plain filesystem hierarchy, but the pristine cache is never involved in any network operation, so this is reasonable (modulo the missing non-corruption guarantees). The basic downside is that virtually no sharing is admissible (at least not trivially — a sharing scheme would require a nontrivial amount of extra logic).
Second, there is the current hashed darcs system: we keep every pristine file and every patch in a separate file, named by the file’s content (i.e. hashed). We can share mostly everything through hardlinking (and for remote operations, we can implement caching easily). The cost is twofold: remote operations require a high number of roundtrips, and internal filesystem fragmentation inflates repository size (the stored files are often just a fraction of a filesystem block, and most filesystems never allocate more than one file in a given block). Moreover, current darcs implementation creates extremely large directories, which tend to make individual file operations quite inefficient.
To look into a somewhat different world, there’s also git. Basically, what git does is same as what darcs does: it creates a single file per object, with cryptographic hash as its name. This allows for sharing and fast local access. It also suffers from the fragmentation problem and is unsuitable for retrieval over dumb transport. Git uses “packs” to circumvent those two problems: you need to manually compress the repository from time to time, into these packs. It also relies on smart transport whenever possible (which uses the same packing mechanism internally).
Anyway, the compromises between the two extremes seem to fall into the “generational” category: a hatchery of one kind and a “mature space” of another kind: with mature space hopefully containing large immutable files (with many objects packed in each). The hatchery is more of a question. As noted above, git uses a “loose” object storage (many small files) as the hatchery. The generational gap is mostly manual: users are called in for administrative action.
The question is, can we do any better? There are two fronts on which to improve: an automated “aging” of objects, from hatchery to mature space, and a better format for the hatchery. From the other side, compared to current darcs, we can also improve the (non-existent) mature space.
Basically, the mature space can be considered read-only for most practical purposes. The natural solution for darcs would be to create a single packed file per tag, and this packed file can as well be a (compressed) patch bundle. When broken at tags, the patch bundles would have very small context (just the immediately preceding tag).
As for hatchery, one obvious solution is the same as git: loose storage. This is also a straightforward extension of existing darcs, by just implementing a mature space. The costs in form of roundtrips and filesystem fragmentation could still be quite high though.
The other approach for hatchery that comes to mind is to use only a single, append-only file (with a simple index, probably in an external file). Basically, this would work like Camp currently does, appending all patches to a file. At certain point, this append-only file would be packed into the mature space somehow, and we would start afresh with empty hatchery (or at least with reduced one, depending on the packing scheme).
Hatchery/Mature Space Migration
The last thing that needs to be dealt with is the heuristic that migrates bits from the hatchery to the mature space. As I hinted above, for darcs it would generally make sense to do this at tags. Nevertheless, if there are too many or too few tags, this won’t work very well in practice. This basically means we want the heuristic to coalesce “too small” tags, and break up “too big” tagless sequences. The first is easier than the second: we can choose the chunk to pack at the migration time, i.e. when hatchery size hits some threshold. We can just set some constants as how big chunks we want, and chop up the load at the most appropriate tags.
As for the latter, we can decide at the same time: when hatchery grows too big and there’s no tag to use for breakup, we can auto-tag at appropriate place. Such heuristic is bound to make mistakes, but it is probably going to work well-enough.
Let’s consider a format, where mature space consists of patch bundles (broken up at tags) and the hatchery is a small number of append-only files. We can put the hatchery bound at 2M, which becomes the maximum amount of unshareable data per branch. Let’s say we put the optimal pack size at 1.5M. For the GHC kind of repository, this is about 40 packs plus the hatchery and the pristine cache. If we can reconstruct the pristine cache “on the fly” (as data is received from network), the time required should be quite close to that of the actual download, without much time wasted for unpacking onto the filesystem. (Trying with current darcs from a local disk drive, replaying patches in the GHC repository gives processing speed of around 900KB/s. This probably needs some improvements.)
Since in the previous discussion, we side-stepped the issue of pristine cache, the whole system as described will still have an issue with “lazy” repositories — the loose pristine needs to be fetched. This could easily take about the same time as full get with the mentioned improvements, since we are facing some 1200 roundtrips (compared to 40 for the patches — still using the GHC repository as an example). On my current wireless connection, the average RTT to code.haskell.org is around 200ms, and from a wired university connection, it is about 140ms. Even in the case of the fast university network, we are facing some 168 seconds of roundtrip wait time when using “dumb” HTTP.
When doing a full copy of the repository, the number of files in pristine cache doesn’t really matter — it has a slight problem with fragmentation, but we could probably ignore that. However, the potential saving on download time for lazy repository is non-negligible: the size of the pristine cache is around 1/8 of the full history size for GHC. It is definitely possible to pack the pristine cache, however, the proportion of garbage is much higher in pristine cache than in the patch storage. The pristine cache would likely need to be periodically repacked and recombined to stay efficient. The generational approach will likely reduce this problem somewhat, since a part of the garbage will never leave the hatchery. The collection for mature space would probably need to keep track of utilisation factor, and replace packs that reach a certain non-utilisation threshold (say, 50%).
(Just a small sidenote: when we apply a patch, all affected files and all their parent directories need to be replaced in the pristine cache. This works just like git. However, in darcs, the older revisions of those files immediately become garbage, since there is nothing like a commit object to reference them: for darcs, this is purely a cache, since we reconstruct trees by applying and unapplying patches.)
Now that we see that the pristine cache can use a very similar storage concept as the patches do, we could try to unify the patch storage and the pristine storage. Basically, patches behave like pristine, it’s just that the proportion of objects that fall dead is much lower. However, this unified storage would have to forgo the idea of “pack as a bundle”, since the model is more general now. It also poses some issues with applying patches as they are downloaded, since we no longer receive them in application order: we probably want the generalised packs to be sorted by hash, so we can have binary search indices for them. It would be also independent of tags, which is a mixed blessing.
It so far seems that the unified storage code is worthwhile. For one, it will make lazy repositories possible and efficient. Moreover, at cost of a slightly larger download, it can avoid replaying all patches upon repository copy, so the fact they are not sorted in application order does not hurt too much. We get packing that is independent of manual tagging and without need to add automatic tagging to the scheme. Overall, when using the general storage principle, things don’t seem to be overly complicated. It will probably take some trial and error to set the parameters (various thresholds and sizes) right. In the case of the unified storage, the net result should be a relatively simple and thin layer sitting between darcs and the filesystem, acting like sort of object storage: somewhat like a smarter git storage, with provisions for higher garbage rates (since these are virtually 0 with git).