soc final report
As you may have noticed, I skipped last progress report. That’s because I was busy with coding work. Anyway, let’s summarise what happened throughout the GSoC program. The basic idea was to improve darcs support for large repositories — large in terms of number of working files. The primary goal was to improve the code that diffs the working tree against the pristine cache: an operation, that is very common, and in darcs 2.2 and earlier was also painfully slow (depending on circumstances).
So what has been delivered? There is a number of outputs of the project:
- hashed-storage 0.3.7, a library for reading and writing filesystem trees in hash-based formats, with primary focus on darcs hashed pristine
- darcs whatsnew integration with hashed-storage 0.3.7, part of the current stable release of darcs, 2.3.0
- hashed-storage almost-0.4 — a work in progress, improving hashed-storage 0.3.7 in several directions (and breaking the API)
- darcs-hs: a branch integrating hashed-storage 0.4 into darcs, using hashed-storage fast diffing mechanism for all working/pristine diffing needs — naturally, also a work in progress
- darcs-benchmark: a standalone package for benchmarking darcs (pitting different versions of darcs against each other on a number of medium-sized repositories, on a number of different commands)
The first two items on the list are done for good. The diffing code that was a major goal of the project is stable and part of a stable darcs release. It is supported by a independent hackage library, hashed-storage.
For efficient retrieval over HTTP… there is support for creating and using hashed packs in hashed-storage 0.4. This still needs some work to be put in practice, both on darcs-hs and on hashed-storage. I will continue working on this in the fall. The current status of code in hashed-storage 0.4 is basically proof-of-concept (comes with unit tests though). There is also a concept document, designing storage for darcs. Either way, I am trying to not rush things on this, since once we put this into the stable darcs, we will have to live with it for a long time, and mistakes now would be expensive later.
As for darcs-benchmark — I have uploaded version 0.1 to hackage and it should
be ready for wider circulation. Please try it out and let me know if you have
any issues with it (
cabal install darcs-benchmark).
Work in Progress
Since pencils down is approaching quickly, there will stay a number of in-progress items. There are still some bugs to fix in darcs-hs, and hashed-storage needs a haddock cleanup and at least one pass over the API before I can upload 0.4. But let’s see what 0.4 brings over 0.3.7. During a review process that Ganesh started, we have identified a number of weak spots in hashed-storage. This has directly and indirectly led to hashed-storage 0.4.
On the API level, we have generalised the Tree out of IO into a generic Monad, to help with testing infrastructure. The Index API has been simplified and, more importantly, made much safer. With this done, it turned out that the index reading code is needlessly general, and I have simplified it, improving index performance along the way (in the process, I also changed the layout… twice… the magic word versioning mechanism really paid off).
I have also refactored the Hash type, and removed all the hacks that deal with the size prefix on darcs hashes. This means we won’t be able to write out new trees of this kind anymore (but no big loss, old darcs can read repositories without the size prefix as well). Along with the original Hash refactor, I have changed the internal hash representation, cutting hash length from 64 to 32 bytes (this is still SHA256, but is using full 8 bits of every byte, unlike ascii-hex which carries 4 bits of useful data per byte). This representation is also used in the index.
The overall result is, that the index size has shrunk considerably — it is now, on average, smaller than a git index for equivalent repository, even though we use stronger hashes (but yes, git index is more than a simple cache).
The even more interesting outcome is that, with darcs-hs, I can now run “whatsnew” on a 100k-file repository in 0.35 seconds — a mere .1 second (or 40%) slower than equivalent git diff. With 100% Haskell (well, the sha256 implementation is in C, but this code is not tripped in the optimal case). For most realistic repository sizes, the difference is going to be negligible. Oh, and to get an idea how big a 100k-file repository is, the Linux kernel tree has about 28k files in it.
Recently, it has been noted, that for old-fashioned repositories, darcs whatsnew has regressed in performance by a fair amount. This is true, but it’s not clear if it is worth addressing properly. For 2.3.1, the easy fix is to restore the old code path when the repository is not hashed. However, I have already removed unsafeDiff in darcs-hs and I want to keep it that way, so we either need to start treating old-fashioned as second-class, or either come up with a more systematic fix.
As for hashed-storage, I want to do further work on packs. Maybe even add code to read git repositories. Moreover, there is an ongoing review of hashed-storage which may bring further ideas for changes. We also have to make some decisions about what to do about darcs in the future — how to.
It has definitely been a productive summer. Lots of work has been done on hashed-storage, and it seems that with a little further work, it can provide solid underpinning to future darcs versions. Apart from major performance improvements, it exposes a new standalone component with its own set of tests, improving test coverage of what constitutes darcs.