Introduction

This is probably the most active bit of the site, i post random thoughts, events and articles to this blog. It is ridden with personal issues and technical items alike, while being constantly out of date. Read at your own risk. If you are interested in my person (odd enough as that may sound), probably read something about me. If you are only interested in technical articles, see tech. Poetry has a completely separate blog of its own right. Older articles (as well as postings on the previous blogs i had) are to be found in archive. That’s probably it.

Recent Entries

14.08.2009: soc final report

soc final report

As you may have noticed, I skipped last progress report. That’s because I was busy with coding work. Anyway, let’s summarise what happened throughout the GSoC program. The basic idea was to improve darcs support for large repositories — large in terms of number of working files. The primary goal was to improve the code that diffs the working tree against the pristine cache: an operation, that is very common, and in darcs 2.2 and earlier was also painfully slow (depending on circumstances).

Deliverables

So what has been delivered? There is a number of outputs of the project:

The first two items on the list are done for good. The diffing code that was a major goal of the project is stable and part of a stable darcs release. It is supported by a independent hackage library, hashed-storage.

For efficient retrieval over HTTP… there is support for creating and using hashed packs in hashed-storage 0.4. This still needs some work to be put in practice, both on darcs-hs and on hashed-storage. I will continue working on this in the fall. The current status of code in hashed-storage 0.4 is basically proof-of-concept (comes with unit tests though). There is also a concept document, designing storage for darcs. Either way, I am trying to not rush things on this, since once we put this into the stable darcs, we will have to live with it for a long time, and mistakes now would be expensive later.

As for darcs-benchmark — I have uploaded version 0.1 to hackage and it should be ready for wider circulation. Please try it out and let me know if you have any issues with it (cabal install darcs-benchmark).

Work in Progress

Since pencils down is approaching quickly, there will stay a number of in-progress items. There are still some bugs to fix in darcs-hs, and hashed-storage needs a haddock cleanup and at least one pass over the API before I can upload 0.4. But let’s see what 0.4 brings over 0.3.7. During a review process that Ganesh started, we have identified a number of weak spots in hashed-storage. This has directly and indirectly led to hashed-storage 0.4.

On the API level, we have generalised the Tree out of IO into a generic Monad, to help with testing infrastructure. The Index API has been simplified and, more importantly, made much safer. With this done, it turned out that the index reading code is needlessly general, and I have simplified it, improving index performance along the way (in the process, I also changed the layout… twice… the magic word versioning mechanism really paid off).

I have also refactored the Hash type, and removed all the hacks that deal with the size prefix on darcs hashes. This means we won’t be able to write out new trees of this kind anymore (but no big loss, old darcs can read repositories without the size prefix as well). Along with the original Hash refactor, I have changed the internal hash representation, cutting hash length from 64 to 32 bytes (this is still SHA256, but is using full 8 bits of every byte, unlike ascii-hex which carries 4 bits of useful data per byte). This representation is also used in the index.

The overall result is, that the index size has shrunk considerably — it is now, on average, smaller than a git index for equivalent repository, even though we use stronger hashes (but yes, git index is more than a simple cache).

The even more interesting outcome is that, with darcs-hs, I can now run “whatsnew” on a 100k-file repository in 0.35 seconds — a mere .1 second (or 40%) slower than equivalent git diff. With 100% Haskell (well, the sha256 implementation is in C, but this code is not tripped in the optimal case). For most realistic repository sizes, the difference is going to be negligible. Oh, and to get an idea how big a 100k-file repository is, the Linux kernel tree has about 28k files in it.

Future Work

Recently, it has been noted, that for old-fashioned repositories, darcs whatsnew has regressed in performance by a fair amount. This is true, but it’s not clear if it is worth addressing properly. For 2.3.1, the easy fix is to restore the old code path when the repository is not hashed. However, I have already removed unsafeDiff in darcs-hs and I want to keep it that way, so we either need to start treating old-fashioned as second-class, or either come up with a more systematic fix.

As for hashed-storage, I want to do further work on packs. Maybe even add code to read git repositories. Moreover, there is an ongoing review of hashed-storage which may bring further ideas for changes. We also have to make some decisions about what to do about darcs in the future — how to.

Conclusions

It has definitely been a productive summer. Lots of work has been done on hashed-storage, and it seems that with a little further work, it can provide solid underpinning to future darcs versions. Apart from major performance improvements, it exposes a new standalone component with its own set of tests, improving test coverage of what constitutes darcs.

posted 14.08.2009 11:35 pm tags:

29.07.2009: soc progress 10

soc progress 10

Wow, tenth progress report. That’s a long time now. Still a day late, but getting on track slowly (slower than expected I guess). Last week has seen the release of darcs 2.3. Nonetheless, this release has seen partial integration of hashed-storage work. Unfortunately, the release was also a major motivation drain — RMing darcs is a lot of (fairly hard) work.

On the unsafeDiff front, I’m at a standstill (there are still the same 2 instances that I reported about last week). I haven’t progressed on the HPC issue either, but that can wait and it’s not directly relevant to SoC.

I have worked on profiling my last-week’s check/repair port. Fortunately, the implementation does not leak memory, and is slightly slower (some 20%) than the old version. Most of the extra goes into filepath juggling (due to representation differences between darcs and hashed-storage). See my darcs-users post for more details about this.

In other news, I have forked an 0.3 branch of hashed-storage and flipped the switch to 0.4 in mainline. The first compatibility-breaker is the Hash type refactor, which improved things a fair bit. Also, we now represent hashes in-memory (and in the binary index) by 32-byte binary vectors (represented as bytestrings) instead of using the 64-byte ascii-hex representation. Also, this has removed a lot of Maybe wrapping, since the new Hash type comes with a “NoHash” constructor.

Other than that, I have taken a break after the release. I still have some motivational issues, but I should improve my output rate over the next week. The major todo on is file benchmarking (since Eric keeps nagging me about this, and I still haven’t figured how to do this properly) and actual implementation of the (internal) pack format.

Hashed-storage changes:

And darcs-hs changes:

posted 29.07.2009 10:01 pm tags:

25.07.2009: patch formats

Patch Formats

I have been thinking for a while, that a completely new “repository” format (an experimental one) would be in place for darcs 2.4. I have previously outlined a way I’d like to go about building up new things within the darcs 2.x series. Now a darcs repository has two basic “components”: the “file” part of the layout: truly a repository format, and a “patch format”: which determines not only how patches are written out to disk, but more importantly, their exact semantics. Once you set up a “patch format”, this is set in stone and repositories with different patch types cannot exchange patches between them (at least not without an in-between conversion). This is the case between darcs-1 and darcs-2 format repositories, as they use a different patch format. The case of darcs-1 vs “hashed” repositories, as darcs calls them, is only on the file level though: the patch formats are identical, and that’s why hashed and plain darcs-1 repositories can exchange patches just fine. (I will from now on refer to repository and patch format as two orthogonal things, as they mostly are.)

Now I have been working on a packed repository format… one that would allow to store the repository — regardless of patch format used — in a compact form suitable for HTTP-based retrieval. In this post, I’d like to address the other thing: a patch format. It seems worthwhile to improve our current system, since it has a number of weak spots. Currently, we have a number of “primitive” patch types, and some more complicated ones — conflictors in darcs-2 or mergers in darcs-1. I am not going to talk about these — we’ll focus on the primitive patches for now.

The primitives in darcs are addfile, rmfile, addir, rmdir, move, replace and most importantly hunk. (You should be able to look up what these roughly represent somewhere else.)

Let’s address these addfile and friend patches, that create and remove files or directories. Obviously, addfile foo.txt and a different addfile foo.txt are going to conflict. Also, all hunks for foo.txt obviously depend on addfile foo.txt — which means that if you pull two branches together with nontrivial files of the same name, you are going to end up with a massive conflict (at least in terms of darcs data structures) for virtually no reason.

So my proposal is to divorce the filename from file identity: this is something that has been pondered before, I believe. The result would look something like:

hunk fileid 1
- a
+ b

This means that hunks would exist without any dependence on addfile: the abstract file would pop into existence with first hunk touching its identity. Of course this would be no good, since you just lost the relation between a working copy and whatever darcs tracks. To put that relation back on track, we add two patch types:

manifest fileid ./file/path
demanifest fileid ./file/path

A manifest patch will tell darcs to associate the fileid with your working file at ./file/path. The inverse operation is demanifest, and that would remove the association: and your working copy file. The abstract identity continues to exist just fine, and can be manifested again (under the same or different filepath). Basically, this completely de-couples the “hunk-space” from the “filepath-space” — manifest/demanifest/move(/adddir/removedir) patches commute completely freely with hunk/replace patches. To make the de-coupling complete, you want a “manifest” of a non-existent fileid to pop that fileid into existence as well. No problem.

Basically, this means that as far as darcs is concerned, file content manipulation is orthogonal to the directory tree manipulation: and this is good and well, since it allows us to solve conflicts on both of those levels separately, without dragging in a lot of stuff from the other level. Moreover, the add-add conflict no longer exists.

As for the hunk format itself, there is also a number of issues: it uses a GNU-patch-like format with ‘+’ or ‘-’ sign in front of each line. It will usually look like a block of ‘-’ lines followed by a block of ‘+’ lines (either of these may be empty). Parsing this format is not quite simple, you have to look up all the newlines, chop off the ‘+’ and ‘-’ signs etc. Lots of work for darcs.

hunk ./foo.txt 1
- a
+ b

Now a friendly-to-parse alternative would be to use a chunk kind of patch:

chunk fileid - 0 2
a\n
chunk fileid + 0 2
b\n

the 0 and the 2 are an offset and length (in bytes), respectively. What follows the patch header is a monolithic block of text to be removed from (or pasted at) the given offset (and the block is of a given length). This would produce more primitives than the original hunks, but they would be vastly simpler to process in bulk. Each basically represents a string “splice” operation. No newline shuffling whatsoever. The commutation rules should be as simple as they were with hunks: you just wibble the offset (after checking for overlap).

But now that the data in those chunks is basically a blob of (anything), there is an extra thing that can be done. Instead of keeping this data inline in the patch, we could refer to it by a hash and store it elsewhere:

chunk fileid - 0 2 hash
chunk fileid + 0 2 hash

Of course, for the example patch, this is going to inflate the patch size quite a bit. But let’s not care for a while. We now have a chunk format that is O(1) in size (well O(logn) for purists, given the length needs to be represented, but we don’t care about that either). We can still commute and invert it just fine: inversion is flip-flopping the ‘-‘/’+’ sign, and commutation just wibbles the offset. Awesome. (Besides that, it also effectively obliterates any need for a “binary” kind of patch: chunks will do just fine for that, and we could even use binary diff algorithm if we wanted…)

We will need to dereference that hash sometimes of course: when we want to actually apply the patch, and when we want to show it to user. The latter is a non-issue, since our processing power vastly exceeds that of the user, we can play with the patch as much as we like for presentation issues. So when we want to apply that patch, we need to fetch its content… but wait, we already have a mechanism for buckets of bits by their hashes: the hashed pristine! So yes, we can just dump the data bits of the chunks into hashed pristine (plus some wibbling of garbage collection, see my previous post about that).

Now that no actual file content is ever part of any patch representation, we can consider some new options. One would be to store patches inline in the inventory files: this would probably inflate their sizes by a small factor. Looking at darcs repository itself, we have 750K of compressed inventories (1.5M uncompressed). There are 43225 hunks — this would add about 6M of uncompressed (but relatively compressible) text to the inventories (considering about 150 bytes per a chunk patch: 2x sha256 + a little). That is about factor 5. We would probably have to play around a little to find out how (un)reasonable this is. We could also cut a bit of the cost by using a less-stupid encoding of hashes than ascii-hex (aka base16)… say base64 (see RFC 3548/4).

I should probably also note that this would save some bits on conflictor representation (no copies of hunk data) and it should also solve the “big initial patch” problem — the patch itself would be O(n) in number of files (instead of O(n) in number of bytes of the initial tree). There are of course some drawbacks and some other advantages, but I don’t have the amount of time to go into more details just now. Instead, I’ll let people think about this for a while. Comments are definitely welcome (probably address them to our darcs-users@ mailing list).

posted 25.07.2009 2:03 pm tags:

23.07.2009: darcs 2.3.0

Darcs 2.3.0

I’d like to announce immediate availability of a new stable release of darcs, 2.3.0. There is a number of improvements and bugfixes over the previous stable release, 2.2. Moreover, work has been done to improve performance of “darcs whatsnew” for large repositories.

As in the past, there are two source tarballs available. As of this release, the cabal-based build is preferred, and the autoconf build is deprecated. You can obtain the source tarballs at these addresses:

The build instructions are available in the enclosed README file in those tarballs.

Moreover, if you have cabal-install available, you can install latest stable release of darcs by issuing the following commands (no tarballs needed):

    $ cabal update  
    $ cabal install darcs

This should give you a darcs binary in ~/.cabal/bin — you should probably add this to your PATH. More detailed instructions for installing on Windows are available near the end of this announcement.

What’s New

This is a summary of important changes since the last stable release (2.2):

A number of issues has been resolved since 2.2 as well. See http://bugs.darcs.net/issueN for details on bug number N.

The question of GHC 6.8

Using GHC 6.10.3 or newer is strongly recommended. You may compile darcs with GHC 6.8, but there are several caveats. If you are using 6.8.2 or older, please disable mmap support (pass -f-mmap to cabal install or runghc Setup configure). Note that the GHC 6.8.2 that ships with Debian Lenny is not affected and it should be safe to keep mmap enabled. It is also recommended to disable use of Hackage zlib when compiling with GHC 6.8.2 (including the Debian Lenny version): pass -f-zlib to cabal. When using zlib, we have seen occasional crashes with error messages like “openBinaryFile: file locked” — this is a known GHC 6.8.2 bug (and is fixed in GHC 6.8.3). Last, if you are using a 64-bit system, darcs may hang when you exit a pager when compiled with GHC older than 6.10.3. Although this is harmless, it is quite inconvenient.

Overall, the status of GHC 6.8 is “semi-supported”: for many cases, things will work just fine, especially if you take a little extra caution with compilation flags.

Installing on Windows

To install darcs on Windows systems from scratch, please download the Haskell Platform and MSYS:

After installing both, you should have an “MSYS” icon: run MSYS and in the terminal window type (the $ character denotes the prompt, do not repeat it):

$ cabal update  
$ cabal install darcs -f-curl

This should download, compile and install all required dependencies and also darcs itself. The resulting darcs executable will be placed into the Haskell Platform executables folder, and should be accessible from the MSYS shell (just type “darcs —version” to check).

posted 23.07.2009 3:27 pm tags:

23.07.2009: soc progress 9

soc progress 9

Uh, I have slipped by two days again — well, almost. The problem is that in the last two days, I have been somewhat busy with e-mail at darcs-users@… — a sudden peak of traffic hit the list. Basically, two threads created a lot of traffic, the David’s darcs one started by me, and the cheap in-repo branching one. And since this was lot of writing on my part, I didn’t quite feel like writing also the report. But I’m still a day closer to the original Tuesday (and it’s still not even noon here, so that’s more like a day and a half!). I’ll try again to get on track next week.

I have worked some more on the packed format bits in hashed-storage, and I have worked on getting rid of “unsafeDiff” in darcs-hs. The latter is down to 2 uses, one in “darcs replace” and the other in Darcs.Resolution. The replace one should be relatively easy, actually. The resolution one hopefully too, but we have no testsuite coverage of the external merge, which would be better added before I break it… For the changes already done, I have ported over check/repair (I still need to run this through a heap profiler, since I suspect I introduced some space leaks) and “darcs remove” which is now also neater and more readable (at least I think it is). And since we never use unsafeDiff to compare working against pristine anymore, I have nuked a bunch of code that deals with timestamps.

I have also done a little work on HPC reports for darcs itself: I put a current HPC report online, and there’s some automation in place, so it should get updated whenever I push into the repository (it is supposed to be a 1:1 mirror of http://darcs.net). There are still some issues, like the TIXDIR taking over a gigabyte of space (I know how to fix this, but haven’t had the time yet — it involves a relatively easy support in ShellHarness).

Also on the mainline darcs (and this is not strictly SoC work, but worth reporting anyway), we have removed autoconf support. Finally. There are still some ongoing issues, like we need the user manual to be built by Cabal (we kept some bits of the GNUmakefile to address this for now) and some other features are currently not available with cabal (see also this thread).

Hashed-storage changes:

And darcs-hs changes:

posted 23.07.2009 11:14 am tags:

17.07.2009: soc progress 8

soc progress 8

So I have managed to reduce the report lag by a few days, expect the next one on the (regular) Tuesday. Hopefully. Anyway, since the last report, a lot has happened.

I got a new laptop a few weeks back (a Lenovo X200 — I installed the base system on 30th of June), which enabled me to play around with more resource-hungry software than before (this beast has 2G of RAM and two 2.4GHz CPUs). The thing I tried was installing VirtualBox and a copy of Windows XP inside (I have an academic license, hopefully I am within the rules). This worked fairly well (the VM is pretty fast, actually) and enabled me to play with darcs on a win32 system. After installing the Haskell Platform (congrats to all the folks who worked on this: this piece of software is awesome) and MSYS, I got cabal install darcs-beta rolling. Turned out that darcs 2.3 didn’t work very well on windows (well, about 30 tests failed, and darcs whatsnew was completely unusable). It took about two days time to kick all the parts into obedience: I patched all of mmap, hashed-storage and darcs. The net result is that darcs now passes its testsuite on win32 (this never quite worked before). In the meantime, I have already released new beta, darcs 2.3 beta 4 which has all the win32 fixes in it. In addition, I made the hashed-storage testsuite pass on Windows, which should make it much easier in the future to track down problems.

I have also made some coding progress on hashed-storage: first, the packed format prototyping is advancing, and before next report, I hope to have a workable prototype. Some code already went into hashed-storage repository, and there’s already some testing coverage. Which brings me to the other thing, that is that I have improved test coverage of hashed-storage significantly (we now have 10 QC properties and 18 testcases that together cover of 77% toplevels, 66% alternatives and 83% expressions, as reported by HPC).

In darcs-hs, I have done some work on optimising darcs show contents --match, as this seems to be an important operation for repository browsers (or at least for trac, but I imagine others may benefit from this). I have also figured that large part of the tracdarcs inefficiency comes from calling darcs show contents on way old revisions: based on this, Lele has implemented an optimisation that instead looks at the latest revision that has a given version of the file, and for HEAD revisions omits —match altogether. I have deployed the optimised version on my own tracdarcs instance (for another project) and I can confirm that it is much more usable than before. I’ll set up a test instance for GHC repository just for the kicks of it. (Maybe we could then get Hackage folks to install tracdarcs into their trac instances? Now that would be cool.)

The summary of hashed-storage changes for the week:

and for darcs-hs:

posted 17.07.2009 2:55 pm tags:

15.07.2009: darcs 2.3 beta 4

Darcs 2.3 beta 4

We have decided to delay the release cycle slightly for 2.3 and release another beta. The primary reason for this is our Windows support — I would like to invite all Windows users to install darcs-beta and give it a ride. If you have a working cabal-install, all you need to do is run:

$ cabal update
$ cabal install darcs-beta

(this works on all platforms; you may need to use -f-curl, if you don’t have the cURL headers available — this is often the case on Windows). If you do not have cabal-install, please follow the instructions near the end of this message, “Installing on Windows”.

In addition to using cabal-install, you can also download a tarball from http://repos.mornfall.net/darcs/darcs-2.2.98.4.tar.gz and build manually (see the build instructions in README inside the tarball).

Feedback

To make the darcs 2.3 release a good one, we still need testing feedback: please drop a note to darcs-users@ if you have installed darcs-beta, or failed to install it. If you run into any bugs, we need to know about them.

Thanks!

The question of GHC 6.8

Using GHC 6.10.3 or newer is strongly recommended. You may compile darcs with GHC 6.8, but there are several caveats. If you are using 6.8.2 or older, please disable mmap support (pass -f-mmap to cabal install or runghc Setup configure below). Note that the GHC 6.8.2 that ships with Debian Lenny is not affected and it should be safe to keep mmap enabled. It is also recommended to disable use of Hackage zlib when compiling with GHC 6.8.2 (including the Debian Lenny version): pass -f-zlib to cabal. When using zlib, we have seen occasional crashes with error messages like “openBinaryFile: file locked” — this is a known GHC 6.8.2 bug (and is fixed in GHC 6.8.3). Last, if you are using a 64-bit system, darcs may hang when you exit a pager when compiled with GHC older than 6.10.3. Although this is harmless, it is quite inconvenient.

Overall, the status of GHC 6.8 is “semi-supported”: for many cases, things will work just fine, especially if you take a little extra caution with compilation flags.

Installing on Windows

To install darcs on Windows systems from scratch, please download the Haskell Platform and MSYS:

After installing both, you should have an “MSYS” icon: run MSYS and in the terminal window type:

$ cabal update  
$ cabal install darcs-beta -f-curl

This should download, compile and install all required dependencies and darcs-beta itself. The resulting darcs executable will be placed into the Haskell Platform executables folder, and should be accessible from the MSYS shell (just type “darcs —version” to check).

posted 15.07.2009 1:10 pm tags:

12.07.2009: soc progress 7

soc progress 7

This report is a few days late, so first of all, sorry for that. Anyway, the time since last report has been mostly split among two things: working on the theory for the to-be storage format of darcs (which I posted on this blog, under designing storage for darcs) and on the 2.3 release.

Other than those, I have set up HPC report for hashed-storage and a trac instance for the same.

On the code side of things, even less has happened than the last week, but I have started to sketch out the storage API in hashed-storage, and over the next week, I’ll be hopefully able to come up with first workable implementation.

Eric has also been asking about benchmarks for some time now, so I’ll try to get to revisiting Standard Darcs Benchmark in a bit, dust it and hopefully make it reasonably faster to work with than it is.

I guess the last two points are more than enough for now of a TODO. I’ll try to get out next progress report on Friday, to get back on track a little.

The summary of hashed-storage changes for the week:

and for darcs-hs:

posted 12.07.2009 6:54 pm tags:

11.07.2009: designing storage for darcs

Designing Storage for Darcs

Let’s set out some requirements first. We are storing two things: a pristine cache, and history of patches. The first is just a mirror of a working checkout, the second is patches: a sequence of text diffs, each with some metadata.

We need the storage to be efficiently accessible through “dumb” access protocols: requesting a file is always available, and it may be possible to get part of a file of a given size from a given offset (HTTP range request). Moreover, we know that the dumb transport has costly roundtrips: we want to do as few requests as possible. Other than minimising number of roundtrips, we of course want to minimise the amount of data that we need to repeatedly re-download (or more generally, download in waste).

The local operations are (the first 2 operations need to be quite efficient):

And the remote operations:

We would also like local copies to use hardlinking, to reduce space cost of branches. This can be done for any shared, immutable files (and under some circumstances, it could be done in other cases, but that probably complicates matters more than it’d be worth).

There are also two approaches to getting a copy of the repository: the “full copy” that includes all the history, and the “lazy” one, that copies the pristine cache first and avoids any patch application and possibly also patch download.

Extreme Approaches

Basically, there are two extreme approaches to the challenge (and also some middle ground in between that we will discuss later). First, we could keep everything in a single (a few) append-only files. The remove/insert operations would require an expensive copy operation to keep the append property, but this is not so much of a problem, as these are fairly rare. This is basically the Camp approach to repository storage. Camp keeps pristine cache in a plain filesystem hierarchy, but the pristine cache is never involved in any network operation, so this is reasonable (modulo the missing non-corruption guarantees). The basic downside is that virtually no sharing is admissible (at least not trivially — a sharing scheme would require a nontrivial amount of extra logic).

Second, there is the current hashed darcs system: we keep every pristine file and every patch in a separate file, named by the file’s content (i.e. hashed). We can share mostly everything through hardlinking (and for remote operations, we can implement caching easily). The cost is twofold: remote operations require a high number of roundtrips, and internal filesystem fragmentation inflates repository size (the stored files are often just a fraction of a filesystem block, and most filesystems never allocate more than one file in a given block). Moreover, current darcs implementation creates extremely large directories, which tend to make individual file operations quite inefficient.

Compromise Approaches

To look into a somewhat different world, there’s also git. Basically, what git does is same as what darcs does: it creates a single file per object, with cryptographic hash as its name. This allows for sharing and fast local access. It also suffers from the fragmentation problem and is unsuitable for retrieval over dumb transport. Git uses “packs” to circumvent those two problems: you need to manually compress the repository from time to time, into these packs. It also relies on smart transport whenever possible (which uses the same packing mechanism internally).

Anyway, the compromises between the two extremes seem to fall into the “generational” category: a hatchery of one kind and a “mature space” of another kind: with mature space hopefully containing large immutable files (with many objects packed in each). The hatchery is more of a question. As noted above, git uses a “loose” object storage (many small files) as the hatchery. The generational gap is mostly manual: users are called in for administrative action.

The question is, can we do any better? There are two fronts on which to improve: an automated “aging” of objects, from hatchery to mature space, and a better format for the hatchery. From the other side, compared to current darcs, we can also improve the (non-existent) mature space.

Basically, the mature space can be considered read-only for most practical purposes. The natural solution for darcs would be to create a single packed file per tag, and this packed file can as well be a (compressed) patch bundle. When broken at tags, the patch bundles would have very small context (just the immediately preceding tag).

As for hatchery, one obvious solution is the same as git: loose storage. This is also a straightforward extension of existing darcs, by just implementing a mature space. The costs in form of roundtrips and filesystem fragmentation could still be quite high though.

The other approach for hatchery that comes to mind is to use only a single, append-only file (with a simple index, probably in an external file). Basically, this would work like Camp currently does, appending all patches to a file. At certain point, this append-only file would be packed into the mature space somehow, and we would start afresh with empty hatchery (or at least with reduced one, depending on the packing scheme).

Hatchery/Mature Space Migration

The last thing that needs to be dealt with is the heuristic that migrates bits from the hatchery to the mature space. As I hinted above, for darcs it would generally make sense to do this at tags. Nevertheless, if there are too many or too few tags, this won’t work very well in practice. This basically means we want the heuristic to coalesce “too small” tags, and break up “too big” tagless sequences. The first is easier than the second: we can choose the chunk to pack at the migration time, i.e. when hatchery size hits some threshold. We can just set some constants as how big chunks we want, and chop up the load at the most appropriate tags.

As for the latter, we can decide at the same time: when hatchery grows too big and there’s no tag to use for breakup, we can auto-tag at appropriate place. Such heuristic is bound to make mistakes, but it is probably going to work well-enough.

Example

Let’s consider a format, where mature space consists of patch bundles (broken up at tags) and the hatchery is a small number of append-only files. We can put the hatchery bound at 2M, which becomes the maximum amount of unshareable data per branch. Let’s say we put the optimal pack size at 1.5M. For the GHC kind of repository, this is about 40 packs plus the hatchery and the pristine cache. If we can reconstruct the pristine cache “on the fly” (as data is received from network), the time required should be quite close to that of the actual download, without much time wasted for unpacking onto the filesystem. (Trying with current darcs from a local disk drive, replaying patches in the GHC repository gives processing speed of around 900KB/s. This probably needs some improvements.)

Pristine Cache

Since in the previous discussion, we side-stepped the issue of pristine cache, the whole system as described will still have an issue with “lazy” repositories — the loose pristine needs to be fetched. This could easily take about the same time as full get with the mentioned improvements, since we are facing some 1200 roundtrips (compared to 40 for the patches — still using the GHC repository as an example). On my current wireless connection, the average RTT to code.haskell.org is around 200ms, and from a wired university connection, it is about 140ms. Even in the case of the fast university network, we are facing some 168 seconds of roundtrip wait time when using “dumb” HTTP.

When doing a full copy of the repository, the number of files in pristine cache doesn’t really matter — it has a slight problem with fragmentation, but we could probably ignore that. However, the potential saving on download time for lazy repository is non-negligible: the size of the pristine cache is around 1/8 of the full history size for GHC. It is definitely possible to pack the pristine cache, however, the proportion of garbage is much higher in pristine cache than in the patch storage. The pristine cache would likely need to be periodically repacked and recombined to stay efficient. The generational approach will likely reduce this problem somewhat, since a part of the garbage will never leave the hatchery. The collection for mature space would probably need to keep track of utilisation factor, and replace packs that reach a certain non-utilisation threshold (say, 50%).

(Just a small sidenote: when we apply a patch, all affected files and all their parent directories need to be replaced in the pristine cache. This works just like git. However, in darcs, the older revisions of those files immediately become garbage, since there is nothing like a commit object to reference them: for darcs, this is purely a cache, since we reconstruct trees by applying and unapplying patches.)

Unified Storage

Now that we see that the pristine cache can use a very similar storage concept as the patches do, we could try to unify the patch storage and the pristine storage. Basically, patches behave like pristine, it’s just that the proportion of objects that fall dead is much lower. However, this unified storage would have to forgo the idea of “pack as a bundle”, since the model is more general now. It also poses some issues with applying patches as they are downloaded, since we no longer receive them in application order: we probably want the generalised packs to be sorted by hash, so we can have binary search indices for them. It would be also independent of tags, which is a mixed blessing.

Conclusion

It so far seems that the unified storage code is worthwhile. For one, it will make lazy repositories possible and efficient. Moreover, at cost of a slightly larger download, it can avoid replaying all patches upon repository copy, so the fact they are not sorted in application order does not hurt too much. We get packing that is independent of manual tagging and without need to add automatic tagging to the scheme. Overall, when using the general storage principle, things don’t seem to be overly complicated. It will probably take some trial and error to set the parameters (various thresholds and sizes) right. In the case of the unified storage, the net result should be a relatively simple and thin layer sitting between darcs and the filesystem, acting like sort of object storage: somewhat like a smarter git storage, with provisions for higher garbage rates (since these are virtually 0 with git).

posted 11.07.2009 7:11 pm tags:

09.07.2009: darcs 2.3 beta 2

Darcs 2.3 beta 2

Two weeks passed and it is about time to release darcs 2.3 beta 2 (it’s already a day late, sorry about that).

As with beta 1, there is only a single installation package for this release of darcs: cabalised source. (Please note that the final version with also come with the legacy autoconf-based buildsystem, for the last time.)

You can either download a tarball from http://repos.mornfall.net/darcs/darcs-2.2.98.2.tar.gz and build manually (see the build instructions in README inside the tarball), or, alternatively, you can use cabal-install to obtain a copy (the beta release is now available on Hackage):

    $ cabal update
    $ cabal install darcs-beta

This, if successful, will give you a darcs binary in ~/.cabal/bin — you should probably add that to your PATH.

What’s new since beta 1

There are basically two relatively intrusive changes since beta 1. First, I have added index upgrade functionality to hashed-storage and this is now used by darcs, so that any bad or incompatible indexes are recreated transparently by darcs. Moreover, the index format is now architecture-independent, meaning that repositories shared across multiple architectures should not suffer from excessive index rebuilding.

Second, Ganesh has done further work on his gzip CRC correction code. There is a very slight risk of regressions, so if you have around any repositories with broken CRCs in them, please test this functionality. Moreover, there is now an option to limit the repair to current repository, avoiding changes in related branches or caches.

Trent, Thorkil and Eric have done cleanup work and a number of minor fixes for this release, improving overall quality and polish. One bug in the new indexing code has been found and reported by Eric and Guillaume independently, and is fixed in hashed-storage 0.3.4 (which required by this beta).

You can track the release plan and progress at our wiki: http://wiki.darcs.net/development.

The question of GHC 6.8

Using GHC 6.10.3 or newer is strongly recommended. You may compile darcs with GHC 6.8, but there are several caveats. If you are using 6.8.2 or older, please disable mmap support (pass -f-mmap to cabal install or runghc Setup configure below). Note that the GHC 6.8.2 that ships with Debian Lenny is not affected and it should be safe to keep mmap enabled. It is also recommended to disable use of Hackage zlib when compiling with GHC 6.8.2 (including the Debian Lenny version): pass -f-zlib to cabal. When using zlib, we have seen occasional crashes with error messages like “openBinaryFile: file locked” — this is a known GHC 6.8.2 bug (and is fixed in GHC 6.8.3). Last, if you are using a 64-bit system, darcs may hang when you exit a pager when compiled with GHC older than 6.10.3. Although this is harmless, it is quite inconvenient.

Overall, the status of GHC 6.8 is “semi-supported”: for many cases, things will work just fine, especially if you take a little extra caution with compilation flags.

posted 09.07.2009 12:54 pm tags: