Introduction
This is probably the most active bit of the site, i post random thoughts, events and articles to this blog. It is ridden with personal issues and technical items alike, while being constantly out of date. Read at your own risk. If you are interested in my person (odd enough as that may sound), probably read something about me. If you are only interested in technical articles, see tech. Poetry has a completely separate blog of its own right. Older articles (as well as postings on the previous blogs i had) are to be found in archive. That’s probably it.
Recent Entries
14.08.2009: soc final report
soc final report
As you may have noticed, I skipped last progress report. That’s because I was busy with coding work. Anyway, let’s summarise what happened throughout the GSoC program. The basic idea was to improve darcs support for large repositories — large in terms of number of working files. The primary goal was to improve the code that diffs the working tree against the pristine cache: an operation, that is very common, and in darcs 2.2 and earlier was also painfully slow (depending on circumstances).
Deliverables
So what has been delivered? There is a number of outputs of the project:
- hashed-storage 0.3.7, a library for reading and writing filesystem trees in hash-based formats, with primary focus on darcs hashed pristine
- darcs whatsnew integration with hashed-storage 0.3.7, part of the current stable release of darcs, 2.3.0
- hashed-storage almost-0.4 — a work in progress, improving hashed-storage 0.3.7 in several directions (and breaking the API)
- darcs-hs: a branch integrating hashed-storage 0.4 into darcs, using hashed-storage fast diffing mechanism for all working/pristine diffing needs — naturally, also a work in progress
- darcs-benchmark: a standalone package for benchmarking darcs (pitting different versions of darcs against each other on a number of medium-sized repositories, on a number of different commands)
The first two items on the list are done for good. The diffing code that was a major goal of the project is stable and part of a stable darcs release. It is supported by a independent hackage library, hashed-storage.
For efficient retrieval over HTTP… there is support for creating and using hashed packs in hashed-storage 0.4. This still needs some work to be put in practice, both on darcs-hs and on hashed-storage. I will continue working on this in the fall. The current status of code in hashed-storage 0.4 is basically proof-of-concept (comes with unit tests though). There is also a concept document, designing storage for darcs. Either way, I am trying to not rush things on this, since once we put this into the stable darcs, we will have to live with it for a long time, and mistakes now would be expensive later.
As for darcs-benchmark — I have uploaded version 0.1 to hackage and it should
be ready for wider circulation. Please try it out and let me know if you have
any issues with it (cabal install darcs-benchmark).
Work in Progress
Since pencils down is approaching quickly, there will stay a number of in-progress items. There are still some bugs to fix in darcs-hs, and hashed-storage needs a haddock cleanup and at least one pass over the API before I can upload 0.4. But let’s see what 0.4 brings over 0.3.7. During a review process that Ganesh started, we have identified a number of weak spots in hashed-storage. This has directly and indirectly led to hashed-storage 0.4.
On the API level, we have generalised the Tree out of IO into a generic Monad, to help with testing infrastructure. The Index API has been simplified and, more importantly, made much safer. With this done, it turned out that the index reading code is needlessly general, and I have simplified it, improving index performance along the way (in the process, I also changed the layout… twice… the magic word versioning mechanism really paid off).
I have also refactored the Hash type, and removed all the hacks that deal with the size prefix on darcs hashes. This means we won’t be able to write out new trees of this kind anymore (but no big loss, old darcs can read repositories without the size prefix as well). Along with the original Hash refactor, I have changed the internal hash representation, cutting hash length from 64 to 32 bytes (this is still SHA256, but is using full 8 bits of every byte, unlike ascii-hex which carries 4 bits of useful data per byte). This representation is also used in the index.
The overall result is, that the index size has shrunk considerably — it is now, on average, smaller than a git index for equivalent repository, even though we use stronger hashes (but yes, git index is more than a simple cache).
The even more interesting outcome is that, with darcs-hs, I can now run “whatsnew” on a 100k-file repository in 0.35 seconds — a mere .1 second (or 40%) slower than equivalent git diff. With 100% Haskell (well, the sha256 implementation is in C, but this code is not tripped in the optimal case). For most realistic repository sizes, the difference is going to be negligible. Oh, and to get an idea how big a 100k-file repository is, the Linux kernel tree has about 28k files in it.
Future Work
Recently, it has been noted, that for old-fashioned repositories, darcs whatsnew has regressed in performance by a fair amount. This is true, but it’s not clear if it is worth addressing properly. For 2.3.1, the easy fix is to restore the old code path when the repository is not hashed. However, I have already removed unsafeDiff in darcs-hs and I want to keep it that way, so we either need to start treating old-fashioned as second-class, or either come up with a more systematic fix.
As for hashed-storage, I want to do further work on packs. Maybe even add code to read git repositories. Moreover, there is an ongoing review of hashed-storage which may bring further ideas for changes. We also have to make some decisions about what to do about darcs in the future — how to.
Conclusions
It has definitely been a productive summer. Lots of work has been done on hashed-storage, and it seems that with a little further work, it can provide solid underpinning to future darcs versions. Apart from major performance improvements, it exposes a new standalone component with its own set of tests, improving test coverage of what constitutes darcs.
posted 14.08.2009 11:35 pm29.07.2009: soc progress 10
soc progress 10
Wow, tenth progress report. That’s a long time now. Still a day late, but getting on track slowly (slower than expected I guess). Last week has seen the release of darcs 2.3. Nonetheless, this release has seen partial integration of hashed-storage work. Unfortunately, the release was also a major motivation drain — RMing darcs is a lot of (fairly hard) work.
On the unsafeDiff front, I’m at a standstill (there are still the same 2 instances that I reported about last week). I haven’t progressed on the HPC issue either, but that can wait and it’s not directly relevant to SoC.
I have worked on profiling my last-week’s check/repair port. Fortunately, the implementation does not leak memory, and is slightly slower (some 20%) than the old version. Most of the extra goes into filepath juggling (due to representation differences between darcs and hashed-storage). See my darcs-users post for more details about this.
In other news, I have forked an 0.3 branch of hashed-storage and flipped the switch to 0.4 in mainline. The first compatibility-breaker is the Hash type refactor, which improved things a fair bit. Also, we now represent hashes in-memory (and in the binary index) by 32-byte binary vectors (represented as bytestrings) instead of using the 64-byte ascii-hex representation. Also, this has removed a lot of Maybe wrapping, since the new Hash type comes with a “NoHash” constructor.
Other than that, I have taken a break after the release. I still have some motivational issues, but I should improve my output rate over the next week. The major todo on is file benchmarking (since Eric keeps nagging me about this, and I still haven’t figured how to do this properly) and actual implementation of the (internal) pack format.
Hashed-storage changes:
- Do not include _darcs/index in the testdata.
- Move floatPath into the AnchoredPath module.
- Do not forget to reset changesize in Monad, upon sync.
- Avoid calling expandPath when not needed, in Monad.
- Do not forget to update hashes before syncing a hashed tree in Monad.
- Refactor Hash datatypes extensively.
- Bump version to 0.4.0.
- We can now base64/url-encode (and decode) all Hash types.
- Add a generic sha256 function to the Hash module.
- Sprinkle a few haddocks around the Hash module.
- Add a NEWS file.
- Bundle NEWS in source distribution.
- Move the darcs-specific packing bits into Darcs.
- Use “plain” hashes in Packed (instead of the size-carrying darcs ones).
- The hatchery cleanup for compact is a removeFile (for now, anyway).
- Rename readOS to load.
- Haddock header for Packed.
- Fix collateral from using plain hashes in object storage. (Ow, ugly.)
And darcs-hs changes:
- Bump version to 2.3.0.
- Update autoconf release data to 2.3.0.
- Update NEWS.
- Ship the NEWS file in cabal tarballs as well.
- Port to hashed-storage 0.4 Hash changes.
25.07.2009: patch formats
Patch Formats
I have been thinking for a while, that a completely new “repository” format (an experimental one) would be in place for darcs 2.4. I have previously outlined a way I’d like to go about building up new things within the darcs 2.x series. Now a darcs repository has two basic “components”: the “file” part of the layout: truly a repository format, and a “patch format”: which determines not only how patches are written out to disk, but more importantly, their exact semantics. Once you set up a “patch format”, this is set in stone and repositories with different patch types cannot exchange patches between them (at least not without an in-between conversion). This is the case between darcs-1 and darcs-2 format repositories, as they use a different patch format. The case of darcs-1 vs “hashed” repositories, as darcs calls them, is only on the file level though: the patch formats are identical, and that’s why hashed and plain darcs-1 repositories can exchange patches just fine. (I will from now on refer to repository and patch format as two orthogonal things, as they mostly are.)
Now I have been working on a packed repository format… one that would allow to store the repository — regardless of patch format used — in a compact form suitable for HTTP-based retrieval. In this post, I’d like to address the other thing: a patch format. It seems worthwhile to improve our current system, since it has a number of weak spots. Currently, we have a number of “primitive” patch types, and some more complicated ones — conflictors in darcs-2 or mergers in darcs-1. I am not going to talk about these — we’ll focus on the primitive patches for now.
The primitives in darcs are addfile, rmfile, addir, rmdir, move,
replace and most importantly hunk. (You should be able to look up what
these roughly represent somewhere else.)
Let’s address these addfile and friend patches, that create and remove files
or directories. Obviously, addfile foo.txt and a different addfile
foo.txt are going to conflict. Also, all hunks for foo.txt obviously depend on
addfile foo.txt — which means that if you pull two branches together with
nontrivial files of the same name, you are going to end up with a massive
conflict (at least in terms of darcs data structures) for virtually no reason.
So my proposal is to divorce the filename from file identity: this is something that has been pondered before, I believe. The result would look something like:
hunk fileid 1
- a
+ b
This means that hunks would exist without any dependence on addfile: the
abstract file would pop into existence with first hunk touching its
identity. Of course this would be no good, since you just lost the relation
between a working copy and whatever darcs tracks. To put that relation back on
track, we add two patch types:
manifest fileid ./file/path
demanifest fileid ./file/path
A manifest patch will tell darcs to associate the fileid with your working
file at ./file/path. The inverse operation is demanifest, and that would remove
the association: and your working copy file. The abstract identity continues to
exist just fine, and can be manifested again (under the same or different
filepath). Basically, this completely de-couples the “hunk-space” from the
“filepath-space” — manifest/demanifest/move(/adddir/removedir) patches commute
completely freely with hunk/replace patches. To make the de-coupling complete,
you want a “manifest” of a non-existent fileid to pop that fileid into
existence as well. No problem.
Basically, this means that as far as darcs is concerned, file content manipulation is orthogonal to the directory tree manipulation: and this is good and well, since it allows us to solve conflicts on both of those levels separately, without dragging in a lot of stuff from the other level. Moreover, the add-add conflict no longer exists.
As for the hunk format itself, there is also a number of issues: it uses a
GNU-patch-like format with ‘+’ or ‘-’ sign in front of each line. It will
usually look like a block of ‘-’ lines followed by a block of ‘+’ lines (either
of these may be empty). Parsing this format is not quite simple, you have to
look up all the newlines, chop off the ‘+’ and ‘-’ signs etc. Lots of work for
darcs.
hunk ./foo.txt 1
- a
+ b
Now a friendly-to-parse alternative would be to use a chunk kind of patch:
chunk fileid - 0 2
a\n
chunk fileid + 0 2
b\n
the 0 and the 2 are an offset and length (in bytes), respectively. What follows the patch header is a monolithic block of text to be removed from (or pasted at) the given offset (and the block is of a given length). This would produce more primitives than the original hunks, but they would be vastly simpler to process in bulk. Each basically represents a string “splice” operation. No newline shuffling whatsoever. The commutation rules should be as simple as they were with hunks: you just wibble the offset (after checking for overlap).
But now that the data in those chunks is basically a blob of (anything), there is an extra thing that can be done. Instead of keeping this data inline in the patch, we could refer to it by a hash and store it elsewhere:
chunk fileid - 0 2 hash
chunk fileid + 0 2 hash
Of course, for the example patch, this is going to inflate the patch size quite a bit. But let’s not care for a while. We now have a chunk format that is O(1) in size (well O(logn) for purists, given the length needs to be represented, but we don’t care about that either). We can still commute and invert it just fine: inversion is flip-flopping the ‘-‘/’+’ sign, and commutation just wibbles the offset. Awesome. (Besides that, it also effectively obliterates any need for a “binary” kind of patch: chunks will do just fine for that, and we could even use binary diff algorithm if we wanted…)
We will need to dereference that hash sometimes of course: when we want to actually apply the patch, and when we want to show it to user. The latter is a non-issue, since our processing power vastly exceeds that of the user, we can play with the patch as much as we like for presentation issues. So when we want to apply that patch, we need to fetch its content… but wait, we already have a mechanism for buckets of bits by their hashes: the hashed pristine! So yes, we can just dump the data bits of the chunks into hashed pristine (plus some wibbling of garbage collection, see my previous post about that).
Now that no actual file content is ever part of any patch representation, we can consider some new options. One would be to store patches inline in the inventory files: this would probably inflate their sizes by a small factor. Looking at darcs repository itself, we have 750K of compressed inventories (1.5M uncompressed). There are 43225 hunks — this would add about 6M of uncompressed (but relatively compressible) text to the inventories (considering about 150 bytes per a chunk patch: 2x sha256 + a little). That is about factor 5. We would probably have to play around a little to find out how (un)reasonable this is. We could also cut a bit of the cost by using a less-stupid encoding of hashes than ascii-hex (aka base16)… say base64 (see RFC 3548/4).
I should probably also note that this would save some bits on conflictor representation (no copies of hunk data) and it should also solve the “big initial patch” problem — the patch itself would be O(n) in number of files (instead of O(n) in number of bytes of the initial tree). There are of course some drawbacks and some other advantages, but I don’t have the amount of time to go into more details just now. Instead, I’ll let people think about this for a while. Comments are definitely welcome (probably address them to our darcs-users@ mailing list).
posted 25.07.2009 2:03 pm23.07.2009: darcs 2.3.0
Darcs 2.3.0
I’d like to announce immediate availability of a new stable release of darcs, 2.3.0. There is a number of improvements and bugfixes over the previous stable release, 2.2. Moreover, work has been done to improve performance of “darcs whatsnew” for large repositories.
As in the past, there are two source tarballs available. As of this release, the cabal-based build is preferred, and the autoconf build is deprecated. You can obtain the source tarballs at these addresses:
- http://repos.mornfall.net/darcs/darcs-2.3.0.tar.gz
- http://repos.mornfall.net/darcs/darcs-2.3.0_autoconf.tar.gz
The build instructions are available in the enclosed README file in those tarballs.
Moreover, if you have cabal-install available, you can install latest stable release of darcs by issuing the following commands (no tarballs needed):
$ cabal update
$ cabal install darcs
This should give you a darcs binary in ~/.cabal/bin — you should probably
add this to your PATH. More detailed instructions for installing on Windows are
available near the end of this announcement.
What’s New
This is a summary of important changes since the last stable release (2.2):
- Lots and lots of documentation changes (Trent).
- Haskeline improvements (Judah).
- Cabal as default buildsystem (many contributors).
- Fixes in darcs check/repair memory usage (Bertram, David).
- Performance improvement in subtree record (Reinier).
- New option: —summary —xml (Florian Gilcher).
- New option: changes —max-count (Eric and Petr).
- Fix changes —only-to-files for renames (Dmitry).
- Performance fix in “darcs changes” (Benedikt).
- Hardlinks on NTFS (Salvatore).
- Coalesce more changes when creating rollbacks (David).
- New unit test runner (Reinier).
- Inclusion of darcs-shell in contrib (László, Trent).
- Author name/address canonisation: .authorspellings (Simon).
- Working directory index and substantial “darcs wh” optimisation (Petr).
- New command: “darcs show index” (Petr).
- Gzip CRC check and repair feature (Ganesh).
A number of issues has been resolved since 2.2 as well. See http://bugs.darcs.net/issueN for details on bug number N.
- 948 darcsman (Trent)
- 1206 countable nouns (Trent)
- 1285 cabal test v. cabal clean (Trent)
- 1302 use resolved, not resolved-in-unstable (Trent)
- 1235 obliterate —summary (Rob)
- 1270 no MOTD for —xml-output (Lele)
- 1311 cover more timezones (Dave)
- 1292 re-encoding haskeline input (Judah)
- 1313 clickable ToC and refs in PDF manual Trent)
- 1310 create merged \darcsCommand{add} (Trent)
- 1333 better “cannot push to current repository” warning (Petr)
- 1347 (autoconf) check for unsafeMMapFile if mmap use enabled (Dave)
- 1361 specify required includes for curl in cabal file (Reinier)
- 1379 remove libwww support (Trent)
- 1366 remove unreachable code for direct ncurses use (Trent)
- 1271 do not install two copies of darcs.pdf (Trent)
- 1358 encode non-ASCII characters in mail headers (Reinier)
- 1393 swap “darcs mv” and “darcs move” (Trent)
- 1405 improve discoverability of global author file (Trent)
- 1402 don’t “phone home” about bugs (Trent)
- 1301 remove obsolete zsh completion scripts (Trent)
- 1162 makeAbsolute is now a total function (Ben F)
- 1269 setpref predist - exitcode ignored bug (Ben M)
- 1415 —edit-long-comment, not —edit-description, in help (Trent)
- 1413 remove duplicate documentation (Trent)
- 1423 complain about empty add/remove (Trent)
- 1437 Implement darcs changes —max-count (Eric)
- 1430 lazy pattern matching in (-:-) from Changes command module (Dmitry)
- 1434 refactor example test (Trent)
- 1432 refer to %APPDATA%, not %USERPROFILE% (Trent)
- 1186 give a chance to abort if user did not edit description file (Dmitry)
- 1446 make amend-record -m foo replace only the patch name (Dmitry)
- 1435 default to get —hashed from a darcs-1.0 source (Trent)
- 1312 update and reduce build notes (Trent)
- 1351 fix repository path handling on Windows (Salvatore)
- 1173 support hard links on NTFS (Salvatore)
- 1248 support compressed inventories for darcs-1 repos (Ganesh)
- 1455 implement “darcs help environment” (Trent)
The question of GHC 6.8
Using GHC 6.10.3 or newer is strongly recommended. You may compile darcs with GHC 6.8, but there are several caveats. If you are using 6.8.2 or older, please disable mmap support (pass -f-mmap to cabal install or runghc Setup configure). Note that the GHC 6.8.2 that ships with Debian Lenny is not affected and it should be safe to keep mmap enabled. It is also recommended to disable use of Hackage zlib when compiling with GHC 6.8.2 (including the Debian Lenny version): pass -f-zlib to cabal. When using zlib, we have seen occasional crashes with error messages like “openBinaryFile: file locked” — this is a known GHC 6.8.2 bug (and is fixed in GHC 6.8.3). Last, if you are using a 64-bit system, darcs may hang when you exit a pager when compiled with GHC older than 6.10.3. Although this is harmless, it is quite inconvenient.
Overall, the status of GHC 6.8 is “semi-supported”: for many cases, things will work just fine, especially if you take a little extra caution with compilation flags.
Installing on Windows
To install darcs on Windows systems from scratch, please download the Haskell Platform and MSYS:
- http://hackage.haskell.org/platform/2009.2.0.1/HaskellPlatform-2009.2.0.1-setup.exe
- http://sourceforge.net/projects/mingw/files/MSYS+Base+System/MSYS-1.0.11-rc-1.exe/download
After installing both, you should have an “MSYS” icon: run MSYS and in the
terminal window type (the $ character denotes the prompt, do not repeat it):
$ cabal update
$ cabal install darcs -f-curl
This should download, compile and install all required dependencies and also darcs itself. The resulting darcs executable will be placed into the Haskell Platform executables folder, and should be accessible from the MSYS shell (just type “darcs —version” to check).
posted 23.07.2009 3:27 pm23.07.2009: soc progress 9
soc progress 9
Uh, I have slipped by two days again — well, almost. The problem is that in the last two days, I have been somewhat busy with e-mail at darcs-users@… — a sudden peak of traffic hit the list. Basically, two threads created a lot of traffic, the David’s darcs one started by me, and the cheap in-repo branching one. And since this was lot of writing on my part, I didn’t quite feel like writing also the report. But I’m still a day closer to the original Tuesday (and it’s still not even noon here, so that’s more like a day and a half!). I’ll try again to get on track next week.
I have worked some more on the packed format bits in hashed-storage, and I have worked on getting rid of “unsafeDiff” in darcs-hs. The latter is down to 2 uses, one in “darcs replace” and the other in Darcs.Resolution. The replace one should be relatively easy, actually. The resolution one hopefully too, but we have no testsuite coverage of the external merge, which would be better added before I break it… For the changes already done, I have ported over check/repair (I still need to run this through a heap profiler, since I suspect I introduced some space leaks) and “darcs remove” which is now also neater and more readable (at least I think it is). And since we never use unsafeDiff to compare working against pristine anymore, I have nuked a bunch of code that deals with timestamps.
I have also done a little work on HPC reports for darcs itself: I put a current HPC report online, and there’s some automation in place, so it should get updated whenever I push into the repository (it is supposed to be a 1:1 mirror of http://darcs.net). There are still some issues, like the TIXDIR taking over a gigabyte of space (I know how to fix this, but haven’t had the time yet — it involves a relatively easy support in ShellHarness).
Also on the mainline darcs (and this is not strictly SoC work, but worth reporting anyway), we have removed autoconf support. Finally. There are still some ongoing issues, like we need the user manual to be built by Cabal (we kept some bits of the GNUmakefile to address this for now) and some other features are currently not available with cabal (see also this thread).
Hashed-storage changes:
- Clean up the interface implementing darcs hashed trees in object storage.
- In hatch, discard objects that are already in the storage.
- Improve haddock.
- Add a testcase for readOS.
- Add a small testcase for diffTree of identical trees.
- Add a testcase for writePlainTree.
- Add a simple QC property for restrict.
- Add another QC property for restrict.
- Expand the expandPath testcase to cover more possibilities.
- Also cover the items part of the expandPath-produced Tree.
- Obliterate embarassingly useless iterateM and fix mfixFrom (IO is too strict…).
- Implement rudimentary darcsPristineRefs (for chasing references).
- Test live object collection for darcs pristine trees.
- Add a function to postorder-traverse and update a Tree.
- More utilities for darcs hashed trees embedded in an object storage.
- Implement post-compaction cleanup for loose hatcheries.
- More complete OS field initialisation (although dummy so far).
- Use readOS after create to populate the in-memory structure.
- A rudimentary implementation of createPack (which in turn enables compaction).
- Do not fail when a nonexistent file is marked as changed in Monad.
- Do not remove associated Tree hashes in filter.
- Minor formatting fix.
- Avoid removing an in-use file on win32.
And darcs-hs changes:
- Port the replay (check/repair) functionality to hashed-storage.
- Remove the support for writing out new checkpoints.
- Remove the —checkpoint option from the UI.
- Remove now-unused checkPristineAgainstSlurpy.
- Add support for skipping tests (exit 200).
- Avoid removing in-use files on win32.
- Re-implement makeremovepatch in remove command, replacing Slurps with Trees.
- Obliterate all instances of sync_repo and friends, since they are useless now.
- Obliterate timestamp manipulation in HashedIO.
17.07.2009: soc progress 8
soc progress 8
So I have managed to reduce the report lag by a few days, expect the next one on the (regular) Tuesday. Hopefully. Anyway, since the last report, a lot has happened.
I got a new laptop a few weeks back (a Lenovo X200 — I installed the base system on 30th of June), which enabled me to play around with more resource-hungry software than before (this beast has 2G of RAM and two 2.4GHz CPUs). The thing I tried was installing VirtualBox and a copy of Windows XP inside (I have an academic license, hopefully I am within the rules). This worked fairly well (the VM is pretty fast, actually) and enabled me to play with darcs on a win32 system. After installing the Haskell Platform (congrats to all the folks who worked on this: this piece of software is awesome) and MSYS, I got cabal install darcs-beta rolling. Turned out that darcs 2.3 didn’t work very well on windows (well, about 30 tests failed, and darcs whatsnew was completely unusable). It took about two days time to kick all the parts into obedience: I patched all of mmap, hashed-storage and darcs. The net result is that darcs now passes its testsuite on win32 (this never quite worked before). In the meantime, I have already released new beta, darcs 2.3 beta 4 which has all the win32 fixes in it. In addition, I made the hashed-storage testsuite pass on Windows, which should make it much easier in the future to track down problems.
I have also made some coding progress on hashed-storage: first, the packed format prototyping is advancing, and before next report, I hope to have a workable prototype. Some code already went into hashed-storage repository, and there’s already some testing coverage. Which brings me to the other thing, that is that I have improved test coverage of hashed-storage significantly (we now have 10 QC properties and 18 testcases that together cover of 77% toplevels, 66% alternatives and 83% expressions, as reported by HPC).
In darcs-hs, I have done some work on optimising darcs show contents --match,
as this seems to be an important operation for repository browsers (or at least
for trac, but I imagine others may benefit from this). I have also figured that
large part of the tracdarcs inefficiency comes from calling darcs show
contents on way old revisions: based on this, Lele has implemented an
optimisation that instead looks at the latest revision that has a given
version of the file, and for HEAD revisions omits —match altogether. I have
deployed the optimised version on my own tracdarcs instance (for another
project) and I can confirm that it is much more usable than before. I’ll set
up a test instance for GHC repository just for the kicks of it. (Maybe we could
then get Hackage folks to install tracdarcs into their trac instances? Now that
would be cool.)
The summary of hashed-storage changes for the week:
- Improve the modifyTree testcases slightly.
- A basic testcase for diffTrees.
- Minor tidying of the Test module.
- Add a testcase for modifyTree used for removal (with Nothing passed).
- Check plain tree contents in addition to pristine.
- Add some bad and ugly filenames to the test data.
- Add a check that mmaping empty file works.
- More prototyping in Packed.
- Also derive Ord for Hash.
- One more QC property for “reachable”.
- Haddock the new utility functions.
- Add “reachable” to Utils (comes with a pair of QC properties).
- Add iterateM and (m)fixFrom to Utils (comes with QC and unit test respectively).
- Fix QC tests to start with prop_, while unit tests start with check_.
- Reorder test groups.
- Haskell still can’t read my mind. Make Packed compile (and expose it).
- Haddock tweaks.
- Start out on Storage.Hashed.Packed (for a lack of better name).
- Avoid unused import warning in Utils.
- Import Bundled.Posix qualified in Test to avoid future name clashes.
- Refuse to overwrite files in TreeIO rename (darcs relies on this).
- Properly encode whitespace in darcsFormatDir.
- Improved fileExists check in testsuite.
- Make a separate test group for Bundled.Posix.
- Add a test for fileExists from Bundled.Posix.
- Avoid symbol name conflicts with darcs sha256 implementation.
- Make the testsuite pass on win32.
- A quick&dirty unit test for getFileStatus/getSymbolicLinkStatus.
- Do not try to use lstat on win32, it seems to be broken in GHC.
- Bump version to 0.3.6.
- Bump the index magic to HSI1 (incompatible format change).
- Require mmap >= 0.4.
- Get rid of last darcs use in the testsuite.
- Do not use darcs show manifest in test, we have complete file list.
- Avoid trailing slashes when stat-ing directories (fixes win32).
- Correctly store path to the toplevel directory as “.”, not “/”.
- Use readSegment to read hashed directory listings.
and for darcs-hs:
- Optimize darcs show contents —match (avoid slurping pristine).
- Proper implementation for mDoesFileExist/mDoesDirectoryExist in Gorsvet.
- Provide readPending that also provides the “pending conflicts” check natively.
- Bump version to 2.2.98.4.
- Bump the hashed-storage dependency to >= 0.3.6.
- Implement getFileStatus, and use it instead of getSymbolicLinkStatus on win32.
- Use mmapFilePS in gzReadFilePS to avoid lazy bytestring readFile.
- Use DARCS_TESTING_PREFS_DIR in ShellHarness, since APPDATA override does not work.
- Recognize a special DARCS_TESTING_PREFS_DIR envvar to override the global preference directory.
15.07.2009: darcs 2.3 beta 4
Darcs 2.3 beta 4
We have decided to delay the release cycle slightly for 2.3 and release another beta. The primary reason for this is our Windows support — I would like to invite all Windows users to install darcs-beta and give it a ride. If you have a working cabal-install, all you need to do is run:
$ cabal update
$ cabal install darcs-beta
(this works on all platforms; you may need to use -f-curl, if you don’t have the cURL headers available — this is often the case on Windows). If you do not have cabal-install, please follow the instructions near the end of this message, “Installing on Windows”.
In addition to using cabal-install, you can also download a tarball from http://repos.mornfall.net/darcs/darcs-2.2.98.4.tar.gz and build manually (see the build instructions in README inside the tarball).
Feedback
To make the darcs 2.3 release a good one, we still need testing feedback: please drop a note to darcs-users@ if you have installed darcs-beta, or failed to install it. If you run into any bugs, we need to know about them.
Thanks!
The question of GHC 6.8
Using GHC 6.10.3 or newer is strongly recommended. You may compile darcs with GHC 6.8, but there are several caveats. If you are using 6.8.2 or older, please disable mmap support (pass -f-mmap to cabal install or runghc Setup configure below). Note that the GHC 6.8.2 that ships with Debian Lenny is not affected and it should be safe to keep mmap enabled. It is also recommended to disable use of Hackage zlib when compiling with GHC 6.8.2 (including the Debian Lenny version): pass -f-zlib to cabal. When using zlib, we have seen occasional crashes with error messages like “openBinaryFile: file locked” — this is a known GHC 6.8.2 bug (and is fixed in GHC 6.8.3). Last, if you are using a 64-bit system, darcs may hang when you exit a pager when compiled with GHC older than 6.10.3. Although this is harmless, it is quite inconvenient.
Overall, the status of GHC 6.8 is “semi-supported”: for many cases, things will work just fine, especially if you take a little extra caution with compilation flags.
Installing on Windows
To install darcs on Windows systems from scratch, please download the Haskell Platform and MSYS:
- http://hackage.haskell.org/platform/2009.2.0.1/HaskellPlatform-2009.2.0.1-setup.exe
- http://sourceforge.net/projects/mingw/files/MSYS+Base+System/MSYS-1.0.11-rc-1.exe/download
After installing both, you should have an “MSYS” icon: run MSYS and in the terminal window type:
$ cabal update
$ cabal install darcs-beta -f-curl
This should download, compile and install all required dependencies and darcs-beta itself. The resulting darcs executable will be placed into the Haskell Platform executables folder, and should be accessible from the MSYS shell (just type “darcs —version” to check).
posted 15.07.2009 1:10 pm12.07.2009: soc progress 7
soc progress 7
This report is a few days late, so first of all, sorry for that. Anyway, the time since last report has been mostly split among two things: working on the theory for the to-be storage format of darcs (which I posted on this blog, under designing storage for darcs) and on the 2.3 release.
Other than those, I have set up HPC report for hashed-storage and a trac instance for the same.
On the code side of things, even less has happened than the last week, but I have started to sketch out the storage API in hashed-storage, and over the next week, I’ll be hopefully able to come up with first workable implementation.
Eric has also been asking about benchmarks for some time now, so I’ll try to get to revisiting Standard Darcs Benchmark in a bit, dust it and hopefully make it reasonably faster to work with than it is.
I guess the last two points are more than enough for now of a TODO. I’ll try to get out next progress report on Friday, to get back on track a little.
The summary of hashed-storage changes for the week:
- Add a flag to enable hpc support in profiled builds.
- Expand the tree more thoroughly (and less fatally) in Monad.
- Don’t forget to export indexFormatValid.
- Bump version to 0.3.5.
- Rework the API for index versioning and deprecate readOrUpgradeIndex.
and for darcs-hs:
- Resolve conflict.
- Bump version to 2.2.98.3.
- Fix Wall warnings about applyToTree in Darcs.Gorsvet.
- Fix segfaults when index is both invalid and in the wrong format.
- Bump hashed-storage dependency to 0.3.5.
- Include hspwd.hs in the distribution tarball.
- Bump version to 2.2.98.2.
- Update build documentation in README.
- Use the index upgrade functionality from hashed-storage >= 0.3.4.
11.07.2009: designing storage for darcs
Designing Storage for Darcs
Let’s set out some requirements first. We are storing two things: a pristine cache, and history of patches. The first is just a mirror of a working checkout, the second is patches: a sequence of text diffs, each with some metadata.
We need the storage to be efficiently accessible through “dumb” access protocols: requesting a file is always available, and it may be possible to get part of a file of a given size from a given offset (HTTP range request). Moreover, we know that the dumb transport has costly roundtrips: we want to do as few requests as possible. Other than minimising number of roundtrips, we of course want to minimise the amount of data that we need to repeatedly re-download (or more generally, download in waste).
The local operations are (the first 2 operations need to be quite efficient):
- append a new patch to the sequence
- apply patch(es) to the pristine cache
- remove a patch from arbitrary place in the sequence
- place a new patch in the middle of the sequence
And the remote operations:
- make a copy of the whole repository
- retrieve a list of patches we have not yet seen
- retrieve the actual patch content
We would also like local copies to use hardlinking, to reduce space cost of branches. This can be done for any shared, immutable files (and under some circumstances, it could be done in other cases, but that probably complicates matters more than it’d be worth).
There are also two approaches to getting a copy of the repository: the “full copy” that includes all the history, and the “lazy” one, that copies the pristine cache first and avoids any patch application and possibly also patch download.
Extreme Approaches
Basically, there are two extreme approaches to the challenge (and also some middle ground in between that we will discuss later). First, we could keep everything in a single (a few) append-only files. The remove/insert operations would require an expensive copy operation to keep the append property, but this is not so much of a problem, as these are fairly rare. This is basically the Camp approach to repository storage. Camp keeps pristine cache in a plain filesystem hierarchy, but the pristine cache is never involved in any network operation, so this is reasonable (modulo the missing non-corruption guarantees). The basic downside is that virtually no sharing is admissible (at least not trivially — a sharing scheme would require a nontrivial amount of extra logic).
Second, there is the current hashed darcs system: we keep every pristine file and every patch in a separate file, named by the file’s content (i.e. hashed). We can share mostly everything through hardlinking (and for remote operations, we can implement caching easily). The cost is twofold: remote operations require a high number of roundtrips, and internal filesystem fragmentation inflates repository size (the stored files are often just a fraction of a filesystem block, and most filesystems never allocate more than one file in a given block). Moreover, current darcs implementation creates extremely large directories, which tend to make individual file operations quite inefficient.
Compromise Approaches
To look into a somewhat different world, there’s also git. Basically, what git does is same as what darcs does: it creates a single file per object, with cryptographic hash as its name. This allows for sharing and fast local access. It also suffers from the fragmentation problem and is unsuitable for retrieval over dumb transport. Git uses “packs” to circumvent those two problems: you need to manually compress the repository from time to time, into these packs. It also relies on smart transport whenever possible (which uses the same packing mechanism internally).
Anyway, the compromises between the two extremes seem to fall into the “generational” category: a hatchery of one kind and a “mature space” of another kind: with mature space hopefully containing large immutable files (with many objects packed in each). The hatchery is more of a question. As noted above, git uses a “loose” object storage (many small files) as the hatchery. The generational gap is mostly manual: users are called in for administrative action.
The question is, can we do any better? There are two fronts on which to improve: an automated “aging” of objects, from hatchery to mature space, and a better format for the hatchery. From the other side, compared to current darcs, we can also improve the (non-existent) mature space.
Basically, the mature space can be considered read-only for most practical purposes. The natural solution for darcs would be to create a single packed file per tag, and this packed file can as well be a (compressed) patch bundle. When broken at tags, the patch bundles would have very small context (just the immediately preceding tag).
As for hatchery, one obvious solution is the same as git: loose storage. This is also a straightforward extension of existing darcs, by just implementing a mature space. The costs in form of roundtrips and filesystem fragmentation could still be quite high though.
The other approach for hatchery that comes to mind is to use only a single, append-only file (with a simple index, probably in an external file). Basically, this would work like Camp currently does, appending all patches to a file. At certain point, this append-only file would be packed into the mature space somehow, and we would start afresh with empty hatchery (or at least with reduced one, depending on the packing scheme).
Hatchery/Mature Space Migration
The last thing that needs to be dealt with is the heuristic that migrates bits from the hatchery to the mature space. As I hinted above, for darcs it would generally make sense to do this at tags. Nevertheless, if there are too many or too few tags, this won’t work very well in practice. This basically means we want the heuristic to coalesce “too small” tags, and break up “too big” tagless sequences. The first is easier than the second: we can choose the chunk to pack at the migration time, i.e. when hatchery size hits some threshold. We can just set some constants as how big chunks we want, and chop up the load at the most appropriate tags.
As for the latter, we can decide at the same time: when hatchery grows too big and there’s no tag to use for breakup, we can auto-tag at appropriate place. Such heuristic is bound to make mistakes, but it is probably going to work well-enough.
Example
Let’s consider a format, where mature space consists of patch bundles (broken up at tags) and the hatchery is a small number of append-only files. We can put the hatchery bound at 2M, which becomes the maximum amount of unshareable data per branch. Let’s say we put the optimal pack size at 1.5M. For the GHC kind of repository, this is about 40 packs plus the hatchery and the pristine cache. If we can reconstruct the pristine cache “on the fly” (as data is received from network), the time required should be quite close to that of the actual download, without much time wasted for unpacking onto the filesystem. (Trying with current darcs from a local disk drive, replaying patches in the GHC repository gives processing speed of around 900KB/s. This probably needs some improvements.)
Pristine Cache
Since in the previous discussion, we side-stepped the issue of pristine cache, the whole system as described will still have an issue with “lazy” repositories — the loose pristine needs to be fetched. This could easily take about the same time as full get with the mentioned improvements, since we are facing some 1200 roundtrips (compared to 40 for the patches — still using the GHC repository as an example). On my current wireless connection, the average RTT to code.haskell.org is around 200ms, and from a wired university connection, it is about 140ms. Even in the case of the fast university network, we are facing some 168 seconds of roundtrip wait time when using “dumb” HTTP.
When doing a full copy of the repository, the number of files in pristine cache doesn’t really matter — it has a slight problem with fragmentation, but we could probably ignore that. However, the potential saving on download time for lazy repository is non-negligible: the size of the pristine cache is around 1/8 of the full history size for GHC. It is definitely possible to pack the pristine cache, however, the proportion of garbage is much higher in pristine cache than in the patch storage. The pristine cache would likely need to be periodically repacked and recombined to stay efficient. The generational approach will likely reduce this problem somewhat, since a part of the garbage will never leave the hatchery. The collection for mature space would probably need to keep track of utilisation factor, and replace packs that reach a certain non-utilisation threshold (say, 50%).
(Just a small sidenote: when we apply a patch, all affected files and all their parent directories need to be replaced in the pristine cache. This works just like git. However, in darcs, the older revisions of those files immediately become garbage, since there is nothing like a commit object to reference them: for darcs, this is purely a cache, since we reconstruct trees by applying and unapplying patches.)
Unified Storage
Now that we see that the pristine cache can use a very similar storage concept as the patches do, we could try to unify the patch storage and the pristine storage. Basically, patches behave like pristine, it’s just that the proportion of objects that fall dead is much lower. However, this unified storage would have to forgo the idea of “pack as a bundle”, since the model is more general now. It also poses some issues with applying patches as they are downloaded, since we no longer receive them in application order: we probably want the generalised packs to be sorted by hash, so we can have binary search indices for them. It would be also independent of tags, which is a mixed blessing.
Conclusion
It so far seems that the unified storage code is worthwhile. For one, it will make lazy repositories possible and efficient. Moreover, at cost of a slightly larger download, it can avoid replaying all patches upon repository copy, so the fact they are not sorted in application order does not hurt too much. We get packing that is independent of manual tagging and without need to add automatic tagging to the scheme. Overall, when using the general storage principle, things don’t seem to be overly complicated. It will probably take some trial and error to set the parameters (various thresholds and sizes) right. In the case of the unified storage, the net result should be a relatively simple and thin layer sitting between darcs and the filesystem, acting like sort of object storage: somewhat like a smarter git storage, with provisions for higher garbage rates (since these are virtually 0 with git).
posted 11.07.2009 7:11 pm09.07.2009: darcs 2.3 beta 2
Darcs 2.3 beta 2
Two weeks passed and it is about time to release darcs 2.3 beta 2 (it’s already a day late, sorry about that).
As with beta 1, there is only a single installation package for this release of darcs: cabalised source. (Please note that the final version with also come with the legacy autoconf-based buildsystem, for the last time.)
You can either download a tarball from http://repos.mornfall.net/darcs/darcs-2.2.98.2.tar.gz and build manually (see the build instructions in README inside the tarball), or, alternatively, you can use cabal-install to obtain a copy (the beta release is now available on Hackage):
$ cabal update
$ cabal install darcs-beta
This, if successful, will give you a darcs binary in ~/.cabal/bin — you
should probably add that to your PATH.
What’s new since beta 1
There are basically two relatively intrusive changes since beta 1. First, I have added index upgrade functionality to hashed-storage and this is now used by darcs, so that any bad or incompatible indexes are recreated transparently by darcs. Moreover, the index format is now architecture-independent, meaning that repositories shared across multiple architectures should not suffer from excessive index rebuilding.
Second, Ganesh has done further work on his gzip CRC correction code. There is a very slight risk of regressions, so if you have around any repositories with broken CRCs in them, please test this functionality. Moreover, there is now an option to limit the repair to current repository, avoiding changes in related branches or caches.
Trent, Thorkil and Eric have done cleanup work and a number of minor fixes for this release, improving overall quality and polish. One bug in the new indexing code has been found and reported by Eric and Guillaume independently, and is fixed in hashed-storage 0.3.4 (which required by this beta).
You can track the release plan and progress at our wiki: http://wiki.darcs.net/development.
The question of GHC 6.8
Using GHC 6.10.3 or newer is strongly recommended. You may compile darcs with GHC 6.8, but there are several caveats. If you are using 6.8.2 or older, please disable mmap support (pass -f-mmap to cabal install or runghc Setup configure below). Note that the GHC 6.8.2 that ships with Debian Lenny is not affected and it should be safe to keep mmap enabled. It is also recommended to disable use of Hackage zlib when compiling with GHC 6.8.2 (including the Debian Lenny version): pass -f-zlib to cabal. When using zlib, we have seen occasional crashes with error messages like “openBinaryFile: file locked” — this is a known GHC 6.8.2 bug (and is fixed in GHC 6.8.3). Last, if you are using a 64-bit system, darcs may hang when you exit a pager when compiled with GHC older than 6.10.3. Although this is harmless, it is quite inconvenient.
Overall, the status of GHC 6.8 is “semi-supported”: for many cases, things will work just fine, especially if you take a little extra caution with compilation flags.
posted 09.07.2009 12:54 pm