Introduction
This is probably the most active bit of the site, i post random thoughts, events and articles to this blog. It is ridden with personal issues and technical items alike, while being constantly out of date. Read at your own risk. If you are interested in my person (odd enough as that may sound), probably read something about me. If you are only interested in technical articles, see tech. Poetry has a completely separate blog of its own right. Older articles (as well as postings on the previous blogs i had) are to be found in archive. That’s probably it.
Recent Entries
30.06.2009: soc progress 6
soc progress 6
First for the last week’s TODO: endianity conversion and magic word in index are now done. Other than that, I have been working on testing of hashed-storage, and I am happy to announce that I now have a working Arbitrary Tree instance and a bunch of QC properties, and also that the unit tests now also run outside of a darcs checkout of hashed-storage. A few still need a “darcs” binary present on the system, but I’ll fix that in a bit (this just comprises storing the relevant darcs outputs in the test data).
I have also fixed a stupid bug in hashed-storage that caused issues with darcs whatsnew in 2.3 beta 1 involving subtree queries of 2nd-level or deeper subdirectories. Together with the index upgrade functionality, this is now part of 0.3.4 release of hashed-storage.
In other news, I have fixed a bug in mmap that caused current darcs-hs to fail tests. While at it, I also improved error messages in mmap. Thanks to Gracjan’s prompt response, mmap-0.3 including my changes is now available through Hackage.
Nothing much else, not so much an exciting week, but a few things have solidified and darcs 2.3 is starting to look good.
The summary of hashed-storage changes for the week:
- Add a bunch of QC properties for Tree manipulation.
- Add an Arbitrary instance and a relaxed equality function for Tree.
- Rudimentary Show instance for Tree.
- Make the tests run outside of the hashed-storage darcs repo, too.
- Set up repository test.
- Use quickcheck2, since darcs uses that and cabal dislikes mixing QC 1 and 2.
- Bump version to 0.3.4.
- Give proper exit code in “cabal test”.
- Drop redundant expansion in test.
- Check that index upgrade copes with bad or unknown index.
- Remove _darcs/index before testing, if it exists.
- Make the index format host-endianity-independent.
- Fix some Wall noise.
- Implement xlate32/xlate64 endianness conversion functions.
- Add a function to read index and check its version, upgrading if needed.
- Keep a magic number at start of the binary index.
- Add a test for recent expandPath bug and combine the tree/generic groups.
- Refactor tests and port to test-framework.
- Fix typo in expandPath that broke the recursive case.
and for darcs-hs:
- Bump hashed-storage dependency to 0.3.4.
- Clean up unused imports in WhatsNew.
- Add a TODO comment to the inefficient look-for-adds implementation.
- Wall policy for Gorsvet’s applyToTree.
- Use the new hashed-storage index upgrade functionality.
24.06.2009: darcs 2.3 beta 1
Darcs 2.3 beta 1
I’d like to announce immediate availability of a first beta release of darcs 2.3. There is a number of improvements and bugfixes over the last stable release (2.2). Moreover, work has been done on performance of “darcs whatsnew” for large repositories. This has also introduced a slight risk of regressions, but please note, that all of the disruptive changes are in read-only code paths: the new code will never touch your repository, so it is unable to cause permanent harm. The worst that could happen is that you get no or bad diff from “darcs whatsnew”. (This is also a reason why we need your testing!)
There is only a single installation package for this release of darcs: cabalised source. (Please note that the final version with also come with the legacy autoconf-based buildsystem, for the last time.)
You can either download a tarball from http://repos.mornfall.net/darcs/darcs-2.2.98.1.tar.gz and build manually (see the build instructions in README inside the tarball), or, alternatively, you can use cabal-install to obtain a copy (the beta release is now available on Hackage):
$ cabal update
$ cabal install darcs-beta
This should give you a darcs binary in ~/.cabal/bin — you should probably
add that to your PATH.
(Note: The package name on Hackage is different, since people installing darcs from Hackage are expecting a stable version. The name change means that you cannot use the Hackage version to build other packages that depend on the darcs library: you either need the tarball for this, or you can use the stable version from Hackage.)
There is a quick (possibly incomplete) list of important changes:
- lots and lots of documentation changes (Trent)
- haskeline improvements (Judah)
- cabal as default buildsystem (many contributors)
- fixes in darcs check/repair memory usage (Bertram, David)
- performance improvement in subtree record (Reinier)
- —summary —xml (Florian Gilcher)
- changes —max-count (Eric and myself)
- fix changes —only-to-files for renames (Dmitry)
- performance fix in “darcs changes” (Benedikt)
- hardlinks on NTFS (Salvatore)
- coalesce more changes when creating rollbacks (David)
- new unit test runner (Reinier)
- darcs-shell in contrib (László, Trent)
- .authorspellings (Simon) — I find this to be controversial though
- working directory index and substantial “darcs wh” optimisation (myself)
- gzip CRC checking and repair feature (Ganesh)
and there is a number of issues that has been resolved since 2.2:
- 948: darcsman (Trent)
- 1206: countable nouns (Trent)
- 1285: cabal test v. cabal clean (Trent)
- 1302: use resolved, not resolved-in-unstable (Trent)
- 1235: obliterate —summary (Rob)
- 1270: no MOTD for —xml-output (Lele)
- 1311: cover more timezones (Dave)
- 1292: re-encoding haskeline input (Judah)
- 1313: clickable ToC and refs in PDF manual Trent)
- 1310: create merged \darcsCommand{add} (Trent)
- 1333: better “cannot push to current repository” warning (Petr)
- 1347: (autoconf) check for unsafeMMapFile if mmap use enabled (Dave)
- 1361: specify required includes for curl in cabal file (Reinier)
- 1379: remove libwww support (Trent)
- 1366: remove unreachable code for direct ncurses use (Trent)
- 1271: do not install two copies of darcs.pdf (Trent)
- 1358: encode non-ASCII characters in mail headers (Reinier)
- 1393: swap “darcs mv” and “darcs move” (Trent)
- 1405: improve discoverability of global author file (Trent)
- 1402: don’t “phone home” about bugs (Trent)
- 1301: remove obsolete zsh completion scripts (Trent)
- 1162: makeAbsolute is now a total function (Ben F)
- 1269: setpref predist - exitcode ignored bug (Ben M)
- 1415: —edit-long-comment, not —edit-description, in help (Trent)
- 1413: remove duplicate documentation (Trent)
- 1423: complain about empty add/remove (Trent)
- 1437: Implement darcs changes —max-count (Eric)
- 1430: lazy pattern matching in (-:-) from Changes command module (Dmitry)
- 1434: refactor example test (Trent)
- 1432: refer to %APPDATA%, not %USERPROFILE% (Trent)
- 1186: give a chance to abort if user did not edit description file (Dmitry)
- 1446: make amend-record -m foo replace only the patch name (Dmitry)
- 1435: default to get —hashed from a darcs-1.0 source (Trent)
- 1312: update and reduce build notes (Trent)
- 1351: fix repository path handling on Windows (Salvatore)
- 1173: support hard links on NTFS (Salvatore)
- 1248: support compressed inventories for darcs-1 repos (Ganesh)
- 1455: implement “darcs help environment” (Trent)
23.06.2009: soc progress 5
soc progress 5
Last week has been more active again. I have mostly finished my 2.3 agenda, I have been thinking about the repository format and I have started doing some post-2.3 work as well.
In the pre-2.3 department, these have been mostly API cleanups and finishing bits. There are also few pieces missing in the puzzle. I wanted to add a magic word to the start of the index, so we could quickly identify an old (or future) version and discard it immediately (triggering a full index rebuild). This is not particularly destructive, since we can easily use a magic word that won’t match any realistic index start (for the current format). The other thing that missed the 2.3 beta 1 is the endianity conversion for the on-disk format. This can be done together with the magic word.
For repository format, so far I have been thinking about how to efficiently pack mostly static data (think git object database) on-the-fly (as opposed to manual, git-gc style approach) while avoiding excessive re-downloading (and re-packing). So far, the best I could think of was that I can keep say at most 16 objects in completely unpacked form. When this threshold is reached, I take 8 of these files and pack them up into a single indexed object (with these 8 items as sub-objects). Now I can have a fixed-size header, keeping hashes and offsets (and sizes) of those 8 sub-objects. This would work recursively: composed objects could be again composed to bigger composed objects. This way, we would have around 16 files on average representing the whole repository. It would also be easy to only download the relevant parts of the newly appeared files from remote repositories: we can grab the header and then the unknown sub-objects with http range request. We would also need to map from primitive object hashes to their current locations (basically all their parent patches). I’ll have to think more about a suitable data structure for this purpose. Finally, we would still need a gc-style command, since an object filesystem used by darcs would accumulate unreferenced garbage. Especially if we also used the system for pristine cache. Moreover, the purely academical N-ary tree approach would suffer from performance problems, so some real-worldly hacks will be needed to make things work out in practice. (But the tree structure should be useful to show some bounds on the complexity of particular operations.)
Finally, for the post-2.3 bits. In darcs-hs, I have bitten the bullet and flipped all unrecorded-state (basically pristine -> working copy diffing) machinery over to Gorsvet’s unrecordedState (implemented using Index and hashed-storage). This might have introduced some performance regressions, sadly. However, the thing now completely passes the testsuite (after I fixed a bug in the mmap package… I have to submit a patch to the upstream author). Nevertheless, this also means obliteration of a chunk of old code, and a complete fix for the timestamp de-synchronisation issue of current darcs. There’s still a bunch of work to do, which would allow complete removal of unsafeDiff and a bunch of related functionality.
Finally, changes for this week… hashed-storage:
- Move darcs-specific utilities to separate module (Storage.Hashed.Darcs).
- Export the TreeIO alias from Monad.
- Also parametrise the Tree hashing function in readIndex.
- Replace all unfold terminology with expand (breaks API).
- Remove unused bit in Index.
- Fix a silly bug in AnchoredPath parents.
- Fix compilation of tests.
- Further simplify AnchoredPath parents.
- Do not forget to include Storage.Hashed.Test in distribution.
- Fix AnchoredPath parents again.
- Bump version to 0.3.3.1.
- Fix build with GHC 6.8.2 (needs extension field in cabal). Bump version.
… and darcs-hs:
- Basic “show index” implementation.
- Also curse haskell_policy in Czech.
- Clean up unused bits in Darcs.Gorsvet.
- Use TreeIO alias in instance declarations (do not spell out the type).
- Import darcsFormatHash from Storage.Hashed.Darcs.
- Update to reflect Index API change, provide darcs-specific readIndex in Gorsvet.
- Unfold has been renamed to ‘expand’ in Storage.Hashed.Tree.
- Also provide “darcs show pristine” to go with darcs show index.
- Put blank lines between command groups in “darcs help”.
- Cut down descriptions, so that darcs help does not wrap on an 80-column TTY.
- Make “darcs clone” a hidden alias for “darcs get”.
- Flip “darcs changes” to index-based diffing.
- Flip “darcs mark-conflicts” over to index-based diffing.
- Use index-based diffing in Remove.
- Flip AmendRecord to index-based diffing, too.
- Use index-based diffing in unrevert.
- Make revert use index-based diffing.
- Also use index-based diffing in unrecord/obliterate.
- Provide readRecorded in Gorsvet as well.
- Factor out applyToTree in Gorsvet.
- Use index-based diffing in “darcs wh -l”.
- Unexport get_unrecorded* from Repository, remove unused functions from Internal.
- Move tentativelyMergePatches and friends to a new module, Repository.Merge.
- Move add_to_pending to Repository, use unrecordedChanges.
- Clean up unused bits from Repository.Internal.
- Invalidate the index in add_to_pending, as it was getting rebuilt too soon.
- Remove unused import from Gorsvet.
And I need to sleep now. I’m in Berlin now, so I’ll be probably fairly unproductive till about Saturday. I’ll sort out the 2.3 beta 1 tomorrow, since I really really need to sleep now. Goodnight!
posted 23.06.2009 11:41 pm17.06.2009: soc progress 4
soc progress 4
I have slipped the report a day, but yesterday I was mostly engaged in studying various proofs in Petri Net theory for today’s exam. Everything went well as far as I can tell, although I’ll be only to tell when the test is corrected.
Anyway, I have done some work despite study for exams being a massive timesink. Moreover, a lot of time went into chasing ghosts, unfortunately. The performance regression I have mentioned last time turned out to be not really a performance regression, more an oddity in the behaviour of my CPU frequency scaling… it managed to get stuck at 800MHz from time to time, doubling the time needed to do a darcs whatsnew. It wouldn’t be so bad if it kept stuck there, but it non-deterministically managed to get unstuck from time to time, so after some changes I made to the source, the performance suddenly jumped back to original numbers, even though it didn’t make any sense. After few hours of cursing Haskell in general and GHC in particular, I have figured that both are innocent. So much for the regression. At least I took the opportunity and cleaned up and refactored the Tree unfold function (which still needs to be renamed. Noting among things that need to be done before freeze…).
Other than that, I have looked into the darcs wh path slowness and managed to
come up with a reasonable fix, involving creative use of the TreeIO monad from
hashed-storage. The numbers from ghc-testsuite hashed repository are these:
darcs-2.2 wh 0,92s user 0,14s system 98% cpu 1,082 total
darcs-2.2 wh mk 0,20s user 0,02s system 90% cpu 0,246 total
darcs wh 0,06s user 0,04s system 94% cpu 0,105 total
darcs wh mk 0,02s user 0,00s system 94% cpu 0,021 total
Changes for this week… hashed-storage:
- Make lcs an optional dependency (use -fdiff to get the Diff module).
- Refactor and beautify Tree unfold.
- Use mmap instead of bytestring-mmap in readSegment (in Utils).
- Cut down build-depends (no longer need bytestring-mmap nor binary).
- Add fileExists and exists to Monad (unlike find*, these will unfold as needed).
… and darcs-hs:
- Use the mmap package instead of bytestring-mmap.
- The hashed-storage index functions take a filename now.
- Use index-based diffing in Record.
- Use local pendingChanges instead of read_pending, in readRecordedAndPending.
- Explicit import list for Storage.Hashed.Monad in Gorsvet.
- Optimise the file existence checking in whatsnew .
I am now off to play some bassoon and right after that I’ll finally have darcs
hacking time. I guess some administrativia is in place (sending the outstanding
patches to darcs-users@) and then I’ll probably focus on something lighter,
like adding darcs show index to darcs-hs, so I get back into darcs context.
09.06.2009: soc progress 3
soc progress 3
This week has been a little weaker. Lot of non-SoC stuff interfered, not the least exams. So far I have passed 2, 2 more to go (but these are the harder ones, and I spend fair amount of time preparing for those). Of course, bassoon takes its chunk as well and everything.
I have worked on documenting the index format (and code), some minor cleanup and interoperability improvements. I have some unrecorded changes making the lcs dependency optional as well, and I need to sort out the bytestring-mmap versus mmap thing (i.e. drop bytestring-mmap and use mmap everywhere, consistently). Anyway,
I have also tracked down an important performance regression in current hashed-storage (came with an innocent bug fix, and is about factor of 2 slowdown … oops). The bad patch is “Do not miss Stubs that are hidden in a SubTree when unfolding.” Go figure. I have also found another performance problem, in “darcs wh filename”, where darcs reads full pristine cache to check whether the filename given exists in the repository (this can be made much more efficient using hashed-storage). I’ll fix both in the following week, hopefully.
Last, Eric has helped me set up some wiki bits for the SoC project itself and for hashed storage.
Finally, the mandatory list of changes… The parenthesised changes are not on the public branches… the rollback is there mostly because it reverts to a more strict behaviour, although buggy darcs patches require a more benevolent one (there are supposedly move patches out there with non-existent “from” files). The Cabal 1.7.1 thing is really nice, but will have to wait till Cabal 1.8 is released… if you are interested, there’s a patch bundle for upstream darcs in darcs-users@ archive.
Hashed-storage:
- Do not build the unit tests by default (avoids dependency on new process lib).
- Bump version to 0.3.1.
- Make the createIndex Stub error more specific.
- Add a makeName utility function to AnchoredPath.
- Make rename of a non-existent file a non-fatal error (in TreeIO).
- Check that unfolded Tree does not have any Stubs in it.
- Really make the process dependency optional.
- Bump required cabal version to 1.6 (required for source-repository).
- Fix typo in description in the cabal file.
- Bump version to 0.3.2.
- Tighten the base dependency since hackage complains otherwise.
- Do not miss Stubs that are hidden in a SubTree when unfolding.
- (rollback: Make rename of a non-existent file a non-fatal error (in TreeIO).)
- Improve Index documentation.
- Make peekItem safe even when dirlen is Nothing.
- Un-hardcode _darcs/index from the Index module.
- Bump version to 0.3.3.
And darcs-hs:
- (Link darcs executable and unit tests against the darcs library (Cabal >= 1.7.1).)
- (Move non-darcslib modules out of src, so we can build the unit tests.)
- We need to unfold the pristine Tree before rebuilding the index.
- Fix index invalidation in the move command.
- Bump the hashed-storage dependency to >= 0.3.2.
02.06.2009: soc progress 2
soc progress 2
As for the last week… There have been some unexpected developments. The hashed-storage patches have started to trickle into mainline darcs instead of (as I have expected) sitting on my private branch for a while. This is partially due to my RM decision to try push for indexed whatsnew in darcs 2.3 (and Eric dutifully started to push patches into mainline, conjuring review coverage out of thin air…)
Either way, this has probably caused some intermittent breakage, although everything should be back on track with the latest patches. Also, due to some cabal vs ghc 6.8 vs process-1.0.1 bug, installing hashed-storage has caused problems for people on 6.8. Since that library is only required for the unit tests of hashed-storage, I have made this optional and people should be able to install hashed-storage 0.3.1 on 6.8 without much trouble (however, Hackage does not let me upload it right now, and I’m not sure why… I’ll try to resolve that ASAP).
Nevertheless, I am locally using darcs-hs with Record flipped over to use indexed diffing. This goes a long way to improve testing coverage of the index code, since almost everything in the testsuite relies on the record code (whereas the whatsnew code is only mildly tested). With the current versions of hashed-storage and darcs-hs, everything passes just fine, and most bugs I could find with previous incarnations are fixed (mostly pertaining to pending renames and to subtree queries). Unfortunately, doing a subtree query is for some reason relatively slow (although still reasonably faster than say darcs 2.2). I’ll look into that in the following week.
Moreover, I should start looking into getting us a new pristine format that I have promised in my application (the indexed working directory is, obviously, just a part of the whole deal).
In the last week, hashed-storage library has seen these changes:
- Properly decode whitespace when reading darcs hashed pristine.
- Provide unfoldPath to partially unfold a (stubbed) Tree.
- Use partial unfolding in TreeIO monad.
- Bump version to 0.3.
- Update cabal (description and category).
- Resolve conflict in Monad.
- Add haddock to TreeState.
- Resolve conflicts in the cabal file.
- Improve AnchoredPath API and haddocks.
- Do not build the unit tests by default (avoids dependency on new process lib).
- Bump version to 0.3.1.
And darcs-hs:
- Extend the weird filenames part of the whatsnew test to cover indexed filenames.
- Run tests in groups even when only a part of the testsuite is being executed.
- Resolve issue1229: strictify checkPristineAgainstSlurpy.
- Resolve conflict with import list cleanup in Setup.lhs.
- Put back Setup into a Wall-clean state, robustify error conditions a little.
- Add comment explaining strictification of checkPristineAgainstSlurpy
- TreeIO is smart enough now to unfold as needed.
- Version the build dependency on hashed-storage.
- Fix the path restriction versus pending renames in unrecordedChanges.
- Factor out a separate boring_regexps in Darcs.Repository.Prefs.
- Provide a restrictBoring (like restrictSubpaths) in Darcs.Gorsvet.
- Fix Tree restriction in various cases of unrecordedChanges.
- Also invalidateIndex in Revert and Remove.
- Remove tentativelyMerge from Gorsvet, as it’s unused and confusing.
- Take a list of paths in unrecordedChanges instead of Tree transform.
- Drop extra parens.
- Fix witnesses in Darcs.Gorsvet.
24.05.2009: soc progress 1
soc progress 1
I have started doing the real work few days back. So what gives? I have
branched darcs and started porting the relevant bits over to
hashed-storage. Along the way, hashed-storage has received some
improvements. For the most part, these were darcs compatibility improvements
(in the darcs hashed pristine code) and in tree diffing department. The tree
diff is now fully symmetrical, which is required for --look-for-adds.
Efficiency has suffered a little, but I don’t quite expect this to show up on
profiles.
In darcs, I have mostly implemented safe index manipulation. (I.e. not allowing index to get out of date with regards to tracked files… The nature of the index requires that each tracked file is present in the index, so that we don’t need to read the actual working or pristine directory contents.)
Unfortunately, the index still doesn’t work very well with paths that have spaces in them (which is weird, since the index doesn’t particularly care about what is stored in the path, but I’ll investigate that later). This also means, that I can’t test on ghc-hashed, but I can test on ghc-testsuite, which seems to be more interesting of those two, anyway. The numbers (with hot cache in both cases):
darcs wh 0,87s user 0,12s system 94% cpu 1,046 total
darcs-hs wh 0,06s user 0,03s system 84% cpu 0,100 total
That gives about tenfold speedup for whatsnew on hashed repositories. This also fixes the infamous “timestamps get out of sync all the time” bug, which is usually manifested by darcs taking extraordinarily long time “reading pristine”. Branching the ghc-testsuite repo, I get (in the newly created branch, which has broken timestamps wherever hardlinks work; hot cache again):
darcs wh 5,91s user 0,56s system 91% cpu 7,033 total
To get back to darcs-hs, it seems, that at least on my machine, it manages to pass the darcs testsuite (although it took some tweaking to get there). Nevertheless, there are some further issues I have discovered that the suite does not cover. Still, at least for now, it should be safe to use darcs-hs, as the code is “read-only”: it is only used for whatsnew, never for creating patches.
Next week, I’ll work some more on getting record use the new diffing code (index-based, that is). I have already started, but I’m still failing a bunch of tests and they are not trivial to fix yet. Also, I should look into getting back the optimised version of filepath-restricted diff — I had to disable it since it’s not clear how to make it work with pending renames (the original darcs approach doesn’t apply for my version, sadly).
That’s it, I’m attaching a summary of changes on the individual repositories. The first one is hashed-storage (get from http://repos.mornfall.net/hashed-storage):
- Make the diffTree implementation symmetric.
- Implement unlink and rename in Monad.
- Omit missing files when reading an indexed tree.
- Handle empty hashed pristine directories (they may omit gzip header).
- Allow item/subtree removal in modifyTree.
- Implement zipAllFiles, zipAllDirs.
- Add emptyBlob, in addition to emptyTree.
- Concede to also darcs-formatting sha1 sums sans the size prefix.
- Fix cabal build-type to custom (we implement cabal test now).
- Allow checking for file existence as a result of stat.
The other one is darcs-hs, from http://repos.mornfall.net/darcs/darcs-hs:
- Factor out a common bit in WhatsNew.lhs.
- Import relevant bits of gorsvet, for now under Darcs.Gorsvet.
- Handle adds and removals in treeDiff.
- Kill a bunch of unused imports.
- Convenience wrapper for restrict_paths for use in Darcs.
- Make the trailing newline shuffling in treeDiff a little less fragile.
- Appease haskell_policy. (Sigh.)
- Disable restriction in unrecordedChanges for now (less efficient but correct).
- Implement basic index maintenance functionality.
- Bomb out from unrecordedChanges when pending is buggy.
- Invalidate index at key positions in relevant (pristine-modifying) commands.
- Use index for diffing in the basic whatsnew scenario.
19.05.2009: summer of code
Summer of Code 2009: Fast Darcs
It’s not so much of news anymore, but my SoC 2009 proposal got accepted. Apart from other things, it means that I’ll be reporting semi-regularly to this blog about it. This first post is sort of an introduction.
About the project
My project revolves around the idea of fast darcs for medium and large repositories. Three are quite a few haskellers who use darcs in their day to day (haskell) work. A fair number of hackage packages is maintained in darcs. Even though many of these repositories are of a relatively modest size, there is a number of relatively large real-world darcs repositories out there. The primary target of the project is to improve scalability of darcs for large working trees. This should help those users with existing large darcs repositories, as well as encourage people to use darcs for larger projects, whenever the development model fits.
Unfortunately, this work alone is not sufficient to make managing a large project with darcs all smooth sailing. There is a number of improvements that needs to be made, not the least in the darcs core. Nevertheless, this project should bring darcs one step from “not recommended for large projects” to “reasonably well suited for large projects” where we would like to be a few years from now.
And excitement? Hard to tell. The project is sort of boring. In fact, it’s designed to make darcs less noticeable in day to day operation. What might get you, as a haskeller, excited is probably that I intend to make the darcs working tree handling comparably fast to git. And then, git is written in C, hand-tuned for a specific operating system. And unlike mercurial, I do not plan to introduce a C library for low level routines. So let’s prove that Haskell is up to the challenge.
Optimising darcs
Darcs is known to have performance issues when repositories extend beyond certain size. There are several “directions” in which repositories grow — out of these, size of the working tree and length of history my primary focus. Common commands involving the working tree should be comfortable to use and should not introduce delays in workflow. Moreover, big repositories should be stored efficiently and it should be possible to efficiently retrieve both full repositories and new patches remotely.
Goals
To formulate some tangible goals, we should establish some context. In darcs, operations that involve working tree usually need to compare this working tree to its pristine cache, to produce a patch, to either show it to user, or to record it, or to revert it, and so on.
Implementation of this pristine cache is crucial for performance. Producing an implementation of pristine cache that would allow high performance comparisons is first and foremost of the goals. However, even though speedy execution is important, robustness is absolutely vital. Pristine corruption can directly lead to producing corrupt patches which might cause serious issues to repository integrity, possibly even leading to serious data loss. The “hashed” pristine format used by darcs 2.x is resilient against this kind of corruption, and the proposed pristine format is required to keep this property.
In addition to using the pristine cache for local operations, the cache is also retrieved when a copy of remote repository is being made (through “darcs get”). This means that it needs to be possible to efficiently obtain the pristine cache through HTTP requests, and this is another requirement for our implementation.
Apart from pristine cache, there are quite similar requirements for patch storage. This partially relates to the “long history” kind of repository, and it would be beneficial to share as much code between pristine cache and patch storage as possible. The second goal is to provide such patch storage.
The third goal is to make the code reusable in a way that would make it possible to integrate it with the Camp project, that is being worked on by Ian Lynagh. Camp makes improvements in core algorithms for patch manipulation, and should help address several other of the “size” problems darcs currently exhibits (number of active branches, conflicting branches, conflict resolutions and such).
Apart from working tree and repository format in general, there is plenty room for performance improvement in other areas of the code. If time’s left after addressing the primary goals, it will go into optimising neighbouring areas of code: patch application comes to mind (the most expensive operation in local “darcs pull”).
Design details
Proof-of-concept implementation is being actively worked on. The darcs repository can be found at http://repos.mornfall.net/hashed-storage (and http://repos.mornfall.net/gorsvet for a darcs-compatible experimental client). The current iteration of the code comes within a factor of two of equivalent git operation (whereas original darcs 2 code for hashed repositories falls between factor of 15 to 20 when compared to git). The first mentioned repository contains implementation of the working tree indexing code. The code is partially haddocked, in near future I’ll be mainly working on design documentation (inline in the code). Please don’t hesitate to ask if you have questions!
Some initial benchmarking and commentary can be found in following darcs-users@ posts:
http://lists.osuosl.org/pipermail/darcs-users/2009-January/017521.html http://lists.osuosl.org/pipermail/darcs-users/2009-February/017822.html http://lists.osuosl.org/pipermail/darcs-users/2009-February/017862.html
Note that gorsvet is primarily a benchmarking and testing platform.
The generated haddock documentation for hashed-storage can be found at http://repos.mornfall.net/hashed-storage/dist/doc/html/hashed-storage/. Please note that this is work in progress.
Deliverables
A debugged and documented hackage library for hashed storage manipulation. A repository format for camp based on such a library, with integrity guarantees of hashed pristine and patch storage (camp lacks those with its current format) and with performance within factor of 2 from git, for working tree operations.
I intend to patch darcs to use the hashed-storage library throughout — almost certainly after the 2.3 release — and the improvements thus become part of the current stable 2.x series of darcs.
Benefits to Haskell Community
Darcs is currently one of the most widespread version control tools in use by Haskell projects. There have been numerous complaints, not least from the GHC team, about performance issues. Alleviating these issues would certainly benefit a number of Haskell projects (and a number of non-Haskell projects).
Moreover, since we intend to compete directly against git’s performance, darcs can become a showcase of high-performance Haskell programming, rivalling C in IO and system-interaction intensive workloads, in very real-world conditions.
Last, but not least, we use bottom-up design. The foundational code for the project will become a standalone hackage library.
Why?
I am a long-time darcs user. I believe that darcs has lots to offer over more traditional history-oriented version control systems. Both of these motivate me to work on darcs. The first one is clear — I want my tools to work well and not get in my way. Even more importantly, in team projects, I hate when darcs gets in the way of other team members’ work. And second — I would hate the darcs-style tools to die out. Darcs really brought innovative features to the table — and while the easy ones were picked up by other tools (take the interactive record for example) — the hard, and more important, core features never made their way there. I understand there are practical issues that prevent such features to find place in the rigid DAG-based models of the traditional DVCS.
Since I already use darcs, to me it makes more sense to improve darcs (and potentially camp), than it makes to try bolt the darcs core features onto a DAG-oriented DVCS — a task that those DVCSes themselves do not dare to tackle. Moreover, Haskell is sexy.
posted 19.05.2009 3:27 pm07.05.2009: some bassoon progress
Some Bassoon progress
First of all — even though most of you probably know — I have succeeded in those conservatory admissions, which is awesome. More than that, in the audition part I received 23 points (out of 25 possible), which is much more than I would have expected. In the other parts, I have done somewhat worse, but in each of them it was quite enough to pass. Well, school starts next September.
In other news, I have been practising (a lot). There was a small interruption while I was at Utrecht two weeks ago, but other than that, I am getting between an hour and two every day. I have been mostly working on basic technique: breath control, embouchure, getting fingers to move, tonguing and staccato. I do long tones across three octaves every day (metronome on 40, 8 beats per tone, chromatic scale from C to c”, then 4 beats per tone from c” to C (breathing every other), then 2 up again and 1 down). A lot is going into breathing and embouchure in that exercise. I add decrescendos from time to time (it seems that the exercise is also more demanding that way).
Then go scales — last two weeks it was A major, F# minor and I’ll be adding a sharp every two weeks I suppose (I’m at E major/C# minor right now), then switch to Bb major and start adding flats. That way, I should get through all of them before the school starts in September, and maybe even find some extra time for some practice. I should definitely know all of them reliably by first year’s midterms, and I’ll need another pass or two till then. Anyway, the scale exercises revolve around the standard drill: scale, triad, inversions, dominant seventh chord and its inversions. Basically just kicking the fingers into obedience.
After scales, the practice program starts to vary — there are some staccato and finger exercises I do sometimes, and there are etudes (currently in Weissenborn op. 8, start of lesson 18). As far as actual music goes, I have been mostly sticking to continuo parts — maybe two movements of the Händel oboe sonata (well, “oboe”… HWV 364), first two or three movements of Sonata Sesta by Veracini. I have teased the Sicilienne by Eugene Bozza a little, although it is really out of my current technical (not to mention musical) reach. Few weeks ago, prof gave me a copy of Inventions per Oboe and Bassoon, by J. Z. Bartoš. I did some work on the first one, and I played it in the class last week… We also tried it with Lucy, and it wasn’t that bad. There’s this one … surprising bit in there, but overall, it’s getting pretty well.
More recently, I have also started to work on my right hand’s fourth finger, which has been misbehaving from day one. Finally, after few months of frustration, it seems the spell is broken and things have started to move… somewhere. Hard to tell what gives just yet, but it seems that eventually, it’ll stop cramping. Hooray for progress.
And well, reeds. I have two playable reeds right now. One is pretty good, the other — hard to tell just yet. I have been using the latter for just long tones last few days, wondering what will come out of it. It’s harder to control and takes a lot of air in the lower register (C-E) and I’m not quite sure about its intonation either. Interestingly, it initially seemed to be the better of the two…
(Other than bassoon, there’s also the piano thing. I didn’t mention that before, but I have bought myself a Yamaha P-140, so now I can practice piano at home — which is good for the compulsory piano classes that are part of the conservatory curriculum… I don’t have much more than maybe 20 minutes every few days, although I am trying to make it more regular. Nevertheless, there’s also been some progress in the piano department.)
posted 07.05.2009 12:22 am23.04.2009: darcs sprint 2009-04
Darcs Sprint 2009-04
As you might know, as part of the Haskell Hackathon 5 in Utrecht, the second darcs hacking sprint took place. Thanks to generous donations to the Darcs project, I have been able to attend. I took the overnight train from Prague on Thursday evening — interestingly, we managed to accumulate about 90 minute delay in Germany, but by skipping some detours (throwing passangers out at more convenient ones) we managed to get to Utrecht on time. (Wow.) After a while of confusion, we managed to get our strippencards and find the right bus and get to the UU campus (at that time, we have already met with Thorkil, whose train was combined with mine in the middle of the night somewhere in Germany).
The first thing there was to get our name tags and a round of introductions followed — I was fairly surprising by (large) number of darcs hackers among the general Haskell population present at the Hackathon.
Overall during the sprint, various topics have been discussed — ranging from theory issues to actual implementation bits in current darcs. Some roadmapping has taken place. It was great to see all those darcs people together. More interestingly, it sort of convinced me that actually something is going on with darcs. Although most people seem to be pretty short on time, at least I got the impression that people have actual intent to keep advancing darcs, in one form or other.
Also, there’s been some preliminary plotting on what is the life expectancy of darcs 2, roughly. My current plan is to use darcs 2 as an implementation vehicle, because people use it in real life, for carrying out the actual work on hashed-storage. That way, we should be able to de-couple low-level repository and file access code from darcs itself, also meaning it should be reasonably easy to plug it into camp later (or maybe it turns out we instead want to remodel darcs so that we can plug camp-core in… time should tell).
We have done some bits of work on darcsit (gitit) migration of wiki.darcs.net, although it hasn’t been so much of a smooth sailing. I have even tried to set up gitit on one of my machines, but cabal-installed happstack went eventually crashing and burning instead of doing anything useful. I’m stuck on that front, hoping that Eric and Gwern can get test.darcs.net running within reasonable resource usage.
I have thrown a chat with various people about hashed storage and I have explained various problems of the current darcs hashed repositories and how hashed storage addresses those. Also, I have talked with Benedikt about his filecache and we entertained some initial thoughts on how those two play together.
We have also spent a while at a whiteboard with Eric discussing what needs to be fixed in “sources” handling: the file that lives under _darcs/prefs/sources and that keeps track of related branches of a given repository. The catch is, that when inappropriate entries get into that file, bad things can happen: darcs repeatedly waiting for a nonexistent remote machine to answer, or triggering automount a few thousand times by probing just the wrong directory. These need to be remedied and we came up with a reasonable action plan. Hope to have this in 2.3, as it should be moderately hard to fix and it has potential to sweep a fair amount of silly and obscure performance bugs.
All in all, it seems to me that the important part of the sprint this time was talking to people and thinking about things. Still, I did manage to get some hashed-storage coding done (but mostly cleaning up and documenting work) during the sprint. I expect the fruit of the sprint discussions to take shape in form of code improvements in following months.
posted 23.04.2009 7:15 pm