Introduction
This is probably the most active bit of the site, i post random thoughts, events and articles to this blog. It is ridden with personal issues and technical items alike, while being constantly out of date. Read at your own risk. If you are interested in my person (odd enough as that may sound), probably read something about me. If you are only interested in technical articles, see tech. Poetry has a completely separate blog of its own right. Older articles (as well as postings on the previous blogs i had) are to be found in archive. That’s probably it.
Recent Entries
26.08.2011: soc reloaded: outcomes
soc reloaded: Outcomes
This blog is kind of a final report for the summer. I had a progress report drafted for about two weeks, so let me paste that here just for the record. To read about the actual results, please skip to the “State of the Code” section below.
The Last Progress Report
This is the second (and last) progress update for this summer of code project. It was written something like a week before the pencil’s down, but I got disheartened and instead of finishing and posting, I went on to code some more. Here it goes…
Since my last report, I have decided to turn somewhat more radical again. The original plan was to stick with the darcs codebase and do most (all) of the work within that, based primarily on writing tests for the testsuite and not exposing anything of the new functionality in a user-visible fashion. I changed my mind about this. The main reason was that the test environment, as it is, makes certain properties hard to express: a typical test-suite works with assertions (HUnit) and invariants (QC). In this environment, expressing ideas like “the displayed patches are aesthetically pleasing” or “the files in the repository have reasonable shape” is impractical at best.
An alternative would have been to make myself a playground using the darcs library to expose the new code. But the fact is, our current codebase is entrenched in all kinds of legacy issues, like handling filenames and duplicated code. It makes the experimenter’s life harder than necessary, and it also involves rebuilding a whole lot of code that I never use, over and over.
All in all, I made a somewhat bold decision to cut everything that lived under
Darcs.Patch (plus a few dependencies, as few as possible) into a new library,
which I named patchlib, in the best tradition of cmdlib, pathlib and
fslib. At that point, I also removed custom file path handling from that
portion of code, removed the use of a custom Printer (a pretty-printer
implementation) module and a made few other incompatible changes.
Of course, the testing code went along. The net result, at least for me, was an ability to build and test a much smaller piece of self-contained code. It also allowed me to experiment with APIs a bit, where those were used all over darcs, which made it, within the big codebase, impossible to advance without expending disproportionate amount of work on every change. Of course, part of that will be paid back when we decide to port darcs over to use patchlib.
I originally planned this report for the start of this week, but then I got
caught in a big refactor of the ApplyMonad/Apply classes (again). The refactor
was triggered by the need to pretty-print patches, which is not a completely
easy task (made more complicated by the fact that UUIDs are meaningless for the
user, so formatting patches without context is essentially useless
now). Anyway, I am now much happier with how the ApplyMonad class looks (the
ApplyMonadBase thing was genuinely hideous… good riddance). As a net result
the ApplyMonad class and, even more importantly, the ApplyMonad transformer
(used in applyTo{Tree,State} among other things) is substantially easier to
use on the client side (while maybe very slightly harder on the provider side,
it is also much clearer, IMHO). Overall win.
As for formatting and summarising patches, I have created a new Display class
(I plan to nuke the existing viewing classes more or less, when I manage to
make meaningful Display instances for V1 Prim). The API lets you format patches
based on their ending or starting context. Since bits of the patches need to be
fetched from the hashed store, the display needs to run in a LoadMonad. Since
any reasonable patch formatter also needs to pass state around (and since the
actual type of state passed around will be different for different Prim
implementations, we hide the state in a type family of monad transformers; this
also opens the option to use something else than StateT when appropriate).
(This is the end of the “progress” post. The remainder more or less describes the end state of the project.)
State of the Code
You can look at the code in my darcs repositories. Specifically, to play
around with a “demo”, you need pathlib, fslib, patchlib, cmdlib and
gorsvet. A bit more about the individual libraries:
pathlibis small library for handling file paths; it uses strict bytestring representation, and adds a certain amount of type safety and a few form invariantsfslibis the successor of hashed-storage; it deals with accessing the filesystem in an efficient manner, with good support for hashing files and using the hashes for efficient comparisonspatchlib, as outlined above, came into existence as a “fork” of the Darcs.Patch hierarchy, add and take some; the code is self-contained, and has a set of QC/HUnit tests, although these definitely need extending to cover more of the functionality (and some of the functionality needs to be removed)cmdlibis my commandline parsing library, and is needed by the test clientgorsvet(see next point)gorsvetis an experimental client using version 3 primitive patches, and a very simple hashed repository format; its main purpose is to demo the implementation of V3 prims, in addition to existing QC coverage;
About patchlib
One of the main problems of the darcs codebase today is insufficient modularity, and patchlib is an experiment in an attempt to address that concern. Efforts to bring about significant improvement of the situation from “inside” (by restructuring existing code) have to date failed. Even though there have been local improvements, the overall problem stubbornly persists.
Hand in hand with modularity problems come issues with unclear and underspecified (both internal end external) APIs. Since the separation between different components of darcs is blurry at best, the pressure to introduce clean, testable interfaces is minimal. The external library, on the other hand, is forced to put up a presentable façade. Luckily, the Patch subhierarchy in darcs is, compared to remainder of darcs, in a fairly good shape in this respect.
Eventually, patchlib should provide leverage to work with darcs(-style) patches, including at least:
- Implementation of primitive patches, both version 1 (as used by darcs 1 and 2), compatible with existing repositories, and version 3 (with better semantics and more efficient representation; subject of this SoC project).
- Implementation of the “real” patches, in version 1 (as used by darcs 1) and version 2 (as used by darcs 2), and at some point, version 3 (of as yet unspecified properties).
- Implementation of PatchInfo and “named” patches, which implement changesets in the darcs sense, and allow tracking their metadata. With PatchInfo (and to a limited extent, patch implementations themselves) goes a set of matchers and support code for interactive dependency-aware patch selection.
With a homogenous set of APIs, mostly mediated by type classes and appropriate instances, to:
- Create primitive patches (mainly through diffing, but also by
version-specific direct construction). The most important type class is
Diff, defined inData.Patch.Diff. The interface allows multiple equivalent representations of a single change, intended to provide an interface for heuristic detection of high-level patch types like hunk move or token replace. - Create real and named patches by building them from constituent primitives.
- Store and load all types of patches. This role used to be filled by the
ShowPatchBasicandShowPatchclasses, but is being superseded byStoreandLoadinstances (the classes are provided byfslibfor generic hash-based stores). - Format (and summarise) patches for user-friendly display, using relevant
context to improve the legibility and usefulness of the output. The new
class to achieve this is
Display. Previously, this role was filled byShowPatch, with itsshowContextPatchmethod (which is, however, significantly under-used by darcs). Since V3 primitive patches are substantially harder to read without context, the interface for user-level rendering of patches mediated byDisplaymandates context use. - Commute, invert (all patch types) and merge (“real” and “named”
patches). The primary classes are
CommuteandMerge, currently both faithful copies of their darcs versions. They will however need to be changed to allow these operations access to aLoadMonad, for the benefit of delayed, on-demand loading of extensive text data (applies to hunks at the moment). Currently, theCommuteclass allows this, by substituting the Maybe monad for aCommuteMonadconstraint, when compared to the original darcs definition. - Load, store and convert (to/from legacy format)
PatchInfoobjects that are used to track metadata inNamedpatches.
About V3 Prims in patchlib
The version 3 primitives in their current incarnation in patchlib have the following traits (based on the list from the proposal):
- File content and file location/existence are tracked separately, tied together through universally-unique identifiers (of 256 bits). This avoids a number of conflict scenarios and may allow further novel features, like non-history-disruptive project splits and merges or subtree “checkouts”.
- Hunk content is detached from the hunk patch itself through a hash (unless shorter than twice the length of the hash representation). This makes the patch files themselves very compact and allows most commute rules to avoid loading the hunk content altogether (improving speed and reducing memory footprint). It also makes handling of binary files vastly more efficient.
- The hunks are byte-based, making it substantially more efficient to apply a patch to a file, since no newline scanning needs to happen.
- The same hunk format that is used for text files is reused for binary hunks, basically encoding a range substitution operation on the binary file. A binary delta algorithm can be plugged in to compute more efficient binary hunks, although even full content replace is much cheaper than the binary patch type available in V1 prims.
- A set of primitive patches implementing a hunk “move” operation is implemented, and is passing the generic commutation / application tests. Unfortunately, at this point there is no diffing algorithm to detect such moves, although one is planned for the future.
- Hunk and hunk move patches are the only content-editing patches available to date. Further patch types can be added to the library without restriction as long as the format is not frozen (at least token replace and indentation patch types are planned before any such freeze). New object types and accompanying patch types may be added in later backward-compatible revisions of the format.
About gorsvet
Gorsvet is a toy implementation of a repository layer that uses V3 primitive
patches. (Un)fortunately, the V3 prims violate fundamental assumptions made by
the repository and command layers of darcs, which means that an integration is
substantially more expensive than fits a single SoC project. However, as
outlined above, it is useful to be able to play around with the V3 prim patches
in a realistic environment. Therefore, gorsvet. I made it into a rather thin
user shell based on cmdlib and a prototype repository layer. The UI more or
less resembles darcs (without interactivity, since that’d be superfluous for a
tool of this scope). You are of course welcome to try the experimental tool
out: the online help should give you an idea how to use it.
One thing I would like to discuss a bit here is the repository format. Since the patch types are incompatible anyway, we are fully liberated from backward compatibility considerations. The next darcs repository format can be designed from scratch, keeping in mind the shortcomings of the previous two formats. The implementation in gorsvet is a peek at what the result might look like. Anyway, we still need a better “composite” patch layer, which represents conflicts (and sits one level above primitive patches), since the current (version 2) composite patches in darcs are quite unsatisfactory. That also means we have plenty of time to play around with the repository format, which is more or less independent of the composite patch format.
As for the format itself, I went for as simple as possible (but no simpler). So
far, I have 2 files and a hashed store: .gorsvet/hashed is a sha256-based,
uncompressed “dumping grounds” for stuff of all kinds. The files are (not
implemented as of this writing) .gorsvet/index (with the same purpose as
_darcs/index — fast, efficient working copy access) and .gorsvet/meta — a
small set of root pointers. This has two very beneficial side effects: very
atomic oupdates and transactional semantics (free transaction
rollback). Compression and garbage collection of the hashed store can then be
sorted out separately (and does not affect semnatics of the repository either
way). There are currently 4 root pointers in meta: shadow, pristine,
inventory and order. The pristine is the same thing as in darcs, shadow
is a similar thing, but at any time reflects the current state of the working
copy (it is automatically updated from working every time, before anything else
happens with the repository). The reason is mainly that the working copy has
entirely wrong semantics for diffing V3 primitive patches: most importantly,
UUID tracking is implemented through shadow.
The remaining two root pointers represent two new data structures (and
inventory is different from what darcs calls an inventory). The order
simply lists patchinfo hashes (a handle for each patch that does not change
through commutation) in the application order for the repository: in this
sense, order replaces hashed_inventory known from darcs. It is pretty
compact, but on its own also useless (since it give us no way to get the
patches themselves). The inventory then, on the other hand, is an efficient
map (currently a sorted pair list, written and read as a Data.Map) from
patchinfo hashes to the current patch storage hashes. Moreover, the patchinfo
objects are, unlike in darcs, stored in the hashed store as separate entities,
and the patchinfo hash can be used to efficiently fetch the patchinfo
itself. Therefore, the named patch can be assembled from inventory and
order puts the patch into correct context.
Apart from storing and showing things, I have also implemented a “pull” command for gorsvet, but it’s currently fairly unusable, since any conflict automatically means failure (there is no layer to handle conflicts; we could use version 2 darcs patches, but I think it would constitute a dangerous slippery slope: we definitely want a more solid implementation, and also want to avoid a double transition).
Future Work
The obvious future work lies in the conflict handling. There are two main options in this regard: either re-engineer a patch-level, commute-based representation of conflicts (in the spirit of mergers and conflictors), as V3 “composite” patches, or alternatively, use a non-patch based mechanism for tracking conflicts and resolutions. It’s still somewhat early to decide which is a better choice, and they come with different trade-offs. Nevertheless, the decision, and the implementation, constitute a major step towards darcs 3.
The other major piece of work that remains is the repository format: in this area, I have done some research in both the previous and this year’s project, but there are no definitive answers, even less an implementation. I think we now have a number of good ideas on how to approach this. We do need to sort out a few issues though, and the decision on the conflict layer also influences the shape of the repository.
Each of these two open problems is probably about the size of an ambitious SoC project. On top of that, a lot of integration work needs to happen to actually make real use of the advancements. We shall see how much time and resources can be found for advancing this cause, but I am relatively optimistic: the primitive level has turned out fairly well, and to me it seems that shedding the shackles of legacy code sprawl can boost the project as a whole significantly forward.
(PS: While I agree, on the theoretical level, that nuking significant amounts of legacy code carries non-trivial risks, for a small volunteer project like darcs it is imperative to be fun. And a trench war against legacy code is not fun. Writing new things and exploring possibilities, on the other hand, is fun. Which means we need a bit more of the latter and a bit less of the former, even though the project sits in the more conservative camp — afterall, we handle rather precious data… Even when taking natural conservativism into account, a well-motivated, honest subsystem rewrite is better than a half-hearted, someone-has-to-do-it maintenance of a piece of code that everyone hates…)
posted 26.08.2011 1:40 pm12.07.2011: soc reloaded: progress 1
soc reloaded: progress 1
Oops! There has been no update for a long while from me, although I have been busy with code/patches. So far, I have tackled two areas: generalising and cleaning up the existing Patch testsuite, so I could apply it to the in-progress V3 Prim code later. This has been quite successful, although it took a little longer than I would have liked. With the new structure, (QC) properties for single patches, commutes and merges can be applied to any concrete patch type that supports the respective operations. Therefore, I now have coverage of both V3 Prims as standalone patches (for single patch and commute properties) and also when used with RealPatch (the implementation of non-primitive darcs 2 patches).
The latter part was then to make all these tests pass. Since I finished the respective Read/Show instances yesterday, all tests pass. Commute, apply and friends have been done couple days before that. So the next step is to write more tests that can demonstrate where the code needs to be augmented.
Hunk storage
I have slightly changed the actual (on-disk) hunk format slightly. For now, “detached” hunk-text storage is not quite supported, I am keeping that post-midterms. But the format still counts on that being possible. We do need a new monad (class) for writing patches though, since the Show instance is somewhat inadequate: the detached storage needs to be handled somehow.
Anyway, the format now looks like this:
hunk NNN .whitespace_encoded_old .new_text
(we use the same method for encoding whitespace as we do in filenames here). We
might want to change to a format that’s faster to produce/parse, since hunks
typically do contain whitespace. On the other hand, only very short hunks will
be encoded in this form. Also, an empty string is encoded as “!”. So the hunk
text (old and new) can take following forms: .whitespace_encoded, ! (empty) or
“@”. The last form, @ should take 2 parameters, a hash and length, like
e.g. this: @123456789ABCD<...>:65000. That means that most of the time, we
can ignore the hash and only fool around with the numbers (offset and lengths)
when commuting patches. Of course, applying them is a different matter:
nevertheless, we still do have a substantial advantage over V1 prims there,
since each hunk is a simple splice/catenate operation on bytestrings. With V1
prims, we had to chop up the hunk at newlines and remove the +/- signs.
As for implementation, this means we need to abstract commute over a monad
class, which besides commutation failure can express an “fetch text for this
hash” operation. This might be simpler than it was in the Apply case, a lot
of code had to be modified to accomodate for V3 Prims, since the existing
commute runs in the Maybe monad and we can make Maybe an instance of
CommuteMonad. Nevertheless, to make actual use of this, toplevel code that
runs commutes will definitely need to be modified, and in effect, all the
intermediate code that relies on Maybe for commutation will need to be
modified as well. This part could become actually more hairy than was the case
with Apply.
The question of coalesce
Apart from apply and commute, there is currently one more “core” operation in
Darcs patches: coalesce. This operation takes two primitive patches and decides
whether they can be merged into a single primitive patch. This can only happen
if the patches do not commute. Unfortunately, coalesce does not preserve
commutation behaviour: move a b :> move b c gets coalesced into move a c,
which modifies its commutation behaviour with (add, remove, move) patches
mentioning b. On the other hand, coalesce is normally only used during
operations like amend-record, rebase and when handling the “pending” patch.
All these operations modify the identity of a patch, and therefore it shouldn’t
matter much that coalesce fails to commute with commute.
On the downside, coalesce is currently a first-class operation that cannot be derived from the remaining. Most importantly, it is “redundant” with the diff operation, that constructs patches from two states. The problem is that with current (V1) prims, there is no diff operation for some patches. If we had reliable diff for all patch types, we could implement coalesce in terms of diff, commute and apply (pseudocode):
coalesce context (a :> b) | isJust (commute (a :> b)) =
diff context (apply context $ a :> b)
| otherwise = a :> b
This would work as long as there was a deterministic diff operation, i.e. one
with the property (for a being a primitive patch) that diff ctx (apply ctx
a) == a for all a. This diff operation does not need (and indeed cannot be
made) universal over different patch types, but fortunately that doesn’t
matter. We can always call it with a specific patch type as one of its inputs:
diff TextHunk ctx1 ctx2
I believe this operation should be possible to have, and it would also allow significant improvements in the “record” user experience: darcs could try various differs on a pair of states and offer “better” patches than just hunks (like eg. replace, move, etc.).
The ability to implement coalesce in terms of other operations is important because even more than commute, the implementation of coalesce is O(n^2), with n being the number of different primitive patch types: it needs to take into account any pair of types. With the above approach, since coalesce always yields a single resulting patch, we can implement it as follows:
- try to commute a with b; if this works, there’s nothing to coalesce
- if it fails, take a context in which (a :> b) can be applied, ideally as small as possible (since diff is somewhat expensive)
- try to diff the original and post-a/b context, using both the “a” differ and “b” differ: if either succeeds (i.e. produces a single primitive patch), then we have a winner; otherwise, we cannot coalesce either
An UI digression
This is not GSoC related: you can skip this section if all you care about is GSoC…
Anyway, when I am talking about user interface. I think a substantial
improvement in the way “semantic” patches work would come with the ability to
infer those patches “automatically”. In the spirit of the above diff
operation that is available with every patch type. However, it is hard to do
a semantic “diff” on significantly divergent repository states. This is simply
because further changes obscure the relationships of entities in question. When
the only thing that changes is a “mv”, it is easy to detect. But when you
also edit the file, it is no longer possible to tell for sure if this is a move
or a new object.
What would help substantially, then, would be to be able to run diff much more often. This prompts a workflow change. Of course, we cannot ask people to commit every time they do a small change. However, we could ask them to “amend” an “in-progress” patch when they do. This would be especially useful if they can be coached into stashing their changes before and after things like applying “sed” to whole codebase, moving around files etc. This would be basically a supercharged version of “darcs mv”: it would say “I did something, you figure what is the right way to represent it”. It adds the burden of having to call this both before and after the “contentious” operation. But it also adds significant benefits (IMHO).
The other thing that would this kind of “open” patch (that you keep adding things to, until you are satisfied, and then you commit) allow is progressive (time-sensitive) revert. This is something that I would be completely sold on: if I kept telling the VCS to note down my changes reasonably often, I could get, in exchange, a whole-repository (but still granular) undo operation.
(It is not hard to imagine that you could also have more than one “open” patch at a time, sorting changes into different buckets for semi-related changes. The UI for that one would be more tricky though.)
The story of Apply
To get back to GSoC though. For what it’s worth, the test-suite part of the
work and a sketch of the V3 prim implementation is already in the
screened branch. The changes to the Apply class are almost ready for
getting into screened as well; they are currently available as patch635.
The basic challenge with Apply was that V1 prims and V3 prims apply to different kinds of things. While V1 applies to a filesystem tree (basically your working copy), this is not the case with V3 anymore. The V3 state is modelled as a map from UUIDs to Objects. However, it would be extremely uncool to have two different “Apply” type classes to apply different kinds of patches: this would also mean that all higher layers of patch code would need to duplicate their apply implementations. Not cool.
Therefore, associated types to the rescue: I have added an ApplyState
associated type to the Apply class. The Prim level patches then decide what
they can be applied to (currently a Tree, from hashed-storage, for V1 and
an ObjectMap, which still needs to be fleshed out in more detail, for V3).
Any higher levels inherit their apply state from the prims. Cool.
Of course, it’s not as simple, since we actually have to implement that cool
apply method. This was traditionally (well, since I merged my new annotate
code; not that traditional I guess) been done through the ApplyMonad
class. Now ApplyMonad used to have operations like “create a file”, “create a
directory” or “write this bytestring to this file”. That’s cool for V1 prims,
but not very useful for V3 prims. So ApplyMonad needed to be generalised over
multiple apply states. This forced a multi-parameter type class, since there
are no functional dependencies to save us (and therefore associated types do
not apply either). This is because we expect some monads (e.g. IO) to be
instances of both ObjectMap- and Tree-based ApplyMonad. In general, I
didn’t want to limit this to special monads, although we might go for that
option later if it turns out to be superior.
Anyway, the ApplyMonad class is a bit of a meta-class, since the actually
useful set operations is different for different apply state
representations. But since the methods can carry their own constraints, I have
added a couple of fully generic methods (get current state, set current state
and the like) and a set that only applies to ObjectMap and one that only
applies to Tree. This doesn’t seem to pose significant trouble. Haskell
doesn’t seem to have higher-order type classes that would solve this maybe
slightly more elegantly. (I don’t even know if they are possible. Don’t crucify
me if they aren’t.)
Anyway, long story short, we now have a single apply method in a single
Apply class, that works on both V1 and V3 prims, as witnessed by running the
same set of tests, which sometimes do invoke apply, on both implementations.
The story of Commute
There isn’t much of a story behind this one so far. As I outlined above, there
are things coming here as well, but they are not required to allow V3 prims per
se: only needed for the detached storage optimisation. The commute as it is
implemented works, at least as far as tests are concerned. This goes hand in
hand with some kind of StoreMonad and LoadMonad abstraction, that will
actually allow us to implement the detached storage. The CommuteMonad can
then be a
class (LoadMonad m) => CommuteMonad m where
commuteFail :: m a
kind of deal. A LoadMonad superclass constraint can (should) appear on ApplyMonad as well. For now, the instances don’t need to be too elaborate (they can simply fail to fetch anything at all, which will work just fine for V1 prims).
The story of unsafeInterleaveIO
I don’t like unsafeInterleaveIO. At all. My last summer of code was, after a
fashion, about removing a significant source of unsafe, ugly and outright
dangerous lazy IO from darcs. I believe it was a significant success. Now the
LoadMonad / StoreMonad abstraction has a potential to rid us of another
source of lazy IO in darcs: currently, the reading of patch content is
unsafeInterleaveIO`d and the rest of the code treats the repository as a
kind-of-pure data structure built of patches. Of course, the unsafeInterleaveIO
is unsafe because it breaks the type system. Since darcs uses it a lot, there
is no telling which value is actually pure. Arguing about runtime behaviour of
lazy code is hard enough when it is actually pure. Random IO thunks lurking
inside pure values (kind of like an alien in Sigourney’s stomach) turn it into
a nightmare.
This will involve a lot of code being lifted into a monad, fortunately a significantly restricted monad. In practice, it’ll be IO more often than not (although in the testsuite, it’ll probably be the like of Maybe, or StateT Maybe). What matters is not that the code will execute in IO, but it is statically well protected: it cannot access the IO monad and it has to be explicit about side effects (like loading things from disk). Therefore, we give things types that actually reflect impurity, without allowing them to spin out of control with side effects (like they could if they were simply in the IO monad). A static type system win.
Conclusion
Ok, I guess this is more than enough for today. I’ll try to keep you folks more informed about the progress in the second half of the endeavour. On the other hand, I am a bit worried that these posts are more useful as a note-to-self resource than for general reading by others. Well, let’s hope, dear reader, that you found at least something of this post useful and/or interesting.
posted 12.07.2011 12:27 pm24.05.2011: soc reloaded: darcs primitive patches version 3
soc reloaded: darcs primitive patches version 3
This year, I have accepted Eric’s invitation to submit a proposal to Google Summer of Code again. I am not going to repeat the proposal itself here, so please read that as well. This post is more about filling in the technical details and setting out a plan of work.
First a clarification: even though there are no V2 prims, I call those V3, because the V1 prims have slightly (and somewhat confusingly) different semantics in Darcs1 and Darcs2 repositories; if nothing else filename encoding has changed incompatibly. There have been some commute rule changes as well, although I am not sure this wasn’t retroactively changed even for Darcs1 repos. Not important anyway.
In the rest of the post, I will lay out how I anticipate the V3 prims to work and further what I intend to implement and roughly how.
Objects
Primitive patches shall operate on a collection of abstract objects (which define the “pristine” state), each object in the collection being uniquely identified. Objects come into existence the first time they are referenced and they are never destroyed. We assign a type to each object (and object patches get corresponding arrow types).
I imagine there would be a few object types: binary, text, directory. We can add “bugs” objects and stuff like that later. The patch types should be monomorphic to simplify things. We can share implementations between different patch types if they are identical apart from their type.
Directories
A directory object is a map from names (strings) to object ids. (I say map and not bimap since there seems no good reason to prevent multiple manifestations of a single object.) We should however take care to avoid loops in the structure and such. We could even tie hardlinking to this, although that’s probably pretty useless in practice. We definitely should take care to avoid loops and similar abominations in the directory structure.
Among other repository properties, we keep a “root” object — this is the UUID of a (directory) object that represents the root of the working copy of the repository. The directory can map names to things, like text or binary files, or other directories.
Akin to the “root” object, we may want to keep track of a “preferences” object as well. Again, this would be just an UUID of a directory object. The directory object could then list individual preference files.
Patches
Some examples:
- bhunk (binary hunk) :: binary -> binary
- hunk (text hunk) :: text -> text
- bin2text :: binary -> text
- text2bin :: text -> binary
- manifest :: directory -> directory
- demanifest :: directory -> directory
Patches of different types on the same object clearly don’t need to commute. If there is a binary -> binary patch and a text -> text patch affecting the same object, they can never change their order. In fact, a -> b patches for a != b can’t realistically commute with any a -> a patch. This should drastically reduce the exponential number of commute rules we’d otherwise need to deal with, and should make the primitive commute function much more modular. In fact, only a -> a patches for same a become involved in the exponential commute definition blowup. This should be manageable.
Moreover, if we impose a map from patches to the object they affect, we can also trivially commute patches that affect different objects. We will need to generalise this later, however, since some patch types may involve multiple objects (even though our type system can’t handle that yet, either), or even involve a list of objects variable under commute.
Patch application needs to obey the type restrictions of course.
We will possibly want to attach a UUID to each primitive patch as well, so it can be readily identified. Of course, this increases the space overhead, but presumably not exceedingly so.
Hunks
The basic patch type is the hunk: the representation may be identical for both binary and text objects. What is not the same is how binary and text hunks are obtained. For text objects, we should use a whitespace-sensitive diffing algorithm (line diff, most likely; either the one we already have in darcs, or alternatively patience diff). For binary objects, we can use one of the binary delta algorithms. It may be prudent to disallow commute for binary hunks, too.
The format is still not defined, although there is a “first shot” at http://web.mornfall.net/blog/patch_formats.html. But in the end, we probably want a somewhat different format anyway, or an additional hunk type, because we apparently want both removal and addition to happen in a single Prim, for commute to make more sense. So the format could instead look like:
hunk 123 "old text" "new text"
or
hunk 123 old_hash new_hash
with the quoted-string version being a text-escaped, “inline” variant of the
patch to be used when length of old\_text + new\_text is less than two
times the length of a hash. Even though patches of this textual form are not of
constant byte width, that doesn’t matter since other patch types cannot be made
to be that way either (like replace, or manifest). Any “substantial” text that
is the actual problematic part to parse is moved away by indirecting it through
hashes.
Multi-object patches
Until now, we restricted ourselves to patches that affect a single object. This may be genuinely impractical for patches that move around things, be it complete files (move) or pieces of content (hunk move). We want such patches to commute as a single unit, either commuting completely or not at all. This could be achieved differently, by adding a concept of atomic patch group. I am not entirely sure if that is right or not, but it currently seems like the more complicated option.
Therefore, we can go on adding multi-object patches. Presumably, the correct type would be (a, b) -> (c, d). Most commonly of course (i.e. in the two abovementioned cases), this would end up being (a, a) -> (a, a).
Generic commute rules
Let’s assume a function
touches :: Prim -> [UUID]
we can say that
commute (a :> b) | null (touches a `intersect` touches b) = (b :> a)
Now we can also add the type restrictions. We demand that for each touched object, all the types in both patches match for the commute to be allowed.
commute (a :> b) | not (a `typematch` b) = fail
Where
typematch a b = all match (touches a `union` touches b)
where match x | type a x /= type b x = False
| fst (type a x) /= snd (type a x) = False
| fst (type b x) /= snd (type b x) = False
| otherwise = True
With type :: Patch -> UUID -> (ObjectType, ObjectType).
A few more restrictions will need to be applied in case we need a patch type that may operate on object collections such as a “subtree”, whose composition can change in time. I have a rather vague idea that some kind of general mechanism could be used for designating such an object collection could be used, which could then be built into the framework, making it also possible to define that kind of patches without introducing exponentially big commute definitions.
Adding new patch/object types
With the generic commute rules, it becomes possible (and easy) to add new object types and corresponding patch types to the system, without ending up in an exponential tangle. One such object type could be “haskell” (holding a representation of a Haskell AST), or “bug” (for in-repo bug tracker, ala bugs everywhere). Another useful object type could be “set”, keeping a sorted set of lines, or a “changelog”, keeping a timestamped list of textual chunks.
Of course, there are other options to achieve a similar effect, but I like how this “falls out” more or less naturally from the design.
Optimisations
The suggested patch representation allows for some optimisations in the way patches are stored. This could include per-object buckets, detached hunk storage (which is more or less implied by the hunk format) or the like. Per-object buckets are slightly complicated by multi-object patches, but probably not ruled out. With per-object buckets, UUIDs based on hashing and minimal context for prims may be feasible, granting a number of desirable security and convenience properties to the system as a whole.
Issues
There’s a couple of issues that have been raised about the proposed design so far. I think the major one is that merge behaviour might be somewhat surprising in some respects, especially if root and preference object UUIDs are picked fresh, instead of being hardcoded: it is not clear how to manage these cases. On the repository level, any two repositories with disjoint sets of live UUIDs merge cleanly. However, they don’t necessarily merge cleanly on the working directory level.
The most counter-intuitive case seems to be when you initialise an empty
repository and then immediately pull from somewhere: you get a conflict here. A
possible solution: darcs init should create a repo with no objects. Any and
all objects come into existence through patches. And once you have two patches
that add a root object, you get a reasonable conflict that can be reasonably
resolved. (The preferred solution probably entailing recursive merge of the two
trees, with conflicting leaves being renamed out of way; this would constitute
“conflict markup” and would need to be explicitly recorded, of course, like
with other conflicts… another desirable solution may be renaming one of the
roots and making it a child of the other, confirming the other as the
repository root.)
About preferences, that’s a bit more tricky. The tradeoff is slightly different, since pivoting repository roots sanely is probably significantly more valuable than it is for preferences. More discussion is probably needed on how to exactly arrange this, but the exact details are not important for the project. Most of the available options can be implemented relatively painlessly on top of the proposed infrastructure.
Battle Plan
So in the above, I have outlined the design of what I intend to implement. I think I am now reasonably happy with the abstract level of design. There is of course a number of implementation challenges to be solved.
Nevertheless, since it is really getting late now, so I will keep the “battle plan” for another post. I intend to spell out roughly what and how I intend to do on the implementation level, and break this down into a couple of phases, time-wise. A very quick summary for now…
The core of the implementation work will happen directly in the darcs
repository, since each “version” of primitive patches comes with a separate
module hierarchy (even if there is only V1 so far). I can work in
Darcs.Patch.Prim.V3 more or less without disturbing other work or the release
process. I expect that coverage of the newly written code will come mostly from
QC and unit tests, which will likely live in the darcs source tree as well.
An external, experimental “client” may be an option, depending on how things go. It would probably implement a simple repository format without any of the higher level patch types (i.e. no conflict handling etc.), mostly for demonstration purposes.
A semi-independent vein of the implementation work would entail pushing out lazy IO out of the patch code in darcs per se. This is tied in with decoupling the hunk content (the text or binary blobs) from the patches themselves in memory. Even though it may well be possible to achieve using IO interleaving, an explicit approach will be more transparent, presumably much easier to reason about and, ultimately, debug, than the current lazy IO code.
posted 24.05.2011 12:47 am20.09.2010: italy
Italy
Yesterday, I have returned from a trip to Italy. The main event was the SEFM (Software Engineering & Formal Methods) 2010 conference, held in CNR in Pisa, the side event was a visit to Florence, Pisa, Lucca, Bologna and Venice. Although I have been initially a bit sceptical (as I usually am), partly due to travelling alone, the trip turned out to be one of the best thing that happened to me recently. I had really an awesome time, met lots of great people, had lots of fun, the weather was fabulous… The cities of northern Italy are beautiful, the conference was great as well — in particular, I really enjoyed both the tutorials I attended: first about Orc by Jayadev Misra and the second by David Harel about computational biology was even better. David really is a great speaker and the topic is quite intriguing as well.
Florence & Arrival
The choice of taking a train (instead of flying) turned out to be quite lucky one, since it enabled me to stop at various places along the way. I took an afternoon connection to Wien on Saturday and from there a night train to Florence, where I arrived around 6:30 am on Sunday. The city has been quite empty in the early morning, so I had a chance to walk around the deserted city centre during sunrise. I even managed to peek into the Basilica di Santa Maria del Fiore before leaving. Overall, the architecture of Italian cities is really nice to look at.
Anyway, in the afternoon I took a train to Pisa, checked in at the hotel and spent part of the afternoon trying to find an open grocery store. Traversed a good part of residential areas of Pisa, finally finding an open shop near the railway station. Thus refreshed, I looked around a bit more, going through the nicer historical parts of the town and crashing back at the hotel later in the evening.
Pisa, The Conference
Next day was the first conference day, mostly filled by tutorials and chatting with other attendees. I have made acquaintances with a few fellow PhD students and researchers from various parts of the world (including two colleagues from Prague). On Tuesday, we had the first set of conference sessions and in the evening, we had a dinner in a local pizzeria with Emilia (Technion, Israel but originally from St. Peterburg) and Abel (BME, Hungary) and Faraz (UCF, US, originally from India), and after the dinner I walked around Pisa with Faraz a bit. We went to see the Piazza dei Miracoli at night, with its Leaning Tower and the Duomo di Pisa. More sessions followed on Wednesday and in the afternoon, we attended the social event, which consisted of a trip to Lucca, where we were received by a guide and have been shown around the city, which has been really nice as well, and the guide did a really good job too. The trip was followed by a dinner in a villa outside of Lucca — the only complaint I could possibly raise about it, I guess, was that there was actually too much food to handle. Which is not much of a complaint really. Most of the afternoon, I spent with Natallia, who’s currently a post-doc researcher at CWI (Netherlands), has a PhD from Trento and a masters from her native Belarus. I much enjoyed both the event and the company.
On Thursday, the model checking session of the conference took place — this being the session where I gave my presentation as well. The fact that this was in the morning after the social event was somewhat inconvenient, but I think overall it went quite smoothly and I am fairly happy about it. In fact, a couple people later told me that they enjoyed the talk, which made me really quite happy. That is the sort of thing that can motivate you to continue the hard work (cough) and all. I had a bit of a conversation with Kirsten Winter (ITEE / University of Queensland, Australia) and later Dimitra Giannakopoulou (NASA, US) and both were very encouraging of my current and (hopefully) future work. I suppose that is as much an incentive to carry through with one’s PhD as one can hope for.
Moreover, a couple of people have asked me about the Red Hat funding for my formal methods research and whether and how the cooperation works. I have explained my plan for applying my research on the real code that I work on, which has met with a lot of approval and enthusiasm. I certainly hope that the goals are attainable and that it will be possible for me to continue to work on this as my PhD continues. There is definitely a lot of interest from the academia in seeing further real-world formal method applications.
Departure, Bologna & Venice
Later that day, I went to the town centre for a bit, and to take some daylight photos… Met later for another dinner with Emilia and Abel (same place, that’s how conservative we are), walk around the town, and a bit of Italian ice-cream. That was the last evening in Pisa, and I crashed to bed and in the morning packed up and checked out.
I had arranged previously with Enrico, a long time friend of mine and also a fellow Debian Developer, to meet in Bologna. I took a noon train to Florence, then Prato and finally arrived in Bologna in the afternoon, where Enrico and Yuwei picked me up. They walked me through the city centre of Bologna and shown me through some nice hidden places (which I would never think of visiting, were there not for my experienced guides). Among other things, we have seen a medic’s lecture room from 17th century, a city council art gallery and visited the San Petronio Basilica, one of the largest churches in the world, built throughout the 14th and 15th centuries. Later in the evening, we had a dinner with members of the local amateur astronomy club (which turned out to be more of an all-sorts-geek club) and a tour of their observatory-museum. My thanks to Enrico, Yuwei and also the members of the club for another great evening. We finally retreated to Enrico’s place (where he stays while in Bologna, and where his parents live).
The next day (that is Saturday) was split in two parts, first I spent with Enrico and Yuwei and stayed for lunch with Enrico’s parents (which was very nice!). In the afternoon, Enrico drove me to a nearby railway station and I took a train to Venice. I very much enjoyed the stay in Bologna, in a big part thanks to my hosts who really went out of their way to make my stay pleasant: thanks again, and looking forward to seeing you again.
After lunch, I departed for Venice, which is about 2 hours of a train ride away from Bologna — I arrived around 4 pm, to the Venezia Santa Lucia station. The weather (which was sunny and pleasantly warm until then) went on a strike though, so it was overcast and every now and when it went on to rain for a while. Overall, I guess that was sort of a mixed blessing: on one hand, it was a bit cold and wet (and I wasn’t very well equipped for that), on the other, there probably weren’t as many tourists in the city as there would be on a sunny day. Sadly, the San Marco was already closed when I arrived, so I had to omit that one. Nevertheless, I have seen most of the city, also quite a few remote bits of it that were virtually deserted on a Saturday without sun (like the seaside wall of the Arsenal, where I did not meet a single person).
Back Home
Late in the evening (around half past midnight), I took a train back to mainland (Mestre) where I boarded a night train to Wien. Finally landed in Brno a bit before noon.
Overall, as I already said a couple of times, it was a very nice trip. The late summer in Italy and meeting everyone was amazing. I am of course a bit sad and melancholic since it’s all over, but then, I am off to Netherlands in a week for another conference, which will be probably very different, but hopefully also nice. Sadly, due to all the travelling, I am going to have some issues at the conservatory (I still don’t have a viable timetable, eg.) but nevertheless, it was definitely worth it. I also hope to meet the folks that I have befriended again at some point — which doesn’t seem that unlikely, since Abel is actually attending the same conference that I am, in Netherlands next week.
The trip also sort of made me realize that my attachment to this place (which I call home, more or less) is not as tight as I have been thinking. It may, after all, be a good idea to move elsewhere, maybe for a couple of months or maybe a year or two. But that it is still a long way from here.
posted 20.09.2010 2:20 pm14.08.2009: soc final report
soc final report
As you may have noticed, I skipped last progress report. That’s because I was busy with coding work. Anyway, let’s summarise what happened throughout the GSoC program. The basic idea was to improve darcs support for large repositories — large in terms of number of working files. The primary goal was to improve the code that diffs the working tree against the pristine cache: an operation, that is very common, and in darcs 2.2 and earlier was also painfully slow (depending on circumstances).
Deliverables
So what has been delivered? There is a number of outputs of the project:
- hashed-storage 0.3.7, a library for reading and writing filesystem trees in hash-based formats, with primary focus on darcs hashed pristine
- darcs whatsnew integration with hashed-storage 0.3.7, part of the current stable release of darcs, 2.3.0
- hashed-storage almost-0.4 — a work in progress, improving hashed-storage 0.3.7 in several directions (and breaking the API)
- darcs-hs: a branch integrating hashed-storage 0.4 into darcs, using hashed-storage fast diffing mechanism for all working/pristine diffing needs — naturally, also a work in progress
- darcs-benchmark: a standalone package for benchmarking darcs (pitting different versions of darcs against each other on a number of medium-sized repositories, on a number of different commands)
The first two items on the list are done for good. The diffing code that was a major goal of the project is stable and part of a stable darcs release. It is supported by a independent hackage library, hashed-storage.
For efficient retrieval over HTTP… there is support for creating and using hashed packs in hashed-storage 0.4. This still needs some work to be put in practice, both on darcs-hs and on hashed-storage. I will continue working on this in the fall. The current status of code in hashed-storage 0.4 is basically proof-of-concept (comes with unit tests though). There is also a concept document, designing storage for darcs. Either way, I am trying to not rush things on this, since once we put this into the stable darcs, we will have to live with it for a long time, and mistakes now would be expensive later.
As for darcs-benchmark — I have uploaded version 0.1 to hackage and it should
be ready for wider circulation. Please try it out and let me know if you have
any issues with it (cabal install darcs-benchmark).
Work in Progress
Since pencils down is approaching quickly, there will stay a number of in-progress items. There are still some bugs to fix in darcs-hs, and hashed-storage needs a haddock cleanup and at least one pass over the API before I can upload 0.4. But let’s see what 0.4 brings over 0.3.7. During a review process that Ganesh started, we have identified a number of weak spots in hashed-storage. This has directly and indirectly led to hashed-storage 0.4.
On the API level, we have generalised the Tree out of IO into a generic Monad, to help with testing infrastructure. The Index API has been simplified and, more importantly, made much safer. With this done, it turned out that the index reading code is needlessly general, and I have simplified it, improving index performance along the way (in the process, I also changed the layout… twice… the magic word versioning mechanism really paid off).
I have also refactored the Hash type, and removed all the hacks that deal with the size prefix on darcs hashes. This means we won’t be able to write out new trees of this kind anymore (but no big loss, old darcs can read repositories without the size prefix as well). Along with the original Hash refactor, I have changed the internal hash representation, cutting hash length from 64 to 32 bytes (this is still SHA256, but is using full 8 bits of every byte, unlike ascii-hex which carries 4 bits of useful data per byte). This representation is also used in the index.
The overall result is, that the index size has shrunk considerably — it is now, on average, smaller than a git index for equivalent repository, even though we use stronger hashes (but yes, git index is more than a simple cache).
The even more interesting outcome is that, with darcs-hs, I can now run “whatsnew” on a 100k-file repository in 0.35 seconds — a mere .1 second (or 40%) slower than equivalent git diff. With 100% Haskell (well, the sha256 implementation is in C, but this code is not tripped in the optimal case). For most realistic repository sizes, the difference is going to be negligible. Oh, and to get an idea how big a 100k-file repository is, the Linux kernel tree has about 28k files in it.
Future Work
Recently, it has been noted, that for old-fashioned repositories, darcs whatsnew has regressed in performance by a fair amount. This is true, but it’s not clear if it is worth addressing properly. For 2.3.1, the easy fix is to restore the old code path when the repository is not hashed. However, I have already removed unsafeDiff in darcs-hs and I want to keep it that way, so we either need to start treating old-fashioned as second-class, or either come up with a more systematic fix.
As for hashed-storage, I want to do further work on packs. Maybe even add code to read git repositories. Moreover, there is an ongoing review of hashed-storage which may bring further ideas for changes. We also have to make some decisions about what to do about darcs in the future — how to.
Conclusions
It has definitely been a productive summer. Lots of work has been done on hashed-storage, and it seems that with a little further work, it can provide solid underpinning to future darcs versions. Apart from major performance improvements, it exposes a new standalone component with its own set of tests, improving test coverage of what constitutes darcs.
posted 14.08.2009 11:35 pm29.07.2009: soc progress 10
soc progress 10
Wow, tenth progress report. That’s a long time now. Still a day late, but getting on track slowly (slower than expected I guess). Last week has seen the release of darcs 2.3. Nonetheless, this release has seen partial integration of hashed-storage work. Unfortunately, the release was also a major motivation drain — RMing darcs is a lot of (fairly hard) work.
On the unsafeDiff front, I’m at a standstill (there are still the same 2 instances that I reported about last week). I haven’t progressed on the HPC issue either, but that can wait and it’s not directly relevant to SoC.
I have worked on profiling my last-week’s check/repair port. Fortunately, the implementation does not leak memory, and is slightly slower (some 20%) than the old version. Most of the extra goes into filepath juggling (due to representation differences between darcs and hashed-storage). See my darcs-users post for more details about this.
In other news, I have forked an 0.3 branch of hashed-storage and flipped the switch to 0.4 in mainline. The first compatibility-breaker is the Hash type refactor, which improved things a fair bit. Also, we now represent hashes in-memory (and in the binary index) by 32-byte binary vectors (represented as bytestrings) instead of using the 64-byte ascii-hex representation. Also, this has removed a lot of Maybe wrapping, since the new Hash type comes with a “NoHash” constructor.
Other than that, I have taken a break after the release. I still have some motivational issues, but I should improve my output rate over the next week. The major todo on is file benchmarking (since Eric keeps nagging me about this, and I still haven’t figured how to do this properly) and actual implementation of the (internal) pack format.
Hashed-storage changes:
- Do not include _darcs/index in the testdata.
- Move floatPath into the AnchoredPath module.
- Do not forget to reset changesize in Monad, upon sync.
- Avoid calling expandPath when not needed, in Monad.
- Do not forget to update hashes before syncing a hashed tree in Monad.
- Refactor Hash datatypes extensively.
- Bump version to 0.4.0.
- We can now base64/url-encode (and decode) all Hash types.
- Add a generic sha256 function to the Hash module.
- Sprinkle a few haddocks around the Hash module.
- Add a NEWS file.
- Bundle NEWS in source distribution.
- Move the darcs-specific packing bits into Darcs.
- Use “plain” hashes in Packed (instead of the size-carrying darcs ones).
- The hatchery cleanup for compact is a removeFile (for now, anyway).
- Rename readOS to load.
- Haddock header for Packed.
- Fix collateral from using plain hashes in object storage. (Ow, ugly.)
And darcs-hs changes:
- Bump version to 2.3.0.
- Update autoconf release data to 2.3.0.
- Update NEWS.
- Ship the NEWS file in cabal tarballs as well.
- Port to hashed-storage 0.4 Hash changes.
25.07.2009: patch formats
Patch Formats
I have been thinking for a while, that a completely new “repository” format (an experimental one) would be in place for darcs 2.4. I have previously outlined a way I’d like to go about building up new things within the darcs 2.x series. Now a darcs repository has two basic “components”: the “file” part of the layout: truly a repository format, and a “patch format”: which determines not only how patches are written out to disk, but more importantly, their exact semantics. Once you set up a “patch format”, this is set in stone and repositories with different patch types cannot exchange patches between them (at least not without an in-between conversion). This is the case between darcs-1 and darcs-2 format repositories, as they use a different patch format. The case of darcs-1 vs “hashed” repositories, as darcs calls them, is only on the file level though: the patch formats are identical, and that’s why hashed and plain darcs-1 repositories can exchange patches just fine. (I will from now on refer to repository and patch format as two orthogonal things, as they mostly are.)
Now I have been working on a packed repository format… one that would allow to store the repository — regardless of patch format used — in a compact form suitable for HTTP-based retrieval. In this post, I’d like to address the other thing: a patch format. It seems worthwhile to improve our current system, since it has a number of weak spots. Currently, we have a number of “primitive” patch types, and some more complicated ones — conflictors in darcs-2 or mergers in darcs-1. I am not going to talk about these — we’ll focus on the primitive patches for now.
The primitives in darcs are addfile, rmfile, addir, rmdir, move,
replace and most importantly hunk. (You should be able to look up what
these roughly represent somewhere else.)
Let’s address these addfile and friend patches, that create and remove files
or directories. Obviously, addfile foo.txt and a different addfile
foo.txt are going to conflict. Also, all hunks for foo.txt obviously depend on
addfile foo.txt — which means that if you pull two branches together with
nontrivial files of the same name, you are going to end up with a massive
conflict (at least in terms of darcs data structures) for virtually no reason.
So my proposal is to divorce the filename from file identity: this is something that has been pondered before, I believe. The result would look something like:
hunk fileid 1
- a
+ b
This means that hunks would exist without any dependence on addfile: the
abstract file would pop into existence with first hunk touching its
identity. Of course this would be no good, since you just lost the relation
between a working copy and whatever darcs tracks. To put that relation back on
track, we add two patch types:
manifest fileid ./file/path
demanifest fileid ./file/path
A manifest patch will tell darcs to associate the fileid with your working
file at ./file/path. The inverse operation is demanifest, and that would remove
the association: and your working copy file. The abstract identity continues to
exist just fine, and can be manifested again (under the same or different
filepath). Basically, this completely de-couples the “hunk-space” from the
“filepath-space” — manifest/demanifest/move(/adddir/removedir) patches commute
completely freely with hunk/replace patches. To make the de-coupling complete,
you want a “manifest” of a non-existent fileid to pop that fileid into
existence as well. No problem.
Basically, this means that as far as darcs is concerned, file content manipulation is orthogonal to the directory tree manipulation: and this is good and well, since it allows us to solve conflicts on both of those levels separately, without dragging in a lot of stuff from the other level. Moreover, the add-add conflict no longer exists.
As for the hunk format itself, there is also a number of issues: it uses a
GNU-patch-like format with ‘+’ or ‘-’ sign in front of each line. It will
usually look like a block of ‘-’ lines followed by a block of ‘+’ lines (either
of these may be empty). Parsing this format is not quite simple, you have to
look up all the newlines, chop off the ‘+’ and ‘-’ signs etc. Lots of work for
darcs.
hunk ./foo.txt 1
- a
+ b
Now a friendly-to-parse alternative would be to use a chunk kind of patch:
chunk fileid - 0 2
a\n
chunk fileid + 0 2
b\n
the 0 and the 2 are an offset and length (in bytes), respectively. What follows the patch header is a monolithic block of text to be removed from (or pasted at) the given offset (and the block is of a given length). This would produce more primitives than the original hunks, but they would be vastly simpler to process in bulk. Each basically represents a string “splice” operation. No newline shuffling whatsoever. The commutation rules should be as simple as they were with hunks: you just wibble the offset (after checking for overlap).
But now that the data in those chunks is basically a blob of (anything), there is an extra thing that can be done. Instead of keeping this data inline in the patch, we could refer to it by a hash and store it elsewhere:
chunk fileid - 0 2 hash
chunk fileid + 0 2 hash
Of course, for the example patch, this is going to inflate the patch size quite a bit. But let’s not care for a while. We now have a chunk format that is O(1) in size (well O(logn) for purists, given the length needs to be represented, but we don’t care about that either). We can still commute and invert it just fine: inversion is flip-flopping the ‘-‘/’+’ sign, and commutation just wibbles the offset. Awesome. (Besides that, it also effectively obliterates any need for a “binary” kind of patch: chunks will do just fine for that, and we could even use binary diff algorithm if we wanted…)
We will need to dereference that hash sometimes of course: when we want to actually apply the patch, and when we want to show it to user. The latter is a non-issue, since our processing power vastly exceeds that of the user, we can play with the patch as much as we like for presentation issues. So when we want to apply that patch, we need to fetch its content… but wait, we already have a mechanism for buckets of bits by their hashes: the hashed pristine! So yes, we can just dump the data bits of the chunks into hashed pristine (plus some wibbling of garbage collection, see my previous post about that).
Now that no actual file content is ever part of any patch representation, we can consider some new options. One would be to store patches inline in the inventory files: this would probably inflate their sizes by a small factor. Looking at darcs repository itself, we have 750K of compressed inventories (1.5M uncompressed). There are 43225 hunks — this would add about 6M of uncompressed (but relatively compressible) text to the inventories (considering about 150 bytes per a chunk patch: 2x sha256 + a little). That is about factor 5. We would probably have to play around a little to find out how (un)reasonable this is. We could also cut a bit of the cost by using a less-stupid encoding of hashes than ascii-hex (aka base16)… say base64 (see RFC 3548/4).
I should probably also note that this would save some bits on conflictor representation (no copies of hunk data) and it should also solve the “big initial patch” problem — the patch itself would be O(n) in number of files (instead of O(n) in number of bytes of the initial tree). There are of course some drawbacks and some other advantages, but I don’t have the amount of time to go into more details just now. Instead, I’ll let people think about this for a while. Comments are definitely welcome (probably address them to our darcs-users@ mailing list).
posted 25.07.2009 2:03 pm23.07.2009: darcs 2.3.0
Darcs 2.3.0
I’d like to announce immediate availability of a new stable release of darcs, 2.3.0. There is a number of improvements and bugfixes over the previous stable release, 2.2. Moreover, work has been done to improve performance of “darcs whatsnew” for large repositories.
As in the past, there are two source tarballs available. As of this release, the cabal-based build is preferred, and the autoconf build is deprecated. You can obtain the source tarballs at these addresses:
- http://repos.mornfall.net/darcs/darcs-2.3.0.tar.gz
- http://repos.mornfall.net/darcs/darcs-2.3.0_autoconf.tar.gz
The build instructions are available in the enclosed README file in those tarballs.
Moreover, if you have cabal-install available, you can install latest stable release of darcs by issuing the following commands (no tarballs needed):
$ cabal update
$ cabal install darcs
This should give you a darcs binary in ~/.cabal/bin — you should probably
add this to your PATH. More detailed instructions for installing on Windows are
available near the end of this announcement.
What’s New
This is a summary of important changes since the last stable release (2.2):
- Lots and lots of documentation changes (Trent).
- Haskeline improvements (Judah).
- Cabal as default buildsystem (many contributors).
- Fixes in darcs check/repair memory usage (Bertram, David).
- Performance improvement in subtree record (Reinier).
- New option: —summary —xml (Florian Gilcher).
- New option: changes —max-count (Eric and Petr).
- Fix changes —only-to-files for renames (Dmitry).
- Performance fix in “darcs changes” (Benedikt).
- Hardlinks on NTFS (Salvatore).
- Coalesce more changes when creating rollbacks (David).
- New unit test runner (Reinier).
- Inclusion of darcs-shell in contrib (László, Trent).
- Author name/address canonisation: .authorspellings (Simon).
- Working directory index and substantial “darcs wh” optimisation (Petr).
- New command: “darcs show index” (Petr).
- Gzip CRC check and repair feature (Ganesh).
A number of issues has been resolved since 2.2 as well. See http://bugs.darcs.net/issueN for details on bug number N.
- 948 darcsman (Trent)
- 1206 countable nouns (Trent)
- 1285 cabal test v. cabal clean (Trent)
- 1302 use resolved, not resolved-in-unstable (Trent)
- 1235 obliterate —summary (Rob)
- 1270 no MOTD for —xml-output (Lele)
- 1311 cover more timezones (Dave)
- 1292 re-encoding haskeline input (Judah)
- 1313 clickable ToC and refs in PDF manual Trent)
- 1310 create merged \darcsCommand{add} (Trent)
- 1333 better “cannot push to current repository” warning (Petr)
- 1347 (autoconf) check for unsafeMMapFile if mmap use enabled (Dave)
- 1361 specify required includes for curl in cabal file (Reinier)
- 1379 remove libwww support (Trent)
- 1366 remove unreachable code for direct ncurses use (Trent)
- 1271 do not install two copies of darcs.pdf (Trent)
- 1358 encode non-ASCII characters in mail headers (Reinier)
- 1393 swap “darcs mv” and “darcs move” (Trent)
- 1405 improve discoverability of global author file (Trent)
- 1402 don’t “phone home” about bugs (Trent)
- 1301 remove obsolete zsh completion scripts (Trent)
- 1162 makeAbsolute is now a total function (Ben F)
- 1269 setpref predist - exitcode ignored bug (Ben M)
- 1415 —edit-long-comment, not —edit-description, in help (Trent)
- 1413 remove duplicate documentation (Trent)
- 1423 complain about empty add/remove (Trent)
- 1437 Implement darcs changes —max-count (Eric)
- 1430 lazy pattern matching in (-:-) from Changes command module (Dmitry)
- 1434 refactor example test (Trent)
- 1432 refer to %APPDATA%, not %USERPROFILE% (Trent)
- 1186 give a chance to abort if user did not edit description file (Dmitry)
- 1446 make amend-record -m foo replace only the patch name (Dmitry)
- 1435 default to get —hashed from a darcs-1.0 source (Trent)
- 1312 update and reduce build notes (Trent)
- 1351 fix repository path handling on Windows (Salvatore)
- 1173 support hard links on NTFS (Salvatore)
- 1248 support compressed inventories for darcs-1 repos (Ganesh)
- 1455 implement “darcs help environment” (Trent)
The question of GHC 6.8
Using GHC 6.10.3 or newer is strongly recommended. You may compile darcs with GHC 6.8, but there are several caveats. If you are using 6.8.2 or older, please disable mmap support (pass -f-mmap to cabal install or runghc Setup configure). Note that the GHC 6.8.2 that ships with Debian Lenny is not affected and it should be safe to keep mmap enabled. It is also recommended to disable use of Hackage zlib when compiling with GHC 6.8.2 (including the Debian Lenny version): pass -f-zlib to cabal. When using zlib, we have seen occasional crashes with error messages like “openBinaryFile: file locked” — this is a known GHC 6.8.2 bug (and is fixed in GHC 6.8.3). Last, if you are using a 64-bit system, darcs may hang when you exit a pager when compiled with GHC older than 6.10.3. Although this is harmless, it is quite inconvenient.
Overall, the status of GHC 6.8 is “semi-supported”: for many cases, things will work just fine, especially if you take a little extra caution with compilation flags.
Installing on Windows
To install darcs on Windows systems from scratch, please download the Haskell Platform and MSYS:
- http://hackage.haskell.org/platform/2009.2.0.1/HaskellPlatform-2009.2.0.1-setup.exe
- http://sourceforge.net/projects/mingw/files/MSYS+Base+System/MSYS-1.0.11-rc-1.exe/download
After installing both, you should have an “MSYS” icon: run MSYS and in the
terminal window type (the $ character denotes the prompt, do not repeat it):
$ cabal update
$ cabal install darcs -f-curl
This should download, compile and install all required dependencies and also darcs itself. The resulting darcs executable will be placed into the Haskell Platform executables folder, and should be accessible from the MSYS shell (just type “darcs —version” to check).
posted 23.07.2009 3:27 pm23.07.2009: soc progress 9
soc progress 9
Uh, I have slipped by two days again — well, almost. The problem is that in the last two days, I have been somewhat busy with e-mail at darcs-users@… — a sudden peak of traffic hit the list. Basically, two threads created a lot of traffic, the David’s darcs one started by me, and the cheap in-repo branching one. And since this was lot of writing on my part, I didn’t quite feel like writing also the report. But I’m still a day closer to the original Tuesday (and it’s still not even noon here, so that’s more like a day and a half!). I’ll try again to get on track next week.
I have worked some more on the packed format bits in hashed-storage, and I have worked on getting rid of “unsafeDiff” in darcs-hs. The latter is down to 2 uses, one in “darcs replace” and the other in Darcs.Resolution. The replace one should be relatively easy, actually. The resolution one hopefully too, but we have no testsuite coverage of the external merge, which would be better added before I break it… For the changes already done, I have ported over check/repair (I still need to run this through a heap profiler, since I suspect I introduced some space leaks) and “darcs remove” which is now also neater and more readable (at least I think it is). And since we never use unsafeDiff to compare working against pristine anymore, I have nuked a bunch of code that deals with timestamps.
I have also done a little work on HPC reports for darcs itself: I put a current HPC report online, and there’s some automation in place, so it should get updated whenever I push into the repository (it is supposed to be a 1:1 mirror of http://darcs.net). There are still some issues, like the TIXDIR taking over a gigabyte of space (I know how to fix this, but haven’t had the time yet — it involves a relatively easy support in ShellHarness).
Also on the mainline darcs (and this is not strictly SoC work, but worth reporting anyway), we have removed autoconf support. Finally. There are still some ongoing issues, like we need the user manual to be built by Cabal (we kept some bits of the GNUmakefile to address this for now) and some other features are currently not available with cabal (see also this thread).
Hashed-storage changes:
- Clean up the interface implementing darcs hashed trees in object storage.
- In hatch, discard objects that are already in the storage.
- Improve haddock.
- Add a testcase for readOS.
- Add a small testcase for diffTree of identical trees.
- Add a testcase for writePlainTree.
- Add a simple QC property for restrict.
- Add another QC property for restrict.
- Expand the expandPath testcase to cover more possibilities.
- Also cover the items part of the expandPath-produced Tree.
- Obliterate embarassingly useless iterateM and fix mfixFrom (IO is too strict…).
- Implement rudimentary darcsPristineRefs (for chasing references).
- Test live object collection for darcs pristine trees.
- Add a function to postorder-traverse and update a Tree.
- More utilities for darcs hashed trees embedded in an object storage.
- Implement post-compaction cleanup for loose hatcheries.
- More complete OS field initialisation (although dummy so far).
- Use readOS after create to populate the in-memory structure.
- A rudimentary implementation of createPack (which in turn enables compaction).
- Do not fail when a nonexistent file is marked as changed in Monad.
- Do not remove associated Tree hashes in filter.
- Minor formatting fix.
- Avoid removing an in-use file on win32.
And darcs-hs changes:
- Port the replay (check/repair) functionality to hashed-storage.
- Remove the support for writing out new checkpoints.
- Remove the —checkpoint option from the UI.
- Remove now-unused checkPristineAgainstSlurpy.
- Add support for skipping tests (exit 200).
- Avoid removing in-use files on win32.
- Re-implement makeremovepatch in remove command, replacing Slurps with Trees.
- Obliterate all instances of sync_repo and friends, since they are useless now.
- Obliterate timestamp manipulation in HashedIO.
17.07.2009: soc progress 8
soc progress 8
So I have managed to reduce the report lag by a few days, expect the next one on the (regular) Tuesday. Hopefully. Anyway, since the last report, a lot has happened.
I got a new laptop a few weeks back (a Lenovo X200 — I installed the base system on 30th of June), which enabled me to play around with more resource-hungry software than before (this beast has 2G of RAM and two 2.4GHz CPUs). The thing I tried was installing VirtualBox and a copy of Windows XP inside (I have an academic license, hopefully I am within the rules). This worked fairly well (the VM is pretty fast, actually) and enabled me to play with darcs on a win32 system. After installing the Haskell Platform (congrats to all the folks who worked on this: this piece of software is awesome) and MSYS, I got cabal install darcs-beta rolling. Turned out that darcs 2.3 didn’t work very well on windows (well, about 30 tests failed, and darcs whatsnew was completely unusable). It took about two days time to kick all the parts into obedience: I patched all of mmap, hashed-storage and darcs. The net result is that darcs now passes its testsuite on win32 (this never quite worked before). In the meantime, I have already released new beta, darcs 2.3 beta 4 which has all the win32 fixes in it. In addition, I made the hashed-storage testsuite pass on Windows, which should make it much easier in the future to track down problems.
I have also made some coding progress on hashed-storage: first, the packed format prototyping is advancing, and before next report, I hope to have a workable prototype. Some code already went into hashed-storage repository, and there’s already some testing coverage. Which brings me to the other thing, that is that I have improved test coverage of hashed-storage significantly (we now have 10 QC properties and 18 testcases that together cover of 77% toplevels, 66% alternatives and 83% expressions, as reported by HPC).
In darcs-hs, I have done some work on optimising darcs show contents --match,
as this seems to be an important operation for repository browsers (or at least
for trac, but I imagine others may benefit from this). I have also figured that
large part of the tracdarcs inefficiency comes from calling darcs show
contents on way old revisions: based on this, Lele has implemented an
optimisation that instead looks at the latest revision that has a given
version of the file, and for HEAD revisions omits —match altogether. I have
deployed the optimised version on my own tracdarcs instance (for another
project) and I can confirm that it is much more usable than before. I’ll set
up a test instance for GHC repository just for the kicks of it. (Maybe we could
then get Hackage folks to install tracdarcs into their trac instances? Now that
would be cool.)
The summary of hashed-storage changes for the week:
- Improve the modifyTree testcases slightly.
- A basic testcase for diffTrees.
- Minor tidying of the Test module.
- Add a testcase for modifyTree used for removal (with Nothing passed).
- Check plain tree contents in addition to pristine.
- Add some bad and ugly filenames to the test data.
- Add a check that mmaping empty file works.
- More prototyping in Packed.
- Also derive Ord for Hash.
- One more QC property for “reachable”.
- Haddock the new utility functions.
- Add “reachable” to Utils (comes with a pair of QC properties).
- Add iterateM and (m)fixFrom to Utils (comes with QC and unit test respectively).
- Fix QC tests to start with prop_, while unit tests start with check_.
- Reorder test groups.
- Haskell still can’t read my mind. Make Packed compile (and expose it).
- Haddock tweaks.
- Start out on Storage.Hashed.Packed (for a lack of better name).
- Avoid unused import warning in Utils.
- Import Bundled.Posix qualified in Test to avoid future name clashes.
- Refuse to overwrite files in TreeIO rename (darcs relies on this).
- Properly encode whitespace in darcsFormatDir.
- Improved fileExists check in testsuite.
- Make a separate test group for Bundled.Posix.
- Add a test for fileExists from Bundled.Posix.
- Avoid symbol name conflicts with darcs sha256 implementation.
- Make the testsuite pass on win32.
- A quick&dirty unit test for getFileStatus/getSymbolicLinkStatus.
- Do not try to use lstat on win32, it seems to be broken in GHC.
- Bump version to 0.3.6.
- Bump the index magic to HSI1 (incompatible format change).
- Require mmap >= 0.4.
- Get rid of last darcs use in the testsuite.
- Do not use darcs show manifest in test, we have complete file list.
- Avoid trailing slashes when stat-ing directories (fixes win32).
- Correctly store path to the toplevel directory as “.”, not “/”.
- Use readSegment to read hashed directory listings.
and for darcs-hs:
- Optimize darcs show contents —match (avoid slurping pristine).
- Proper implementation for mDoesFileExist/mDoesDirectoryExist in Gorsvet.
- Provide readPending that also provides the “pending conflicts” check natively.
- Bump version to 2.2.98.4.
- Bump the hashed-storage dependency to >= 0.3.6.
- Implement getFileStatus, and use it instead of getSymbolicLinkStatus on win32.
- Use mmapFilePS in gzReadFilePS to avoid lazy bytestring readFile.
- Use DARCS_TESTING_PREFS_DIR in ShellHarness, since APPDATA override does not work.
- Recognize a special DARCS_TESTING_PREFS_DIR envvar to override the global preference directory.