soc reloaded: progress 1
Oops! There has been no update for a long while from me, although I have been busy with code/patches. So far, I have tackled two areas: generalising and cleaning up the existing Patch testsuite, so I could apply it to the in-progress V3 Prim code later. This has been quite successful, although it took a little longer than I would have liked. With the new structure, (QC) properties for single patches, commutes and merges can be applied to any concrete patch type that supports the respective operations. Therefore, I now have coverage of both V3 Prims as standalone patches (for single patch and commute properties) and also when used with RealPatch (the implementation of non-primitive darcs 2 patches).
The latter part was then to make all these tests pass. Since I finished the respective Read/Show instances yesterday, all tests pass. Commute, apply and friends have been done couple days before that. So the next step is to write more tests that can demonstrate where the code needs to be augmented.
I have slightly changed the actual (on-disk) hunk format slightly. For now, “detached” hunk-text storage is not quite supported, I am keeping that post-midterms. But the format still counts on that being possible. We do need a new monad (class) for writing patches though, since the Show instance is somewhat inadequate: the detached storage needs to be handled somehow.
Anyway, the format now looks like this:
hunk NNN .whitespace_encoded_old .new_text
(we use the same method for encoding whitespace as we do in filenames here). We
might want to change to a format that’s faster to produce/parse, since hunks
typically do contain whitespace. On the other hand, only very short hunks will
be encoded in this form. Also, an empty string is encoded as “!”. So the hunk
text (old and new) can take following forms: .whitespace_encoded, ! (empty) or
“@”. The last form, @ should take 2 parameters, a hash and length, like
@123456789ABCD<...>:65000. That means that most of the time, we
can ignore the hash and only fool around with the numbers (offset and lengths)
when commuting patches. Of course, applying them is a different matter:
nevertheless, we still do have a substantial advantage over V1 prims there,
since each hunk is a simple splice/catenate operation on bytestrings. With V1
prims, we had to chop up the hunk at newlines and remove the +/- signs.
As for implementation, this means we need to abstract commute over a monad
class, which besides commutation failure can express an “fetch text for this
hash” operation. This might be simpler than it was in the
Apply case, a lot
of code had to be modified to accomodate for V3 Prims, since the existing
commute runs in the
Maybe monad and we can make Maybe an instance of
CommuteMonad. Nevertheless, to make actual use of this, toplevel code that
runs commutes will definitely need to be modified, and in effect, all the
intermediate code that relies on
Maybe for commutation will need to be
modified as well. This part could become actually more hairy than was the case
The question of coalesce
Apart from apply and commute, there is currently one more “core” operation in
Darcs patches: coalesce. This operation takes two primitive patches and decides
whether they can be merged into a single primitive patch. This can only happen
if the patches do not commute. Unfortunately, coalesce does not preserve
move a b :> move b c gets coalesced into
move a c,
which modifies its commutation behaviour with (add, remove, move) patches
b. On the other hand, coalesce is normally only used during
operations like amend-record, rebase and when handling the “pending” patch.
All these operations modify the identity of a patch, and therefore it shouldn’t
matter much that coalesce fails to commute with commute.
On the downside, coalesce is currently a first-class operation that cannot be derived from the remaining. Most importantly, it is “redundant” with the diff operation, that constructs patches from two states. The problem is that with current (V1) prims, there is no diff operation for some patches. If we had reliable diff for all patch types, we could implement coalesce in terms of diff, commute and apply (pseudocode):
coalesce context (a :> b) | isJust (commute (a :> b)) = diff context (apply context $ a :> b) | otherwise = a :> b
This would work as long as there was a deterministic diff operation, i.e. one
with the property (for a being a primitive patch) that
diff ctx (apply ctx
a) == a for all a. This
diff operation does not need (and indeed cannot be
made) universal over different patch types, but fortunately that doesn’t
matter. We can always call it with a specific patch type as one of its inputs:
diff TextHunk ctx1 ctx2
I believe this operation should be possible to have, and it would also allow significant improvements in the “record” user experience: darcs could try various differs on a pair of states and offer “better” patches than just hunks (like eg. replace, move, etc.).
The ability to implement coalesce in terms of other operations is important because even more than commute, the implementation of coalesce is O(n^2), with n being the number of different primitive patch types: it needs to take into account any pair of types. With the above approach, since coalesce always yields a single resulting patch, we can implement it as follows:
- try to commute a with b; if this works, there’s nothing to coalesce
- if it fails, take a context in which (a :> b) can be applied, ideally as small as possible (since diff is somewhat expensive)
- try to diff the original and post-a/b context, using both the “a” differ and “b” differ: if either succeeds (i.e. produces a single primitive patch), then we have a winner; otherwise, we cannot coalesce either
An UI digression
This is not GSoC related: you can skip this section if all you care about is GSoC…
Anyway, when I am talking about user interface. I think a substantial
improvement in the way “semantic” patches work would come with the ability to
infer those patches “automatically”. In the spirit of the above
operation that is available with every patch type. However, it is hard to do
a semantic “diff” on significantly divergent repository states. This is simply
because further changes obscure the relationships of entities in question. When
the only thing that changes is a “mv”, it is easy to detect. But when you
also edit the file, it is no longer possible to tell for sure if this is a move
or a new object.
What would help substantially, then, would be to be able to run diff much more often. This prompts a workflow change. Of course, we cannot ask people to commit every time they do a small change. However, we could ask them to “amend” an “in-progress” patch when they do. This would be especially useful if they can be coached into stashing their changes before and after things like applying “sed” to whole codebase, moving around files etc. This would be basically a supercharged version of “darcs mv”: it would say “I did something, you figure what is the right way to represent it”. It adds the burden of having to call this both before and after the “contentious” operation. But it also adds significant benefits (IMHO).
The other thing that would this kind of “open” patch (that you keep adding things to, until you are satisfied, and then you commit) allow is progressive (time-sensitive) revert. This is something that I would be completely sold on: if I kept telling the VCS to note down my changes reasonably often, I could get, in exchange, a whole-repository (but still granular) undo operation.
(It is not hard to imagine that you could also have more than one “open” patch at a time, sorting changes into different buckets for semi-related changes. The UI for that one would be more tricky though.)
The story of Apply
To get back to GSoC though. For what it’s worth, the test-suite part of the
work and a sketch of the V3 prim implementation is already in the
screened branch. The changes to the
Apply class are almost ready for
getting into screened as well; they are currently available as patch635.
The basic challenge with Apply was that V1 prims and V3 prims apply to different kinds of things. While V1 applies to a filesystem tree (basically your working copy), this is not the case with V3 anymore. The V3 state is modelled as a map from UUIDs to Objects. However, it would be extremely uncool to have two different “Apply” type classes to apply different kinds of patches: this would also mean that all higher layers of patch code would need to duplicate their apply implementations. Not cool.
Therefore, associated types to the rescue: I have added an
associated type to the
Apply class. The
Prim level patches then decide what
they can be applied to (currently a
hashed-storage, for V1 and
ObjectMap, which still needs to be fleshed out in more detail, for V3).
Any higher levels inherit their apply state from the prims. Cool.
Of course, it’s not as simple, since we actually have to implement that cool
apply method. This was traditionally (well, since I merged my new annotate
code; not that traditional I guess) been done through the
ApplyMonad used to have operations like “create a file”, “create a
directory” or “write this bytestring to this file”. That’s cool for V1 prims,
but not very useful for V3 prims. So
ApplyMonad needed to be generalised over
multiple apply states. This forced a multi-parameter type class, since there
are no functional dependencies to save us (and therefore associated types do
not apply either). This is because we expect some monads (e.g. IO) to be
instances of both
ApplyMonad. In general, I
didn’t want to limit this to special monads, although we might go for that
option later if it turns out to be superior.
ApplyMonad class is a bit of a meta-class, since the actually
useful set operations is different for different apply state
representations. But since the methods can carry their own constraints, I have
added a couple of fully generic methods (get current state, set current state
and the like) and a set that only applies to
ObjectMap and one that only
Tree. This doesn’t seem to pose significant trouble. Haskell
doesn’t seem to have higher-order type classes that would solve this maybe
slightly more elegantly. (I don’t even know if they are possible. Don’t crucify
me if they aren’t.)
Anyway, long story short, we now have a single
apply method in a single
Apply class, that works on both V1 and V3 prims, as witnessed by running the
same set of tests, which sometimes do invoke
apply, on both implementations.
The story of Commute
There isn’t much of a story behind this one so far. As I outlined above, there
are things coming here as well, but they are not required to allow V3 prims per
se: only needed for the detached storage optimisation. The commute as it is
implemented works, at least as far as tests are concerned. This goes hand in
hand with some kind of
LoadMonad abstraction, that will
actually allow us to implement the detached storage. The
then be a
class (LoadMonad m) => CommuteMonad m where commuteFail :: m a
kind of deal. A LoadMonad superclass constraint can (should) appear on ApplyMonad as well. For now, the instances don’t need to be too elaborate (they can simply fail to fetch anything at all, which will work just fine for V1 prims).
The story of unsafeInterleaveIO
I don’t like unsafeInterleaveIO. At all. My last summer of code was, after a
fashion, about removing a significant source of unsafe, ugly and outright
dangerous lazy IO from darcs. I believe it was a significant success. Now the
StoreMonad abstraction has a potential to rid us of another
source of lazy IO in darcs: currently, the reading of patch content is
unsafeInterleaveIO`d and the rest of the code treats the repository as a
kind-of-pure data structure built of patches. Of course, the unsafeInterleaveIO
is unsafe because it breaks the type system. Since darcs uses it a lot, there
is no telling which value is actually pure. Arguing about runtime behaviour of
lazy code is hard enough when it is actually pure. Random IO thunks lurking
inside pure values (kind of like an alien in Sigourney’s stomach) turn it into
This will involve a lot of code being lifted into a monad, fortunately a significantly restricted monad. In practice, it’ll be IO more often than not (although in the testsuite, it’ll probably be the like of Maybe, or StateT Maybe). What matters is not that the code will execute in IO, but it is statically well protected: it cannot access the IO monad and it has to be explicit about side effects (like loading things from disk). Therefore, we give things types that actually reflect impurity, without allowing them to spin out of control with side effects (like they could if they were simply in the IO monad). A static type system win.
Ok, I guess this is more than enough for today. I’ll try to keep you folks more informed about the progress in the second half of the endeavour. On the other hand, I am a bit worried that these posts are more useful as a note-to-self resource than for general reading by others. Well, let’s hope, dear reader, that you found at least something of this post useful and/or interesting.