Wednesday, January 5, 2011

Merging binary and cereal

In the last post I mentioned a few changes I'd like to see in the binary package. A commenter pointed out that the cereal package might provide a better starting point for implementing the API I want, due to its continuation based implementation.

The commenter might well be right. I don't really know whether the current cereal code base or Lennart's binary branch would serve as a better starting point. My understanding is that both implementations are very similar (i.e. continuation based.) I'd go with the faster one, whichever it is.

I think binary and cereal could benefit from a merger. Both libraries are very similar; they have 23 functions in common. The main differences are:

  • In binary, runGet takes a lazy ByteString, in cereal it takes a strict one. The cereal version also allows for parse error handling by returning the result wrapped in Either. Both the lazy and strict versions of runGet are easily implemented [1] in terms of runGetPartial, as presented in the last post.

  • cereal provides isolate, a function that lets you run a parser on a part of the input, in isolation. It would be a nice addition to the binary interface. In fact, Duncan, Lennart, and I discussed adding such a feature to binary during ZuriHac.

  • cereal provides 11 functions for working with containers. These are nice to have, but I'd prefer if they were split of into a separate package in order to remove the dependency on containers. Parsing different sized machine words in different byte orders is much more fundamental than parsing e.g. a serialized Map; we might want to use a different map type in a year but machine words will most likely still be 32 or 64 bits).

Below is a complete list of the differences.

Functions that exist in binary but not in cereal:

  • bytesRead,
  • getLazyByteStringNul, and
  • getRemainingLazyByteString.

Functions that exist in cereal but not in binary:

  • isolate,
  • label,
  • lookAheadM, and
  • 11 container related functions.

If the libraries would merge we could:

  • Parse both strict and lazy ByteStrings with the same parser.

  • Reduce implementation effort and user confusion.

According to this list of reverse dependencies, binary has many more direct dependencies, which suggests that cereal should be absorbed into binary and not the other way around to reduce breakages.

  1. Not quite true. You can't have both parse error handling, lazy parsing, and keep the binary's current type for runGet. To know if you have a parse error implies forcing enough of the input to check. Right now binary doesn't really have a story for error checking.

6 comments:

  1. isolate goes a little bit further that just isolating to that portion of the input. It also requires the Get operation passed to consume exactly the input that is provided.

    Cereal also provides a debugging interface, that will dump a stack trace when a parse error happens. You can hook into this functionality via the label function.

    I would really like to see these functions make their way into a merger, as well as some method of error handling; they have been invaluable in day to day work for me.

    ReplyDelete
  2. The "cereal package" link is broken, it points to "http://package.haskell.org/package/cereal".

    ReplyDelete
  3. moltar,

    You're right that isolate does a little bit more than I described. I'm not sure if requiring the parser to consume exactly the specified amount of input is an essential feature of isolate. I'll have to think about it.

    Error handling would definitely make it into the merger as the Result data type I described in the last post includes a Fail constructor which is intended to be used for that purpose.

    Stack traces are nice if they don'y hurt performance too much. I believe Lennart added them to his branch of binary but they hurt performance a lot.

    ReplyDelete
  4. I think merge is best case scenario. And cereal should be merged into binary. But I think that critical component is missing. Who will do such modifications?

    Also Get from cereal have Alternative instance whereas one from binary doesn't.

    If we have function which checks for end of input isolate which checks that all input has been consumed could be written as:

    isolate' n g = isolate n (g *> eof)

    ReplyDelete
  5. Aleksey,

    I'm not sure who will make the modifications. If I had the time I would attempt them myself, but as I mentioned in the previous posts these are projects I like to see done but where I don't have the time to work on. Hopefully I can convince the maintainers that this is a good thing and they will have some time to work on it together.

    Yes, there is the Alternative instance. I think the "standard" continuation based design can support that without too much trouble (e.g. attoparsec already does).

    ReplyDelete