Monday, January 3, 2011

Haskell library improvements I'd like to see

At hackathons I often end up chatting with people about changes I'd like to see in some of Haskell's core libraries. As always, there are many more changes I'd like to make than I have time to make. I'm posting some of my "to-dos" here in hope that someone with some spare time will pick them up.

Improvements to the binary package

In the binary package, add incremental input support to Data.Binary.Get. This would allow users to parse large inputs read from e.g. a file, without having to resort to lazy I/O. The API would be quite simple, add a new data type:

data Result r =
      Fail !ByteString !Int64
    | Partial (ByteString -> Result r)  
    | Done r !ByteString !Int64

Both the Fail and Done constructors contain the current parse state. This helps debugging and error reporting in the case of Fail and makes it possible to hand the remaining input to some other function (or parser) in the case of Done. In addition to this data type, we need a function to run a parser:

runGetPartial :: Get a -> Result a

That's it! The hard part is to implement this API while keeping the great performance of the current implementation. I believe Lennart Kolmodin had a working implementation of this design, but I can't find the code.

I'd also like to see the implementation techniques used in the blaze-builder package ported to Data.Binary.Builder to improve the performance of builders. Data.Binary.Builder has a nice, simple API and a lot of users (via Data.Binary.Put). Giving those users some free performance would be a good thing.

If I'd undertake this project myself I'd start by writing some Criterion benchmarks for the parser, inspired by the current set of benchmarks, and porting all of the blaze-builder benchmarks to the binary package.

Improvements to the text package

In the text package, improve the performance of the lazy text builder in Data.Text.Lazy.Builder, using the same blaze-binary implementation techniques mentioned above.

I'd also add a rewrite rule for fromText/unpackCString# that would transcode a GHC string literal directly from UTF-8 to UTF-16 (which is what the text package uses internally) instead of going via a String, which is what happens now.

Warning

All of the above changes will likely require you to read Core. If you're unfamiliar with Core you can take a peek at my slide deck from last year's CUFP, which has a few slides about reading Core.

14 comments:

  1. I think it would be a good idea if we had a mentor program, where mentors (such as yourself) could put up these kinds of project ideas and help newer Haskellers get it done. It would allow newer people to get experience on serious code while having guidance, and allow more experienced users to offload some of the work (and maintenance) they do.

    ReplyDelete
  2. My work was made public on http://haskell.org/~kolmodin , but I see now that my account is gone (along with many other homepages).
    I could possibly make it available through http://code.haskell.org but (surprisingly enough) they're not very darcs friendly.
    Suggestions?

    ReplyDelete
  3. darcsden and patch-tag offer public darcs hosting.

    ReplyDelete
  4. @Lennart Kolmodin
    I know darcs is "the haskell way", but github works great for tossing up some random code.

    ReplyDelete
  5. Michael, I think that's a great idea. To start with, I'm willing to mentor anyone who wants to work on the two projects mentioned in this post.

    ReplyDelete
  6. Lennart,

    I recommend GitHub. There seems to be a lot of Haskell code (mine included) there nowadays. This means you'll have to use Git of course.

    ReplyDelete
  7. With regards to binary. I think cereal provides much better starting point. Get monad from binary is just a state monad and adding error handling and incremental parsing will require complete rewrite.

    Cereal uses continuations and adding support for incremental parsing is more or less straightforward. It's very similar to attoparsec which has incremental parsing.

    Actually I have incomplete implementation. It's not tested, not benchmarked but seems to work. Ufortunately I don't have time now but I'm willing to share code if someone is interested in it.

    BTW this will allow to parse both strict and lazy bytestings with same parser.

    ReplyDelete
  8. Шимууар, I just wrote a new post on my thoughts on binary and cereal.

    ReplyDelete
  9. Maybe binary could just use blaze-builder and save implementation efforts? Than blaze builder should be split in two parts in order to remove text dependency

    BTW you can report you wishes for text in its bug tracker:
    https://bitbucket.org/bos/text/issues

    ReplyDelete
  10. Aleksey,

    In my mind blaze-builder is a higher level library than binary so I'd prefer if we port Simon's excellent implementation to binary instead of adding a dependency on blaze-builder. As I understand it blaze-builder is also very much in development still so it would perhaps be a bit early to add it as a dependency.

    Regarding text, I'm the author of Data.Text.Lazy.Builder, so any feature request ticket in that area will just come back to me. I might still end up doing it myself some day, but if someone else found the time I'd be happy.

    ReplyDelete
  11. blaze-builder is quite high level but it have low level core which could be moved in the separate package so other people could use it. Core is going to be much smaller and I presume it will be much easier to stabilize. Also blaze-builder reimplement API from Data.Binary.Put so it already have same functionality. I think it's pointless to spend efforts to duplicate existing functionality.

    There is advantage in having same types. It could be quite annoying to have several serializers with nearly identical API, similar implementation and performance. Yet it's not possible to compose them becasue they have different types.

    ReplyDelete
  12. I know this post is quite old, but I just came across some evidence (working on mustache2hs) that Text.Lazy.Builder is quite a bit slower than Blaze Builder. I looked at the implementations and they're very different, but I'm not very familiar with either of their internals.

    I wonde if the Blaze Builder approach could be generalised somehow so that ByteString and Text and even other sorts of things could be Builders with less code duplication?

    ReplyDelete
    Replies
    1. I think it's worth taking a look and see if we can make the text builder faster. We can't use the same implementation though as blaze-builder uses pointer arithmetic, which requires that the memory is pinned, which is the case for ByteString but not for Text.

      I suspect the largest gain will be had by making GHC better at compiling continuation-heavy code, but there might be possible to make some improvements even without doing that.

      Delete
  13. This comment has been removed by the author.

    ReplyDelete