Saturday, January 15, 2011

Setting up a Haskell development environment on Windows

This blog post describes how to set up a basic Haskell development environment on Windows. This is the setup I use to make sure that the network package continues to work on Windows.

The Haskell Platform has everything you need to get started, but it doesn't allow you to build certain packages that require Unix tools, like autotools. For that, you need MSYS (or Cygwin, but the former works better.)

Installing MSYS

  1. Install the latest Haskell Platform. Use the default settings.

  2. Download version 1.0.11 of MSYS. You'll need the following files:

    The files are all hosted on haskell.org as they're quite hard to find in the official MinGW/MSYS repo.

  3. Run MSYS-1.0.11.exe followed by msysDTK-1.0.1.exe. The former asks you if you want to run a normalization step. You can skip that.

  4. Unpack msysCORE-1.0.11-bin.tar.gz into C:\msys\1.0. Note that you can't do that using an MSYS shell, because you can't overwrite the files in use, so make a copy of C:\msys\1.0, unpack it there, and then rename the copy back to C:\msys\1.0.

  5. Add C:\Program Files\Haskell Platform\VERSION\mingw\bin to your PATH. This is neccesary if you ever want to build packages that use a configure script, like network, as configure scripts need access to a C compiler.

You now have a basic Haskell development setup and you should be able to install more packages using the cabal command line tool.

Wednesday, January 5, 2011

Merging binary and cereal

In the last post I mentioned a few changes I'd like to see in the binary package. A commenter pointed out that the cereal package might provide a better starting point for implementing the API I want, due to its continuation based implementation.

The commenter might well be right. I don't really know whether the current cereal code base or Lennart's binary branch would serve as a better starting point. My understanding is that both implementations are very similar (i.e. continuation based.) I'd go with the faster one, whichever it is.

I think binary and cereal could benefit from a merger. Both libraries are very similar; they have 23 functions in common. The main differences are:

  • In binary, runGet takes a lazy ByteString, in cereal it takes a strict one. The cereal version also allows for parse error handling by returning the result wrapped in Either. Both the lazy and strict versions of runGet are easily implemented [1] in terms of runGetPartial, as presented in the last post.

  • cereal provides isolate, a function that lets you run a parser on a part of the input, in isolation. It would be a nice addition to the binary interface. In fact, Duncan, Lennart, and I discussed adding such a feature to binary during ZuriHac.

  • cereal provides 11 functions for working with containers. These are nice to have, but I'd prefer if they were split of into a separate package in order to remove the dependency on containers. Parsing different sized machine words in different byte orders is much more fundamental than parsing e.g. a serialized Map; we might want to use a different map type in a year but machine words will most likely still be 32 or 64 bits).

Below is a complete list of the differences.

Functions that exist in binary but not in cereal:

  • bytesRead,
  • getLazyByteStringNul, and
  • getRemainingLazyByteString.

Functions that exist in cereal but not in binary:

  • isolate,
  • label,
  • lookAheadM, and
  • 11 container related functions.

If the libraries would merge we could:

  • Parse both strict and lazy ByteStrings with the same parser.

  • Reduce implementation effort and user confusion.

According to this list of reverse dependencies, binary has many more direct dependencies, which suggests that cereal should be absorbed into binary and not the other way around to reduce breakages.

  1. Not quite true. You can't have both parse error handling, lazy parsing, and keep the binary's current type for runGet. To know if you have a parse error implies forcing enough of the input to check. Right now binary doesn't really have a story for error checking.

Monday, January 3, 2011

Haskell library improvements I'd like to see

At hackathons I often end up chatting with people about changes I'd like to see in some of Haskell's core libraries. As always, there are many more changes I'd like to make than I have time to make. I'm posting some of my "to-dos" here in hope that someone with some spare time will pick them up.

Improvements to the binary package

In the binary package, add incremental input support to Data.Binary.Get. This would allow users to parse large inputs read from e.g. a file, without having to resort to lazy I/O. The API would be quite simple, add a new data type:

data Result r =
      Fail !ByteString !Int64
    | Partial (ByteString -> Result r)  
    | Done r !ByteString !Int64

Both the Fail and Done constructors contain the current parse state. This helps debugging and error reporting in the case of Fail and makes it possible to hand the remaining input to some other function (or parser) in the case of Done. In addition to this data type, we need a function to run a parser:

runGetPartial :: Get a -> Result a

That's it! The hard part is to implement this API while keeping the great performance of the current implementation. I believe Lennart Kolmodin had a working implementation of this design, but I can't find the code.

I'd also like to see the implementation techniques used in the blaze-builder package ported to Data.Binary.Builder to improve the performance of builders. Data.Binary.Builder has a nice, simple API and a lot of users (via Data.Binary.Put). Giving those users some free performance would be a good thing.

If I'd undertake this project myself I'd start by writing some Criterion benchmarks for the parser, inspired by the current set of benchmarks, and porting all of the blaze-builder benchmarks to the binary package.

Improvements to the text package

In the text package, improve the performance of the lazy text builder in Data.Text.Lazy.Builder, using the same blaze-binary implementation techniques mentioned above.

I'd also add a rewrite rule for fromText/unpackCString# that would transcode a GHC string literal directly from UTF-8 to UTF-16 (which is what the text package uses internally) instead of going via a String, which is what happens now.

Warning

All of the above changes will likely require you to read Core. If you're unfamiliar with Core you can take a peek at my slide deck from last year's CUFP, which has a few slides about reading Core.