Wednesday, March 14, 2012

The Cabal of my dreams

This post outlines changes I think we should make to the Cabal build infrastructure, in order for it to stay relevant to Haskell developers. This post is intentionally high level and uses strong statements; I'm trying to establish a direction to go in, not discuss implementation details. I have a huge amount of respect and appreciation for the Cabal developers; this post is in no way intended as a criticism of them or their work.

Here's what I expect of a build system:

Hermetic builds

Builds should be completely independent. The build system must behave as if all artifacts were rebuilt every time. If any artifacts are shared, they must be both immutable and uniquely determined by the inputs used to create them (e.g. compiler flags, source files, dependencies used, etc.)

If artifacts can't be safely shared, you're better of rebuilding every time than trying to share them anyway.

Parallel builds

Real world programs are too large to be built sequentially. Parallel builds are a necessary feature; soon distributed builds will be as well.

Changes needed to Cabal

cabal build means "build my package"

If I'm in a source directory and type cabal build, I want Cabal to build my package, not tell me about steps I could take to build my package. cabal build should

  • configure the package, even if I haven't configured the package before,
  • build any dependencies and store them locally in the source tree (optionally sharing them with other builds if that can be done safely), and
  • build the package.

Aside: configure very much feels like an unnecessary step. It would be better if build took all the flags configure takes today, so we never have to run configure again. This is after all what cabal install does today.

cabal test means "test my package"

If I'm in a source directory and type cabal test, I want Cabal to run my test suites. It is understood that to run the test suites they first need to be built, so go build and run them already!

I don't know what cabal install means

At the end of the day we build code to produce executables. When I type cabal install <library> I'm doing busy work that the build system should do for me, manually telling it about dependencies it might need to install later. cabal install should, except perhaps in the case of executables, be replaced by cabal build as described above.

The only reason I can see for running cabal install <library> manually is to save some compilation time in the future or to make sure the packages exists locally e.g. before getting on a plane.

No package is an island

We frequently have to work on multiple dependent packages at once and thus we also need a way to build them together, without registering the by-products in some shared package database. For example, given this directory structure

src
  |-- foo
  |  | -- foo.cabal
  |-- bar
     | -- bar.cabal

where foo is a package that depends on bar, I should be able to do

cd src
cabal build foo

and that better traverse the directory structure, find, and build dependencies and use them when compiling foo.

Aside: This could be implemented in a few different, more or less explicit, ways e.g. by specifying the directories used to search for dependencies explicitly, as in

cd src/foo
cabal build --root-dir=~/src

Parallel builds

We're almost there. Parallel builds were developed by a GSoC student last year. We need to get the patches merged into Cabal.

Building specific components within a package

Not as important as the above changes, but having cabal build build all components in a single package starts getting annoying when you have several of them. For example, if you want to run a single test suite, but the package defines several, you end up unnecessarily linking binaries you won't run.

Proposal, take the component to build as an extra, optional argument to build. For example,

cd src/foo
cabal build some-executable

would only build some-executable. Combine that with multi-package builds and we can use

cd src
cabal build foo:some-component

to build a specific component in a specific package in a source tree of multiple packages.

Friday, March 9, 2012

Improvements to HashMap and HashSet creation

I mentioned in my last post that I'd like to improve the performance of fromList for all types exported by the unordered-containers package. Twan van Laarhoven beat me to it and his changes, together with some additional optimizations by me, made for quite a nice speed-up:

(fromListWith should also see similar improvements.)

The improvements are in the unordered-containers-0.2.0.1 release.

I also plan to add additional functions e.g. unfoldr and unfoldrM that will allow efficient bulk creation of maps and sets. For example, unfoldrM will let you efficiently create a map (or set) from the content of a file.

Tuesday, March 6, 2012

Announcing unordered-containers 0.2

I'm proud to announce a new major release of the unordered-containers package. This release is the result of over a year's work on completely replacing the underlying data structure used in the implementation of HashMap and HashSet.

The old Patricia trie based implementation has been replaced by a new implementation based on hash array mapped tries (HAMT.) A HAMT is a trie implemented as a n-ary tree, where n is typically 16 or 32. In addition to having a higher branching factor, HAMTs reduce space overhead by not representing empty sub trees at all. You can read more about HAMTs in Phil Bagwell's paper.

The higher branching factor lets us trade a small insert/delete performance regression against a large lookup performance improvement. In addition, the new data structure uses less memory and its better data locality should make it possible to improve its performance in the future.

Benchmark results

I used my standard Benchmark suite to compare the performance of the old and new implemetation.

Don't pay too much attention to the x-axis in the graphs below, it displays the time taken to run a batch of operations, not a single operation. Instead look at the relative improvement (or regression.)

The benchmarks were run using a 32-bit GHC 7.4.1 on a 2.3 GHz Intel Core i7 MacBook Pro, using the default RTS settings (i.e. no +RTS flags.)

lookup

The benchmarks named <something>-miss measure the time taken when the key is not present in the map.

There's not much to say here, except that the better data locality of the HAMT makes for a big difference in lookup performance.

insert

The benchmarks named <something>-dup measures the time taken when the key and value were already present in the map.

There are two things to note here. First, even though we're copying 8 times as much data when rebuilding the node at each level of the tree, the insert performance isn't much worse than in the old implementation. This is partly due to some changes I made to GHC but mostly due to modern CPUs being good at doing things un bulk, like copying blocks of memory in cache.

The second thing to note is that inserting an already existing key-value pair is faster in the new implementation. This is due to a really evil clever optimization that uses pointer equality (in a safe way) to check if the key and value are already present in the map and if so avoids rebuilding the tree.

delete

The results here are similar to those for insert, except that the pointer equality optimization I works even better here.

Bulk operations

Bulk operations ought to be faster on the HAMT, given its more chunky representation. However, at the moment these operations are slightly slower (but the absolute numbers are still low.) I hope to be able to optimize these more in the future, likely by improving the code GHC generates for loops over arrays.

Memory usage

The theoretical overhead per key-value pair in the old implementation is 8 words, while it's only ~4.5 words in the new implementation. Real world measurements might not show quite as big of an improvement, but the new data structure definitely uses less memory in the real world measurements I've done.

Putting things in perspective

To see how far we have come since the days before unordered-containers, here's a comparison between the lastest HashMap implementation and the current Map implementation:

(Note that I reran the benchmarks to create this graph, so the numbers might not match the other graphs exactly.)

Backwards compatibility

The API is fully backwards compatible. I bumped the major version number to

  • indicate the increased risk of stability issues due to the major internal overhaul, and
  • to celebrate the occasion. :)

If your code worked with 0.1, it should work with 0.2.

The future

The new implementation leaves room for future performance improvement. I believe the old Patricia tree implementation had reached its potential; there's only so much you can do with a binary tree before cache misses, due to poor data locality, dictate performance.

I know of at least one use case that can be optimized a lot, but that I haven't tackled yet due to lack of time: linear use of the data structure. For example, when creating a new map from a list of key-value pairs, using fromList. This use case can be sped up by perhaps as much as 2x by repeatedly mutating the HAMT behind the scenes and then freezing it before returning it to the user.

Finally, GHC doesn't generate ideal code for loops over arrays or the array usage patterns in my implementation. Hopefully future improvements to GHC's treatment of array based code will show up as performance improvements in unordered-containers.