Tuesday, March 6, 2012

Announcing unordered-containers 0.2

I'm proud to announce a new major release of the unordered-containers package. This release is the result of over a year's work on completely replacing the underlying data structure used in the implementation of HashMap and HashSet.

The old Patricia trie based implementation has been replaced by a new implementation based on hash array mapped tries (HAMT.) A HAMT is a trie implemented as a n-ary tree, where n is typically 16 or 32. In addition to having a higher branching factor, HAMTs reduce space overhead by not representing empty sub trees at all. You can read more about HAMTs in Phil Bagwell's paper.

The higher branching factor lets us trade a small insert/delete performance regression against a large lookup performance improvement. In addition, the new data structure uses less memory and its better data locality should make it possible to improve its performance in the future.

Benchmark results

I used my standard Benchmark suite to compare the performance of the old and new implemetation.

Don't pay too much attention to the x-axis in the graphs below, it displays the time taken to run a batch of operations, not a single operation. Instead look at the relative improvement (or regression.)

The benchmarks were run using a 32-bit GHC 7.4.1 on a 2.3 GHz Intel Core i7 MacBook Pro, using the default RTS settings (i.e. no +RTS flags.)

lookup

The benchmarks named <something>-miss measure the time taken when the key is not present in the map.

There's not much to say here, except that the better data locality of the HAMT makes for a big difference in lookup performance.

insert

The benchmarks named <something>-dup measures the time taken when the key and value were already present in the map.

There are two things to note here. First, even though we're copying 8 times as much data when rebuilding the node at each level of the tree, the insert performance isn't much worse than in the old implementation. This is partly due to some changes I made to GHC but mostly due to modern CPUs being good at doing things un bulk, like copying blocks of memory in cache.

The second thing to note is that inserting an already existing key-value pair is faster in the new implementation. This is due to a really evil clever optimization that uses pointer equality (in a safe way) to check if the key and value are already present in the map and if so avoids rebuilding the tree.

delete

The results here are similar to those for insert, except that the pointer equality optimization I works even better here.

Bulk operations

Bulk operations ought to be faster on the HAMT, given its more chunky representation. However, at the moment these operations are slightly slower (but the absolute numbers are still low.) I hope to be able to optimize these more in the future, likely by improving the code GHC generates for loops over arrays.

Memory usage

The theoretical overhead per key-value pair in the old implementation is 8 words, while it's only ~4.5 words in the new implementation. Real world measurements might not show quite as big of an improvement, but the new data structure definitely uses less memory in the real world measurements I've done.

Putting things in perspective

To see how far we have come since the days before unordered-containers, here's a comparison between the lastest HashMap implementation and the current Map implementation:

(Note that I reran the benchmarks to create this graph, so the numbers might not match the other graphs exactly.)

Backwards compatibility

The API is fully backwards compatible. I bumped the major version number to

  • indicate the increased risk of stability issues due to the major internal overhaul, and
  • to celebrate the occasion. :)

If your code worked with 0.1, it should work with 0.2.

The future

The new implementation leaves room for future performance improvement. I believe the old Patricia tree implementation had reached its potential; there's only so much you can do with a binary tree before cache misses, due to poor data locality, dictate performance.

I know of at least one use case that can be optimized a lot, but that I haven't tackled yet due to lack of time: linear use of the data structure. For example, when creating a new map from a list of key-value pairs, using fromList. This use case can be sped up by perhaps as much as 2x by repeatedly mutating the HAMT behind the scenes and then freezing it before returning it to the user.

Finally, GHC doesn't generate ideal code for loops over arrays or the array usage patterns in my implementation. Hopefully future improvements to GHC's treatment of array based code will show up as performance improvements in unordered-containers.

8 comments:

  1. I would be interested to see a comparison between your new HAMT with the pointer-optimization and without when the benchmark is getting data from an external file.

    Currently, your benchmark uses the same list of values and keys this time, so the optimization makes sense. But when you're driving with external data without reuse inside your program, I wonder if this is just an extra overhead that does not gain any benefit. Perhaps it would make sense to provide two implementations of insert and delete, one with and one without the optimization?

    ReplyDelete
  2. Christophe: I'm not sure I understand. The data that he is checking for pointer equality is the branch of the HAMT, not the user data.

    ReplyDelete
  3. This comment has been removed by the author.

    ReplyDelete
  4. I guess I misunderstood the following paragraph:

    "The second thing to note is that inserting an already existing key-value pair is faster in the new implementation. This is due to a really evil clever optimization that uses pointer equality (in a safe way) to check if the key and value are already present in the map and if so avoids rebuilding the tree."

    ReplyDelete
  5. Edward & Chris,

    You're both right. Most of the time I compare branches of the tree. This is the case in delete for example. There's exactly one case where I compare values, in insert. In that case the optimization in likely to trigger less often i.e. only if you're trying to insert exactly the same value, which happens. Also note that GHC reuses objects in some cases (e.g. it has a pool of Chars) which will make them compare pointer equal in this case, even if they came from different sources.

    ReplyDelete
  6. Is there any chance of having some way to use your implementation of the data structure without necessarily relying on the unordered-containers interface?

    In particular, a suggestion was made for updating one of my libraries to use something like unordered-containers as a backend, but with several optimizations based on using the hash in my library, as well as the backend. This becomes a much more attractive idea if I can pass in the hash, rather than have unordered-containers recalculate it.

    This could be provided as a lower-level internal interface, like bytestring provides, or just as a separate package - but either way, it'd allow for some really slick uses that the current api can't quite support.

    ReplyDelete
  7. Anonymous,

    That's an interesting idea. The data definition itself is somewhat specialized for my current use case, in that it includes a field holding the hash in the leaf data constructor. Could you have a look at the data type definition and see if it feels your needs as is (e.g. if I would expose it through Data.HashMap.Internal)?

    Once way to avoid recomputing the hash if you already have one is to define a data type like so:

    data MyKey = MyKey MyHash RealKey

    instance Ord MyKey where
    (MyKey _ k1) <= (MyKey _ k2) = k1 <= k2

    instance Hashable MyKey where
    hash (MyKey h _) = h

    ReplyDelete
    Replies
    1. Same anonymous here. I feel silly I didn't catch that simple approach to avoid re-hashing.

      Still, I feel like having internal access to the structure would be a bit easier to use. And the current data structure looks fine, after a quick look.

      Delete