I'm proud to announce a new major release of the unordered-containers package. This release is the result of over a year's work on completely replacing the underlying data structure used in the implementation of
The old Patricia trie based implementation has been replaced by a new implementation based on hash array mapped tries (HAMT.) A HAMT is a trie implemented as a n-ary tree, where n is typically 16 or 32. In addition to having a higher branching factor, HAMTs reduce space overhead by not representing empty sub trees at all. You can read more about HAMTs in Phil Bagwell's paper.
The higher branching factor lets us trade a small insert/delete performance regression against a large lookup performance improvement. In addition, the new data structure uses less memory and its better data locality should make it possible to improve its performance in the future.
I used my standard Benchmark suite to compare the performance of the old and new implemetation.
Don't pay too much attention to the x-axis in the graphs below, it displays the time taken to run a batch of operations, not a single operation. Instead look at the relative improvement (or regression.)
The benchmarks were run using a 32-bit GHC 7.4.1 on a 2.3 GHz Intel Core i7 MacBook Pro, using the default RTS settings (i.e. no
The benchmarks named <something>-miss measure the time taken when the key is not present in the map.
There's not much to say here, except that the better data locality of the HAMT makes for a big difference in lookup performance.
The benchmarks named <something>-dup measures the time taken when the key and value were already present in the map.
There are two things to note here. First, even though we're copying 8 times as much data when rebuilding the node at each level of the tree, the insert performance isn't much worse than in the old implementation. This is partly due to some changes I made to GHC but mostly due to modern CPUs being good at doing things un bulk, like copying blocks of memory in cache.
The second thing to note is that inserting an already existing key-value pair is faster in the new implementation. This is due to a really
evil clever optimization that uses pointer equality (in a safe way) to check if the key and value are already present in the map and if so avoids rebuilding the tree.
The results here are similar to those for insert, except that the pointer equality optimization I works even better here.
Bulk operations ought to be faster on the HAMT, given its more chunky representation. However, at the moment these operations are slightly slower (but the absolute numbers are still low.) I hope to be able to optimize these more in the future, likely by improving the code GHC generates for loops over arrays.
The theoretical overhead per key-value pair in the old implementation is 8 words, while it's only ~4.5 words in the new implementation. Real world measurements might not show quite as big of an improvement, but the new data structure definitely uses less memory in the real world measurements I've done.
Putting things in perspective
To see how far we have come since the days before unordered-containers, here's a comparison between the lastest
HashMap implementation and the current
(Note that I reran the benchmarks to create this graph, so the numbers might not match the other graphs exactly.)
The API is fully backwards compatible. I bumped the major version number to
- indicate the increased risk of stability issues due to the major internal overhaul, and
- to celebrate the occasion. :)
If your code worked with 0.1, it should work with 0.2.
The new implementation leaves room for future performance improvement. I believe the old Patricia tree implementation had reached its potential; there's only so much you can do with a binary tree before cache misses, due to poor data locality, dictate performance.
I know of at least one use case that can be optimized a lot, but that I haven't tackled yet due to lack of time: linear use of the data structure. For example, when creating a new map from a list of key-value pairs, using
fromList. This use case can be sped up by perhaps as much as 2x by repeatedly mutating the HAMT behind the scenes and then freezing it before returning it to the user.
Finally, GHC doesn't generate ideal code for loops over arrays or the array usage patterns in my implementation. Hopefully future improvements to GHC's treatment of array based code will show up as performance improvements in unordered-containers.