Friday, February 18, 2011

A new, faster containers library

I am delighted to announce the release of preview versions of two new packages:

unordered-containers 0.1
Efficient hashing-based container types.

hashable 1.1
Efficient hashing of common types.

These packages address a need for faster container types in Haskell, especially for string types like ByteString and Text. The package achieves better performance by using hashing and a Patricia trie (also known as a radix tree), instead of a balanced tree, in its implementation.

The provided HashMap type is three times faster than Data.Map for lookups and twice as fast for inserts, for all key types I've tried, including ByteString, String, and Int.

I'm calling this a preview release, even though the code has been well tested both for correctness and performance, as the API is quite basic and doesn't provide a HashSet type yet. One thing you'll notice is that the API is split into a value-strict and a value-lazy version, making it easier to use in a strict setting.

If you want to contribute, please get copies of the source trees from here:

  • git clone https://github.com/tibbe/unordered-containers.git
  • git clone https://github.com/tibbe/hashable.git

Alternatively, just fork the projects on GitHub.

As I mentioned in my talk on hashing-based containers earlier this week, I'm working on an even faster version, based on hash-array mapped tries, but it won't be ready until GHC 7.2.

13 comments:

  1. It would be nice benchmarking it against clojure
    implementation. Have you tried it?

    ReplyDelete
  2. A small comment: shouldn't be Eq a superclass of Hashable? Especially since the docs say "If two values are equal according to the == method, then applying the hash method on each of the two values must produce the same integer result. ".

    ReplyDelete
  3. Paolo,

    It would be nice to benchmark it against the Clojure implementation. It's on my list of things to do. The implementation is the same as IntMap so the numbers Edward Z Yang presents in his blog should be representative: http://blog.ezyang.com/2010/03/the-case-of-the-hash-array-mapped-trie/

    ReplyDelete
  4. Twan,

    Perhaps. I have some doubt about whether type classes are good for sub-typing hierarchies. See Oleg email titled "In opposition of Functor as super-class of Monad" on the haskell-prime mailing list.

    ReplyDelete
  5. I very much like this idea of freeing hashing from hash tables!

    I also have a semantic concern. The documentation on the hash method says "This integer need not remain consistent from one execution of an application to another execution of the same application."

    The first remark violates a basic principle of purity in functional programming (apparently not universally shared), as described at Notions of purity in Haskell. I wonder if we can preserve the efficiency that I guess motivates the remark, while upholding the semantic purity I'd hope for.

    ReplyDelete
  6. I'm not sure that "efficiency" is the source of that remark, but even if it is, predictable hashing has a security implication for public-facing services. You can DoS with varying degrees of triviality a public-facing service that uses predictable hashing in a way that permits a remote attacker to tickle a worst-case of the some data structure, such as a hash-based implementation of query-string parameters.

    Perhaps a good place for implicit parameters, which ought to let you set up a random number once in IO (probably even in "main") but not require dealing with the pain of trying to pass the randomization parameter every which way and changing type signatures everywhere. It's technically referentially transparent.

    ReplyDelete
  7. Conal,

    It's not as bad as it sounds. The hash functions are all perfectly pure. The comment basically says that you shouldn't rely on the implementation staying the same over time (i.e. don't rely on these hash function for storing things on disk that you expect to be able to read back in 10 years). It also allows for implementations to compute different hash values for the same input on machines with different byte order.

    No I/O monads were hurt in the creation of these hash function.

    ReplyDelete
  8. > It's not as bad as it sounds. The hash functions are all perfectly pure. The comment basically says you shouldn't rely on the implementation staying the same over time (i.e. don't rely on these hash function for storing things on disk that you expect to be able to read back in 10 years).

    I'm fine with the idea that if I replace old code with new code, then the meaning of the new code needn't agree with the meaning of the old code.

    > It also allows for implementations to compute different hash values for the same input on machines with different byte order.

    This part worries me. If you read the "notions of purity" post I pointed to, I think you'll see that my concerns apply here. Just as with 'os :: String' in that post. Or if there were 'isLittleEndian :: Bool' that says whether the machine running the code has a little endian architecture. It's a blow to the lovely and useful property that our language (other than IO) has clear and tractably simple semantics, and in particular a machine-independent semantics. Which is to say that we're doing denotative programming (in the sense of Peter Landin), not just hacking.

    ReplyDelete
  9. Conal,

    I hear you. I wish code didn't change meaning when it moved between machines. However, you don't need to go further than Int to get that behavior (i.e. 32 vs 64-bit). Overall I think the trade-off of more efficient hash functions is worth it. In practice few people run on architectures with different endianness.

    ReplyDelete
  10. > I wish code didn't change meaning when it moved between machines. However, you don't need to go further than Int to get that behavior (i.e. 32 vs 64-bit).

    I'm hoping we'll reverse this trend, rather than continuing it and citing precedent. As Luke Palmer said: "You must be the change you wish to see in the world (-- Mahatma Gandhi). As applied to software: design software as if it were the beautiful paradise you want it to be, then build pieces of the scaffolding back to the status quo."

    Haskell is one of the few languages poised to extend robustly to execute a single program not only on multiple threads and multiple cores, but on multiple machines and multiple OSes. Or it would be except for a few choices of machine- or OS-dependent semantics, as in 'System.Info', 'Int' and now 'hash'. I worry that the more of these choices we make, the tougher it'll be to fix later when we want to tackle distributed computing.

    ReplyDelete
  11. I'm sorry about this "purity" conversation distracting from the main topic of your post.

    ReplyDelete
  12. Can I expect similiar performance when using `Text` instead of `Bytestring`s?

    ReplyDelete
  13. Yes. The reason I created the library to begin with was to store large numbers of Text values.

    ReplyDelete