Thursday, June 9, 2011

Computing the size of a HashMap

It's often useful to do some back-of-the-envelope calculations to figure how much memory your program will use. A rough estimate can help you figure out whether your current design will work for your expected input data or not.

In this post I'll show you how to compute the size of a HashMap of ByteString keys and Int values. The technique can be applied to any Haskell data type. Note that the following applies to GHC, other compilers may use different storage conventions.

First, lets start by looking at how much memory one ByteString uses. Here's the definition of ByteString:

data ByteString = PS {-# UNPACK #-} !(ForeignPtr Word8)
                     {-# UNPACK #-} !Int  -- offset
                     {-# UNPACK #-} !Int  -- length

(The two Int fields are used to support O(1) slicing.)

We also need the definitions of ForeignPtr and Int:

data ForeignPtr a = ForeignPtr Addr# ForeignPtrContents

data ForeignPtrContents
    = PlainForeignPtr !(IORef (Finalizers, [IO ()]))
    | MallocPtr (MutableByteArray# RealWorld)
                !(IORef (Finalizers, [IO ()]))
    | PlainPtr (MutableByteArray# RealWorld)

data Int = I# Int#

Int# is an unboxed type. Most unboxed types take one word, the exceptions being Int64#, Word64#, and Double# which take 2 words on a 32-bit machine.

The UNPACK pragma indicates to the compiler that it should unpack the contents of a constructor field into the constructor itself, removing a level of indirection. After the compiler has unpacked the fields, we'll end up with a definition that looks like this:

data ByteString = PS Addr# ForeignPtrContents
                     Int#  -- offset
                     Int#  -- length

To compute the number of words a constructor will use, count the number of constructor fields and add one (for the constructor header). This rule-of-thumb works in most cases (but see the exceptions for 64-bit unboxed types above). Applying this rule we can see that the PS constructor will use 5 words. We then need to add the size of the ForeignPtrContents constructor, which is a PlainPtr if the ByteString was allocated on the GHC heap. Since PlainPtr has one field we add 2 words to the total memory usage.

Finally we need to look at the definition of MutableByteArray#, which is implemented by the GHC run-time system using a C struct named StgArrWords:

typedef struct {
    StgHeader  header;
    StgWord    bytes;
    StgWord    payload[FLEXIBLE_ARRAY];
} StgArrWords;

The StgHeader takes one word (when compiling with profiling turned off) so StgArrWords takes 2 words, plus the actual payload (rounded up to whole words).

If we add it all up we get 9 words, plus the size of the payload. We can visualize the memory layout of a ByteString using a box-and-arrows diagram (for lack of a better term):

Now lets look at the definition of HashMap:

data HashMap k v
    = Bin {-# UNPACK #-} !SuffixMask
          !(HashMap k v)
          !(HashMap k v)
    | Tip {-# UNPACK #-} !Hash
          {-# UNPACK #-} !(FL.FullList k v)
    | Nil

SuffixMask and Hash are just synonyms for Int. We also need the definition of FullList:

data FullList k v = FL !k v !(List k v)
data List k v = Nil | Cons !k v !(List k v)

Applying the UNPACK pragmas we get

data HashMap k v
    = Bin Int#
          !(HashMap k v)
          !(HashMap k v)
    | Tip Int#
          !k v !(List k v)
    | Nil

By counting the number of fields, we can see that each Bin constructor takes 4 words and each Tip constructor 5 words, plus the size of the List.

The size of the List depends on the number of hash collisions. These are quite rare so lets assume that the FullList only has one element, which means List is Nil. The Nil constructor has no fields so a single in-memory value can be shared by all instances in the program. This means that a Tip constructor takes 5 words, including the (zero) cost of the List field.

Below is a diagram of the memory layout of a HashMap with two entries and no hash collisions.

Note how the Nil constructor is shared.

Most likely, you're not interested in the size of individual Bin and Tip constructors, but in the size of each key-value entry in the map. To compute that we need to know the number of Bin constructors per Tip constructor.

For the sake of this discussion lets assume the tree is perfectly balanced. For a HashMap of size N this means that we have N leaves (Tip constructors) and N-1 interior nodes (Bin constructors). We can now compute the overhead per key-value pair (i.e. the size of the key-value pair, excluding the actual key and value) as: (5N + 4(N-1)) / N = 9 - 4/N or about 9 words.

Given that an Int (not an Int#) takes 2 words, we can compute the total memory cost of a key-value pair in a HashMap ByteString Int as

(9 + 9 + 2) * MachineWordSize + ByteStringLength

bytes, where ByteStringLength should be rounded up to a multiple of the machine word size (i.e. 4 or 8).

N.B. The program as a whole will use a bit more memory than the sum of all the values you allocate due to GC.

Aside: I'm working on switching HashMap to use another data structure, a Hash Array Mapped Trie, in its implementation. This will bring the overhead per key-value pair down from 9 words to about 4 words.

No comments:

Post a Comment