Thursday, June 9, 2011

Memory footprints of some common data types

As a follow-up to my last post on computing the size of a HashMap I've compiled a list of some common data types and their memory footprints. The caveat from the previous post applies: the following only applies to GHC.

N is the length of the value (for string types) or the number of elements (for container types).

Data type Memory footprint Notes
Data.ByteString 9 words + N bytes 1
Data.Text 6 words + 2N bytes 1, 2
String 5N words 3
Data.Map 6N words + size of keys & values
Data.Set 5N words + size of elements
Data.IntMap 3N + 5(N-1) words + size of values 4
Data.IntSet 2N + 5(N-1) words 4
Data.HashMap 5N + 4(N-1) words + size of keys & values 4
Data.HashSet 5N + 4(N-1) words + size of elements 4, 5

For example, on a 32-bit machine a HashMap Text Int where the average string length is 8 characters will use roughly

4 * (5N + 4(N-1)) + N * (4 * 6 + 16) + 4 * 2N = 84N - 16

bytes, or about 84 bytes per key-value pair.

N.B. The given sizes assumes that the data type isn't UNPACKed into some other data type, which decreases the total memory usage.

  1. The size of the content is rounded up to a multiple of the machine word size.

  2. Data.Text stores the Unicode string encoded using UTF-16. The memory usage depends on the actual code points in the string, but can be approximated as 2N bytes.

  3. GHC allocates a small pool of shared Chars so the actual memory usage might be smaller depending, on the content of the string. In addition, if the String isn't fully evaluated it might us more or less space, depending on the size of the thunk representing the rest of the computation.

  4. Assuming perfectly balanced tree.

  5. At the time of writing HashSet is a simple wrapper around HashMap. The overhead could easily be decreased by N words to 4N + 4(N-1) words by not storing a dummy value.

6 comments:

  1. Data.Text uses UTF-16 which for some (rare) characters (those outside of the Basic Multilingual Plane, like Byzantine music symbols) uses more than two bytes per character.

    ReplyDelete
  2. Surely note (4) is unnecessary? None of the binary trie-based maps actually cares about balance, there are always N-1 internal nodes. That said, I'm not thinking carefully about the cost of hash collisions here.

    ReplyDelete
  3. J. Maessen,

    Due to the presence of the `Nil` constructor I believe you can have more than N-1 internal nodes. The `Nil` constructor gets introduced when deleting nodes. You can end up with a tree that looks like (Bin ... (Tip ...) Nil)

    ReplyDelete
  4. I'm having trouble deriving 5N for String. For a list I would expect 2 words per cell: one for the cons/nil tag, and one pointer to the contents. Then a Char is free (preallocated) if it's in the 0-255 range, and 2 words otherwise (one for the box, one for the data).

    As an aside, I made a Sized typeclass that tries to add up the storage taken by the (fully forced) data type. It should be an upper bound because it doesn't understand sharing, but it seems generally useful to be able to get a size estimate. It's also probably impossible to get an accurate measurement of collections types without access to internal data, but at least I can get a guess.

    Carrying this further, given something like reallyUnsafePtrEquality (or make unsafeCoerce to a pointer type and compare?), would it be possible to get the number of bytes shared? Then I could get an actual measurement for the amount of space taken up by many copies of a persistent data structure.

    ReplyDelete
    Replies
    1. It's 3 words per cell, one for the GC header (i.e. info pointer table), one for the pointer to the value, and one for the pointer to the tail. As you say, the Char is either 0 or 2 words, depending on which Unicode code point it represents.

      If you want to see how to check sharing look at how vacuum does it. I believe it looks inside the actual closures and follows pointers to see where they point.

      Delete
    2. Ah, I was forgetting the GC overhead. Thanks for the vacuum pointer.

      Delete