Friday, April 9, 2010

Generational garbage collection and floating garbage

The new GHC I/O manager stresses the GHC run-time system in ways most Haskell code doesn't; heavy concurrency and shared data structures really test both the scheduler and the garbage collector. It therefore shouldn't come as a surprise that we once in a while run into performance issues. In this blog post I'll describe an issue related to generational garbage collection that I ran into recently in hope that it's of general interest.
In addition to monitoring file descriptors for events, the I/O manager lets programs register timeouts. The I/O manager stores the registered timeouts in a data structure know as a priority search queue (PSQ). The queue is stored as a part of the I/O manager's state:
Programs can register timeouts using the registerTimeout function:
The registerTimeout function computes the absolute timeout, creates a unique key that can be used to later unregister the timeout, and finally inserts the timeout in the queue.
The last step is the interesting one. Note that the value returned by atomicModifyIORef isn't forced at this point so a thunk, on the form (insert ...), is stored inside the IORef. This stored thunk will eventually get evaluated by the I/O manager thread when it grabs all expired timeouts.
We've created a micro benchmark that measures the time taken to register a large number of timeouts. The benchmark creates a large number of threads (e.g. 20,000) that all call registerTimeout and then wait for the timeout to expire.
While the queue implementation is plenty fast, compared to the implementation in the old I/O manager, the benchmark spends a large amount of its time doing garbage collection (65%). Simon Marlow was kind enough to help me understand what's going on. The explanation that follows is just my elaboration on his explanation.
GHC's run-time system uses a generational, copying garbage collector that uses two generations by default. Objects are allocated in the young generation and promoted to the old generation after having survived two collections of the young generation, also known as minor collections.
As multiple threads call registerTimeout, a chain of insert thunks builds up in the young generation:
The grey circle represents the emTimeouts, which points to the chain of thunks. As emTimeouts is alive for the duration of the benchmark, anything that's pointed to by it will not get garbage collected.
If the thunks survive two minor garbage collections, before getting evaluated by the I/O manager thread, they are promoted to the old generation.
At some point the I/O manager thread runs and evaluates the chain of thunks. As each thunk gets evaluated it creates a new version of the queue. The nodes that are not shared by the old and new version of the queue, about O(log n) nodes per insert call, become garbage.
When a thunk is evaluated it gets overwritten with an indirection (colored red in the diagram below). The indirection is necessary so that any object that was pointing to the thunk prior to it getting evaluated can find its value. Each of these indirections refer to one intermediate version of the queue that was created while evaluating the chain of thunks.
This is where things get interesting.
Infant mortality or the generational hypothesis is the observation that, in most cases, young objects are much more likely to die than old objects. If the generational hypothesis holds true the garbage collector can make a number of simplifying assumptions to improve performance. One such simplification is that all the values in the old generation are considered alive (i.e. reachable from somewhere) during minor garbage collections. This simplification allows the lets collector avoid traversing the entire reference tree each minor collection.
Since the indirections in the old generation refer to all the intermediate versions of the queue (colored green above), these versions won't be collected during minor garbage collections. They can only be collected after the indirections have been collected, which happens during a collection of the old generation, knows as a major collection.
Unless a major collection occurs, the intermediate versions will be unnecessarily copied during the two subsequent minor collections: During the first collection the values are copied inside the young generation and during the second they are copied into the old generation. This creates extra work for the garbage collector. Once inside the old generation the intermediate versions won't be copied again and they will be collected during the next major garbage collection, together with the indirections.
This phenomenon, where objects which are garbage are retained and copied unnecessarily, is know as floating garbage.
It turns out there's no straightforward way to fix the problem. Forcing evaluation of the value returned by atomicModifyIORef causes another problem; each thread performs more work causing its stack to grow and which also creates extra work for the garbage collector.
The workaround1 I came up with is to have each thread record the modification it wishes to perform on the queue rather than actually performing that modification. The modification are stored as closures in a list and are applied by the I/O manager thread before it grabs the expired timeouts:
The list-of-edits approach decreases GC time to 30%, increases performance by 34%, and creates no floating garbage.
1. I discovered the workaround by accident while trying to solve a different problem. I only later understood the problem it actually solved.


  1. Is there a way to allocate memory inside a specific block/memory region which has specific aging properties (i.e. by using somthing like a special "newIORefInRegion" primitive)? Wouldn't this the right way to do it?

  2. I don't think that is possible. I don't know what the best solution would be. However, I'm still looking for something better than this workaround.

  3. I thought of it some more. Isn't the real issue here that the runtime system replaces thunks with "indirection objects" pointing to the outcoming value? I.e. what usually happens, if the outcoming value is produces sometime later during the evaluation process, is that the outcoming value lives far apart from the thunk (and possibly in a younger generation).

    Wouldn't it be preferable to keep a thunk and its resulting value in the same generation or---even better---locally together? Would it be possible to calculate the (maximal) extent of the resulting value of a thunk in advance, reserve space for it in the thunk object while creating the thunk, and do the update "in place"?

  4. These are all good questions and unfortunately I don't have any good answers to them at the moment. I believe in-place update has been tried in some context, with worse performance as a result.

    Keeping the value in the old generation will give you reference from the young generation to the old generation and I think the current GC assumes there are no such references (but don't take my word for it).