After a weekend of hacking in Utrecht, Don Stewart and I have improved the performance of the containers package, especially when the key or element type can be unboxed (e.g. Int). The performance gains come from repeatedly applying a few code transformations. Today, I will describe one of these transformations: the static argument transformation.
The transformation works most of the time, but make sure you have some benchmarks ready so you can catch the cases where it makes things worse. I recommend using the excellent Criterion benchmarking package, as it takes care of many subtle issues in benchmarking.
The static argument transformation
The static argument transformation transforms a recursive function with one or more constant arguments into a non-recursive wrapper with a locally defined recursive worker. For example,
is transformed to
This transformation tends to improve performance for some functions, especially higher-order functions.
Higher-order functions
To understand why this transformation can improve the performance of higher-order functions, consider this definition of filter:
The actual predicate is unknown to GHC when it compiles the module where filter is defined. At run time, a pointer to the actual predicate is passed as an argument to filter. This hurts performance as calling a function indirectly through a pointer is less efficient than calling a known function.
If filter could be inlined at the call site, where the actual predicate is known, we could avoid the cost of this indirect call. However, GHC doesn’t inline recursive functions (as it doesn’t know when to stop inlining) and therefore cannot inline filter at call sites.
The static argument transformation can help here. By applying the transformation we get
Note how the constant argument p is no longer passed in each recursive call.
Since this definition is not recursive, GHC can inline it. Since GHC can inline it, GHC can also inline the predicate into the body. For example, given this call site
the code GHC generates looks something like this, after simplification:
The predicate is now known and the extra indirection can be avoided. In addition, GHC can inline the definition of even and avoid the overhead of that function application:
In order of decreasing importance, here are the reasons the transformation improves performance:
- We no longer have to make an indirect function call via a pointer.
- We don’t have to make a function call to
even, as its definition has been inlined. - The closure that’s created by the expression
go xs is slightly smalled than the closure created by filter p xs, which save a little bit of allocation.
Since inlining is so important for this transformation to work, you may sometimes need to add an explicit INLINE pragma:
In this particular case it’s not needed as GHC inlines filter without any hints from us.
When applying the transformation to a function, try both with and without the inline pragma and measure the difference using benchmarks.
Summary
The static argument transformation can remove performance penalties introduced by using abstractions such as higher-order functions.
GHC is getting better and better over time so expect some of these manual transformations to be applied automatically by the compiler some day. Until then you can use them to improve the performance of your code.
Next time we'll take a closer look at how to reason about evaluation order an how making function arguments strict can improve the performance of non-polymorphic functions as well.