Tuesday, September 7, 2010

The static argument transformation

After a weekend of hacking in Utrecht, Don Stewart and I have improved the performance of the containers package, especially when the key or element type can be unboxed (e.g. Int). The performance gains come from repeatedly applying a few code transformations. Today, I will describe one of these transformations: the static argument transformation.

The transformation works most of the time, but make sure you have some benchmarks ready so you can catch the cases where it makes things worse. I recommend using the excellent Criterion benchmarking package, as it takes care of many subtle issues in benchmarking.

The static argument transformation

The static argument transformation transforms a recursive function with one or more constant arguments into a non-recursive wrapper with a locally defined recursive worker. For example,

is transformed to

This transformation tends to improve performance for some functions, especially higher-order functions.

Higher-order functions

To understand why this transformation can improve the performance of higher-order functions, consider this definition of filter:

The actual predicate is unknown to GHC when it compiles the module where filter is defined. At run time, a pointer to the actual predicate is passed as an argument to filter. This hurts performance as calling a function indirectly through a pointer is less efficient than calling a known function.

If filter could be inlined at the call site, where the actual predicate is known, we could avoid the cost of this indirect call. However, GHC doesn’t inline recursive functions (as it doesn’t know when to stop inlining) and therefore cannot inline filter at call sites.

The static argument transformation can help here. By applying the transformation we get

Note how the constant argument p is no longer passed in each recursive call.

Since this definition is not recursive, GHC can inline it. Since GHC can inline it, GHC can also inline the predicate into the body. For example, given this call site

the code GHC generates looks something like this, after simplification:

The predicate is now known and the extra indirection can be avoided. In addition, GHC can inline the definition of even and avoid the overhead of that function application:

In order of decreasing importance, here are the reasons the transformation improves performance:

  • We no longer have to make an indirect function call via a pointer.
  • We don’t have to make a function call to even, as its definition has been inlined.
  • The closure that’s created by the expression go xs is slightly smalled than the closure created by filter p xs, which save a little bit of allocation.

Since inlining is so important for this transformation to work, you may sometimes need to add an explicit INLINE pragma:

In this particular case it’s not needed as GHC inlines filter without any hints from us.

When applying the transformation to a function, try both with and without the inline pragma and measure the difference using benchmarks.

Summary

The static argument transformation can remove performance penalties introduced by using abstractions such as higher-order functions.

GHC is getting better and better over time so expect some of these manual transformations to be applied automatically by the compiler some day. Until then you can use them to improve the performance of your code.

Next time we'll take a closer look at how to reason about evaluation order an how making function arguments strict can improve the performance of non-polymorphic functions as well.

16 comments:

  1. Thanks for the interesting hands-on post, tibbe.

    Could you also please present the worker-wrapper transformations that were applied to improve performance during the Hackathon?

    In general, I'd find it great to have a cookbook-like tutorial (in a similar style as above) on code transformations that improve performance (maybe even on the haskell.org wiki?)

    ReplyDelete
  2. I'm glad you liked the post. I'm planning to write a few more performance oriented posts in the future so stay tuned.

    The static argument transformation is one of the worker/wrapper transforms we did use at the hackathon. I'll try to present the other flavors as well, when I find the time.

    I also like to see a performance cookbook. I've collected some ideas over the last few months which I will try to turn into posts.

    ReplyDelete
  3. I really don't understand why member is improved by this transformation. I'm sure you're right, but why? Both functions will be turned into loops. In the case of the member function the n argument doesn't have to be moved in the tail call so it should be the same cost as the go function.

    ReplyDelete
  4. I first read "unboxable" as "something that cannot be boxed" as opposed to "something that can be unboxed".

    In any case, I was meaning to write this post myself, but I think you did a better job than I would have. Nice job!

    ReplyDelete
  5. augustss, I didn't intend for the member function to be an example of a case where we get a performance improvement, but rather an example what the transformation looks like.

    The transformation turns out to be an improvement in even in the case of member (by ~8%) since the transformed version of member inlines at which point more unboxing happens.

    You get an even greater effect (~17%) if you make member strict in the first argument.

    ReplyDelete
  6. lpsmith,

    Thanks for the kind words. I rephrased the sentence containing the word "unboxable". Hopefully it's clearer now.

    ReplyDelete
  7. Does the proposed new superoptimization stuff take care of that?

    http://research.microsoft.com/en-us/um/people/simonpj/papers/supercompilation/supercomp-by-eval.pdf

    ReplyDelete
  8. Unfortunately, none of the code posted above is visible on http://planet.haskell.org, probably because the PlanetPlanet software (rightfully) filters out any script tags.

    ReplyDelete
  9. Thomas,

    That's unfortunate. I originally used pre-tags for code, but Blogger didn't render them well. I'd like to use some other blog software to manage my blog but right now I have no time to look into that.

    ReplyDelete
  10. This is really helpful and lucid. Especially the point that the main virtue arises with composition and other ways of hooking the function up with others -- and the reason for this:

    > The reason this improves performance is that GHC doesn’t inline
    > recursive functions (as it doesn’t know when to stop inlining) and
    > therefore cannot inline the first version of filter in call sites.

    -- the further discussion makes this much more understandable than other thing's I've read. I wish someone like you would write a more extensive, but still simple series of examples like these, in order to fix in the readers' brains the capacity to recognize when the maneuver is necessary.

    ReplyDelete
  11. Thanks for writing this up Johan, very clear.

    ReplyDelete
  12. I've rewritten the explanation of why inlining filter helps performance. Hopefully the relative importance of the different gains is clearer now.

    ReplyDelete
  13. Anonymous: Supercompilation will specialize the use sites if a function is known, but it will not normally perform SAT, although it is fairly easy to get your supercompiler to do so if you have one.

    ReplyDelete
  14. Johan,

    Why does making the first argument strict increase performance? Surely the thunk for the Int is forced immediately in “go”, anyway, via “go (x:xs) | x == n = True”. Can you explain?

    Thanks.

    ReplyDelete
    Replies
    1. Chris,

      Consider:

      member undefined []

      member is not strict in its first argument. This means that the argument cannot be unboxed*. This is an instance of the "lazy base case" problem:

      http://johantibell.com/files/haskell-performance-patterns.html#(11)

      * Caveat: constructor specialization might regain some, but not all, of the lost performance here.

      Delete
  15. I understand why GHC doesn't inline recursive functions but why can't it add them as local definitions to a function and then replace parameters with lexical bindings thus achieving the desired optimization in the code here. I'd love to add some code here to give a more complete explanation but I couldn't find a way to add properly formatted code in a comment. When I tried using the HTML pre tag I get the error message: Your HTML cannot be accepted: Tag is not allowed: PRE

    ReplyDelete