Tuesday, February 21, 2012

Designing a serialization API

Haskell has been around long enough now that some design patterns have emerged. Today I'd like to talk about how to design a serialization API. In this post I'll use the word serialization in a wider sense and let it include converting data to and from e.g. database rows, in addition to more common meanings, such as converting values to binary or textual formats (e.g. JSON.)

The pattern has been extracted from a few packages on Hackage, in particular the binary, aeson, and mysql-simple packages.

Type based serialization

This pattern addresses the problem of converting a Haskell value to and/or from some other representation. As an example we'll look at converting a Haskell value to and from JSON, in some detail, followed by a quick look at binary serialization and converting values to and from MySQL rows.

JSON serialization

First, we can represent all of JSON's constructs directly in Haskell, using a couple of data types. Here are the types used in aeson:

data Number = I !Integer
            | D !Double

type Object = HashMap Text Value

type Array = Vector Value

data Value = Object !Object
           | Array !Array
           | String !Text
           | Number !Number
           | Bool !Bool
           | Null

This representation can be useful on its own, for example when you want to write generic traversals over JSON documents, but it's incovenient and error prone to work with most of the time. What we'd like to do is to convert a JSON document we receive into a more domain specific representation as soon as possible.

Aside: even though we'll serialize Haskell values to and from this type, it would have been reasonable, although perhaps more cumbersome for the users of our API, to skip the Value type entirely and convert our Haskell values directly to and from JSON-encoded ByteStrings.

Lets say we would like to store values representing people as a JSON document. The document should look like this:

    "name": "Bob",
    "born": 1984
    "name": "Alice",
    "born": 1983

We'd like to represent this document as a Vector of Person values:

data Person = Person { name :: !Text, born :: !Int }

What we need is a pair of functions, one for serializing Person values to JSON and one for deserializing JSON to Person values. Since we want to do this based on the type we're serializing from/to, we need a type class or, in the case of aeson, two:

class ToJSON a where
    toJSON :: a -> Value

class FromJSON a where
    parseJSON :: Value -> Parser a

The type of toJSON is straightforward. The type of fromJSON is a bit more complicated in that it allows the API to express that

  • the conversion can fail and
  • the deserialization can work on incomplete ByteString chunks.

The former is fundamental to deserialization while the latter, while very useful, is not.

aeson provides instances of the two type classes for all basic data types (and then some.) This makes it easy to define higher level instances, in terms of these basic instances.

Writing the two required instances for our Person data type is easy:

instance ToJSON Person where
    toJSON p = object [
          "name" .= name p
        , "born" .= born p

instance FromJSON Person where
    parseJSON (Object v) = Person <$>
                           v .: "name" <*>
                           v .: "born"
    parseJSON _          = mzero

Finally, aeson provides two convienient functions for serializing and deserializing values to and from ByteStrings:

decode :: FromJSON a => ByteString -> Maybe a
encode :: ToJSON a => a -> ByteString

Binary serialization

The binary package is simpler than the aeson package in that the user only has to define a single instance and doesn't have to deal with any intermediate data types (i.e. Value.) Here's the Binary type class:

data Binary t where
    put :: t -> Put
    get :: Get t

Get is very much like the Parser monad used in aeson. Put has no direct correspondens in aeson; it's a writer monad that lets you write a sequence of bytes.

The simplest functions on these monads are putWord8 and getWord8:

putWord8 :: Word8 -> Put
getWord8 :: Get Word8

User-defined instances are defined as monadic (or applicative) actions, like in this example from the Haddock documentation:

data Exp = IntE Int
         | OpE  String Exp Exp
   deriving Show

instance Binary Exp where
       put (IntE i)          = do put (0 :: Word8)
                                  put i
       put (OpE s e1 e2)     = do put (1 :: Word8)
                                  put s
                                  put e1
                                  put e2

       get = do t <- get :: Get Word8
                case t of
                     0 -> do i <- get
                             return (IntE i)
                     1 -> do s  <- get
                             e1 <- get
                             e2 <- get
                             return (OpE s e1 e2)

Interacting with a database

You might not associate database APIs with serialization at first, but the two are actually related. Consider this example, which uses the mysql-simple package:

import qualified Data.Text as Text

xs <- query_ conn "select name, age from users"
forM_ xs $ \ (name, age) ->
    putStrLn $ Text.unpack name ++ " is " ++ show (age :: Int)

What's going on here?

The query_ function returns a list of values that are instances of the QueryResults class. The class is defined as follows:

class QueryResults a where
    convertResults :: [Field] -> [Maybe ByteString] -> a

If you squint the type of convertResults matches the shape we saw in fromJSON and get: the serialized representation is passed as arguments, here represented as database rows, and the result type is the type we want to convert to.

Note that in the example above the result type was resolved to (Text, Int) using type inference. This allows the code to be very concise, but also very flexible; we can convert the result to a different type:

data Person = Person { name :: !Text, age :: !Int }

forM_ xs $ \ (Person name age) ->
    -- do something with @name@ and @age@

Depending on how the result value is used, you might need to give an explicit type signature.

The pattern

The pattern should hopefully be clear at this point. If you want to create a serialization API, define a type class (or two) which encapsulates the serialization and deserialization functions. On top of this simple foundation there are a few design decisions you'll have to make, each outlined in its own section below.

Mapping the data to Haskell types

To ensure maximum type safety and make the API easy to work with, we'd like the Haskell types we convert our serialized representation into to not still show traces of the serialized representation. For example, it'd be unfortunate if we need to pass around a custom JsonString data type all over our program, just because we parsed JSON data at some point. We'd like to convert such specific representations into regular data types (e.g. Text) at the borders of our application.

For some serializations APIs there isn't much the API designer needs to do to help the user convert the serialized data to plain-old-Haskell-values, but look out for cases where you might make the user use serilization specific data types all over the program.

Deciding how many type classes to use

How many type classes do you need? It depends on if the user will always wants both serialization and deserialization. In the case of JSON it's often the case that you only want one of the two. For example, a web service API might define a JSON request and response type, but since the server nevers sends you a request, it doesn't make much sense in writing a FromJSON instance for your Haskell Request data type.

Typically, for well defined binary protocols such as protocol buffers or Thrift you'd want a single type class.

Using intermediate types

aeson uses an intermediate data type, Value, that sits in between the user's data types and the raw bytes sent over the wire (or written to file.) This makes sense when you

  • might want to operate directly on this data type, without converting it to a regular Haskell value, or
  • want to simplify the type class instances the user has to write.

The downside of using an intermediate type is that it can hurt performance.

Wanted: an efficient union algorithm for hash-array mapped tries

There's really only one thing that's holding me back from releasing the new hash-array mapped trie (HAMT) based implementation of unordered-containers, a fast implementation of union.

The current released version is implemented in terms of a patricia trie (also known as a radix tree) and it uses a linear time union algorithm. The new version uses a naive implementation that simply calls insert repeatedly. This makes union 378% slower in the new implementation, on my main benchmark. Since I use HAMTs to implement HashSet in addition to HashMap, this is less than optimal.

My plan is to try to port the algorithm used on patricia tries to the HAMT implementation. I think this should be possible, due to both data structures being bit tries, but I haven't had time to try this myself. Any Haskellers out there who would like to try? If someone could implement a time complexity efficient version, I could then do the last bit of low-level optimization.

You can find the source of the HAMT-based implementation in the unordered-containers hamt branch on GitHub.

You'll also want to look at the union implementation used in IntMap.

The HAMT data structure itself is described in Ideal Hash Trees by Phil Bagwell. Note that we're using a persistent implementation, but most of the ideas (like removing empty nodes by using a bitmask) are still used.

Phil also told me that section 4.5 in A Generic Parallel Collection Framework might be useful in implementing union.

Wednesday, February 15, 2012

Slides from Haskell Performance Patterns talk

I gave a talk at the Bay Area Haskell Users Group today. As per usual, it was on the topic of performance. However, inspired by my new series on reliable techniques for writing fast Haskell code, the talk was structured around a set of rules-of-thumb and looks quite different from previous talks on the subject.

The slides are quite terse, but hopefully you'll get something out of them without me talking to them.

Thursday, February 2, 2012

Forcing values returned from monadic computations

This is the start of my series on reliable techniques for writing fast Haskell code. In each installment we'll look at a specific situation and explore some simple guidelines you can use in your everyday programming.

Today's topic is forcing values returned from monadic computations. After reading this article you'll hopefully be able to tell when it's appropriate to force values passed to return and when it's not.

Consider this function:

f :: Int -> Int -> Int
f x y = x + y

f is strict in both arguments and when called with two arguments it will return a fully evaluated Int value. No laziness anywhere to be seen.

Now lets look at a function that looks very similar to f, but involves return:

g :: Monad m => Int -> Int -> m Int
g x y = return $ x + y

g looks very similar to f so it's easy to think that it, like f, is strict in both arguments and that it will return a fully evaluated Int value inside the given monad.

However, that's not the case. Lets look what happens if m is e.g. a state monad (such as IO) [1]. Here's a possible definition of a state monad:

newtype State s a = S (s -> (a, s))

instance Monad (State s) where
    return x = S $ \ s -> (x, s)
    m >>= f = ...

Lets expand >>=, return, and $ in g:

g :: Int -> Int -> State s Int
g x y = S $ \ s -> (x + y, s)

Since our monad wraps the return value in a (lazy) pair, the evaluation of x + y is delayed until the first component of that pair is evaluated, some time later in the program. If we don't want to delay the evaluation, we must first force the value passed to return (e.g. by using $!):

g :: Int -> Int -> State s Int
g x y = return $! x + y

Here's what we get if we expand the stricter version:

g :: Int -> Int -> State s Int
g x y = let z = x + y
        in z `seq` S $ \ s -> (z, s)

Now the expression x + y is evaluated before the pair is created and the result (rather than a thunk representing the expression) is stored in the pair.

We can also see the difference between the two different versions of g in the core generated by GHC. First the lazier version:

g = \ (x :: Int) (y :: Int) (s :: s) ->
    (case x of _
         I# x# -> case y of _
             I# y# -> I# (+# x# y#),

and then the stricter one:

g = \ (x :: Int) (y :: Int) (s :: s) ->
    case x of _
        I# x# -> case y of _
            I# y# -> (I# (+# x# y#), s)

If you look carefully you'll see that in the lazier version the evaluation of x and y has been pushed inside the first component of the pair.

The stricter version still allocates an I# box, because pairs are polymorphic in their components and unboxed values (like Int#) cannot be stored in polymorphic fields and thus need to be wrapped.

The above example used a simple arithmetic expression but the same reasoning applies to data types with strict fields. For example, given this strict-pair data type:

data StrictPair a b = SP !a !b

you need to use $! if you want its fields to be evaluated before a monadic computation returning a StrictPair returns, like so:

h :: Int -> Int -> State s (StrictPair Int Int)
h x y = return $! SP (x+1) (y+1)

Aside: Int (and Double, Float, etc.) arithmetic is really just a special case of a data type with strict fields as Int is defined as:

data Int = I# Int#

where Int# is an unboxed type. Fields of unboxed types are always strict.

Forcing values that are already in WHNF has no effect e.g.

-- Has no effect and thus potentially confuses readers:
h x y = return $! (x, y)

Guideline: force data types with strict fields (including e.g. Int) to WHNF using $! before passing them to return.

  1. This behavior is not specific to the state monad. Any monad where return is lazy in its argument would behave the same.