Tuesday, February 21, 2012

Designing a serialization API

Haskell has been around long enough now that some design patterns have emerged. Today I'd like to talk about how to design a serialization API. In this post I'll use the word serialization in a wider sense and let it include converting data to and from e.g. database rows, in addition to more common meanings, such as converting values to binary or textual formats (e.g. JSON.)

The pattern has been extracted from a few packages on Hackage, in particular the binary, aeson, and mysql-simple packages.

Type based serialization

This pattern addresses the problem of converting a Haskell value to and/or from some other representation. As an example we'll look at converting a Haskell value to and from JSON, in some detail, followed by a quick look at binary serialization and converting values to and from MySQL rows.

JSON serialization

First, we can represent all of JSON's constructs directly in Haskell, using a couple of data types. Here are the types used in aeson:

data Number = I !Integer
            | D !Double

type Object = HashMap Text Value

type Array = Vector Value

data Value = Object !Object
           | Array !Array
           | String !Text
           | Number !Number
           | Bool !Bool
           | Null

This representation can be useful on its own, for example when you want to write generic traversals over JSON documents, but it's incovenient and error prone to work with most of the time. What we'd like to do is to convert a JSON document we receive into a more domain specific representation as soon as possible.

Aside: even though we'll serialize Haskell values to and from this type, it would have been reasonable, although perhaps more cumbersome for the users of our API, to skip the Value type entirely and convert our Haskell values directly to and from JSON-encoded ByteStrings.

Lets say we would like to store values representing people as a JSON document. The document should look like this:

[
  {
    "name": "Bob",
    "born": 1984
  },
  {
    "name": "Alice",
    "born": 1983
  }
]

We'd like to represent this document as a Vector of Person values:

data Person = Person { name :: !Text, born :: !Int }

What we need is a pair of functions, one for serializing Person values to JSON and one for deserializing JSON to Person values. Since we want to do this based on the type we're serializing from/to, we need a type class or, in the case of aeson, two:

class ToJSON a where
    toJSON :: a -> Value

class FromJSON a where
    parseJSON :: Value -> Parser a

The type of toJSON is straightforward. The type of fromJSON is a bit more complicated in that it allows the API to express that

  • the conversion can fail and
  • the deserialization can work on incomplete ByteString chunks.

The former is fundamental to deserialization while the latter, while very useful, is not.

aeson provides instances of the two type classes for all basic data types (and then some.) This makes it easy to define higher level instances, in terms of these basic instances.

Writing the two required instances for our Person data type is easy:

instance ToJSON Person where
    toJSON p = object [
          "name" .= name p
        , "born" .= born p
        ]

instance FromJSON Person where
    parseJSON (Object v) = Person <$>
                           v .: "name" <*>
                           v .: "born"
    parseJSON _          = mzero

Finally, aeson provides two convienient functions for serializing and deserializing values to and from ByteStrings:

decode :: FromJSON a => ByteString -> Maybe a
encode :: ToJSON a => a -> ByteString

Binary serialization

The binary package is simpler than the aeson package in that the user only has to define a single instance and doesn't have to deal with any intermediate data types (i.e. Value.) Here's the Binary type class:

data Binary t where
    put :: t -> Put
    get :: Get t

Get is very much like the Parser monad used in aeson. Put has no direct correspondens in aeson; it's a writer monad that lets you write a sequence of bytes.

The simplest functions on these monads are putWord8 and getWord8:

putWord8 :: Word8 -> Put
getWord8 :: Get Word8

User-defined instances are defined as monadic (or applicative) actions, like in this example from the Haddock documentation:

data Exp = IntE Int
         | OpE  String Exp Exp
   deriving Show

instance Binary Exp where
       put (IntE i)          = do put (0 :: Word8)
                                  put i
       put (OpE s e1 e2)     = do put (1 :: Word8)
                                  put s
                                  put e1
                                  put e2

       get = do t <- get :: Get Word8
                case t of
                     0 -> do i <- get
                             return (IntE i)
                     1 -> do s  <- get
                             e1 <- get
                             e2 <- get
                             return (OpE s e1 e2)

Interacting with a database

You might not associate database APIs with serialization at first, but the two are actually related. Consider this example, which uses the mysql-simple package:

import qualified Data.Text as Text

xs <- query_ conn "select name, age from users"
forM_ xs $ \ (name, age) ->
    putStrLn $ Text.unpack name ++ " is " ++ show (age :: Int)

What's going on here?

The query_ function returns a list of values that are instances of the QueryResults class. The class is defined as follows:

class QueryResults a where
    convertResults :: [Field] -> [Maybe ByteString] -> a

If you squint the type of convertResults matches the shape we saw in fromJSON and get: the serialized representation is passed as arguments, here represented as database rows, and the result type is the type we want to convert to.

Note that in the example above the result type was resolved to (Text, Int) using type inference. This allows the code to be very concise, but also very flexible; we can convert the result to a different type:

data Person = Person { name :: !Text, age :: !Int }

forM_ xs $ \ (Person name age) ->
    -- do something with @name@ and @age@

Depending on how the result value is used, you might need to give an explicit type signature.

The pattern

The pattern should hopefully be clear at this point. If you want to create a serialization API, define a type class (or two) which encapsulates the serialization and deserialization functions. On top of this simple foundation there are a few design decisions you'll have to make, each outlined in its own section below.

Mapping the data to Haskell types

To ensure maximum type safety and make the API easy to work with, we'd like the Haskell types we convert our serialized representation into to not still show traces of the serialized representation. For example, it'd be unfortunate if we need to pass around a custom JsonString data type all over our program, just because we parsed JSON data at some point. We'd like to convert such specific representations into regular data types (e.g. Text) at the borders of our application.

For some serializations APIs there isn't much the API designer needs to do to help the user convert the serialized data to plain-old-Haskell-values, but look out for cases where you might make the user use serilization specific data types all over the program.

Deciding how many type classes to use

How many type classes do you need? It depends on if the user will always wants both serialization and deserialization. In the case of JSON it's often the case that you only want one of the two. For example, a web service API might define a JSON request and response type, but since the server nevers sends you a request, it doesn't make much sense in writing a FromJSON instance for your Haskell Request data type.

Typically, for well defined binary protocols such as protocol buffers or Thrift you'd want a single type class.

Using intermediate types

aeson uses an intermediate data type, Value, that sits in between the user's data types and the raw bytes sent over the wire (or written to file.) This makes sense when you

  • might want to operate directly on this data type, without converting it to a regular Haskell value, or
  • want to simplify the type class instances the user has to write.

The downside of using an intermediate type is that it can hurt performance.

5 comments:

  1. I'm experimenting with a graph serialisation library that uses an intermediate list-based format. The reason being is that the plans are for multiple graph types to be able to be serialised to/from multiple binary encodings. To ensure canonicity (and because most formats require it), the graph -> intermediate construction requires a BFS-based traversal of the original graph.

    In practice, so far the usage of the intermediate type doesn't seem to hurt performance due to list fusion.

    ReplyDelete
  2. I wrote some extended thoughts on an additional axis here: http://mtnviewmark.wordpress.com/2012/02/22/another-serialization-axis/

    ReplyDelete
  3. Mark, I'm glad you covered that aspect, because I completely forgot to mention it. Sometimes we can use a type class-based approach to output multiple different encodings, by using a newtype to control which instance gets used, but sometimes it's better to not use a type-directed approach at all.

    ReplyDelete
  4. I'm really loving your presentation of emerged design patterns.
    Please, continue it! :)

    ReplyDelete
  5. While we're adding axes, here are some more (which would influence the design):

    * Generic programming (type-based derivation of the binary format).
    * Automatic versioning (save a FingerPrint of the Haskell type).

    ReplyDelete