Thursday, August 23, 2012

A new fast and easy to use CSV library

I'm proud to present the cassava library, an efficient, easy to use CSV library for Haskell. The library is designed in the style of aeson, Bryan O'Sullivan's excellent JSON library.

The library implements RFC 4180 with a few extensions, such as Unicode support. It is also fast. I compared it to the Python csv module, which is written in C, and cassava outperformed it in all my benchmarks. I've spent almost no time optimizing the library -- it owes its speed to attoparsec -- so there should still be room for speed improvements.

Here's the two second crash course in using the library. Given a CSV file with this content:

John Doe,50000
Jane Doe,60000

here's how you'd process it record-by-record:

{-# LANGUAGE ScopedTypeVariables #-}

import qualified Data.ByteString.Lazy as BL
import Data.Csv
import qualified Data.Vector as V

main :: IO ()
main = do
    csvData <- BL.readFile "salaries.csv"
    case decode csvData of
        Left err -> putStrLn err
        Right v -> V.forM_ v $ \ (name, salary :: Int) ->
            putStrLn $ name ++ " earns " ++ show salary ++ " dollars"

(In this example it's not strictly neccesary to parse the salary field as an Int, a String would do, but we do so for demonstration purposes.)

cassava is quite different from most CSV libraries. Most CSV libraries will let you parse CSV files into something equivalent to [[ByteString]], but after that you're on your own. cassava instead lets you declare what you expect the type of each record to be (i.e. (Text, Int) in the example above) and the library will then both parse the CSV file and convert each column to the requested type, doing error checking as it goes.

Download the package from Hackage: http://hackage.haskell.org/package/cassava

Get the code from GitHub: https://github.com/tibbe/cassava

12 comments:

  1. Hooray! Last time I had to parse CSV in Haskell was a painful experience of various bad libraries, good to see something of quality.

    ReplyDelete
  2. Nice.
    A few notes:
    1. You're using Either String a for errors handling; what about (Failure String m) => m a instead? Or maybe even additional type for erorr messages instead of String...
    2. What about support for CSV files with named columns (where column names are in the first row)? As far as I see, your NamedRecord stuff might help here...

    ReplyDelete
    Replies
    1. > 1. You're using Either String a for errors handling; what about (Failure String m) => m a instead? Or maybe even additional type for erorr messages instead of String...

      I've considered returning e.g. a DecodeError data type on error with more detailed error information. I might still do that in the future if there's a need.

      I'm not going to use a monadic return type, as it's less general.

      > 2. What about support for CSV files with named columns (where column names are in the first row)? As far as I see, your NamedRecord stuff might help here...

      This is what NamedRecord and decodeByName is for. Check out the NamedRecord docs for an example on how to parse records by name.

      Delete
  3. I really wish that authors would put the required 'import' statements at the top of code-snippets on their blogs. *hint hint* Saves a click through to the docs before I can try out the example.

    Awesome looking CSV library! I will definitely give this a go.

    ReplyDelete
    Replies
    1. I've updated the example so you can copy-n-paste it into a file as-is.

      Delete
  4. Is it possible to do something like parse a field into a "Maybe Int" to handle cases where the value may be missing?

    ReplyDelete
    Replies
    1. You can defined a newtype e.g.

      newtype MaybeInt = MI Int

      instance FromField MaybeInt where
      parseField s
      | B.nulll s = return Nothing
      | otherwise = (Just .MaybeInt) <$> fromField

      (N.B. I haven't tested or type checked this code.)

      I will look into adding an instance of Field for Maybe (and perhaps Either).

      Delete
  5. Thanks a lot for creating this! It's MUCH faster and easier to use than other CSV-parsing options out there.

    I found that it was pretty much essential to use the Streaming variant of the library if I wanted to use CSV input to generate a data structure more complex than a Vector (e.g., creating a Map from a table with keys and values). If you write a complicated parseRecord instance and have a sizeable amount of data, the memory usage of the standard "decode" function blows up very fast.

    ReplyDelete
    Replies
    1. I think I know how to make the standard decode use less memory. The issue is that we parse the whole CSV file into a Vector (Vector ByteString) and both Vector and ByteString have quite high memory overheads (e.g. 9 words per ByteString). Since we're creating loads of small Vectors and ByteStrings, memory balloons.

      The fix is to try to do the FromRecord conversion as we parse. I haven't had time to look into this yet, but it should bring the memory usage down from size(CSV) to size(Vector a), where "a" is the data type you use with FromRecord.

      Delete
  6. Thanks for this Johan, very useful.

    ReplyDelete