I'm proud to present the cassava library, an efficient, easy to use CSV library for Haskell. The library is designed in the style of aeson, Bryan O'Sullivan's excellent JSON library.
The library implements RFC 4180 with a few extensions, such as Unicode support. It is also fast. I compared it to the Python csv module, which is written in C, and cassava outperformed it in all my benchmarks. I've spent almost no time optimizing the library -- it owes its speed to attoparsec -- so there should still be room for speed improvements.
Here's the two second crash course in using the library. Given a CSV file with this content:
John Doe,50000
Jane Doe,60000
here's how you'd process it record-by-record:
{-# LANGUAGE ScopedTypeVariables #-}
import qualified Data.ByteString.Lazy as BL
import Data.Csv
import qualified Data.Vector as V
main :: IO ()
main = do
csvData <- BL.readFile "salaries.csv"
case decode csvData of
Left err -> putStrLn err
Right v -> V.forM_ v $ \ (name, salary :: Int) ->
putStrLn $ name ++ " earns " ++ show salary ++ " dollars"
(In this example it's not strictly neccesary to parse the salary field as an Int, a String would do, but we do so for demonstration purposes.)
cassava is quite different from most CSV libraries. Most CSV libraries will let you parse CSV files into something equivalent to [[ByteString]], but after that you're on your own. cassava instead lets you declare what you expect the type of each record to be (i.e. (Text, Int) in the example above) and the library will then both parse the CSV file and convert each column to the requested type, doing error checking as it goes.
Download the package from Hackage: http://hackage.haskell.org/package/cassava
Get the code from GitHub: https://github.com/tibbe/cassava
great! :)
ReplyDeleteHooray! Last time I had to parse CSV in Haskell was a painful experience of various bad libraries, good to see something of quality.
ReplyDeleteGreat! More like this everybody.
ReplyDeleteNice.
ReplyDeleteA few notes:
1. You're using Either String a for errors handling; what about (Failure String m) => m a instead? Or maybe even additional type for erorr messages instead of String...
2. What about support for CSV files with named columns (where column names are in the first row)? As far as I see, your NamedRecord stuff might help here...
> 1. You're using Either String a for errors handling; what about (Failure String m) => m a instead? Or maybe even additional type for erorr messages instead of String...
DeleteI've considered returning e.g. a DecodeError data type on error with more detailed error information. I might still do that in the future if there's a need.
I'm not going to use a monadic return type, as it's less general.
> 2. What about support for CSV files with named columns (where column names are in the first row)? As far as I see, your NamedRecord stuff might help here...
This is what NamedRecord and decodeByName is for. Check out the NamedRecord docs for an example on how to parse records by name.
I really wish that authors would put the required 'import' statements at the top of code-snippets on their blogs. *hint hint* Saves a click through to the docs before I can try out the example.
ReplyDeleteAwesome looking CSV library! I will definitely give this a go.
I've updated the example so you can copy-n-paste it into a file as-is.
DeleteIs it possible to do something like parse a field into a "Maybe Int" to handle cases where the value may be missing?
ReplyDeleteYou can defined a newtype e.g.
Deletenewtype MaybeInt = MI Int
instance FromField MaybeInt where
parseField s
| B.nulll s = return Nothing
| otherwise = (Just .MaybeInt) <$> fromField
(N.B. I haven't tested or type checked this code.)
I will look into adding an instance of Field for Maybe (and perhaps Either).
Thanks a lot for creating this! It's MUCH faster and easier to use than other CSV-parsing options out there.
ReplyDeleteI found that it was pretty much essential to use the Streaming variant of the library if I wanted to use CSV input to generate a data structure more complex than a Vector (e.g., creating a Map from a table with keys and values). If you write a complicated parseRecord instance and have a sizeable amount of data, the memory usage of the standard "decode" function blows up very fast.
I think I know how to make the standard decode use less memory. The issue is that we parse the whole CSV file into a Vector (Vector ByteString) and both Vector and ByteString have quite high memory overheads (e.g. 9 words per ByteString). Since we're creating loads of small Vectors and ByteStrings, memory balloons.
DeleteThe fix is to try to do the FromRecord conversion as we parse. I haven't had time to look into this yet, but it should bring the memory usage down from size(CSV) to size(Vector a), where "a" is the data type you use with FromRecord.