Saturday, November 17, 2012

Streaming and incremental CSV parsing using cassava

Today I released the next major version of cassava, my CSV parsing and encoding library.

New in this version is streaming and incremental parsing, exposed through Data.Csv.Streaming and Data.Csv.Incremental respectively. Both approaches allow for O(1)-space parsing and more flexible error handling. The latter also allows for interleaving parsing and I/O.

The API now exposes three ways to parse CSV files, ordered from most convenient and least flexible to least convenient and most flexible:

  • Data.Csv
  • Data.Csv.Streaming
  • Data.Csv.Incremental

For example, Data.Csv causes the whole parse to fail if there are any errors, either in parsing or type conversion. This is convenient if you want to parse a small to medium-sized CSV file that you know is correctly formatted.

On the other extreme, if you're parsing a 1GB CSV file that's being uploaded by some user of your webapp, you probably want to use the Data.Csv.Incremental module, to avoid high memory usage and to be able to more graciously deal with formatting errors in the user's CSV file.

Other notable changes:

  • The various index-based decode functions now take an extra argument that allow you to skip the header line, if the file has one. Previously you had to use the name-based decode functions to work with files that contained headers.

  • Space usage in Data.Csv.decode and friends has been reduced significantly. However, these decode functions still have somewhat high space usage, so if you're parsing 100MB or more of CSV data, you want to use the Streaming or Incremental modules. I have plans on improving space usage by a large amount in the future.

2 comments:

  1. Are there any chances on Unicode/Utf8 support?

    ReplyDelete
    Replies
    1. Already supported. Use the ToField/FromField instances for Text.

      Delete