Saturday, March 1, 2014

Google Summer of Code Projects

Every year I put together a list of Google Summer of Code projects I'd like see students work on. Here's my list for this year.

As normal the focus is on existing infrastructure. I believe, and I think our experience in the past bears out, that such projects are more successful.

Improved Hackage login

The Hackage login/user system could use several improvements.

From a security perspective, we need to switch to a current best practice implementation of password storage, such as bcrypt, scrypt, or PBKDF2. MD5, which is what HTTP Digest Auth uses, has known attacks.

From a usability perspective, we need to move to a cookie-based login system. While using HTTP auth is convenient from an implementation perspective, it doesn't work well from a usability perspective (that's why sites that otherwise try to follow the REST approach don't use HTTP auth.) A cookie-based approach allows us to, among other things,

  • display the current login status of the user,
  • allow users to conveniently access a user preference page,
  • allow users to log out, and
  • adapt the UI to the current user.

An example of the latter would be to only show a link to the maintainer section for packages you maintain or show additional actions for the site admins. HTTP auth introduces an extra page transition if you want to move from a list to items to edit that list of items (e.g. you can edit uploaders on /packages/uploaders/, you need to click on the link that takes you to /packages/uploaders/edit.) This is because HTTP auth does authentication on a per HTTP request basis.

Other Hackage usability improvements

There are several other Hackage usability improvements I'd like to see.

The homepage is currently a write-up about the new Hackage server. While that made sense when the new Hackage server was brand new, a more useful homepage would include a list of recently updates packages, most popular packages, packages you maintain, and a link to "getting started" material and other documentation. Looking at other languages' package repo homepages for inspiration wouldn't be a bad start.

The search result page should include download counts and a more easily scannable result list. The current list is hard to read because the package descriptions don't line up. For example, compare the search result page for "xml" for Hackage and Ruby Gems.

Faster Cabal/GHC parallel builds

Mikhail Glushenkov and others have done a great job making our compiles faster. Cabal already builds packages in paralell and with GHC 7.8 it will build modules in parallel as well.

There are still more opportunities for parallelism. Cabal doesn't build individual components or different versions of the same component (e.g. vanilla and profiling) in parallel.

Building all the test suites in parallel would save time if you have many test suites and building vanilla and profiling versions at the same time would allow users to turn on profiling by default (in ~/.cabal/config) without paying (much of) a compile time penalty.

There's already some work underway here so there might not be enough Cabal work to last a student through the summer. The remaining time could be spent increasing the amount of parallism offered by ghc -j.

Today the parallel speed-up offered by ghc -j is quite modest and I believe we ought to be able to increase it. If you exclude link times, if we had N independent modules of the same size we should get close to a N times parallel speed-up, which I don't think we do today. While real packages don't have this much available parallelism, improvements in the embarrasingly parallel case should help the average case.

Cabal file pretty-printer

If we had a Cabal file pretty printer, in the spirit of go-fmt for Go, we could more easily apply automatic rewrites to Cabal files. Having a formatter that applies a standard (i.e. normalizing) format to all files would make rewrites tools much simpler, as they wouldn't have to worry about preserving user formatting. Some tools that would benefit:

  • cabal freeze, which will be included in Cabal-1.20
  • cabal init
  • A cabal version number bumper/PVP helper

I don't think such a pretty-printer should be terribly clever. Since Cabal files don't support pattern matching (like Haskell), aligning things doesn't really help readability much. Something simple like a 2 (or 4) space ident and starting each list of items on a new line below the item "header" ought to be enough. Here's an example:

name: Cabal
version: 1.19.2
copyright: 2003-2006, Isaac Jones
           2005-2011, Duncan Coutts
license: BSD3
license-file: LICENSE
  Isaac Jones <>
  Duncan Coutts <>
synopsis: A framework for packaging Haskell software
  The Haskell Common Architecture for Building Applications and
  Libraries: a framework defining a common interface for authors to
  more easily build their Haskell applications in a portable way.
  The Haskell Cabal is part of a larger infrastructure for
  distributing, organizing, and cataloging Haskell libraries and
category: Distribution
cabal-version: >=1.10
build-type: Custom

  README tests/README changelog

source-repository head
  type:     git
  subdir:   Cabal

    base       >= 4 && < 5,
    deepseq    >= 1.3 && < 1.4,
    filepath   >= 1 && < 1.4,
    directory  >= 1 && < 1.3,
    process    >= && < 1.3,
    time       >= 1.1 && < 1.5,
    containers >= 0.1 && < 0.6,
    array      >= 0.1 && < 0.6,
    pretty     >= 1 && < 1.2,
    bytestring >= 0.9

  if !os(windows)
      unix >= 2.0 && < 2.8

  ghc-options: -Wall -fno-ignore-asserts -fwarn-tabs

Thursday, April 25, 2013

More GSoC ideas

Here are another two GSoC ideas.

Faster ghc -c building

The main obstacle to implementing completely parallel builds in Cabal is that calling ghc --make on a bunch of modules is much faster than calling ghc -c on each module.

Today, Cabal builds modules by passing them to ghc --make, but we'd like to create the module dependency graph ourselves and then call ghc -c on each module, in parallel. However, since ghc -c is so much slower than ghc --make, the user needs 2 (and sometimes even more) cores for parallel builds to pay off. If we could make ghc -c faster, we'd could write a parallel build system that actually gives good speed-ups on typical developer hardware.

The project would involve profiling ghc -c to figure out where the time is spent and then trying to improve performance. One possible source of inefficiency is reparsing all .hi files every time ghc -c is run.

Preferably the student should be at least vaguely familiar with the GHC source code.

Cabal dependency solver improvements

There's one shortcoming in the Cabal package dependency solver that is starting to bite more and more often, namely not treating the components in a .cabal file (i.e. libraries, executables, tests, and benchmarks) as separate entities for the purpose of dependency resolution. In practice this means that for many core libraries this fails:

cabal install --only-dependencies --enable-tests

but this succeeds:

cabal install <manually compiled list of test dependencies>
cabal install --only-dependencies

The reason is that the Library defined in the package is a dependency of the test framework (i.e. the test-framework package), creating a dependency cycle involving the library itself and the test framework. However, if the test dependency is expressed as:

library foo

test-suite my-tests
  -- No direct dependency on the library:
  hs-source-dirs: . tests

the dependency solver could find a solution, as the test suite no longer depends on the library, but it doesn't today.

The project would involve fixing the solver to treat each component separately (i.e. as if it was a separate package) for the purpose of dependency resolution.

For an example of this problem see the hashable package's .cabal file. In this case the dependency cycle involves hashable and Criterion.

Monday, April 15, 2013 GSoC ideas

This year's Google Summer of Code is upon us. Every year I try to come up with a list of projects that I'd like see done. Here's this year's list, in no particular order:

Better formatting support for Haddock

While adequate for basic API docs, Haddock leaves something to be desired when you want to write more than a paragraph or two.

For example, you can't use bold text for empahsis or have links with anchor text. Support for images or inline HTML (e.g. for creating tables) is similarly missing. All headings need to be defined in the export list which is both inconvenient and mixes up API organization with the use of headings to structure longer segments of text.

This project would try to improve the state of Haskell documentation by improving the Haddock markup language by either

  • adding features from Markdown to the Haddock markup language, or
  • adding a new markup language that is a superset of Markdown to Haddock.

Why Markdown? Markdown is what most of the programming-related part of the web (e.g. GitHub, StackOverflow) has standardized on as a human-friendly markup language. The reason Markdown works so well is that it codifies the current practice, already used and improved over time in e.g. mailing list discussions, instead of inventing a brand new language.

Option 1: Add Markdown as an alternative markup language

This option would let users opt-in to use (a super set of) Markdown instead of the current Haddock markup by putting a

{-# HADDOCK Markdown #-}

pragma on top of the source file. The language would be a superset as we'd still want to support single-quoted strings to hyperlink identifiers, etc.

This option might run into difficulties with the C preprocessor, which also uses # for its markup. Part of the project would involve thinking about that problem and more generally the implications of using Markdown in Haddock.

Option 2: Add features from Markdown to Haddock

This option is slightly less ambitious in that we'd be adding a few select features from Markdown to Haddock, for example support for bold text and anchor texts. Since we're not trying to support all of Markdown the issue with the C preprocessor could be solved by not supporting #-style headings while still supporting *bold* for bold text.

More parallelism in Cabal builds

Builds could always be faster. There are a few more things that Cabal could build in parallel:

  • Each component (e.g. tests) could be built in parallel, while taking dependencies into account (i.e. from executables to the library).
  • Profiling and non-profiling versions could be built in parallel, making it much cheaper to always enable profiling by default in ~/.cabal/config.
  • Individual modules could be built in parallel.

The last option gives the most parallelism but is also the hardest to implement. It requires that we have correct dependency information (which we could get from ghc -M) and even then compiling individual modules using ghc -c is up to 2x slower compared to compiling the same modules with ghc --make. Still, it could be a win for anyone with >2 CPU cores and it would support building e.g. profiling libraries in parallel without much (or any) extra work.

Saturday, November 17, 2012

Streaming and incremental CSV parsing using cassava

Today I released the next major version of cassava, my CSV parsing and encoding library.

New in this version is streaming and incremental parsing, exposed through Data.Csv.Streaming and Data.Csv.Incremental respectively. Both approaches allow for O(1)-space parsing and more flexible error handling. The latter also allows for interleaving parsing and I/O.

The API now exposes three ways to parse CSV files, ordered from most convenient and least flexible to least convenient and most flexible:

  • Data.Csv
  • Data.Csv.Streaming
  • Data.Csv.Incremental

For example, Data.Csv causes the whole parse to fail if there are any errors, either in parsing or type conversion. This is convenient if you want to parse a small to medium-sized CSV file that you know is correctly formatted.

On the other extreme, if you're parsing a 1GB CSV file that's being uploaded by some user of your webapp, you probably want to use the Data.Csv.Incremental module, to avoid high memory usage and to be able to more graciously deal with formatting errors in the user's CSV file.

Other notable changes:

  • The various index-based decode functions now take an extra argument that allow you to skip the header line, if the file has one. Previously you had to use the name-based decode functions to work with files that contained headers.

  • Space usage in Data.Csv.decode and friends has been reduced significantly. However, these decode functions still have somewhat high space usage, so if you're parsing 100MB or more of CSV data, you want to use the Streaming or Incremental modules. I have plans on improving space usage by a large amount in the future.

Friday, August 24, 2012

You can soon play in the cabal sandbox

I just merged the last large set of patches, written by Mikhail Glushenkov, that are needed to implement the new cabal sandbox feature (a generalization of cabal-dev, cab, etc) into the cabal master branch.

This work will alleviate the dependency problems that crop up too often when working with cabal. It doesn't solve all problems, but it should prevent package breakages due to reinstalls (just like cabal-dev does) by avoiding the use of a shared package database.

There's still some work left to be done, mainly coming up with a UI that supports all the use cases we have and is easy to use at the same time. I hope we can get that done by the end of ICFP.

The mechanism used to implement this feature, package environments, will also allow us to do other useful things, like allowing you to specify a list of exact package versions you want to build your package against. This is useful e.g. if your building and shipping an executable, as you don't want each version to be released against a different set of library versions, depending on what the dependency solver picked on a given day.

Thursday, August 23, 2012

A new fast and easy to use CSV library

I'm proud to present the cassava library, an efficient, easy to use CSV library for Haskell. The library is designed in the style of aeson, Bryan O'Sullivan's excellent JSON library.

The library implements RFC 4180 with a few extensions, such as Unicode support. It is also fast. I compared it to the Python csv module, which is written in C, and cassava outperformed it in all my benchmarks. I've spent almost no time optimizing the library -- it owes its speed to attoparsec -- so there should still be room for speed improvements.

Here's the two second crash course in using the library. Given a CSV file with this content:

John Doe,50000
Jane Doe,60000

here's how you'd process it record-by-record:

{-# LANGUAGE ScopedTypeVariables #-}

import qualified Data.ByteString.Lazy as BL
import Data.Csv
import qualified Data.Vector as V

main :: IO ()
main = do
    csvData <- BL.readFile "salaries.csv"
    case decode csvData of
        Left err -> putStrLn err
        Right v -> V.forM_ v $ \ (name, salary :: Int) ->
            putStrLn $ name ++ " earns " ++ show salary ++ " dollars"

(In this example it's not strictly neccesary to parse the salary field as an Int, a String would do, but we do so for demonstration purposes.)

cassava is quite different from most CSV libraries. Most CSV libraries will let you parse CSV files into something equivalent to [[ByteString]], but after that you're on your own. cassava instead lets you declare what you expect the type of each record to be (i.e. (Text, Int) in the example above) and the library will then both parse the CSV file and convert each column to the requested type, doing error checking as it goes.

Download the package from Hackage:

Get the code from GitHub:

Tuesday, June 19, 2012

Haskell Implementors' Workshop, Call for Talks

Gregory Collins' and I are organizing this year's Haskell Implementors' Workshop in Copenhagen. If you like to give a talk, see the instructions in the Call-for-Talks email. The acceptable range of topics is quite wide. If you've been hacking on some cool project lately, please do apply.