Sunday, March 20, 2011

Summer of Code project suggestions

Here’s a list of projects I’d like to see implemented during this year’s Google Summer of Code. I’d be willing to mentor or co-mentor any of these projects.

Project: Build multiple Cabal packages in parallel

Many developers have multi-core machines but Cabal runs the build process in a single thread, only making use of one core. If the build process could be parallelized build times could be cut by perhaps a factor of 2-8, depending on the number of cores and opportunity of parallel execution available.

Task: Add support for building packages in parallel, via a set of command line flags (e.g. -j).

Sub-tasks:

  • Keep a queue of available build tasks and schedule the tasks on multiple threads, while respecting task dependencies. Cabal already has an idea of what needs to be built in which order so you don't need to start from scratch.
  • Adapt the program output (including logging) system so that parallel builds don't generate garbled output.

See the Cabal ticket for more information.

Project: Simpler support for isolated/sandboxed Cabal builds

cabal-dev and capri allow developers to build packages in their own sandboxes, using a separate package database for each. This allows for isolated builds and prevents breakages due e.g. package upgrades. Merging cabal-dev into Cabal allows us to share lots of code and makes the feature more accessible to developers.

For more information see the Cabal ticket.

Also have a look at how isolated build are supported in other language. You can for example look at Python's vitualenv package (and clones) for inspiration.

Project: Convert the text package to use UTF-8 internally

When the text package was created, early benchmarks showed that using UTF-16 as the internal representation for Unicode code points was the fastest. The package still uses UTF-16 internally.

The benchmarks might not have given a complete picture of the performance implications of using different internal encodings: all benchmarks were run on input data that used the same encoding as used internally, but most real world data uses UTF-8. If the benchmarks would also have taken the cost of decoding and encoding from and to UTF-8 the results might have been different. For example, encoding a Text value to a ByteString containing UTF-8 data can be a O(1) operation if the encoding used to represent the Text value is also UTF-8.

UTF-8 also uses less space for ASCII data, which is very common in documents as it's used heavily in markup. A smaller footprint means less memory usage in programs that hold on to many small text fragments (e.g. text analysis applications, such as machine learning).

Tasks:

  • Create a set of realistic benchmarks that will show the performance implications of using different internal encodings (i.e. UTF-16 and UTF-8), on real world data.
  • Convert a small part of the text package to use UTF-8 internally and validate that it's now faster on the above benchmarks.
  • Convert the whole package to use UTF-8 internally.

7 comments:

  1. There are a lot of ways to improve cabal:
    http://www.reddit.com/r/haskell_proposals/comments/fqey1/improve_cabal/

    Personally, I would put parallel builds lower on the totem pole, and sandboxing at the highest.
    I hope we can have at least 2 students working on the cabal infrastructure.

    ReplyDelete
  2. The referenced ticket for parallel cabal builds only seems to talk about making cabal-install parallel. Does the project concern parallelization of Cabal itself as well?

    ReplyDelete
  3. Amsay, the project is focused on building packages in parallel as that's where the biggest potential gain is at the moment. Once that's implemented we could look for other opportunities for parallelism.

    ReplyDelete
  4. The ticket also says that "downloads seem to be serialized, again because there is probably little benefit to making multiple connections to the same server." Why is this? Are bandwidth restrictions really so severe?

    ReplyDelete
  5. Amsay,

    Presumably a single connection to a server is enough to use all available bandwidth between the client and the server. If we had multiple servers things would be different.

    ReplyDelete
  6. > Build multiple Cabal packages in parallel

    What's or would be the difference to passing ghc-options="+RTS -N2 -RTS" to cabal?
    see http://lambdor.net/?p=306

    ReplyDelete
  7. Lambdor,

    -N2 doesn't help unless Cabal uses more than one thread e.g. by calling forkIO, which it doesn't.

    ReplyDelete