Monday, March 28, 2011

Google Summer of Code application period starts today

Starting at 19:00 UTC today, you can apply to Google Summer of Code as a student. If you'd like to get paid writing Haskell for a summer it's time to find an interesting task and send in your application. Have a look at the ideas list and find something you like. I'd again like to recommend you pick one of the three projects I've written about previously.

Sunday, March 20, 2011

Summer of Code project suggestions

Here’s a list of projects I’d like to see implemented during this year’s Google Summer of Code. I’d be willing to mentor or co-mentor any of these projects.

Project: Build multiple Cabal packages in parallel

Many developers have multi-core machines but Cabal runs the build process in a single thread, only making use of one core. If the build process could be parallelized build times could be cut by perhaps a factor of 2-8, depending on the number of cores and opportunity of parallel execution available.

Task: Add support for building packages in parallel, via a set of command line flags (e.g. -j).

Sub-tasks:

  • Keep a queue of available build tasks and schedule the tasks on multiple threads, while respecting task dependencies. Cabal already has an idea of what needs to be built in which order so you don't need to start from scratch.
  • Adapt the program output (including logging) system so that parallel builds don't generate garbled output.

See the Cabal ticket for more information.

Project: Simpler support for isolated/sandboxed Cabal builds

cabal-dev and capri allow developers to build packages in their own sandboxes, using a separate package database for each. This allows for isolated builds and prevents breakages due e.g. package upgrades. Merging cabal-dev into Cabal allows us to share lots of code and makes the feature more accessible to developers.

For more information see the Cabal ticket.

Also have a look at how isolated build are supported in other language. You can for example look at Python's vitualenv package (and clones) for inspiration.

Project: Convert the text package to use UTF-8 internally

When the text package was created, early benchmarks showed that using UTF-16 as the internal representation for Unicode code points was the fastest. The package still uses UTF-16 internally.

The benchmarks might not have given a complete picture of the performance implications of using different internal encodings: all benchmarks were run on input data that used the same encoding as used internally, but most real world data uses UTF-8. If the benchmarks would also have taken the cost of decoding and encoding from and to UTF-8 the results might have been different. For example, encoding a Text value to a ByteString containing UTF-8 data can be a O(1) operation if the encoding used to represent the Text value is also UTF-8.

UTF-8 also uses less space for ASCII data, which is very common in documents as it's used heavily in markup. A smaller footprint means less memory usage in programs that hold on to many small text fragments (e.g. text analysis applications, such as machine learning).

Tasks:

  • Create a set of realistic benchmarks that will show the performance implications of using different internal encodings (i.e. UTF-16 and UTF-8), on real world data.
  • Convert a small part of the text package to use UTF-8 internally and validate that it's now faster on the above benchmarks.
  • Convert the whole package to use UTF-8 internally.

Writing a good Google Summer of Code application

If you are a student and want to work on a Haskell project during this year's Google Summer of Code, it’s time to start thinking about which project you’d like to work on. Whatever project you chose, it’s important to think about two things: scope and impact.

The scope of the project should be such that you can actually finish it in one summer. This might seem obvious, but we’ve had several projects in the past fail because they were too ambitious. Avoid projects that have too many unknowns and thus have a high risk of failing e.g. trying to implement a new GC algorithm that performs competitively with GHC’s current GC.

I would also like to discourage work on brand new libraries. While we’ve seen a few successful libraries come out of GSoC, many others never saw much (or any) use. Creating a good API and a high performance implementation takes a lot of Haskell and software engineering experience, something most students lack, almost by definition. We don’t need yet another library of so-so quality on Hackage, there are already plenty of those (some of mine included).

The impact of the project is also important: the more people that can benefit from your work the better. The best way to have a large impact is to work on an existing piece of infrastructure (like Cabal or Hackage) or library (like text) that already has lots of users. Work on games or niche libraries don’t make good GSoC projects.

Writing a good application

Assuming that you’ve picked a project with a sensible scope and enough impact, you need to convince the Haskell GSoC mentors, a group of experienced Haskell hackers, that you’re the right person for the job. You do this by

  • appealing to previous Haskell experience (e.g. coursework and/or open source contributions), and
  • by showing that you understand the problem you’re trying to solve.

The best way to show that you understand the problem you’re trying to solve is by drafting a design. The design should be as concrete as possible. A design is not a requirements specification: don't list the features you're going to implement, explain how they are to be implemented. I’d suggest spending a day or two tinkering with the code you intend to work on. Perhaps even try to solve some small part of the real task. This will teach you a lot about the problem and help you create a much better design.

In the next few days I’ll try to post some projects I think make good GSoC projects.

Tuesday, March 8, 2011

Video of my hashing-based containers talk at Galois

The video of my talk on hashing-based containers, given at Galois, is now up:

Faster persistent data structures through hashing from Galois Video on Vimeo.

You can also watch it directly on Vimeo. The slides are also available.