Here’s a list of projects I’d like to see implemented during this year’s Google Summer of Code. I’d be willing to mentor or co-mentor any of these projects.
Project: Build multiple Cabal packages in parallel
Many developers have multi-core machines but Cabal runs the build process in a single thread, only making use of one core. If the build process could be parallelized build times could be cut by perhaps a factor of 2-8, depending on the number of cores and opportunity of parallel execution available.
Task: Add support for building packages in parallel, via a set of command line flags (e.g.
- Keep a queue of available build tasks and schedule the tasks on multiple threads, while respecting task dependencies. Cabal already has an idea of what needs to be built in which order so you don't need to start from scratch.
- Adapt the program output (including logging) system so that parallel builds don't generate garbled output.
See the Cabal ticket for more information.
Project: Simpler support for isolated/sandboxed Cabal builds
cabal-dev and capri allow developers to build packages in their own sandboxes, using a separate package database for each. This allows for isolated builds and prevents breakages due e.g. package upgrades. Merging cabal-dev into Cabal allows us to share lots of code and makes the feature more accessible to developers.
For more information see the Cabal ticket.
Also have a look at how isolated build are supported in other language. You can for example look at Python's vitualenv package (and clones) for inspiration.
Project: Convert the text package to use UTF-8 internally
When the text package was created, early benchmarks showed that using UTF-16 as the internal representation for Unicode code points was the fastest. The package still uses UTF-16 internally.
The benchmarks might not have given a complete picture of the performance implications of using different internal encodings: all benchmarks were run on input data that used the same encoding as used internally, but most real world data uses UTF-8. If the benchmarks would also have taken the cost of decoding and encoding from and to UTF-8 the results might have been different. For example, encoding a Text value to a ByteString containing UTF-8 data can be a O(1) operation if the encoding used to represent the Text value is also UTF-8.
UTF-8 also uses less space for ASCII data, which is very common in documents as it's used heavily in markup. A smaller footprint means less memory usage in programs that hold on to many small text fragments (e.g. text analysis applications, such as machine learning).
- Create a set of realistic benchmarks that will show the performance implications of using different internal encodings (i.e. UTF-16 and UTF-8), on real world data.
- Convert a small part of the text package to use UTF-8 internally and validate that it's now faster on the above benchmarks.
- Convert the whole package to use UTF-8 internally.