ELKI (also on GitHub) is data mining and data science open-source project. It is unique with respect to its modular architecture: you can combine algorithms, distance functions, and indexes for acceleration with very few limitations (of course, algorithms that do not use distances cannot be combined with distances). It is not the easiest code because of efficiency. For data mining, you need to be careful about memory - using ArrayList<Integer>
is a no-go if you want scalability.
Because of the modular architecture, it is easy to contribute just small modules, like a single distance function or algorithm.
We keep a list of data mining project ideas, roughly grouped by difficulty. Most projects are the implementation of some variant of an algorithm. ELKI aims at allowing comparative studies of algorithms, so we try to allow any combination, and cover also variants of algorithms. For example with k-means, we not only have Lloyds algorithm, but 10 variants of the general k-means theme. Over 220 articles have been (at least partially) reimplemented in ELKI.
By implementing everything in the same tool, we get much more comparable results. If you use R for benchmarking, you are usually comparing apples and oranges. k-means in R itself is actually an old Fortran program, and very fast. k-means in R but in the "flexclust" package is 100x slower, because it is written in real R code. So don't trust a benchmark in R... also, R modules tend to be incompatible, so you often can't use distance A from modules A with algorithm B from module B. in ELKI we try to share as much code as possible across implementations to reduce such artifacts (it will, of course, never be possible to have a 100% fair benchmark - there is always room for optimization), but also to allow combining modules easily.
You could start with something small such as the Hartigan&Wong k-means variant, and then continue into spherical k-means (which is meant for sparse data, where different performance optimizations may become necessary) and continue into adding better support for categorical data; or adding indexing functionality.
I'd also love to see a better UI for ELKI, but that is a major effort.