Open source data science projects to contribute

Contribution into open source projects is typically a good way to get some practice for newbies, and try a new area for experienced data scientists and analysts.

Which projects do you contribute? Please provide some intro + link on Github.

Topic beginner open-source

Category Data Science


Check this project on github. It contains a comprehensive list of open source projects grouped by language, with some short descriptions. I think you can find there some of them which meet you needs.


If one likes cross-platform visual programming tools, Orange is an option. Having recently moved to Python 3, they haven't yet got all the widgets ported. It's bringing the PyData stack (NumPy, SciPy, SciKit Learn, ...) to Python 3, PyQt, PyQtGraph, and it's GPL'd on GitHub.

Orange screenshow


ELKI (also on GitHub) is data mining and data science open-source project. It is unique with respect to its modular architecture: you can combine algorithms, distance functions, and indexes for acceleration with very few limitations (of course, algorithms that do not use distances cannot be combined with distances). It is not the easiest code because of efficiency. For data mining, you need to be careful about memory - using ArrayList<Integer> is a no-go if you want scalability.

Because of the modular architecture, it is easy to contribute just small modules, like a single distance function or algorithm.

We keep a list of data mining project ideas, roughly grouped by difficulty. Most projects are the implementation of some variant of an algorithm. ELKI aims at allowing comparative studies of algorithms, so we try to allow any combination, and cover also variants of algorithms. For example with k-means, we not only have Lloyds algorithm, but 10 variants of the general k-means theme. Over 220 articles have been (at least partially) reimplemented in ELKI.

By implementing everything in the same tool, we get much more comparable results. If you use R for benchmarking, you are usually comparing apples and oranges. k-means in R itself is actually an old Fortran program, and very fast. k-means in R but in the "flexclust" package is 100x slower, because it is written in real R code. So don't trust a benchmark in R... also, R modules tend to be incompatible, so you often can't use distance A from modules A with algorithm B from module B. in ELKI we try to share as much code as possible across implementations to reduce such artifacts (it will, of course, never be possible to have a 100% fair benchmark - there is always room for optimization), but also to allow combining modules easily.

You could start with something small such as the Hartigan&Wong k-means variant, and then continue into spherical k-means (which is meant for sparse data, where different performance optimizations may become necessary) and continue into adding better support for categorical data; or adding indexing functionality.

I'd also love to see a better UI for ELKI, but that is a major effort.


The Julia project is one which I actively contribute to, including the advanced computing and XGBoost libraries. So, I can definitely vouch for it's maintenence and the quality of the community.

Some really good open source data science projects where even the beginners can contribute are:

  • Sklearn: Always developing at a rapid pace, the sklearn community is always open to new developers and contributors.
  • H2O: H2O is another fast growing data science projects, working on scalable machine learning and Deep Learning solutions.
  • Go: Open source data science road map and resources. Not really a technical project, but is very helpful for absolute beginners and aspiring analysts.
  • Pylearn2: Another fast growing Machine Learning and Deep Learning project.
  • Vowpal Wabbit: The Vowpal Wabbit (VW) project is a fast out-of-core learning system sponsored by Microsoft Research and (previously) Yahoo! Research.

Here is a Quora discussion on such projects and some more which are not mentioned in this answer.

Here is a another nice discussion about open source Data Science and ML projects in Python.


There are plenty of them available. I do not know if I am allowed to do this (please let me know if it is wrong), but I develop one and it has already over 2 years on git hub (it actually started one years before github). The project is called rapaio, is on git hub here and recently I started to write a manual for it (some of my friends asked me about that). The manual can be found here.

It fits your needs if you are willing to develop in Java 8, if you like to do yourself any tool and if you like to experiment. There are only two principles which I enforce. The first one is write something only when you need it. That is because I strongly believe that only when you need a tool you also know what you really want from it in terms of output, performance, information. The second principle is you depend only on jdk, if you need something you will write it. I can agree that I am old fashioned, but you can tailor any feature for your purpose in this way.

If I am not allowed to do that as an aswer, again, please let me know. Although, since it's an open source initiative, a give something back to the people with no profit type of project I see not reason why I could not do it.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.