Data Science in C (or C++)

I'm an R language programmer. I'm also in the group of people who are considered Data Scientists but who come from academic disciplines other than CS.

This works out well in my role as a Data Scientist, however, by starting my career in R and only having basic knowledge of other scripting/web languages, I've felt somewhat inadequate in 2 key areas:

  1. Lack of a solid knowledge of programming theory.
  2. Lack of a competitive level of skill in faster and more widely used languages like C, C++ and Java, which could be utilized to increase the speed of the pipeline and Big Data computations as well as to create DS/data products which can be more readily developed into fast back-end scripts or standalone applications.

The solution is simple of course -- go learn about programming, which is what I've been doing by enrolling in some classes (currently C programming).

However, now that I'm starting to address problems #1 and #2 above, I'm left asking myself "Just how viable are languages like C and C++ for Data Science?".

For instance, I can move data around very quickly and interact with users just fine, but what about advanced regression, Machine Learning, text mining and other more advanced statistical operations?

So. can C do the job -- what tools are available for advanced statistics, ML, AI, and other areas of Data Science? Or must I lose most of the efficiency gained by programming in C by calling on R scripts or other languages?

The best resource I've found thus far in C is a library called Shark, which gives C/C++ the ability to use Support Vector Machines, linear regression (not non-linear and other advanced regression like multinomial probit, etc) and a shortlist of other (great but) statistical functions.

Topic c programming statistics bigdata machine-learning

Category Data Science


There are certain benefits of writing the complete application in C++ rather than implementing only small part of the algorithm where this is "really needed".

Most important, mixing two languages adds complexity to the project, also using up your time. You need to write and maintain bindings. Your build scripts become more complex. Your debugging tools may not be able to cross the language barrier. Your loggers on both layers may not have anything common in between. This is why talks "we will spend a lot of time on profiling first, and implement only critical sections in C++" often end up with "I see it is slow, but still somewhat bearable and might be fast enough".

Programming performance in the language is a function of the time spent with that language. When working a lot with a single language, you learn more libraries, more approaches and get more general skills. Hence you do the same task faster and easier than when you only use it to write a few performance critical sections. So if you only need to read a single CSV file and then do the rest of processing in C++, I would not add Python layer just for reading that file. Also C++ can do.

Even the idea "write in Python first, then rewrite in C++" is not without drawbacks. The same data structure that can be an array of polymorphic objects in C++ may need to stay just a numpy matrix in Python. If Python comes first, it may enforce less readable architecture.

Finally, there are many good libraries for C++ as well. OpenCV, for instance, provides all matrix algebra and many other useful algorithms. Dlib provides many statistical algorithms. MxNet has a perfect C++ API, DarkNet is in general C++ based and for TensorRT C++ is the recommended language to use.

Also many C++ opponents still talk about malloc/free, manual array resizing, null terminated strings, difficult to iterate over map and other similar problems that have been resolved many years ago.


Scalable Machine Learning Solutions for Big Data:

I will add my $.02 because there is a key area that seems not to have been addressed in all of the previous posts - machine learning on big data!

For big data, scalability is key, and R is insufficient. Further, languages like Python and R are only useful for interfacing with scalable solutions which are usually written in other languages. I make this distinction not because I want to disparage those using them, but only because its so important for members of the data science community to understand what truly scalable machine learning solutions look like.

I do most of my work with big data on distributed memory clusters. That is, I don't just use one 16 core machine (4 quad core processors on a single motherboard sharing the memory of that motherboard), I use a small cluster of 64 16 core machines. The requirements are very different for these distributed memory clusters than for shared memory environments and big data machine learning requires scalable solutions within distributed memory environments in many cases.

We also use C and C++ everywhere within a proprietary database product. All of our high level stuff is handled in C++ and MPI, but the low level stuff that touches the data is all longs and C style character arrays to keep the product very very fast. The convenience of std strings are simply not worth the computational cost.

There not many C++ libraries available which offer distributed, scalable machine learning capabilities - MLPACK.

However, there are other scalable solutions with APIs:

Apache Spark has a scalable machine learning library called MLib that you can interface with.

Also Tensorflow now has distributed tensorflow and has a C++ api.

Hope this helps!


Take a look at Intel DAAL which is underway. It's highly optimised for Intel CPU architecture and supports distributed computations.


Not sure whether it's been mentioned yet, but there's also vowpal wabbit but it might be specific to certain kinds of problem only.


As a data scientist the other languages (C++/Java) come in handy when you need incorporate machine learning into an existing production engine.

Waffles is both a well-maintained C++ class library and command-line analysis package. It's got supervised and unsupervised learning, tons of data manipulation tools, sparse data tools, and other things such as audio processing. Since it's also a class library, you can extend it as you need. Even if you are not the one developing the C++ engine (chances are you won't be), this will allow you to prototype, test, and hand something over to the developers.

Most importantly, I believe my knowledge of C++ & Java really help me understand how Python and R work. Any language is only used properly when you understand a little about what is going on underneath. By learning the differences between languages you can learn to exploit the strengths of your main language.

Update

For commercial applications with large data sets, Apache Spark - MLLib is important. Here you can use Scala, Java, or Python.


I agree that the current trend is to use Python/R and to bind it to some C/C++ extensions for computationally expensive tasks.

However, if you want to stay in C/C++, you might want to have a look at Dlib:

Dlib is a general purpose cross-platform C++ library designed using contract programming and modern C++ techniques. It is open source software and licensed under the Boost Software License.

enter image description here


There are some C++ tools for statistics and data science like ROOT https://root.cern.ch/drupal/ , BAT https://www.mppmu.mpg.de/bat/ , boost , or OpenCV


As Andre Holzner has said, extending R with C/C++ extension is a very good way to take advantage of the best of both sides. Also you can try the inverse , working with C++ and ocasionally calling function of R with the RInside package o R. Here you can find how

http://cran.r-project.org/web/packages/RInside/index.html http://dirk.eddelbuettel.com/code/rinside.html

Once you're working in C++ you have many libraries , many of them built up for specific problems, other more general

http://www.shogun-toolbox.org/page/features/ http://image.diku.dk/shark/sphinx_pages/build/html/index.html

http://mlpack.org/


Or must I loose most of the efficiency gained by programming in C by calling on R scripts or other languages?

Do the opposite: learn C/C++ to write R extensions. Use C/C++ only for the performance critical sections of your new algorithms, use R to build your analysis, import data, make plots etc.

If you want to go beyond R, I'd recommend learning python. There are many libraries available such as scikit-learn for machine learning algorithms or PyBrain for building Neural Networks etc. (and use pylab/matplotlib for plotting and iPython notebooks to develop your analyses). Again, C/C++ is useful to implement time critical algorithms as python extensions.


In my opinion, ideally, to be a more well-rounded professional, it would be nice to know at least one programming language for the most popular programming paradigms (procedural, object-oriented, functional). Certainly, I consider R and Python as the two most popular programming languages and environments for data science and, therefore, primary data science tools.

Julia is impressive in certain aspects, but it tries to catch up with those two and establish itself as a major data science tool. However, I don't see this happening any time soon, simply due to R/Python's popularity, very large communities as well as enormous ecosystems of existing and newly developed packages/libraries, covering an very wide range of domains / fields of study.

Having said that, many packages and libraries, focused on data science, ML and AI areas, are implemented and/or provide APIs in languages other than R or Python (for the proof, see this curated list and this curated list, both of which are excellent and give a solid perspective about the variety in the field). This is especially true for performance-oriented or specialized software. For that software, I've seen projects with implementation and/or APIs mostly in Java, C and C++ (Java is especially popular in the big data segment of data science - due to its closeness to Hadoop and its ecosystem - and in the NLP segment), but other options are available, albeit to a much more limited, domain-based, extent. Neither of these languages is a waste of time, however you have to prioritize mastering any or all of them with your current work situation, projects and interests. So, to answer your question about viability of C/C++ (and Java), I would say that they are all viable, however not as primary data science tools, but as secondary ones.

Answering your questions on 1) C as a potential data science tool and 2) its efficiency, I would say that: 1) while it's possible to use C for data science, I would recommend against doing it, because you'd have a very hard time finding corresponding libraries or, even more so, trying to implement corresponding algorithms by yourself; 2) you shouldn't worry about efficiency, as many performance-critical segments of code are implemented in low-level languages like C, plus, there are options to interface popular data science languages with, say, C (for example, Rcpp package for integration R with C/C++: http://dirk.eddelbuettel.com/code/rcpp.html). This is in addition to simpler, but often rather effective, approaches to performance, such as consistent use of vectorization in R as well as using various parallel programming frameworks, packages and libraries. For R ecosystem examples, see CRAN Task View "High-Performance and Parallel Computing with R".

Speaking about data science, I think that it makes quite a lot of sense to mention the importance of reproducible research approach as well as the availability of various tools, supporting this concept (for more details, please see my relevant answer). I hope that my answer is helpful.


I would be keen to understand why you would need another language (apart form Python) if your goal is " but what about advanced regression, Machine Learning, text mining and other more advanced statistical operations".
For that kind of thing, C is a waste of time. It's a good tool to have but in the ~20 years since Java came out, I've rarely coded C.
If you prefer the more functional-programming side of R, learn Scala before you get into too many procedural bad habits coding with C.
Lastly learn to use Hadley Wickham's libraries - they'll save you a lot of time doing data manipulation.


R is one of the key tool for data scientist, what ever you do don't stop using it.

Now talking about C, C++ or even Java. They are good popular languages. Wether you need them or will need them depend on the type of job or projects you have. From personal experience, there are so many tools out there for data scientist that you will always feel like you constantly need to be learning.

You can add Python or Matlab to things to learn if you want and keep adding. The best way to learn is to take on a work project using other tools that you are not comfortable with. If I were you, I would learn Python before C. It is more used in the community than C. But learning C is not a waste of your time.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.