How prevalent is `C/C++` in machine learning development?

I am currently a data scientist mostly doing NLP, and I do most of my work inPython. Since I didn't get a CS degree in undergrad, I've been limited to very high level languages; Java, Python, and R. I somehow even took Data Structures and Algorithms avoiding C or C++.

I'm intending to go to graduate school to study more Natural Language Processing, and I'm wondering how much C/C++ I need to know. Deep-learning frameworks like PyTorch or Tensorflow are written in C++, and CUDA is only available in C. I'm not going to be writing Cython libraries, but I would like to do research and build new models (i.e. like "inventing" CNN's, seq2seq models, transformers).

I don't know how much C/C++ is used, and I'm unsure if it's worth learning the language-specific complexities that may be channeled into learning something else; hopefully somebody can let me know how prevalent the use of C/C++ is?

Topic cnn programming deep-learning nlp machine-learning

Category Data Science


Machine learning is inherently data intensive, and typical ML algorithms are massively data-parallel. Therefore, even when developing new algorithms, high-level mathy languages (like Python, R, Octave) can be reasonably fast if you are willing to describe your algorithm in terms of standard operations on matrices and vectors.

On the other hand, for deeper exploration of fundamental concepts it can be more interesting to treat individual components as objects for which you want to conceptualize and visualize their internal state and interactions. This is a case where C++ may shine. Using C++, of course, means that a compiler will attempt to optimize your execution speed. Additionally, it opens the door to straightforward multi-core execution with OpenMP (or other available threading approaches).

C++ is a high level language -- not inherently more verbose or tedious than Python for algorithm development. The biggest challenges for working with C++ are:

  • A more anarchic library ecosystem means a bigger effort for choosing and integrating existing components.
  • Less stable language rules (or the interpretation thereof) means that something you create today might not compile a few years down the road (due to compiler upgrades).

Consider, also, that TensorFlow documentation identifies some benefits of using C++ over Python for certain low-level cases. See TensorFlow: Create an op.


Low-level coding for GPU acceleration is an entirely different can of worms with very limited language options. This is not something to be concerned about until after you have a well-defined custom algorithm that you want to super-optimize. More likely, you would be better off using a framework (like TensorFlow) to handle GPU interactions for you.


For exploratory visualization purposes, don't discount the interactive power of JavaScript, which is also comparatively fast:


As you already understand, the vast majority of the data science work is made with rather high level languages such as Python and R. So it's not a matter of prevalence, it's a matter of which part in the big world of data science you want/can do with your skills and your tools.

Imho inventing new models requires:

  • strong theoretical background in maths and statistics, in-depth knowledge of existing estimation/inference methods
  • good understanding of computational complexity and (preferably) algorithmic optimization methods.

If additionally you implement your models yourself (that's not necessarily the case), that's where you probably need to know low level languages such as C/C++, because computational efficiency is crucial when people are going to use the model with massive datasets which require a lot of computation.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.