Why aren't languages like C, C++ used for data analytics instead of R, Python?

I have started learning data science using R, however I have C++ as a subject this semester, and my project is to predict the outcome of a game using C++. I have not come across many instances (close to none, I did find libraries like Shark though) of implementation in C++.

Is it to do with the fact that C++ isn't as simple to use when it comes to manipulating large amount of data?

Topic data c predictive-modeling

Category Data Science


Python and R and these "scripting languages" (in quotes because they're used for far more than their scripting origins) prevail because data scientists typically have a mixed background of programming and mathematics. Sure the overhead can be enormous if done in a naive way, but the ecosystem is improving for many of these analytics engines and libraries.

  • Many analytics libraries use numpy, either directly or indirectly, or some technique similar to numpy, i.e. holding data in an efficient runtime representation and using Python's flexible API to operate directly on data, instead of boxed values.
  • Weld is a project that aims to unify all data representations, especially useful for cross-framework data transfer (e.g. copying data between one framework's representation in memory to another).
  • TensorFlow is actually written in C++, but most people interact with it via the Python API. This constructs a graph in-memory as an application description, but when you create a tf.Session, it instantiates C++ objects in its runtime to correspond to the graph. Because each operation is coarse-grained (the user says "give me the result" repeatedly with each input), most of the heavy-lifting is done by the runtime in C++.

Advantages of any modern interpreted language over C++. Like any tradeoff, these are advantages in some situations and disadvantages in others. Situations where you don't want these conveniences are becoming more rare, though, as hardware gets even faster and high-level language implementations get even more efficent. No compile step. Write your code in my_program.py, then run it with python my_program.py. No memory management. You don't have to explicitly allocate memory for new variables, and you don't have to explicitly free memory you're done with. The interpreter will allocate memory for you and free it when it's safe to do so. High-level native data types. Strings, tuples, lists, sets, dictionaries, file objects and more are built-in. As an example, {"x": "y"} defines a dictionary (hash table) with string "x" as a key and string "y" as its value.

Specific advantages of Python: Especially clean, straightforward syntax. This is a major goal of the Python language. Programmers familiar with C and C++ will find the syntax familiar yet much simpler without all the braces and semicolons. Duck typing. If an object supports .quack, go ahead and call .quack on it without worrying about that object's specific type. Iterators, generators and comprehensions. To get the first character of every line in a file, you'd write:

file = open("file.txt") list_of_first_characters = [line[0] for line in file] file.close()

This iterates over the file only once. (These particular features are just the tip of the iceberg of simple built-in syntax for high-level language features. Check out decorators next if you're intrigued.) Huge standard library. Just to pick some random examples, Python ships with several XML parsers, csv & zip file readers & writers, libraries for using pretty much every internet protocol and data type, etc. Great support for building web apps. Along with Ruby and JavaScript, Python is very popular in the web development community. There are several mature frameworks and a supportive community to get you started.


Yes, you're correct -- it's that C and C++ are harder to use and are more burdened with boilerplate code that obfuscates your model building logic. When you build models, you have to iterate rapidly and frequently, often throwing away a lot of your code. Having to write boilerplate code each time substantially slows you down over the long run. Using R's caret package or Python's scikit-learn library, I can train a model in just 5-10 lines of code.

Ecosystem also plays a big role. For example, Ruby is easy to use, but the community has never really seen a need for machine learning libraries to the extent that Python's community has. R is more widely used than Python (for stats and machine learning only) because of the strength of its ecosystem and its long history catering to that need.

It's worth pointing out that most of these R and Python libraries are written in low-level languages like C or Fortran for their speed. For example, I believe Google's TensorFlow is built with C, but to make things easier for end users, its API is in Python.


It's because you lose to much time configuring and building the code itself rather than solving the actual problem. Example... to load the data, you have to initialize memory in C, then you need to handle it later. In Python you just call a method to load it. Garbage collector will handle it for you later when you don't need it. This is a very simple scenario.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.