R in production

Many of us are very familiar with using R in reproducible, but very much targeted, ad-hoc analysis. Given that R is currently the best collection of cutting-edge scientific methods from world-class experts in each particular field, and given that plenty of libraries exist for data io in R, it seems very natural to extend its applications into production environments for live decision making.

Therefore my questions are:

  • did someone of you go into production with pure R (I know of shiny, yhat etc, but would be very interesting to hear of pure R);
  • is there a good book/guide/article on the topic of building R into some serious live decision-making pipelines (such as e.g. credit scoring);
  • I would like to hear also if you think it's not a good idea at all;

Topic scoring predictive-modeling r

Category Data Science


In dealing with performance, memory, etc issues with R, data scientists have been able to utilize a free tool called Saturn Cloud that has RStudio built in. You can spin up a machine running R with over a hundred cores, 4TB of RAM, or multiple V100 GPUs or utilize the futures R package to make use of the RAM and processors. Additionally, boot up the saturn-rstudio-tensorflow image to use a GPU–that images is a modified version rocker/ml. With it, you can use the TensorFlow and Keras R packages on a GPU.

When creating a resource, you have the choice of which IDE your workspace will use and the hardware. By selecting an RStudio Server resource, it defaults to using an image with the latest version of R and common libraries like tidyverse and data.table (and you can install more libraries too).

saturn cloud


Speed of code execution is rarely an issue. The important speed in business is almost always the speed of designing, deploying, and maintaining the application. An experienced programmer can optimize where necessary to get code execution fast enough. In these cases, R can make a lot of sense in production.

In cases where speed of execution IS an issue, you are already going to find an optimized C++ or some such real-time decision engine. So your choices are integrate an R process, or add the bits you need to the engine. The latter is probably the only option, not because of the speed of R, but because you don't have the time to incorporate any external process. If the company has nothing to start with, I can't imagine everyone saying "let's build our time critical real-time engine in R because of the great statistical libraries".

I'll give a few examples from my corporate experiences, where I use R in production:

  • Delivering Shiny applications dealing with data that is not/ not yet institutionalized. I will generally load already-processed data frames and use Shiny to display different graphs and charts. Computation is minimal.
  • Decision making analysis that requires heavy use of advanced libraries (mcclust, machine learning) but done on a daily or longer time-scale. In this case there is no reason to use any other language. I've already done the prototyping in R, so my fastest and best option is to keep things there.

I did not use R for production when integrating with a real-time C++ decision engine. Issues:

  • An additional layer of complication to spawn R processes and integrate the results
  • A suitable machine-learning library (Waffles) was available in C++

The caveat in the latter case: I still use R to generate the training files.


R and most of its CRAN modules are licensed using the GPL.

In many companies, legal departments go crazy if you propose to use anything that is GPL in production... It's not reasonable, but you'll see they love Apache, and hate GPL. Before going into production, make sure it's okay with the legal department. (IMHO you are safe to use your modified code for internal products. Integrating R into your commercial product and handing this out to others is very different. But unfortunately, many legal departments try to ban all use of GPL whatsoever.)

Other than that, R is often really slooow unless calling Fortran code hidden inside. It's nice when you are still trying to figure out what to do. But for production, you may want maximum performance, and full integration with your services. Benchmark yourself, if R is the best choice for your use case.

On the performance issues with R (I know R advocates are going to downvote me for saying so ...):

Morandat, F., Hill, B., Osvald, L., & Vitek, J. (2012). Evaluating the design of the R language. In ECOOP 2012–Object-Oriented Programming (pp. 104-131). Springer Berlin Heidelberg.

(by the TraceR/ProfileR/ReactoR people from purdue, who are now working on fastR which tries to execute R code on the JVM?) states:

On those benchmarks, R is on average 501 slower than C and 43 times slower Python.

and:

Observations. R is clearly slow and memory inefficient. Much more so than other dynamic languages. This is largely due to the combination of language features (call-by-value, extreme dynamism, lazy evaluation) and the lack of efficient built-in types. We believe that with some effort it should be possible to improve both time and space usage, but this would likely require a full rewrite of the implementation.

Sorry to break the news. It's now my research, but it aligns with my observations.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.