Running huge datasets with R

I'm trying to run some analysis with some big datasets (eg 400k rows vs. 400 columns) with R (e.g. using neural networks and recommendation systems). But, it's taking too long to process the data (with huge matrices, e.g. 400k rows vs. 400k columns). What are some free/cheap ways to improve R performance?

I'm accepting packages or web services suggestions (other options are welcome).

Topic optimization r processing bigdata

Category Data Science


Although your question is not very specific so I'll try to give you some generic solutions. There are couple of things you can do here:

  • Check sparseMatrix from Matrix package as mentioned by @Sidhha
  • Try running your model in parallel using packages like snowfall, Parallel. Check this list of packages on Cran which can help you runnning your model in multicore parallel mode.
  • You can also try data.table package. It is quite phenomenal in speed.

Good reads:

  1. 11 Tips on How to Handle Big Data in R (and 1 Bad Pun)
  2. Why R is slow & how to improve its Performance?

Since you mention you are building a recommendation system, I believe you have a sparse matrix which you are working on. Check sparseMatrix from Matrix package. This should be able to help you with storing your large size matrix in memory and train your model.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.