Which open-source sgdb for kind of large data

I have a 7 giga confidential dataset which I want to use for a machine learning application.

I tried :

Every package recommanded for efficient dataset management in R like :

  • data.table,
  • ff
  • and sqldf with no success.

Data.table needs to load all the data in the memory from what I read, so it's obvious that it will not work since my computer has only 4g RAM. Ff leads to a memory error too.

So I decided to turn to sgdb and I tried :

  • Mysql which managed to load my dataset in 2 hours and 21'. Then I began my requests (I have a few requests to do to prepare my data before I export a smaller set in R for machine learning application), and then I had to wait for hours before I got the following message "The total number of locks exceeds the lock table size" (my request was just an update to extract the month from a date for each tuple).
  • I read that postgre was similar to mysql in performance so I didn't try
  • I read that redis was really performant but not at all adapted to massive importation like I want to do here so I didn't try
  • I tried mongoDb, the nosql upraising solution that I heard everywhere about. Not only I find rather disturbing that mongoimport is so limited in options (I had to change all semi-colon in commas using sed before I can import the data), but It seems to be less performant that mysql since I launched the loading yesterday and it is still running.

What I can't try : data are confidential so I don't really want to rent some space on Azure or Amazon clouding solution. I am not sure that it is that big that I have to turn to Hadoop solution but maybe I am wrong about that.

Is there an open-source performant solution that I didn't try that you would recommend to perform some sql-like requests on a biggish dataset ?


Edit : Some more details about what I want to do with these data for you to visualize. These are events with a timestamp and a geolocalisation. I have 8 billions of lines. One example of what I want to do :

  • standardize series identified by geolocalisation (I need to compute mean grouping by geolocalisation for example),
  • compute average count of events by type of season, day... (usual group by sql request)...

Edit

As a beginning of answer for those who have limited hardware like me, rSQLite seems to be a possibility. I am still interested in other people experiences.

Topic mongodb r performance

Category Data Science


analyzing 8 billion lines on a 4gb computer is pretty silly, but you can try

http://www.asdfree.com/2013/03/column-store-r-or-how-i-learned-to-stop.html

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.