Which open-source sgdb for kind of large data

Question

Which open-source sgdb for kind of large data

Stéphanie C

2015年6月24日 16:23

I have a 7 giga confidential dataset which I want to use for a machine learning application.

I tried :

Every package recommanded for efficient dataset management in R like :

data.table,
ff
and sqldf with no success.

Data.table needs to load all the data in the memory from what I read, so it's obvious that it will not work since my computer has only 4g RAM. Ff leads to a memory error too.

So I decided to turn to sgdb and I tried :

Mysql which managed to load my dataset in 2 hours and 21'. Then I began my requests (I have a few requests to do to prepare my data before I export a smaller set in R for machine learning application), and then I had to wait for hours before I got the following message "The total number of locks exceeds the lock table size" (my request was just an update to extract the month from a date for each tuple).
I read that postgre was similar to mysql in performance so I didn't try
I read that redis was really performant but not at all adapted to massive importation like I want to do here so I didn't try
I tried mongoDb, the nosql upraising solution that I heard everywhere about. Not only I find rather disturbing that mongoimport is so limited in options (I had to change all semi-colon in commas using sed before I can import the data), but It seems to be less performant that mysql since I launched the loading yesterday and it is still running.

What I can't try : data are confidential so I don't really want to rent some space on Azure or Amazon clouding solution. I am not sure that it is that big that I have to turn to Hadoop solution but maybe I am wrong about that.

Is there an open-source performant solution that I didn't try that you would recommend to perform some sql-like requests on a biggish dataset ?

Edit : Some more details about what I want to do with these data for you to visualize. These are events with a timestamp and a geolocalisation. I have 8 billions of lines. One example of what I want to do :

standardize series identified by geolocalisation (I need to compute mean grouping by geolocalisation for example),
compute average count of events by type of season, day... (usual group by sql request)...

Edit

As a beginning of answer for those who have limited hardware like me, rSQLite seems to be a possibility. I am still interested in other people experiences.

Topic mongodb r performance

Category Data Science

Anthony Damico · Accepted Answer · 2015年6月23日 02:37

1

Anthony Damico answered at 2015年6月23日 02:37

analyzing 8 billion lines on a 4gb computer is pretty silly, but you can try

http://www.asdfree.com/2013/03/column-store-r-or-how-i-learned-to-stop.html

Which open-source sgdb for kind of large data

About