Which open-source sgdb for kind of large data
I have a 7 giga confidential dataset which I want to use for a machine learning application.
I tried :
Every package recommanded for efficient dataset management in R like :
- data.table,
- ff
- and sqldf with no success.
Data.table needs to load all the data in the memory from what I read, so it's obvious that it will not work since my computer has only 4g RAM. Ff leads to a memory error too.
So I decided to turn to sgdb and I tried :
- Mysql which managed to load my dataset in 2 hours and 21'. Then I began my requests (I have a few requests to do to prepare my data before I export a smaller set in R for machine learning application), and then I had to wait for hours before I got the following message "The total number of locks exceeds the lock table size" (my request was just an update to extract the month from a date for each tuple).
- I read that postgre was similar to mysql in performance so I didn't try
- I read that redis was really performant but not at all adapted to massive importation like I want to do here so I didn't try
- I tried mongoDb, the nosql upraising solution that I heard everywhere about. Not only I find rather disturbing that mongoimport is so limited in options (I had to change all semi-colon in commas using sed before I can import the data), but It seems to be less performant that mysql since I launched the loading yesterday and it is still running.
What I can't try : data are confidential so I don't really want to rent some space on Azure or Amazon clouding solution. I am not sure that it is that big that I have to turn to Hadoop solution but maybe I am wrong about that.
Is there an open-source performant solution that I didn't try that you would recommend to perform some sql-like requests on a biggish dataset ?
Edit : Some more details about what I want to do with these data for you to visualize. These are events with a timestamp and a geolocalisation. I have 8 billions of lines. One example of what I want to do :
- standardize series identified by geolocalisation (I need to compute mean grouping by geolocalisation for example),
- compute average count of events by type of season, day... (usual group by sql request)...
Edit
As a beginning of answer for those who have limited hardware like me, rSQLite seems to be a possibility. I am still interested in other people experiences.
Topic mongodb r performance
Category Data Science