Seeking advice on database architecture -- given my problem, what tools should I learn?

I'm a fairly experienced R user, but until now I haven't had a good reason to learn to use databases. Now I have a problem where I am dealing with model output that I need to save to disk, and then query for another process. If the data were smaller, I'd store everything in a list, with hierarchical elements. For example, if my object is called output.OLS:

1 summary(output.OLS)
        Length Class  Mode
SEP0307 3      -none- list
SEP0308 3      -none- list
SEP0306 3      -none- list
SEN0308 3      -none- list
snip -- hundreds of sub-lists
SEN0307 3      -none- list

the second level of hierarchy:

1 summary(output.OLS[[1]])
      Length Class  Mode
c1  11     -none- list
c2  11     -none- list
c2  11     -none- list

the third:

1 summary(output.OLS[[1]][[1]])
         Length Class      Mode
Baseline 1008   data.frame list
CanSEM45 1008   data.frame list
CanSEM85 1008   data.frame list
CCSM45   1008   data.frame list
CCSM85   1008   data.frame list
GISS45   1008   data.frame list
GISS85   1008   data.frame list
HadGEM45 1008   data.frame list
HadGEM85 1008   data.frame list
MIROC45  1008   data.frame list
MIROC85  1008   data.frame list

and we can see that now we're looking at data.frames

1 dim(output.OLS[[1]][[1]][[1]])
[1]   33 1008
1 colnames(output.OLS[[1]][[1]][[1]])[1:10]
 [1] "scen"     "year"     "region"     "size"    "amount"     "mean" 
 [7] "semean"   "varres"   "result.1" "result.2" 
snip 
[1006] "result.998"  "result.999"  "result.1000"

There are four such output objects, for a total of about 6GB. They don't all have the columns mean, semean and varres

This list is about 1.5 gigs, which is unwieldy given what I want to do with it. It'd be much better to read subsets of it from a database. I can think of two potential architectures:

Architecture 1: Single, huge table The variable names would be region (the outermost hierarchy), cx (i.e.: c1, c2, c3), model (Baseline, CanSEM, etc), and then all of the results: result.1, result.2, ..., result.1000

This would be most convenient. The problem is that the table would be both deep and wide. Too wide for SQLLite, which is the simplest database, as far as I can tell.

Architecture 2:: one keyvalue table and many data tables. The keyvalue table would have the following columns: cx, scenario, year, place, keyvalue = coerce_to_integer(unique(scenario, year, place, cx))

The data tables would look like this keyvalue, result_number, result

This option uses hierarchy and is actually a relational database (relational via the keyvalues, right)?

But it has the drawback that it nearly triples the size of my data: for each datum, I need to store a keyvalue and an index (result_number). This seems like it wouldn't make sense.

Are there other options For example, are there databases similar to R's lists that can be stored in memory, and read from memory? Is that sort of like what NoSQL databases are?

So, my question

  1. What database architectures would make sense for this problem?
  2. Where should I start learning about these sorts of databases?
  3. How does R interface with them? Can I create the database from R? If so, how?

Topic sql r nosql databases

Category Data Science


As the user above, I am more in the Python side, so I am not aware of what the best solution for R would be, but I suggest using HDF5 files.

You can store different objects in different tables inside the file, and each table can have its hierarchy. Volume and memory should also not be a limitation, as you don't need to load the whole file into memory.


I used R for several years but have since moved to Python and so I have a hard time understanding output.OLS's data hierarchy. Nonetheless, here are my thoughts.

In-Memory Databases: If you're struggling to fit your R object in memory, then I'm guessing the memory requirement is too great for your laptop (i.e., the problem isn't due a 32-bit installation of R). If that's the case, putting an in-memory db, like MongoDB, on your laptop will only transfer the memory problem to a new technology. In other words, 6 GB in R will be 6 GB in Mongo, so if your laptop simply can't handle 6 GB, then it doesn't matter how you put it there. You could set up a MongoDB on AWS, which could hold 100s of GB, but this can be expensive and also difficult if you don't have AWS experience.

On the other hand, MongoDB is good at storing hierarchical data like lists and hash tables / dictionaries (I forgot what R's term for that is) and is fast if you have enough memory.

Relational Databases (RDMS): I would recommend going this route and creating a wide and deep table. You can still get value out of an RDBMS even if your machine has limited memory because everything is efficiently stored on disk. If you create table indexes on the columns you expect to use/filter/select most, you can very quickly query the data that you want even if your data is multiple GB. If you don't create table indexes, your queries will be horribly slow, and you'll probably abandon the RDBMS effort entirely (so make sure to index).

A long time ago my go-to package for accessing MySQL through R was RMySQL. MySQL has been around for many years, so you'll be able to find most answers to your problems on Goggle, but MySQL user interfaces like Sequel Pro (for Mac only) have sort of fallen behind on maintenance, so setting up tables and databases can be a little more challenging than they used to be. You'll now have to go the command-line route.

Postgres has now replaced MySQL as the open-source DB of choice, but it lacks good user-interface support and so you'll probably have to set everything up via the command line. I started using Postgres after I stopped using R, so I never used a Postgres R package, but it looks like there is one here.

Whether you go with MySQL or Postgres, you'll probably have to use the command-line outside of R to set up the database, but after that creating, dropping, joining, or querying tables can all be done through R. RDBMS systems are also great because you can guarantee data quality (no duplicates) by creating primary keys -- I found this feature very helpful for my R modeling. In any case, SQL is a very basic language that every data-oriented person should know so setting up an RDBMS and possibly learning SQL are skills worth learning.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.