Seeking advice on database architecture -- given my problem, what tools should I learn?
I'm a fairly experienced R user, but until now I haven't had a good reason to learn to use databases. Now I have a problem where I am dealing with model output that I need to save to disk, and then query for another process. If the data were smaller, I'd store everything in a list
, with hierarchical elements. For example, if my object is called output.OLS
:
1 summary(output.OLS)
Length Class Mode
SEP0307 3 -none- list
SEP0308 3 -none- list
SEP0306 3 -none- list
SEN0308 3 -none- list
snip -- hundreds of sub-lists
SEN0307 3 -none- list
the second level of hierarchy:
1 summary(output.OLS[[1]])
Length Class Mode
c1 11 -none- list
c2 11 -none- list
c2 11 -none- list
the third:
1 summary(output.OLS[[1]][[1]])
Length Class Mode
Baseline 1008 data.frame list
CanSEM45 1008 data.frame list
CanSEM85 1008 data.frame list
CCSM45 1008 data.frame list
CCSM85 1008 data.frame list
GISS45 1008 data.frame list
GISS85 1008 data.frame list
HadGEM45 1008 data.frame list
HadGEM85 1008 data.frame list
MIROC45 1008 data.frame list
MIROC85 1008 data.frame list
and we can see that now we're looking at data.frame
s
1 dim(output.OLS[[1]][[1]][[1]])
[1] 33 1008
1 colnames(output.OLS[[1]][[1]][[1]])[1:10]
[1] "scen" "year" "region" "size" "amount" "mean"
[7] "semean" "varres" "result.1" "result.2"
snip
[1006] "result.998" "result.999" "result.1000"
There are four such output objects, for a total of about 6GB. They don't all have the columns mean
, semean
and varres
This list is about 1.5 gigs, which is unwieldy given what I want to do with it. It'd be much better to read subsets of it from a database. I can think of two potential architectures:
Architecture 1: Single, huge table
The variable names would be region
(the outermost hierarchy), cx
(i.e.: c1, c2, c3), model
(Baseline, CanSEM, etc), and then all of the results: result.1, result.2, ..., result.1000
This would be most convenient. The problem is that the table would be both deep and wide. Too wide for SQLLite, which is the simplest database, as far as I can tell.
Architecture 2:: one keyvalue table and many data tables. The keyvalue table would have the following columns:
cx, scenario, year, place, keyvalue = coerce_to_integer(unique(scenario, year, place, cx))
The data tables would look like this
keyvalue, result_number, result
This option uses hierarchy and is actually a relational database (relational via the keyvalues, right)?
But it has the drawback that it nearly triples the size of my data: for each datum, I need to store a keyvalue and an index (result_number). This seems like it wouldn't make sense.
Are there other options
For example, are there databases similar to R's lists
that can be stored in memory, and read from memory? Is that sort of like what NoSQL databases are?
So, my question
- What database architectures would make sense for this problem?
- Where should I start learning about these sorts of databases?
- How does R interface with them? Can I create the database from R? If so, how?
Category Data Science