Data representation (NoSQL database?) for a medical study
Problem description
I have a data set about 10000 patients in a study. For each patient, I have a list of various measurements. Some information is scalar data (e.g. age), some information is time series of measurements, some other information can be even a bitmap. The individual record itself can be quite thick (10kB to 10MB). The data is to be processed practically in two steps:
Preprocessing at the level of individual records (patients), i.e. to extract some features in raw data, store them. Calculate some slopes in time series etc. All this can be done on individual level and it can be very easily distributed.
On top of the preprocessed data (extracted features), I will need to calculate some aggregated things such as e.g. average age, but also some machine learning tasks.
The question
Obviously, this is very suitable to be addressed in Apache Spark (or any map-reduce architecture). At the most general level, my question is: what is the most appropriate NoSQL database for this situation?
So far, I have considered two basic options:
- MongoDB - to take advantage of the document oriented storage where everything is on the same place. However, I am not sure about the performance on the larger binary data (pictures, time-series).
- Cassandra - this may have some better storage of binary data, but the joins will be necessary (even if optimized by indexing all data by "patient id").
Topic mongodb nosql machine-learning
Category Data Science