Data representation (NoSQL database?) for a medical study

Problem description

I have a data set about 10000 patients in a study. For each patient, I have a list of various measurements. Some information is scalar data (e.g. age), some information is time series of measurements, some other information can be even a bitmap. The individual record itself can be quite thick (10kB to 10MB). The data is to be processed practically in two steps:

  1. Preprocessing at the level of individual records (patients), i.e. to extract some features in raw data, store them. Calculate some slopes in time series etc. All this can be done on individual level and it can be very easily distributed.

  2. On top of the preprocessed data (extracted features), I will need to calculate some aggregated things such as e.g. average age, but also some machine learning tasks.

The question

Obviously, this is very suitable to be addressed in Apache Spark (or any map-reduce architecture). At the most general level, my question is: what is the most appropriate NoSQL database for this situation?

So far, I have considered two basic options:

  1. MongoDB - to take advantage of the document oriented storage where everything is on the same place. However, I am not sure about the performance on the larger binary data (pictures, time-series).
  2. Cassandra - this may have some better storage of binary data, but the joins will be necessary (even if optimized by indexing all data by "patient id").

Topic mongodb nosql machine-learning

Category Data Science


I think you need to define the process and desired outcome a little more. It sounds like you need to:

  1. Define what features you want.
  2. Figure out how to extract those features.
  3. Figure out how to store those features.
  4. Pass that dataset to a ML model for training.

I would figure out what exactly you want for #3 in terms of the number and type of data elements and then choose a data storage method only then.

Unless you are going to be passing the unstructured documents directly to your model, you don't need MongoDB's capabilities. 10,000 records is small, but since you mentioned you want to calculate some aggregated statistics on your patient-level data, you could likely get by with something as simple as MySQL or SQLite.

Spark and Map/Reduce are actually competitors, with Spark stealing MR's spotlight lately. You might need one or the other for feature extraction but they're probably overkill for the rest of the stuff you described.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.