To datawarehouse or not to data warehouse?

I was wondering if you will be as so kind to assist me with a quick question (will to be happy to explain more if you are willing to...). I am researching and setting up a system to do a machine learning job (training) to find correlations between Social Media (or other digital trails from wearables etc.) information of a user and his scores on personality tests.

The scores are in my Postgresql (on AWS) and I need to decide on how to store the Social Media/Digital trails from wearables (unstructured and structured) information. I was thinking DynamoDB.

I was also thinking to integrate both databases under Amazon Redshift and to do the analytics (using RapidMinder) from there..... Does it all make sense? Do I really need a data warehouse for this? Will it be more sensible to use just a single DB (Postgresql or Dynamo) for all this without data warehousing? to I am talking about up to 100K records more or less (for the training).... Future data will in the millions.

I get so many conflicting answers and I hope and will appreciate your kindness and advice. Thank you so much in advance!!!

Topic redshift

Category Data Science


The main purpose of a datawarehouse is the ability to aggregate different types of data, and columns in a rapid way ( near realtime ) . Storage capability isn't the problem Datawarehousing is trying to solve. I can't really answer your question since i'm not quite aware of the volume of analytics you want to perform, but if it's for training a model over and over ( Online Learning ) , just set up a whole pipeline to apply the transformations you want ( the ETL part ) , model, train, and run your predictive method as many times as you want.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.