Big Data - Data Warehouse Solutions?

I have a dozen of databases that stores different data, and each of them are 100TBs in size. All of the data is stored in AWS services such as RDS, Aurora and Dynamo.

Many times I find myself need to perform "joins" across databases, for example a student ID that appears in multiple databases with data that I want to gather. The joins are usually done after data is streamed out of the database, since the data is not located in the same database, and this sometimes requires hours just for thousands of records.

Can services such as AWS redshift or Google BigQuery allow you to somehow "import" data from many data sources and then you can perform SQL queries to join them?

How about Hadoop and Hive? Where we steam data out from the database and place it as files in Hadoop, and let Hive Query the data?

Topic redshift bigdata databases

Category Data Science


Can services such as AWS redshift or Google BigQuery allow you to somehow "import" data from many data sources and then you can perform SQL queries to join them?

It depends on your data and the type of joins you are performing. But, yeah, databases like Redshift can definitely perform better in your use case as they are column-based databases. Read this post and the associated answers for understanding how columnar data stores handle data.

How about Hadoop and Hive?

Hadoop + Hive is mostly a DIY hosted/cloud version of what Redshift gives you on cloud.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.