Exploratory data analysis (EDA) on large dataset

Question

Exploratory data analysis (EDA) on large dataset

MAS

2022年5月19日 21:56

I am working with lots of data (we have a table that produces 30 million rows daily). What is the best way to explore it (do on EDA)? Take a frictional slicing of the data randomly (100000 rows) or select the first 100000 rows from the entire dataset or should i take all the dataset WHAT SHOULD I DO?

thanks!!!!

Topic pyspark deep-learning scikit-learn pandas machine-learning

Category Data Science

Ralph Winters · Accepted Answer · 2022年5月19日 21:56

You mentioned data is added daily. A lot of this has to do with how your data is structured and if recent data is more important than older data. It might be easier to take a random sample from recent data. But if you are looking over all of the dates, you could sample different periods. but the statistical answer also has to do with how many variables you are looking at. Practically, you might want to start with a 'reasonable' number of rows that are easy to get, do basic EDA for missing values, rules of thumb like insuring you have a minimum count for performing things like a regression. Then increase the number to the level that you would need to have a recognizable distribution for all of the variables you are interested in. What you often miss when taking random samples are outliers, so it is always useful to ask the business what they expect the upper and lower ranges to be

Erwan · Accepted Answer · 2022年5月18日 10:51

1

Erwan answered at 2022年5月18日 10:51

Generally it's easier to manipulate a subset of the data, but it's important to take a random sample in order to have a representative sample.

Exploratory data analysis (EDA) on large dataset

About