Exploratory data analysis (EDA) on large dataset

I am working with lots of data (we have a table that produces 30 million rows daily). What is the best way to explore it (do on EDA)? Take a frictional slicing of the data randomly (100000 rows) or select the first 100000 rows from the entire dataset or should i take all the dataset WHAT SHOULD I DO?

thanks!!!!

Topic pyspark deep-learning scikit-learn pandas machine-learning

Category Data Science


You mentioned data is added daily. A lot of this has to do with how your data is structured and if recent data is more important than older data. It might be easier to take a random sample from recent data. But if you are looking over all of the dates, you could sample different periods. but the statistical answer also has to do with how many variables you are looking at. Practically, you might want to start with a 'reasonable' number of rows that are easy to get, do basic EDA for missing values, rules of thumb like insuring you have a minimum count for performing things like a regression. Then increase the number to the level that you would need to have a recognizable distribution for all of the variables you are interested in. What you often miss when taking random samples are outliers, so it is always useful to ask the business what they expect the upper and lower ranges to be


Generally it's easier to manipulate a subset of the data, but it's important to take a random sample in order to have a representative sample.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.