How to get a (descriptive) overview of a large database?

I'm facing a data framework with

  1. ~ 20 k observations and
  2. 151 variables across
  3. 2078 subjects

At first I am primarily interested in how the data looks like related to a single parameter. But I cannot plot 2078 subjects on the x-axis and make a bar plot out of it or so.

What would be useful methods for such a situation? I prefer some visualizations but I think they won't be applicable. I'm afraid even non-visualization methods are not really helpful as well.

Topic aggregation descriptive-statistics ggplot2 visualization r

Category Data Science


There's no way to have a complete summary of a large dataset like this, you have to analyze what can be relevant, decompose into more specific pieces of information and then find the best way to visualize each specific part on its own.

The first thing would be to plot the distribution of this parameter of interest across subjects and/or observations.

If you want to look at the individual level and there are too many values, you can simply pick a random subset (say 100 subjects) and plot these. Then you do it again with a different random subset in order to distinguish real patterns from variations due to chance.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.