Histograms in Machine Learning

I have a large data set with over 100k samples and I want to predict a continuous target feature from 4 other continuous features using Scikit Learn. For this project, I would like to visualize and analyze the data using both 1 dimensional and two dimensional histograms. I know how to plot histograms and I know what a histogram means/displays mathematically but how can I make good use of it in order to analyze my data?

One thing that comes to mind is that I could spot regions with outliers, but this doesn't seem so useful/efficient (correct me if I'm wrong).

So what are useful ways to use histograms for analyzing Machine Learning data?

Thanks

Topic historgram scikit-learn pandas python machine-learning

Category Data Science


I would suggest you, other than simple histograms, to visualize how variables are associated with each other using a pairplot from seaborn.pairplot(). This will let you check how correlated your explanatory variables are with each other. Multicollinearity can be a problem that you can solve using dimensionality reduction, for example.

Outliers might not be a problem, but you can't say before running any model. On that, I suggest you to run the same model more than once, with and without outliers. Also, always normalize your data, this might affect the "outlierness" of an observation.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.