Normal vs Uniform Distribution for machine learning

Question

Normal vs Uniform Distribution for machine learning

Michael Pulis

2022年3月31日 02:04

I have a dataset that follows Zipf's law such that the majority of the values are concentrated at one end, with the remaining items containing a very small percentage. Training on the dataset as is would introduce a bias, and thus I was thinking of restructuring the data to fall into buckets. Thus my model would be a multi-class classification model, rather than a regression model (I am training a NN).

My question is whether I should draw up the buckets such that the distribution of the items is uniform, or normal. A uniform distribution would ensure that the NN has the same amount of examples of each bucket, however, some say that normal distributions work better for machine learning. Which one should I use?

Thanks

Topic distribution data machine-learning

Category Data Science

Brian Spiering · Accepted Answer · 2021年10月28日 15:40

From my understanding, you are describing the distribution of target (outcome) values.

There are many options. You can directy transform the target values to be a normal or uniform distribution while remaining a regression problem. One example is scikit-learn's TransformedTargetRegressor.

You can also discretize (bin) the data.

The "normal distributions work better for machine learning" typically refer to feature values, not target values.

Normal vs Uniform Distribution for machine learning

About