Normal vs Uniform Distribution for machine learning

I have a dataset that follows Zipf's law such that the majority of the values are concentrated at one end, with the remaining items containing a very small percentage. Training on the dataset as is would introduce a bias, and thus I was thinking of restructuring the data to fall into buckets. Thus my model would be a multi-class classification model, rather than a regression model (I am training a NN).

My question is whether I should draw up the buckets such that the distribution of the items is uniform, or normal. A uniform distribution would ensure that the NN has the same amount of examples of each bucket, however, some say that normal distributions work better for machine learning. Which one should I use?

Thanks

Topic distribution data machine-learning

Category Data Science


From my understanding, you are describing the distribution of target (outcome) values.

There are many options. You can directy transform the target values to be a normal or uniform distribution while remaining a regression problem. One example is scikit-learn's TransformedTargetRegressor.

You can also discretize (bin) the data.

The "normal distributions work better for machine learning" typically refer to feature values, not target values.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.