PSI where not to use

From what I understand PSI is used for continuous data. Generally, equal sized bins are created to compare two data set, and number of buckets is usually 10. Is that for a reason, why 10 bucket? Also, I was wondering if PSI can also be use categorical data less than 10 value? In case of categorical variables, what approach would be the best to estimate the shift in the population?

Topic descriptive-statistics

Category Data Science


In my experience, 10 or 20 is often used since these correspond to deciles or twentiles. People tend to have an intuitive understanding of decile or twentile. Often that understanding is wrong but we think we know it. So using 10 or 20 is comfortable.

Using too many buckets, too much noise can be introduced with very small changes in the data causing larger changes in PSI. But too few buckets hides signal. Is 10 or 20 optimal? Probably not from a statistical view for all variables but it is consistent and comfortable. There are other ways to bucket but make sure the audience knows or you are teaching the audience the methods.

With categoricals you do not need to bucket. Each category is its own "bucket" if you will. With many categories, you may want to combine but that is up to you who knows the data.

The PSI formula needs a discrete probability distribution passed in. If all categories or n-tiles, are accounted for and the quantities are normalized, then you have a discrete probability distribution. I have a check in my PSI functions for this case. I have seen bucketing going wrong, especially on the edges.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.