Problem with binning

Question

Problem with binning

Ricky

2021年3月8日 14:16

I am trying to change continuous data points to categorical by using binning. I know two techniques, i) equal width bins ii) bins with equal number of elements. My questions are:

Which type of binning is appropriate for which kind of problem?
I use pandas for my data analysis task and it has pd.cut method for arbitrary binning which I use for equal wdith bins and pd.qcut method for bins with equal number of elements. The second function always produces very complicated bin boundaries (like, [(-28.004,795.8976],(795.8976,900.342]]). Is there any way to control the bin boundaries so that they look more meaningful to non-technical persons?

Thanks in advance.

Topic feature-engineering numerical data categorical-data

Category Data Science

antounes · Accepted Answer · 2021年3月8日 14:16

The two methods you're citing belong to what is called unsupervised binning, including as you said equal width and equal frequency binning. On the other hand, supervised binning broadly tries to make sure bins are made in majority of instances sharing the same class label.

For both types of unsupervised binning, i.e. equal frequency and equal width, the best way is still to give it a try and select based on the observation of the resulting histogram you get. If your data is not properly divided by bins of equal frequency, maybe equal width bins would help, and vice versa.

For what concerns Pandas execution, you can pass a precision argument to qcut, this should return more "comprehensible" bin limits, as shown below

>>> array = np.random.randn(10)

>>> pd.qcut(array, q=4, precision=3)
Categories (4, interval[float64]): [(-1.8889999999999998, -0.732] < (-0.732, 
-0.136] < (-0.136, 0.973] < (0.973, 1.543]]

>>> pd.qcut(array, q=4, precision=0)
Categories (4, interval[float64]): [(-3.0, -1.0] < (-1.0, -0.0] < (-0.0, 1.0] 
< (1.0, 2.0]]

Problem with binning

About