Purpose of converting continuous data to categorical data

I was reading through a notebook tutorial working with the Titanic dataset, linked here, and noticed that they highly favored ordinal data to continuous data.

For example, they converted both the Age and Fare features into ordinal data bins.

I understand that categorizing data like this is helpful when doing data analytics manually, as fewer categories makes data easier to understand from a human perspective. But intuitively, I would think that doing this would cause our data to lose precision, thus leading to our model losing precision or accuracy.

Can someone explain when is converting numerical data to ordinal data is appropriate, and the underlying statistics of why it is effective?

Topic numerical data categorical-data machine-learning

Category Data Science


Converting numerical data into categorical requires familiarity with the dataset. For example, in the case of Titanic dataset you mention, age or class of the passenger carry predictive power but how?

Ticket fare is based on class, and different classes are probably are on different decks. So in essence, it is a categorical feature.

For age, you do not expect to have different survival probability for a 9 year old and 10 year old, given every other feature (class, gender etc) is the same. It is important to visualize the data and look for natural inflection points.


Your intuition is generally correct - in many cases, premature discretization of continuous variables is undesirable. Doing so throws away potentially meaningful data, and the result can be highly dependent on exactly how you bucket the continuous variables, which is usually done rather arbitrarily. Bucketing people by age decade, for example, implies that there is more similarity between a 50-year-old and a 59-year-old than there is between a 59-year-old and a 60-year-old. There can be some advantages in statistical power to doing this, but if your binning doesn't reflect natural cutpoints in the data, you may just be throwing away valuable information.

You can find a very similar question here:

https://stats.stackexchange.com/questions/68834/what-is-the-benefit-of-breaking-up-a-continuous-predictor-variable?noredirect=1&lq=1


Your question is very broad and I can only answer part of it. Most importantly, there is no way to say this or that is always better. It depends on the method you work with and (often) also on the data. Let‘s consider two examples.

1) Think about neural nets. They often work well with data which has not too much variance. This is one reason why data is often scaled and/or normalized. Transforming continuous features to categorical can be helpful here.

2) Think about linear regression. You need to specify the functional form in your regression equation to capture the data generating process well. Let‘s say you have „age“ as a feature and this feature does not have a linear relation to your y. You may try a quadratic form or other things. However, if you generate age classes (say 10 year intervals), you can add this features as „dummies“ to your model and you don‘t need to worry too much about parameterization of „age“ as a feature (dummies work similar as regression splines in this case).

So it really depends on the problem. In reality you need to try different representations, contingent on your model and the data. Also take Kaggle kernels too serious. They often provide good examples, but most of the kernels are really hands-on.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.