Transforming Categorical to Numerical variable

I have a categorical variable with 4 levels ('8 c', '6 c','NAN','Others') and I want to convert it to numerical form. an Obvious way is to simply remove the 'c' part from the first two categories and replace NAN with 0. However, I was wondering about the 'Others' level?

What could be the best way to transform this level? Please note that the variable represents the number of cylinders for a given car.

Topic transformation feature-engineering numerical categorical-data

Category Data Science


I spent some time exploring this dataset:

There are some findings I want to share it with you:

  • Number of samples is 426880 samples.
  • Number of categories in the cylinder column are: 3,4,5,6,7,8,10,12,others,Blank cell. You could take a look at the cylinder list at the beginning. here
  • There are no 7, 9, and 11 cylinders. Then, 'others' could contain more likely either 1 or 2 cylinders. click here
  • Number of samples contain the 'others' value in the cylinder column is 1298 samples which is equivalent to 0.3% of the total number of samples.

You can handle missing value and 'others' as following:

  1. Others: due to the very small amount of samples containing 'others', you could remove all these samples. Alternatively, you could replace them with 1 or 2 cylinders where 2 is the most popular.

  2. Blank cells: from my naive knowledge of cars, if the cars are from the same manufacturer and the same model, the same type of fuel, they have probably the same number of cylinders. Then, you can replace the blank values with the number of cylinders from the other cars which have the number of cylinders. Please, see the example below:

enter image description here

Another example:

enter image description here

Another example:

enter image description here

  • If you find the car model is missing. Then, I recommend removing these samples (they are 2673).

I would suggest in this case 2 steps as part of your data preparation:

  • substitute 'NAN' for 'Others', since both labels are giving you no info and can be considered as unknown values
  • once you have finally 3 labels ('8 c', '6 c','Others'), apply one hot encoding, since you only have 3 possible categories (which prevents your dataset from being too sparse) and at the same time you do not assume that the unknown values are 0 cylinders.

It really depends what your variable refers to, and which kind of model you want to use.

A few things you can do :

  • OneHotEncoding : will create binary variables for each possibility for your variable : in your case, it'll create 4 variables '8 c', '6 c','NAN','Others', that take 1 or 0. This way, each possible variable output is now a binary variable, independant from others. Example : 'Var'='8 c' becomes '8 c' = 1, '6 c' = 0, 'NAN' = 0, 'Others' = 0

  • Manual coding with order : You can yourself transform '8 c' as 1, '6 c' as 2, 'NAN' as 3 and 'Others' as 4 for example, but this will mean for your model than '8 c' and '6 c' are closer than '8 c' and 'Others' for example. That's a method you can use for ordinal values (when V1 > V2 > ... )

  • Target Encoding : Often used when you have too much possible values, and can't afford to OneHot (which creates as much variables as potential values). Target Encoding will code your values as numbers, refering to their potential link to the target variable (y in your modelisation). I don't recommend it in your case since you only have 4 potential values.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.