Handling encoding of a dataset which has more than total 2000 columns

Question

Handling encoding of a dataset which has more than total 2000 columns

Sahil

2022年6月3日 11:09

Whenever we have a dataset to be pre processed, before feeding it to the model we convert the categorical values to numerical values for which we generally use LabelEncoding, One Hot encoding etc techniques but all these are done manually going through each column.

But what if are dataset is huge in terms of columns(eg : 2000 columns), here it wont be possible to go through each column manually, in such cases how do we handle encoding?

Are there any specific libraries available which deal with automatic encoding of variable? I know of category_encoders which provides with different encoding techniques but how do we do it at in the above mentioned condition.

Topic categorical-encoding encoding

Category Data Science

spectre · Accepted Answer · 2021年12月25日 14:25

There are different types of categorical data like ordinal, nominal and even in them there are sub categories like high cardinal and low cardinal variables. So you have to know what kind of categorical variables you have in your data because different types of variables need different encoding techniques. You cannot apply One Hot Encoding (or any other encoding for that matter) to all your variables.

Now once you know what kind of variables you have, you can directly apply the relevant techniques to those column only by using the library you mentioned category_encoders. Suppose you have 5 columns which need One Hot Encoding, you do not apply One Hot Encoding to all 5 of them separately. Just mentions the column names when applying the encoder and it will apply automatically.

Shahriyar Mammadli · Accepted Answer · 2020年11月5日 09:01

To correctly encode your variable you have to know what those variables are about. An algorithm needs to somehow understand the type of your variable to automatically encode them. In such a case, you should have a dictionary for variables about the type of variable (sometimes, dataset guides or readme files, or some txt files contain that). Or you have to know that all the variables are monotype, so you can apply the same encoding. If you don't have these or similar information sources about the data, it is not possible (unless you have a perfect model to categorize the variables according to their type :)) ) to automatically encode them.

Although you cannot categorize the types of categorical variables automatically, it is possible to distinguish the continuous (and also discrete) and categorical variables. When I face a similar situation of having lots of variables, one of the very first things that I do is to have a count and percentage of distinct values for every variable. Thus, for example, if a variable with 200000 samples has ~154000 (unless there is a variable with 154000 categories, which is almost possible) distinct values, then it is a continuous (or discrete) variable. If a variable with 200000 samples has 13 distinct values then it is a categorical variable for sure. Using similar tricks you can identify categorical variables. However, thereafter, it is inevitable to analyze categorical variables one-by-one. When you categorize them within themselves e.g. rank variables, nominal variables, etc. you can altogether encode each variable type.

Handling encoding of a dataset which has more than total 2000 columns

About