How should I OneHotEncod a column of (8128 rows and) 2058 nuniques?
The title, pretty much.
I just want to know the best and most efficient way to OneHotEncode
a column with like 2058 nuniques. Doing a fit_transform
of said column, I know I will get an array of 2058 (minus 1 when you drop first) columns. Is it the right approach? Apart from that, I have another column that has about 441 nuniques, so that's another headache I need to take care of.
I know for a fact that the first column (the one with 2058 nuniques) is very important for the dataset. It's basically the brand names of cars, which in the real world is a deciding factor for someone to purchase the car or not; so I know it is important, but considering the dataset, I just want to exclude it due to the sheer unique values, and the fact that I'd have to OneHotEncode it.
So it just boils down to this: Is there another way to deal with these many unique values, or something else that I can do?
For the sake of this question:
- the column with 2058 nuniques = df['A']
- the column with 441 nuniques = df['B']
Topic one-hot-encoding scikit-learn pandas
Category Data Science