How should I OneHotEncod a column of (8128 rows and) 2058 nuniques?

The title, pretty much.

I just want to know the best and most efficient way to OneHotEncode a column with like 2058 nuniques. Doing a fit_transform of said column, I know I will get an array of 2058 (minus 1 when you drop first) columns. Is it the right approach? Apart from that, I have another column that has about 441 nuniques, so that's another headache I need to take care of.

I know for a fact that the first column (the one with 2058 nuniques) is very important for the dataset. It's basically the brand names of cars, which in the real world is a deciding factor for someone to purchase the car or not; so I know it is important, but considering the dataset, I just want to exclude it due to the sheer unique values, and the fact that I'd have to OneHotEncode it.

So it just boils down to this: Is there another way to deal with these many unique values, or something else that I can do?

For the sake of this question:

  1. the column with 2058 nuniques = df['A']
  2. the column with 441 nuniques = df['B']

Topic one-hot-encoding scikit-learn pandas

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.