Dealing with multiple distinct-value categorical variables
So, I've got a dataset with almost all of its columns are categorical variables. Problem is that most of the categorical variables have so many distinct values.
For instance, one column have more than one million unique value, it's an IP address column in case anyone is interested. Someone suggested to split it into multiple other columns using domain knowledge, so split it to Network Class type, Host type and so on. However wouldn't that make my dataset lose some information? What if I wanted to deal with IP addresses as is?
Nevertheless, the domain knowledge solution might work on the IP column, however, I've got other columns that have more than 100K distinct values, each value is a constant-length random string.
I did work with Embedding Layers before, I was dealing with max thousands of features, never worked with 10K++ features, so I'm not sure if that would work with millions.
Much Regards
Topic word-embeddings neural-network categorical-data machine-learning
Category Data Science