Dealing with multiple distinct-value categorical variables

So, I've got a dataset with almost all of its columns are categorical variables. Problem is that most of the categorical variables have so many distinct values.

For instance, one column have more than one million unique value, it's an IP address column in case anyone is interested. Someone suggested to split it into multiple other columns using domain knowledge, so split it to Network Class type, Host type and so on. However wouldn't that make my dataset lose some information? What if I wanted to deal with IP addresses as is?

Nevertheless, the domain knowledge solution might work on the IP column, however, I've got other columns that have more than 100K distinct values, each value is a constant-length random string.

I did work with Embedding Layers before, I was dealing with max thousands of features, never worked with 10K++ features, so I'm not sure if that would work with millions.

Much Regards

Topic word-embeddings neural-network categorical-data machine-learning

Category Data Science


There are multiple options, one can try and decide which suits your data best:

  • Word Embeddings:
    • Can use pre-trained models.
    • Can train your own Word2Vec model on your domain-specific data
  • Try to group different values:
    • Rare words having very low statistical significance can be marked as other
    • Try different clustering algorithms depending upon your data

There might be other more efficient methods, will add those as well in case I find any.

Thanks


Have you heard of CatBoostClassifier?

https://tech.yandex.com/catboost/doc/dg/concepts/python-reference_catboostclassifier-docpage/

It is type of Boosting classifier developed to deal specifically with categorical features. It has achieved state of the art results and the package developed by the authors have excellent support and even GPU portability. Take a look, this can be your solution.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.