Word2vec to encode medical procedures when using isolation forests

I am planning to use Isolation Forests in R (solitude package) to identify outlier medical claims in my data.

Each row of my data represents the group of drugs that each provider has administered in the last 12 months.

There are approximately 700+ unique drugs in my dataset and using one-hot encoding with a variety of numerical features will blow out the number of columns in my data.

As an alternative to one-hot encoding I've reading about using word2vec to convert words or in my case the collection of drugs per provider to numerical vectors.

My question is can these numerical features per provider be using as input features in my isolation forest model?

Topic isolation-forest unsupervised-learning anomaly-detection outlier r

Category Data Science


Word2vec in most cases performs better than one hot encoding with lesser dimension. You can try using word2vec Embeddings only problem i see is that word2vec are generic bedding and drugs name can be very specific to medical field. Due to which you can face two problems

  1. A lot of words in your vocabulary may not be present is word2vec

  2. Embedding as they are generic may not do very good in medical context.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.