Imbalanced Classification: BOW vs doc2Vec in XGBoost with sample weights
I am new to machine learning. I have an imbalanced dataset of pages of reports with
class 1: 97%,
class 2: 2.2%
class 3: 0.25%
which are the different type of pages
I am mostly concerned with correctly predicting class 2 3. I tried
- doc2Vec with XGBoost (with sample weight to correct the imbalanced classes)
- BOW with XGBoost (with sample weight to correct the imbalanced classes)
Oddly, 2 outperformed 1. I thought doc2Vec should be better as it creates features embeddings for the relation between document/pages. So why is Doc2Vec faring worse than BOW? Thank you
model_dbow = Doc2Vec(dm=0, min_count=2, workers=cores, seed=0)
model_dbow.build_vocab(train_tagged.values)
model_dbow.train(train_tagged.values, total_examples=len(train_tagged.values), epochs=40)
model_dmm = Doc2Vec(dm=1, dm_mean=1, min_count=1, workers=cores, seed=0)
model_dmm.build_vocab(train_tagged.values)
model_dmm.train(train_tagged.values, total_examples=len(train_tagged.values), epochs=40)
new_model = ConcatenatedDoc2Vec([model_dbow, model_dmm])
```
Topic doc2vec xgboost class-imbalance
Category Data Science