What to do if your adversarial validation show different distributions for an NLP problem?
I was trying to figure out if the test set from a competition is similar to the train set. This was done in a NLP competition, in which I had two columns, tweet and type, and I needed to predict the type of crime the tweet was reporting. So I decided to check if the train set is too different from the test set. This is what I've done so far:
# drop the target column from the training data
train_adv = train_set.drop([type], axis=1)
test_adv = submission_set.copy()
# add the train/test labels
train_adv[AV_label] = 0
test_adv[AV_label] = 1
# make one big dataset
all_data = pd.concat([train_adv, test_adv], axis=0, ignore_index=True)
X_adv = all_data[tweet]
y_adv = all_data[AV_label]
X_train_adv, X_test_adv, y_train_adv, y_test_adv = train_test_split(
X_adv, y_adv, test_size=0.5, random_state=42, stratify=y_adv
)
lgbm_pipe = Pipeline(
[
(
vect,
CountVectorizer(
lowercase=True,
analyzer=word,
#ngram_range=(1, 2),
strip_accents=ascii,
token_pattern=r\w+,
stop_words=english,
),
),
(feature_selection, SelectKBest(chi2, k=500)),
(poly,PolynomialFeatures(2, interaction_only=True)),
(scaler, MaxAbsScaler()),
(model, LGBMClassifier(boosting_type=gbdt, n_jobs=-1, num_leaves=45)),
]
)
lgbm_pipe.fit(X_train_adv, y_train_adv)
y_pred_adv = lgbm_pipe.predict(X_test_adv)
print(fAccuracy: {accuracy_score(y_test_adv, y_pred_adv)})
print(fF1-Score Weighted: {f1_score(y_test_adv, y_pred_adv)})
All metrics are very high, indicating (from what I've learnt) that the test set is very different from the train set. Is there anything I could actually do with this information? Since there are only two columns, I don't know how can I fix this and avoid overfitting. It's not like I have some features which could explain what's happening. Any idea?
Topic validation cross-validation nlp machine-learning
Category Data Science