What to do if your adversarial validation show different distributions for an NLP problem?

Question

What to do if your adversarial validation show different distributions for an NLP problem?

dsbr_

2022年6月3日 22:30

I was trying to figure out if the test set from a competition is similar to the train set. This was done in a NLP competition, in which I had two columns, tweet and type, and I needed to predict the type of crime the tweet was reporting. So I decided to check if the train set is too different from the test set. This is what I've done so far:

# drop the target column from the training data
train_adv = train_set.drop([type], axis=1)
test_adv = submission_set.copy()

# add the train/test labels
train_adv[AV_label] = 0
test_adv[AV_label] = 1

# make one big dataset
all_data = pd.concat([train_adv, test_adv], axis=0, ignore_index=True)

X_adv = all_data[tweet]
y_adv = all_data[AV_label]

X_train_adv, X_test_adv, y_train_adv, y_test_adv = train_test_split(
    X_adv, y_adv, test_size=0.5, random_state=42, stratify=y_adv
)

lgbm_pipe = Pipeline(
    [
        (
            vect,
            CountVectorizer(
                lowercase=True,
                analyzer=word,
                #ngram_range=(1, 2),
                strip_accents=ascii,
                token_pattern=r\w+,
                stop_words=english,
            ),
        ),      
        (feature_selection, SelectKBest(chi2, k=500)),
        (poly,PolynomialFeatures(2, interaction_only=True)),
        (scaler, MaxAbsScaler()),
        (model, LGBMClassifier(boosting_type=gbdt, n_jobs=-1, num_leaves=45)),
    ]
)

lgbm_pipe.fit(X_train_adv, y_train_adv)

y_pred_adv = lgbm_pipe.predict(X_test_adv)

print(fAccuracy: {accuracy_score(y_test_adv, y_pred_adv)})
print(fF1-Score Weighted: {f1_score(y_test_adv, y_pred_adv)})

All metrics are very high, indicating (from what I've learnt) that the test set is very different from the train set. Is there anything I could actually do with this information? Since there are only two columns, I don't know how can I fix this and avoid overfitting. It's not like I have some features which could explain what's happening. Any idea?

Topic validation cross-validation nlp machine-learning

Category Data Science

What to do if your adversarial validation show different distributions for an NLP problem?

About