What can I do when my test and validation scores are good, but the submission is terrible?

This is a very broad question, I understand and I'm totally fine if someone believes it's not appropriate to do it. But it's killing me not to understand this...

Here's the thing, I'm doing a machine learning model to predict the tweet topic. I'm participating in this competition. So this is what I've done in order to ensure I'm not overfitting: I separated 10% of my training data and I called validation set, and I used the rest (90%) to prepare my model. So 90% of my data was divided into train and test set. So basically I had two datasets to test my model, the test set and the validation set. All results are great! Both the test and the validation set got me great results. I also did a Stratified K-Fold, which also showed me great results. However, the submission set is returning me 73% of accuracy. What can be happening? Why do I get good results in the test and validation set, but no so good in the submission? Is there any explanation? Is there any data leakage happening in here? I find it very weird to have any leakage, since the validation set is not used at all. But idk what can be happening anymore...

This is part of what I've done and where might lead to some leakage (I simplified a little):

# load training data
train_set = pd.read_csv('gender-based-violence-tweet-classification-challenge/Train.csv')

# leave 10% for validation
train = train_set.loc[:35685, [Tweet_ID, tweet, type]]
validation = train_set.loc[35685:, [Tweet_ID, tweet]]

# load the test set
submission_set = pd.read_csv('gender-based-violence-tweet-classification-challenge/Test.csv')

# load submission file
submission_file = pd.read_csv('gender-based-violence-tweet-classification-challenge/SampleSubmission.csv')

def preprocess_text(text):
    STOPWORDS = stopwords.words(english)

    # Check characters to see if they are in punctuation
    nopunc = [char for char in text if char not in string.punctuation]

    # Join the characters again to form the string.
    nopunc = .join(nopunc)

    # Now just remove any stopwords
    return  .join([word for word in nopunc.split() if word.lower() not in STOPWORDS])


X = train[tweet]
y = train[type]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.15, random_state=42
)

pipe = Pipeline([
(vect, CountVectorizer(analyzer=preprocess_text)),
(clf, RandomForestClassifier(class_weight='balanced'))
])

pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)

Topic pipelines data-leakage overfitting nlp

Category Data Science


Imho the most likely explanation is that the submission test set doesn't follow the same distribution as the training/validation/test data that you used to train and evaluate the model. In other words the test data that they use to evaluate is not a random sample from the full data, it's a different dataset collected independently, for example at a different period of time. In this hypothesis your model is trained on a particular distribution of topics during a particular period of time, but it doesn't work as well with a different distribution of topics at a different time.

Another possibility is that the dataset you used contains many duplicates, causing data leakage: if the duplicates are found in both the training and test data, this artificially increases the performance. If the final test set that they use doesn't contain any of the training data, the real performance is lower.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.