Semi-supervised classification with SelfTrainingClassifier: no training after calling fit()

I am practicing semi-supervised learning, at the moment experimenting with sklearn.semi_supervised.SelfTrainingClassifier. I found a dataset for multiclass classification (tweet sentiment classification into 5 sentiment categories) and randomly removed 90% of the targets.

Since it is textual data, preprocessing is needed: I applied CountVectorizer() and created a sklearn.pipeline.Pipeline with the vectorizer and the self-training classifier instance.

For the base estimator of the self-training classifier I used RandomForestClassifier.

My problem is, when running the below script, no training happens. The argument verbose is set to True so if any iteration happened, I would see its output. Also when inspecting the predicted labels, they are identical to the initial ones, confirming that despite no errors showing, something is not in order.

The full code:

import pandas as pd 
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder
from sklearn.semi_supervised import SelfTrainingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer

# Coronavirus dataset from Kaggle: https://www.kaggle.com/datatattle/covid-19-nlp-text-classification
# For this semi-supervised demonstration, only train file is used.
df = pd.read_csv(./datasets/Corona_NLP_Train.csv, encoding='latin-1') 

# subsample the dataset (purely for efficiency, i.e. running the examples quicker)
df = df.sample(frac=0.1)
print(Original data shape: , df.shape)

# Unlabeled data must be denoted by -1 in the target column. Since original data is labeled, we remove labels for 90% of target
rand_indices = df.sample(frac=0.90, random_state=0).index

# create new 'Sentiment_masked' column
df['Sentiment_masked'] = df['Sentiment']
df.loc[rand_indices, 'Sentiment_masked'] = -1

# check original 'Sentiment' distribution
print(Original (unaltered) sentiment distribution:\n, df['Sentiment'].value_counts())

# check masked sentiment distribution
print(Masked sentiment distribution:\n, df['Sentiment_masked'].value_counts())


X = df['OriginalTweet']
y = df['Sentiment_masked']

stclf = SelfTrainingClassifier(
    base_estimator = RandomForestClassifier(n_estimators = 100),
    threshold = 0.9,
    verbose = True)

pipe = Pipeline([('vectorize', CountVectorizer()),  ('model', stclf)])

pipe.fit(X, y)

And I returned the updated/modified labels using:

pd.Series(pipe['model'].transduction_).value_counts()

which yielded:

-1                    3704
Positive               117
Negative                93
Neutral                 79
Extremely Positive      72
Extremely Negative      51

i.e. the exact same as what df['Sentiment_masked'].value_counts() yielded earlier.

What I am missing here?

Topic multiclass-classification semi-supervised-learning classification machine-learning

Category Data Science


The reason you are not seeing any verbose output from the model fitting and no change in the model's labels is because the treshold you are currently using is too high, which doesn't allow the model to add any new pseudo-labels to the dataset. Decreasing the threshold (e.g. to 0.7) does show output with the number of labels added in each iteration:

End of iteration 1, added 54 new labels.
End of iteration 2, added 163 new labels.
End of iteration 3, added 310 new labels.
End of iteration 4, added 576 new labels.
End of iteration 5, added 982 new labels.
End of iteration 6, added 1350 new labels.
End of iteration 7, added 249 new labels.
End of iteration 8, added 12 new labels.
End of iteration 9, added 3 new labels.
End of iteration 10, added 1 new labels.

The reason you are not seeing any change when getting the value counts for the different labels is because the model doesn't actually add the newly generated pseudo-labels to the original dataset. It only adds the labels internally (see the source code) and after fitting returns the class itself which now contains the classifier trained on the original dataset plus the pseudo-labels. This is stored in the base_estimator_ attribute which is then called when predicting on new data (e.g. see the predict method of sklearn.semi_supervised.SelfTrainingClassifier).

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.