Semi-supervised classification with SelfTrainingClassifier: no training after calling fit()
I am practicing semi-supervised learning, at the moment experimenting with sklearn.semi_supervised.SelfTrainingClassifier
. I found a dataset for multiclass classification (tweet sentiment classification into 5 sentiment categories) and randomly removed 90% of the targets.
Since it is textual data, preprocessing is needed: I applied CountVectorizer()
and created a sklearn.pipeline.Pipeline
with the vectorizer and the self-training classifier instance.
For the base estimator of the self-training classifier I used RandomForestClassifier
.
My problem is, when running the below script, no training happens. The argument verbose
is set to True
so if any iteration happened, I would see its output. Also when inspecting the predicted labels, they are identical to the initial ones, confirming that despite no errors showing, something is not in order.
The full code:
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder
from sklearn.semi_supervised import SelfTrainingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer
# Coronavirus dataset from Kaggle: https://www.kaggle.com/datatattle/covid-19-nlp-text-classification
# For this semi-supervised demonstration, only train file is used.
df = pd.read_csv(./datasets/Corona_NLP_Train.csv, encoding='latin-1')
# subsample the dataset (purely for efficiency, i.e. running the examples quicker)
df = df.sample(frac=0.1)
print(Original data shape: , df.shape)
# Unlabeled data must be denoted by -1 in the target column. Since original data is labeled, we remove labels for 90% of target
rand_indices = df.sample(frac=0.90, random_state=0).index
# create new 'Sentiment_masked' column
df['Sentiment_masked'] = df['Sentiment']
df.loc[rand_indices, 'Sentiment_masked'] = -1
# check original 'Sentiment' distribution
print(Original (unaltered) sentiment distribution:\n, df['Sentiment'].value_counts())
# check masked sentiment distribution
print(Masked sentiment distribution:\n, df['Sentiment_masked'].value_counts())
X = df['OriginalTweet']
y = df['Sentiment_masked']
stclf = SelfTrainingClassifier(
base_estimator = RandomForestClassifier(n_estimators = 100),
threshold = 0.9,
verbose = True)
pipe = Pipeline([('vectorize', CountVectorizer()), ('model', stclf)])
pipe.fit(X, y)
And I returned the updated/modified labels using:
pd.Series(pipe['model'].transduction_).value_counts()
which yielded:
-1 3704
Positive 117
Negative 93
Neutral 79
Extremely Positive 72
Extremely Negative 51
i.e. the exact same as what df['Sentiment_masked'].value_counts()
yielded earlier.
What I am missing here?