Semi-supervised classification with SelfTrainingClassifier: no training after calling fit()
I am practicing semi-supervised learning, at the moment experimenting with sklearn.semi_supervised.SelfTrainingClassifier. I found a dataset for multiclass classification (tweet sentiment classification into 5 sentiment categories) and randomly removed 90% of the targets.
Since it is textual data, preprocessing is needed: I applied CountVectorizer() and created a sklearn.pipeline.Pipeline with the vectorizer and the self-training classifier instance.
For the base estimator of the self-training classifier I used RandomForestClassifier.
My problem is, when running the below script, no training happens. The argument verbose is set to True so if any iteration happened, I would see its output. Also when inspecting the predicted labels, they are identical to the initial ones, confirming that despite no errors showing, something is not in order.
The full code:
import pandas as pd 
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder
from sklearn.semi_supervised import SelfTrainingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer
# Coronavirus dataset from Kaggle: https://www.kaggle.com/datatattle/covid-19-nlp-text-classification
# For this semi-supervised demonstration, only train file is used.
df = pd.read_csv(./datasets/Corona_NLP_Train.csv, encoding='latin-1') 
# subsample the dataset (purely for efficiency, i.e. running the examples quicker)
df = df.sample(frac=0.1)
print(Original data shape: , df.shape)
# Unlabeled data must be denoted by -1 in the target column. Since original data is labeled, we remove labels for 90% of target
rand_indices = df.sample(frac=0.90, random_state=0).index
# create new 'Sentiment_masked' column
df['Sentiment_masked'] = df['Sentiment']
df.loc[rand_indices, 'Sentiment_masked'] = -1
# check original 'Sentiment' distribution
print(Original (unaltered) sentiment distribution:\n, df['Sentiment'].value_counts())
# check masked sentiment distribution
print(Masked sentiment distribution:\n, df['Sentiment_masked'].value_counts())
X = df['OriginalTweet']
y = df['Sentiment_masked']
stclf = SelfTrainingClassifier(
    base_estimator = RandomForestClassifier(n_estimators = 100),
    threshold = 0.9,
    verbose = True)
pipe = Pipeline([('vectorize', CountVectorizer()),  ('model', stclf)])
pipe.fit(X, y)
And I returned the updated/modified labels using:
pd.Series(pipe['model'].transduction_).value_counts()
which yielded:
-1                    3704
Positive               117
Negative                93
Neutral                 79
Extremely Positive      72
Extremely Negative      51
i.e. the exact same as what df['Sentiment_masked'].value_counts() yielded earlier.
What I am missing here?