Custom vectorizer transformer in sklearn with cross validation
I created a custom transformer class called Vectorizer()
that inherits from sklearn
's BaseEstimator
and TransformerMixin
classes. The purpose of this class is to provide vectorizer-specific hyperparameters (e.g.: ngram_range
, vectorizer type: CountVectorizer
or TfidfVectorizer
) for the GridSearchCV
or RandomizedSearchCV
, to avoid having to manually rewrite the pipeline every time we believe a vectorizer of a different type or settings could work better.
The custom transformer class looks like this:
class Vectorizer(BaseEstimator, TransformerMixin):
def __init__(self, vectorizer:Callable=CountVectorizer(), ngram_range:tuple=(1,1)) - None:
super().__init__()
self.vectorizer = vectorizer
self.ngram_range = ngram_range
def fit(self, X, y=None):
print(f Vectorizer.fit() called with vectorizer={self.vectorizer} and ngram_range={self.ngram_range}.)
return self
def transform(self, X, y=None):
print(f Vectorizer.transform() called with vectorizer={self.vectorizer} and ngram_range={self.ngram_range}.)
X_ = X.copy()
X_vect_ = self.vectorizer.fit_transform(X_) # problem is in this line!
X_vect_ = X_vect_.toarray()
# print(X_vect_.shape)
# print(self.vectorizer.vocabulary_)
# time.sleep(5)
return X_vect_
(Side note: time.sleep(5)
was merely added to make debugging easier by preventing debug info overflowing on one another.)
I intend to use the custom vectorizer in the following way, with a pipeline and a hyperparameter tuning step:
pipe = Pipeline([
('column_transformer', ColumnTransformer([
('ltype_encode', OneHotEncoder(handle_unknown='ignore'), ['Type']),
('text_vectorizer', Vectorizer(), 'Text')],
remainder='drop')
),
('model', LogisticRegression())
])
param_dict = {
'column_transformer__text_vectorizer__vectorizer': [CountVectorizer(), TfidfVectorizer()]
}
randsearch = GridSearchCV(pipe, param_dict, cv=2, scoring='f1').fit(X_train, y_train)
Now as I was debugging, I have a guess of the problem in my code: the above GridSearchCV
uses a 2-fold cross validation. First, it takes half of the data to train the model and reserves the other half for evaluation. However, the Vectorizer()
class's transform()
method will try to call fit_transform()
on the evaluation dataset again, even though when testing/evaluating, we would want to use the previously fit vectorizer without a refit.
Question is: how could I rectify this problem?
Imports:
import time
import pandas as pd
from typing import Callable
import sklearn
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import GridSearchCV
and sample data in semicolon-separated format:
Src;Text;Type;Target
A99;hi i love python very much;c;1
B07;which programming language should i learn;b;0
A12;what is the difference between python django flask;b;1
A21;i want to be a programmer one day;c;0
B11;should i learn java or python;b;1
C01;how much can i earn as a programmer with python;a;0
c01;hello FLAG FLAG I m from france i enjoyed this lecture thank u very much HEAVY BLACK HEART HEAVY BLACK HEART HEAVY BLACK HEART HEAVY BLACK HEART;b;1
ssa;hi hola salut FOREIGN FOREIGN FOREIGN FOREIGN SMILING FACE WITH HALO HEAVY BLACK HEART CLINKING GLASSES FLAG FLAG;a;1
Topic pipelines cross-validation classification python machine-learning
Category Data Science