Custom vectorizer transformer in sklearn with cross validation

I created a custom transformer class called Vectorizer() that inherits from sklearn's BaseEstimator and TransformerMixin classes. The purpose of this class is to provide vectorizer-specific hyperparameters (e.g.: ngram_range, vectorizer type: CountVectorizer or TfidfVectorizer) for the GridSearchCV or RandomizedSearchCV, to avoid having to manually rewrite the pipeline every time we believe a vectorizer of a different type or settings could work better.

The custom transformer class looks like this:

class Vectorizer(BaseEstimator, TransformerMixin):
    def __init__(self, vectorizer:Callable=CountVectorizer(), ngram_range:tuple=(1,1)) - None:
        super().__init__()
        self.vectorizer = vectorizer
        self.ngram_range = ngram_range
    def fit(self, X, y=None):
        print(f Vectorizer.fit() called with vectorizer={self.vectorizer} and ngram_range={self.ngram_range}.)
        return self 
    def transform(self, X, y=None):
        print(f Vectorizer.transform() called with vectorizer={self.vectorizer} and ngram_range={self.ngram_range}.)
        X_ = X.copy()
        X_vect_ = self.vectorizer.fit_transform(X_)  # problem is in this line! 
        X_vect_ = X_vect_.toarray()
        # print(X_vect_.shape)
        # print(self.vectorizer.vocabulary_)
        # time.sleep(5)
        return X_vect_

(Side note: time.sleep(5) was merely added to make debugging easier by preventing debug info overflowing on one another.)

I intend to use the custom vectorizer in the following way, with a pipeline and a hyperparameter tuning step:

pipe = Pipeline([
    ('column_transformer', ColumnTransformer([
        ('ltype_encode', OneHotEncoder(handle_unknown='ignore'), ['Type']),
        ('text_vectorizer', Vectorizer(), 'Text')],
        remainder='drop')
    ),
    ('model', LogisticRegression())
])

param_dict = {
    'column_transformer__text_vectorizer__vectorizer': [CountVectorizer(), TfidfVectorizer()]
}

randsearch = GridSearchCV(pipe, param_dict, cv=2, scoring='f1').fit(X_train, y_train)

Now as I was debugging, I have a guess of the problem in my code: the above GridSearchCV uses a 2-fold cross validation. First, it takes half of the data to train the model and reserves the other half for evaluation. However, the Vectorizer() class's transform() method will try to call fit_transform() on the evaluation dataset again, even though when testing/evaluating, we would want to use the previously fit vectorizer without a refit.

Question is: how could I rectify this problem?

Imports:

import time
import pandas as pd 
from typing import Callable
import sklearn
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import GridSearchCV

and sample data in semicolon-separated format:

Src;Text;Type;Target
A99;hi i love python very much;c;1
B07;which programming language should i learn;b;0
A12;what is the difference between python django flask;b;1
A21;i want to be a programmer one day;c;0
B11;should i learn java or python;b;1
C01;how much can i earn as a programmer with python;a;0
c01;hello FLAG FLAG I m from france i enjoyed this lecture thank u very much HEAVY BLACK HEART HEAVY BLACK HEART HEAVY BLACK HEART HEAVY BLACK HEART;b;1
ssa;hi hola salut FOREIGN FOREIGN FOREIGN FOREIGN SMILING FACE WITH HALO HEAVY BLACK HEART CLINKING GLASSES FLAG FLAG;a;1

Topic pipelines cross-validation classification python machine-learning

Category Data Science


This can be solved by simply changing the method that is called within transform to the transform method of the vectorizer. In addition you would also have to add a call to fit within the fit method to make sure that the vectorizer is actually fitted before being used to transform any data:

class Vectorizer(BaseEstimator, TransformerMixin):
    def __init__(self, vectorizer:Callable=CountVectorizer(), ngram_range:tuple=(1,1)) -> None:
        super().__init__()
        self.vectorizer = vectorizer
        self.ngram_range = ngram_range
    def fit(self, X, y=None):
        print(f">>> Vectorizer.fit() called with vectorizer={self.vectorizer} and ngram_range={self.ngram_range}.")
        self.vectorizer.fit(X)
        return self 
    def transform(self, X, y=None):
        print(f">>> Vectorizer.transform() called with vectorizer={self.vectorizer} and ngram_range={self.ngram_range}.")
        X_ = X.copy()
        X_vect_ = self.vectorizer.transform(X_)
        X_vect_ = X_vect_.toarray()
        return X_vect_

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.