Why does using a standard scalar on my tf idf matrix make it perform better?

I have a TF-IDF matrix transformed on a list of tweets from a data set I am using. I have a pipeline where I initiate a StandardScalar and then next have my SVM with a linear kernel and auto gamma as the classifier algorithm.

Pretty much as done here in the examples section. With the pipeline, the classifier scores an 87 f1 score. Without the pipe, it scores a dismal 53.

Why is this?

I thought TF-IDF values were already two-fold normalised so shouldn't the standard scalar have no effect as it just performs normalisation again?

Topic tfidf classification svm

Category Data Science


I am not sure what you mean by two-fold normalized. TFIDF values are not normalized per se. They will fall within a somewhat constant range of values, but this completely depends on the dataset.

The StandardScaler performs normalization by removing the mean and scaling to unit variance. Since TFIDF values may vary from dataset to dataset, using the StandardScaler will have an effect on the values, which explains the difference in performance.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.