Why does using a standard scalar on my tf idf matrix make it perform better?

Question

Why does using a standard scalar on my tf idf matrix make it perform better?

Synikk

2022年3月21日 04:08

I have a TF-IDF matrix transformed on a list of tweets from a data set I am using. I have a pipeline where I initiate a StandardScalar and then next have my SVM with a linear kernel and auto gamma as the classifier algorithm.

Pretty much as done here in the examples section. With the pipeline, the classifier scores an 87 f1 score. Without the pipe, it scores a dismal 53.

Why is this?

I thought TF-IDF values were already two-fold normalised so shouldn't the standard scalar have no effect as it just performs normalisation again?

Topic tfidf classification svm

Category Data Science

Valentin Calomme · Accepted Answer · 2021年1月11日 08:42

I am not sure what you mean by two-fold normalized. TFIDF values are not normalized per se. They will fall within a somewhat constant range of values, but this completely depends on the dataset.

The StandardScaler performs normalization by removing the mean and scaling to unit variance. Since TFIDF values may vary from dataset to dataset, using the StandardScaler will have an effect on the values, which explains the difference in performance.

Why does using a standard scalar on my tf idf matrix make it perform better?

About