How fit_transform, transform and TfidfVectorizer works

Question

How fit_transform, transform and TfidfVectorizer works

nananinanana

2022年4月19日 21:04

I'm a machine learning beginner and I tried to use the cosine similarity on fuzzy matching purpose. In the following example I want to compare 'data_dirty' with 'data_clean' :

When I have to vectorize my data I do not really understand what is the purpose of fit_transform and WHY 'dirty_idf_matrix' has ONLY transform argument with SAME vectorizer than 'clean_idf_matrix' which has saved the value with fit if I understood well.

Col_clean = 'fruits_normalized'
Col_dirty = 'fruits'

#read table
data_dirty={f'{Col_dirty}':['I am an apple', 'You are an apple', 'Aple', 'Appls', 'Apples']}
data_clean= {f'{Col_clean}':['apple', 'pear', 'banana', 'apricot', 'pineapple']}

df_clean = pd.DataFrame(data_clean)
df_dirty = pd.DataFrame(data_dirty)

Name_clean = df_clean[f'{Col_clean}'].unique()
Name_dirty= df_dirty[f'{Col_dirty}'].unique()

vectorizer = TfidfVectorizer(min_df=1, analyzer=ngrams)
clean_idf_matrix = vectorizer.fit_transform(Name_clean)
dirty_idf_matrix = vectorizer.transform(Name_dirty)

thank you for your help !

Topic fuzzy-logic cosine-distance scikit-learn python machine-learning

Category Data Science

Sean Owen · Accepted Answer · 2020年3月12日 01:37

I'm not really sure what you're asking, but in general, you need to fit an Estimator to data so it can learn what it has to do, then you transform data with it. fit_transform just does fit and then transform. Here you fit the transformer to Name_clean, and then apply it to both in turn. That's pretty normal.

How fit_transform, transform and TfidfVectorizer works

About