How fit_transform, transform and TfidfVectorizer works

I'm a machine learning beginner and I tried to use the cosine similarity on fuzzy matching purpose. In the following example I want to compare 'data_dirty' with 'data_clean' :

When I have to vectorize my data I do not really understand what is the purpose of fit_transform and WHY 'dirty_idf_matrix' has ONLY transform argument with SAME vectorizer than 'clean_idf_matrix' which has saved the value with fit if I understood well.

Col_clean = 'fruits_normalized'
Col_dirty = 'fruits'

#read table
data_dirty={f'{Col_dirty}':['I am an apple', 'You are an apple', 'Aple', 'Appls', 'Apples']}
data_clean= {f'{Col_clean}':['apple', 'pear', 'banana', 'apricot', 'pineapple']}

df_clean = pd.DataFrame(data_clean)
df_dirty = pd.DataFrame(data_dirty)

Name_clean = df_clean[f'{Col_clean}'].unique()
Name_dirty= df_dirty[f'{Col_dirty}'].unique()

vectorizer = TfidfVectorizer(min_df=1, analyzer=ngrams)
clean_idf_matrix = vectorizer.fit_transform(Name_clean)
dirty_idf_matrix = vectorizer.transform(Name_dirty)

thank you for your help !

Topic fuzzy-logic cosine-distance scikit-learn python machine-learning

Category Data Science


I'm not really sure what you're asking, but in general, you need to fit an Estimator to data so it can learn what it has to do, then you transform data with it. fit_transform just does fit and then transform. Here you fit the transformer to Name_clean, and then apply it to both in turn. That's pretty normal.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.