How fit_transform, transform and TfidfVectorizer works
I'm a machine learning beginner and I tried to use the cosine similarity on fuzzy matching purpose. In the following example I want to compare 'data_dirty' with 'data_clean' :
When I have to vectorize my data I do not really understand what is the purpose of fit_transform and WHY 'dirty_idf_matrix' has ONLY transform argument with SAME vectorizer than 'clean_idf_matrix' which has saved the value with fit if I understood well.
Col_clean = 'fruits_normalized'
Col_dirty = 'fruits'
#read table
data_dirty={f'{Col_dirty}':['I am an apple', 'You are an apple', 'Aple', 'Appls', 'Apples']}
data_clean= {f'{Col_clean}':['apple', 'pear', 'banana', 'apricot', 'pineapple']}
df_clean = pd.DataFrame(data_clean)
df_dirty = pd.DataFrame(data_dirty)
Name_clean = df_clean[f'{Col_clean}'].unique()
Name_dirty= df_dirty[f'{Col_dirty}'].unique()
vectorizer = TfidfVectorizer(min_df=1, analyzer=ngrams)
clean_idf_matrix = vectorizer.fit_transform(Name_clean)
dirty_idf_matrix = vectorizer.transform(Name_dirty)
thank you for your help !
Topic fuzzy-logic cosine-distance scikit-learn python machine-learning
Category Data Science