Document clustering to merge common labels
I am building a recommendation system and I have to clean up some of the labels that I have. For example of the data
df['resolution_modified'].value_counts()
Gives
105829
It is recommended to replace scanner 1732
It is recommended to reboot station 1483
It is recommended to replace printer 881
It is recommended to replace keyboard 700
...
It is recommended to update both computers in erc to ensure y be compliant with acme 1
It is recommended to configure and i have verify alignement printer be work now corrado 1
It is recommended to create rma for break devices please see tt for more information resolve this in favor of rma ticket create 1
It is recommended to replace keyboard manually clear hd space add to stale profile manager instal windows update 1
It is recommended to switch out dpi head from break printers 1
Notice that It is recommended to replace keyboard
and It is recommended to replace keyboard manually clear hd space add to stale profile manager instal windows update
are very similar. Ideally, I would like to just converge to the string that occurs more frequently thus the second string should convert to the first.
I am thinking of using document clustering to handle this approach. I have tried using fuzzywuzzy but since I have a lot of strings the process below is too slow
from fuzzywuzzy import fuzz
def replace_similars(input_list):
# Replaces %90 and more similar strings
for i in range(len(input_list)):
for j in range(len(input_list)):
if i j and fuzz.ratio(input_list[i], input_list[j]) = 90:
input_list[j] = input_list[i]
def generate_mapping(input_list):
new_list = input_list[:] # copy list
replace_similars(new_list)
mapping = {}
for i in range(len(input_list)):
mapping[input_list[i]] = new_list[i]
return mapping
res = h['resolution_modified'].unique()
res.sort()
mapping = generate_mapping(res)
for k, v in mapping.items():
if k != v:
h.loc[h['resolution_modified'] == k, 'resolution_modified'] = v
I wanted to know if there is some document clustering I could apply that weighs in the strings that occur more than once and thus I would just take the common strings related to them that occur less and converge them to the more frequent occuring string. Does anyone have any recommendation on which method to use?
What I have Tried Thus Far:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
v = TfidfVectorizer()
x = v.fit_transform(df['resolution_modified'])
kmeans = KMeans(n_clusters=2).fit(x)
test_strings = ['It is recommended to replace keyboard', 'It is recommended to replace keyboard manually clear hd space add to stale profile manager instal windows update']
kmeans.predict(v.transform(test_strings))
Which gives
array([1, 0], dtype=int32)
Obviously not working so far, will try to increase the number of clusters.