Document clustering to merge common labels

I am building a recommendation system and I have to clean up some of the labels that I have. For example of the data

df['resolution_modified'].value_counts()

Gives

105829
It is recommended to replace scanner                                                                                                 1732
It is recommended to reboot station                                                                                                  1483
It is recommended to replace printer                                                                                                  881
It is recommended to replace keyboard                                                                                                 700
                                                                                                                                    ...  
It is recommended to update both computers in erc to ensure y be compliant with acme                                                    1
It is recommended to configure and i have verify alignement printer be work now corrado                                                 1
It is recommended to create rma for break devices please see tt for more information resolve this in favor of rma ticket create         1
It is recommended to replace keyboard manually clear hd space add to stale profile manager instal windows update                        1
It is recommended to switch out dpi head from break printers                                                                            1

Notice that It is recommended to replace keyboard and It is recommended to replace keyboard manually clear hd space add to stale profile manager instal windows update are very similar. Ideally, I would like to just converge to the string that occurs more frequently thus the second string should convert to the first.

I am thinking of using document clustering to handle this approach. I have tried using fuzzywuzzy but since I have a lot of strings the process below is too slow

from fuzzywuzzy import fuzz

def replace_similars(input_list):
    # Replaces %90 and more similar strings
    for i in range(len(input_list)):
        for j in range(len(input_list)):
            if i  j and fuzz.ratio(input_list[i], input_list[j]) = 90:
                input_list[j] = input_list[i]

def generate_mapping(input_list):
    new_list = input_list[:]  # copy list
    replace_similars(new_list)

    mapping = {}
    for i in range(len(input_list)):
        mapping[input_list[i]] = new_list[i]

    return mapping
res = h['resolution_modified'].unique()
res.sort()
mapping = generate_mapping(res)
for k, v in mapping.items():
    if k != v:
        h.loc[h['resolution_modified'] == k, 'resolution_modified'] = v

I wanted to know if there is some document clustering I could apply that weighs in the strings that occur more than once and thus I would just take the common strings related to them that occur less and converge them to the more frequent occuring string. Does anyone have any recommendation on which method to use?

What I have Tried Thus Far:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
v = TfidfVectorizer()
x = v.fit_transform(df['resolution_modified'])
kmeans = KMeans(n_clusters=2).fit(x)
test_strings = ['It is recommended to replace keyboard', 'It is recommended to replace keyboard manually clear hd space add to stale profile manager instal windows update']
kmeans.predict(v.transform(test_strings))

Which gives

array([1, 0], dtype=int32)

Obviously not working so far, will try to increase the number of clusters.

Topic document-understanding unsupervised-learning data-cleaning clustering machine-learning

Category Data Science


I might misunderstand something but it looks to me like you're trying to find a complex method for a simple problem: if there are many strings which occur multiple times in the list, you should deduplicate the list before comparing all the pairs. You could use a set, but since you will need to count how frequent each string is you should probably directly create a map (dictionary) which stores the frequency for every string (just iterate over the list of strings, then increment the frequency of this string (key) in the map).

Depending how many distinct strings you have, this simple step might be enough to allow you to compare all the pairs of strings efficiently.

Then you could for example decide on a threshold for frequency, for instance keep only the strings which appear at least 10 times. For any string which appear less than 10 times, replace with the frequent string (more than 10 times) which has the highest similarity with it.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.