How can I use all possible spelling correction of documents before clustering those documents?

I have the data set with many documents of 50 to 100 words each.

I need to clean those data by correcting misspelled words in those documents.

I have an algorithm which predicts possible correct words for misspelled word.

The problem is I need to choose or verify the predictions made by that algorithm in order to clean the spelling errors in the documents.

Can I use all the possible correct words predicted for correct spelling in word vector to perform clustering on those data?

Topic word2vec nlp

Category Data Science


This is a hard problem. I don't think using spell correction is the best method to use here, since I think you already knew the issue which is as you mentioned. So here is some suggestions from me :

  1. Study the texts, determine where misspeling occurs and provide the correction by simple mapping. However this could be a bit of problematic especially with named entity e.g. you wrongly correct maersk(company) with makers. Second problem with this is it is quite difficult to cover new cases unless you update it once in a while. Third problem is it involves rather a lot of manual work. But I believe this is the most reliable method for correction.

  2. Your nlp model usually is quite robust to mispelling this unless it happens on the most important words. There are though some nlp algorithms that are specifically introduced to tackle this problem e.g. Character level CNN and Fasttext embedding. I suggest reading through these.

  3. Use pretrained advanced algorithm and finetune from this e.g. BERT and GPT. Those algorithm are already trained from millions of corpus and possibly "dirtier" than your preprocessed corpus and I believe are robust enough for your use-cases.

Those are my suggestions. Hope it helps.


The problem is that you cannot be a 100% sure that the predictions of the corrections of your algorithm will be true. Such an algorithm does not exist yet.

So you have 2 options :

  1. Either you correct the algorithm errors by yourself, but it will be very long, and if your aim is to study the performance of a method it will bias the results. I would not recommand that.
  2. Or you can just apply your algorithm and either set a probability threshold, i.e. you let the process correct the words if the probability is higher than an arbitrary value, say 75%, or correct all the words with the highest probability correction. You can then apply the classification model, and even if some corrections made were false, it should globally enhance the result, so it should be better than not correcting.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.