Text similarity for badly written text

Question

Text similarity for badly written text

Ramiro Hum-Sah

2022年5月19日 03:29

Consider the following scenario:

Suppose two lists of words $L_{1}$ and $L_{2}$ are given. $L_{1}$ contains just bad-written phrases (like 'age' instead of '4ge' or 'blwe' instead of 'blue' etc.). On the other hand, each element of $L_{2}$ is a well-written version of each element of $L_{1}$.

Here is an example:

$$L_{1}=[...,dqta \ 5ciencc,...,s7ack \ exch9nge,...],$$ $$L_{2}=[...,stack \ exchange,...,data \ science,...].$$

Problem: Is there any strategy to try to predict which element $w^{\prime}$ in $L_{2}$ is the syntactically correct counterpart of a given bad-witted element $w$ of $L_{1}$?

By 'strategy' I mean some sort of syntactic word embeddings (that allow us to compare texts by using cosine similarity), any syntactic Word2Vect or a probabilistic model that could compute $P(w^{\prime} | w)$ (how likely is that $w^{\prime}$ could be the well-written version of $w$) etc.

Note: To be concrete, I'm asking for a measure of syntactic similarity among two pieces of text.

Thanks in advance.

Topic bert probability multilabel-classification multiclass-classification nlp

Category Data Science

Text similarity for badly written text

About