Text similarity for badly written text
Consider the following scenario:
Suppose two lists of words $L_{1}$ and $L_{2}$ are given. $L_{1}$ contains just bad-written phrases (like 'age' instead of '4ge' or 'blwe' instead of 'blue' etc.). On the other hand, each element of $L_{2}$ is a well-written version of each element of $L_{1}$.
Here is an example:
$$L_{1}=[...,dqta \ 5ciencc,...,s7ack \ exch9nge,...],$$ $$L_{2}=[...,stack \ exchange,...,data \ science,...].$$
Problem: Is there any strategy to try to predict which element $w^{\prime}$ in $L_{2}$ is the syntactically correct counterpart of a given bad-witted element $w$ of $L_{1}$?
By 'strategy' I mean some sort of syntactic word embeddings (that allow us to compare texts by using cosine similarity), any syntactic Word2Vect or a probabilistic model that could compute $P(w^{\prime} | w)$ (how likely is that $w^{\prime}$ could be the well-written version of $w$) etc.
Note: To be concrete, I'm asking for a measure of syntactic similarity among two pieces of text.
Thanks in advance.
Topic bert probability multilabel-classification multiclass-classification nlp
Category Data Science