how to programmatically introduce grammatical errors in sentences

Question

how to programmatically introduce grammatical errors in sentences

Van Peer

2021年1月14日 10:42

I've a set of sentences in English language. I'm exploring ways to create a dataset of sentences with grammatical errors programmatically. The following options has been tried out randomly -

identify verbs, propositions etc. by POS tagging and change the tense or remove them
change the order of 2 or more words
remove commas, colons, semi-colons etc.

These are not always fool-proof. Are there any proven ways to approach this problem?

Topic grammar-inference language-model nlp python

Category Data Science

Erwan · Accepted Answer · 2021年1月14日 10:42

Generating artificial errors is generally risky in NLP, because it's difficult to make sure that the type and distribution of errors correspond exactly to real human errors. If the artificial errors diverge from real errors and a model is trained based on this data, the model will appear to have very good performance since it will rely on the patterns used to generate the errors. However it might not perform well with real data, and it would be difficult to detect it.

That being said, it's been a problem which has been studied for quite a while so the state of the art should help: Google Scholar gives a lot of references, probably some of these papers provide existing implementations as well. One may notice that the concerns I mentioned above are a recurrent question, with some recent papers analyzing how much artificial errors actually help.

how to programmatically introduce grammatical errors in sentences

About