how to programmatically introduce grammatical errors in sentences

I've a set of sentences in English language. I'm exploring ways to create a dataset of sentences with grammatical errors programmatically. The following options has been tried out randomly -

  • identify verbs, propositions etc. by POS tagging and change the tense or remove them
  • change the order of 2 or more words
  • remove commas, colons, semi-colons etc.

These are not always fool-proof. Are there any proven ways to approach this problem?

Topic grammar-inference language-model nlp python

Category Data Science


Generating artificial errors is generally risky in NLP, because it's difficult to make sure that the type and distribution of errors correspond exactly to real human errors. If the artificial errors diverge from real errors and a model is trained based on this data, the model will appear to have very good performance since it will rely on the patterns used to generate the errors. However it might not perform well with real data, and it would be difficult to detect it.

That being said, it's been a problem which has been studied for quite a while so the state of the art should help: Google Scholar gives a lot of references, probably some of these papers provide existing implementations as well. One may notice that the concerns I mentioned above are a recurrent question, with some recent papers analyzing how much artificial errors actually help.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.