How to generate syntactically correct text for CRNN-CTC text model?
Disregarding the image creation and labeling details, is there a way to generate syntactically correct text examples? As of my current understanding of the CTC model, it takes into consideration the likelihood of a given letter preceding or following another in a given sequence. For example:
Colorless green ideas sleep furiously
The sentence doesn't make sense however, it has a proper syntax: each word has a few vowels, verbs are where they should be, ... I want the word generator to take into account what is more / less valid, and generate examples accordingly. I think generating completely random words, phrases and so introduces bias to the model. Here's another form which is still okay:
clabe lonkey sining slace 225
Which is less valid than the previous example but still, words have a proper syntax. Here's what I think is not good for model generalization:
jhsgdvj c3DDsdc csdce5445dchjv3 cdsIBcsc
Which is usually the result of random generation that I'm trying to avoid. A common practice followed by some text generators I found, is to keep some sort of word files and use them as examples but this limits the examples to the predetermined words and introduces character imbalance for the less frequent characters ex: z, x, ...
Topic text-classification text-generation text
Category Data Science