Create an RNN on text sources with different lengths
I want to create an RNN to generate a new text based on many examples of existing texts of a certain format in the training data. The type of texts in the training data consists of 3 segments, like so:
Example text 1:
[Segment 0, ~20 characters]
[Segment 1, ~200 characters]
[Segment 2, ~400 characters]
It is worth mentioning that the segments are all still alphanumerical, but of varying structure. Segment 1 contains more numbers and Segment 2 has more letters with references to occurrences in Segment 1.
I have training data with many instances of such example texts, but with varying length in each segment.
In most RNN text generation examples I have seen before the training set consists of a single long text which is split into segments of a fixed length for training, and I guess that could be done in my case as well by simply concatenating all of the different examples into one long string. Then dividing that long string into sequence pairs of input and target strings of a fixed length, say 100 characters and use them for training.
My question is if I can not instead take advantage of the fact that I have the text already split up into segments which resemble what I want to generate, and use each example as a batch for training.
In my opinion feels like it would be easier for the RNN to learn the relations between the different text sequences if it got fed data (input + target string) that was belonging to one example at a time, as opposed to data (input + target) spanning over the entire training set.
Is there any aspect that I am overlooking which would favour simply concatenating all training data into one large string?
Many thanks in advance!
Topic generative-models text-generation rnn neural-network
Category Data Science