Create an RNN on text sources with different lengths

Question

Create an RNN on text sources with different lengths

gruffaren

2022年4月2日 02:01

I want to create an RNN to generate a new text based on many examples of existing texts of a certain format in the training data. The type of texts in the training data consists of 3 segments, like so:

Example text 1:

[Segment 0, ~20 characters]

[Segment 1, ~200 characters]

[Segment 2, ~400 characters]

It is worth mentioning that the segments are all still alphanumerical, but of varying structure. Segment 1 contains more numbers and Segment 2 has more letters with references to occurrences in Segment 1.

I have training data with many instances of such example texts, but with varying length in each segment.

In most RNN text generation examples I have seen before the training set consists of a single long text which is split into segments of a fixed length for training, and I guess that could be done in my case as well by simply concatenating all of the different examples into one long string. Then dividing that long string into sequence pairs of input and target strings of a fixed length, say 100 characters and use them for training.

My question is if I can not instead take advantage of the fact that I have the text already split up into segments which resemble what I want to generate, and use each example as a batch for training.

In my opinion feels like it would be easier for the RNN to learn the relations between the different text sequences if it got fed data (input + target string) that was belonging to one example at a time, as opposed to data (input + target) spanning over the entire training set.

Is there any aspect that I am overlooking which would favour simply concatenating all training data into one large string?

Many thanks in advance!

Topic generative-models text-generation rnn neural-network

Category Data Science

Adam Oudad · Accepted Answer · 2020年6月29日 16:04

I think that if you append a token <EOS> (end of sentence) at the end of each sentence when you merge, this would not be a problem, because the RNN would learn to cut sentences and to generate independently if you shuffle your data and train with several shuffles.

However, as you say your data is heterogeneous, you might consider to first run some clustering algorithm and then train your RNN to generate conditionally to a cluster. To build the training dataset, you can merge all data for each cluster. This is a bit more complex than simply "automatic generation" but it depends on what your goal is.

Create an RNN on text sources with different lengths

About