Is it possible to target a specific output length range with BART seq2seq?

I'm currently working on an extractive summary model based on Facebook's BART model. Consistent absolute length output would be highly desirable. The problem is that input length may vary wildly. That is to say, creating the training data, the instructions look like this:

  1. Take the input text (a news article) and start (recursively) deleting examples, excess details, unnecessary background information, quotes, etc.
  2. Once your summary has less than 90 words, stop deleting stuff.
  3. Fix up the text format to match the style guide.

The large BART model available on Huggingface was fine-tuned on 200 samples. All 200 samples had 60-88 words as the output sequence length. However, the model predicted outputs with lengths varying from 50 to 105 words, with some outliers as high as 120 words.

Now I'm questioning whether just throwing more samples at the problem will actually fix it. Since the model is doing very well following the style guide, I don't want to give up on this approach. The outputs which are too long can be eliminated by cranking up the length penalty. But that would make the too short case even more prevalent.

Can fine-tuning achieve a tighter range of output lengths just by specifying more examples? Or is there perhaps a more hacky solution to penalize output lengths not in the range?

Topic transformer sequence-to-sequence nlp

Category Data Science


The answer was to go lower level into the actual transformer config, then force the model to create sequences of 64-128 tokens. Doing this before training forces the model to adapt to this constraint, and obviously, these hard bounds ultimately lead to outputs in the specified range only.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.