Predict indices of text using deep learning

I want to predict the start and end indices of text where a certain type of propaganda technique is used like smears, name-calling, loaded language etc. Some examples from the dataset are:

['THERE ARE ONLY TWO GENDERS\n\nFEMALE \n\nMALE\n', 'This is not an accident!', SO BERNIE BROS HAVEN'T COMMITTED VIOLENCE EH?\n\nPOWER COMES FROM THE BARREL OF A GUN, COMRADES.\n\nWHAT ABOUT THE ONE WHO SHOT CONGRESSMAN SCALISE OR THE DAYTON OHIO MASS SHOOTER?\n]

[[[0, 41]], [], [[47, 83], [3, 14], [33, 41], [163, 175], [85, 93], [0, 176]]]

So, 0 and 41 mean that the whole text from 1st example comes under a certain category i.e. from index 0 to 41.

The next one has nothing weird in it.

Then we have 'Slogan' from 47 to 83 i.e. 'POWER COMES FROM THE BARREL OF A GUN' , and for 3 to 14 there is 'BERNIE BROS' which is highlighted as 'name calling'.

I have tried using regression here with an LSTM model but the results are very poor which I expected. I am looking for the right approach to solve this problem. Any help will be highly appreciated. Thanks!

Topic lstm multilabel-classification sequence

Category Data Science


You could use a "masked language model" (MLM) which predicts if a (short) piece of text or sentence belongs to some class (labels can be derived from the indices, I guess). With LSTM you only go in one direction (start to end) while with bidirectional encoder (BERT like models), you go in both directions which is a great improvement.

Original BERT uses MLM as well as "next sentence prediction" (NSP) during learning. However, maybe MLM with classification at the end may be sufficient. MLM works in the way, that you first learn the nature of the text by "masking" random words and try to predict them. This is very helpful to make a final (downstream) classification regarding to what kind of category a text belongs.

You may also use a pretrained BERT model and fine tune it. Finding relevant parts of text is one of the downstream tasks BERT can do.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.