How do I split contents in a text that would include two or more different themes (context) in NLP?

Question

How do I split contents in a text that would include two or more different themes (context) in NLP?

Ahmad Aburoman

2022年2月26日 01:04

For example, a text: The airlines have affected by Corona since march 2020 a crime has been detected in Noia village this morning

the output should be:

The airline companies have affected by Corona since march 2020
a crime has been detected in Noia village this morning

the text has no Breaks. I know it is not a one-click solution, but if anyone knows a methodology or techniques to solve such a problem, please provide me with resources.

Topic text-classification nltk text-mining nlp

Category Data Science

Fatemeh Rahimi · Accepted Answer · 2021年4月27日 01:35

First reading the question I thought this is very easy and I started searching and trying out some libraries (i.e. nltk, and spacy). Here are my attempts and clearly showing that none of them works. I was thinking that these libraries will generate a parsing tree or a context-free grammar to detect sentences, but they are splitting based on a certain characters such as periods or commas.

So the other solution is to learn a classifier that learns the boundaries of sentences. A simple solution can be to use random forest classifier or SVM to serve as a binary classification task for each token. But the missing part is the data to train the model on and what to consider as labels (a BIO tagging scheme is a good idea here). I include two papers here that can be a good starting point for you to first read their methodology and use the datasets they are mentioning to do the task.

Du, Jinhua, Yan Huang, and Karo Moilanen. "AIG Investments. AI at the FinSBD task: Sentence boundary detection through sequence labelling and BERT fine-tuning." Proceedings of the First Workshop on Financial Technology and Natural Language Processing. 2019.
Rudrapal, Dwijen, et al. "Sentence boundary detection for social media text." (2015).

How do I split contents in a text that would include two or more different themes (context) in NLP?

About