How do I split contents in a text that would include two or more different themes (context) in NLP?

For example, a text: The airlines have affected by Corona since march 2020 a crime has been detected in Noia village this morning

the output should be:

  • The airline companies have affected by Corona since march 2020
  • a crime has been detected in Noia village this morning

the text has no Breaks. I know it is not a one-click solution, but if anyone knows a methodology or techniques to solve such a problem, please provide me with resources.

Topic text-classification nltk text-mining nlp

Category Data Science


First reading the question I thought this is very easy and I started searching and trying out some libraries (i.e. nltk, and spacy). Here are my attempts and clearly showing that none of them works. I was thinking that these libraries will generate a parsing tree or a context-free grammar to detect sentences, but they are splitting based on a certain characters such as periods or commas.

So the other solution is to learn a classifier that learns the boundaries of sentences. A simple solution can be to use random forest classifier or SVM to serve as a binary classification task for each token. But the missing part is the data to train the model on and what to consider as labels (a BIO tagging scheme is a good idea here). I include two papers here that can be a good starting point for you to first read their methodology and use the datasets they are mentioning to do the task.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.