Converting paragraphs into sentences

I'm looking for ways to extract sentences from paragraphs of text containing different types of punctuations and all. I used SpaCy's Sentencizer to begin with.

Sample input python list abstracts:

[A total of 2337 articles were found, and, according to the inclusion and exclusion criteria used, 22 articles were included in the study. Inhibitory activity against 96% (200/208) and 95% (312/328) of the pathogenic fungi tested was described for Eb and [(PhSe)2], respectively. Including in these 536 fungal isolates tested, organoselenium activity was highlighted against Candida spp., Cryptococcus ssp., Trichosporon spp., Aspergillus spp., Fusarium spp., Pythium spp., and Sporothrix spp., with MIC values lower than 64 mug/mL. In conclusion, Eb and [(PhSe)2] have a broad spectrum of in vitro inhibitory antifungal activity.]

Code:

from spacy.lang.en import English

nlp = English()
sentencizer = nlp.create_pipe(sentencizer)
nlp.add_pipe(sentencizer)

# read the sentences into a list
for doc in abstracts[:5]:
    do = nlp(doc)
    for sent in list(do.sents):
        print(sent)

Output:

A total of 2337 articles were found, and, according to the inclusion and exclusion criteria used, 22 articles were included in the study.
Inhibitory activity against 96% (200/208) and 95% (312/328) of the pathogenic fungi tested was described for Eb and [(PhSe)2], respectively.
Including in these 536 fungal isolates tested, organoselenium activity was highlighted against Candida spp.,
Cryptococcus ssp.,
Trichosporon spp.,
Aspergillus spp.,
Fusarium spp.,
Pythium spp.,
and Sporothrix spp.,
with MIC values lower than 64 mug/mL. In conclusion, Eb and [(PhSe)2] have a broad spectrum of in vitro inhibitory antifungal activity.

It works fine for normal text but fails when there are dots (.) present in the sentence elsewhere other than at the end, which breaks the whole sentence as shown in the above output. How can we address this? Are there any other proven methods or libraries to perform this task?

Topic information-extraction spacy tokenization nlp

Category Data Science


Spacy's Sentencizer is very simple. However, Spacy 3.0 includes Sentencerecognizer which basically is a trainable sentence tagger and should behave better. Here is the issue with the details of its inception. You can train it if you have segmented sentence data.

Another option is using NLTK's sent_tokenize, which should give better results than Spacy's Sentencizer. I have tested it with your example and it works well.

from nltk.tokenize import sent_tokenize
sent_tokenize("A total....")

Finally, if for some abbreviations sent_tokenize does not work well and you have a list of abbreviations to be supported (like "spp." in your examples), you could use NLTK's PunktSentenceTokenizer:

from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters
punkt_param = PunktParameters()
abbreviation = ['spp.']
punkt_param.abbrev_types = set(abbreviation)
tokenizer = PunktSentenceTokenizer(punkt_param)
tokenizer.tokenize("A total ....")

There is nothing in SpaCy that you can use out-of-the-box. However, they allow you to use custom components

To solve your problem, I see at least three ways to do it.

  • NTLK

NLTK allows you to add known abbreviations as exceptions. See this StackOverflow post.

  • Use a regular expression

Since your problem is that you have some example of dots that shouldn't mean a sentence starts, you could customize a basic regular expression to include that behavior. Here is a stackoverflow answer that could get you started.

  • Post-processing

You could also use SpaCy's default segmentation, and then merge sentences if they end with a known abbreviation. It's not incredibly elegant, but it will work.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.