Converting paragraphs into sentences
I'm looking for ways to extract sentences from paragraphs of text containing different types of punctuations and all. I used SpaCy
's Sentencizer
to begin with.
Sample input python list abstracts
:
[A total of 2337 articles were found, and, according to the inclusion and exclusion criteria used, 22 articles were included in the study. Inhibitory activity against 96% (200/208) and 95% (312/328) of the pathogenic fungi tested was described for Eb and [(PhSe)2], respectively. Including in these 536 fungal isolates tested, organoselenium activity was highlighted against Candida spp., Cryptococcus ssp., Trichosporon spp., Aspergillus spp., Fusarium spp., Pythium spp., and Sporothrix spp., with MIC values lower than 64 mug/mL. In conclusion, Eb and [(PhSe)2] have a broad spectrum of in vitro inhibitory antifungal activity.]
Code:
from spacy.lang.en import English
nlp = English()
sentencizer = nlp.create_pipe(sentencizer)
nlp.add_pipe(sentencizer)
# read the sentences into a list
for doc in abstracts[:5]:
do = nlp(doc)
for sent in list(do.sents):
print(sent)
Output:
A total of 2337 articles were found, and, according to the inclusion and exclusion criteria used, 22 articles were included in the study.
Inhibitory activity against 96% (200/208) and 95% (312/328) of the pathogenic fungi tested was described for Eb and [(PhSe)2], respectively.
Including in these 536 fungal isolates tested, organoselenium activity was highlighted against Candida spp.,
Cryptococcus ssp.,
Trichosporon spp.,
Aspergillus spp.,
Fusarium spp.,
Pythium spp.,
and Sporothrix spp.,
with MIC values lower than 64 mug/mL. In conclusion, Eb and [(PhSe)2] have a broad spectrum of in vitro inhibitory antifungal activity.
It works fine for normal text but fails when there are dots (.
) present in the sentence elsewhere other than at the end, which breaks the whole sentence as shown in the above output. How can we address this? Are there any other proven methods or libraries to perform this task?
Topic information-extraction spacy tokenization nlp
Category Data Science