How to stem plural words properly?

Question

How to stem plural words properly?

Mahdi Ghajary

2021年3月7日 14:06

I'm looking for a way to avoid removing ending s when s isn't a suffix. In order to do that, I first check if a word exists in my index, if it does, I don't remove the ending s but If it doesn't, I go on and remove the ending s and add it to the index. But the problem is what to do when starting to build the index.

Imagine we encounter books, I remove s and add book to my index. On the other hand, I may encounter dangerous for the first time, since it doesn't exists in my index yet, I remove s and add dangerou which is obviously wrong. What should I do?

Specifically I'm looking for ways to properly detect if suffixes and prefixes are indeed one or part of the original word. one way that comes to my mind is using a formal dictionary and instead of my own index, check the words in that dictionary.

P.S: I'm not working on English docs. It's a college/prototype thing Therefore I'm looking for general, good ideas with good accuracy. I'm not looking for advanced stuff with superb accuracy and considerable complexity.

Topic indexing nlp information-retrieval

Category Data Science

noe · Accepted Answer · 2021年2月5日 09:14

It seems that you are collecting the lemmas in your docs. For that, you need a lemmatizer.

If available for your language, you should use an external lemmatizer. Some packages supporting lemmatization for different languages are StanfordNLP (or its equivalent for Python, Stanza), Spacy or NLTK.

Depending on the language, the approach to get a good lemmatization varies, but many times it involves expressing the language morphological knowledge as rules.

If no lemmatizer or stemmer is available in the language you are working with, another approach would be to use unsupervised approaches to segment words into morphemes, and use your linguistic knowledge of Georgian to devise some heuristic rules to identify the stem among them. This kind of approach consists of a model trained to identify morphemes without any labels (i.e. unsupervisedly). The most relevant Python package for this is Morfessor. Also, there is a Python package called Polyglot that offers pre-trained Morfessor models in different languages, so you should check if yours is included.

How to stem plural words properly?

About