How to stem plural words properly?
I'm looking for a way to avoid removing ending s
when s
isn't a suffix. In order to do that, I first check if a word exists in my index, if it does, I don't remove the ending s
but If it doesn't, I go on and remove the ending s
and add it to the index. But the problem is what to do when starting to build the index.
Imagine we encounter books
, I remove s
and add book
to my index. On the other hand, I may encounter dangerous
for the first time, since it doesn't exists in my index yet, I remove s
and add dangerou
which is obviously wrong. What should I do?
Specifically I'm looking for ways to properly detect if suffixes and prefixes are indeed one or part of the original word. one way that comes to my mind is using a formal dictionary and instead of my own index, check the words in that dictionary.
P.S: I'm not working on English docs. It's a college/prototype thing Therefore I'm looking for general, good ideas with good accuracy. I'm not looking for advanced stuff with superb accuracy and considerable complexity.
Topic indexing nlp information-retrieval
Category Data Science