Stemming/lemmatization for German words

Question

Stemming/lemmatization for German words

johnnydoe

2022年4月7日 00:02

I have a huge dataset of German words and their frequency in a text corpus (so words like der, die, das have a very high frequency, whereas terminology-like words have a very low frequency). Different forms of the same word, such as plural or 3rd person forms do appear, but there is no guarantee that this happens for every word.

I tried using spacy.load('de_core_news_sm') but it says it can't find the model. Other older posts don't mention anything reliable in this sense.

Maybe a second question: what could I do to determine a reliable popularity of a word using these frequencies when it comes to related words? For example, the singular form, Katze, has a frequency of 1000, but its plural, Katzen, has a frequency of 500. One idea is to add them; another idea is to make the plural have the same score as its singular, because the definition is the word is basically the same. How does one deal with this whe

Topic nltk scipy nlp python

Category Data Science

technik · Accepted Answer · 2022年3月3日 19:18

When spacy can't find the model just download it with "python -m spacy download de_core_news_sm".

For your second problem, you should use stemming or lemmatization depending on your use case. If you have a lot of words and not enough processing power stemming will be the better solution. That should also solve your Katze/Katzen problem.

With lemmatization you can also aggregate words together that have the same meaning but not the same word stem. For example, "besser" would be transformed to "gut". So this really depends on how much detail you want.

Stemming/lemmatization for German words

About