Stemming/lemmatization for German words
I have a huge dataset of German words and their frequency in a text corpus (so words like der, die, das have a very high frequency, whereas terminology-like words have a very low frequency). Different forms of the same word, such as plural or 3rd person forms do appear, but there is no guarantee that this happens for every word.
I tried using spacy.load('de_core_news_sm')
but it says it can't find the model. Other older posts don't mention anything reliable in this sense.
Maybe a second question: what could I do to determine a reliable popularity of a word using these frequencies when it comes to related words? For example, the singular form, Katze, has a frequency of 1000, but its plural, Katzen, has a frequency of 500. One idea is to add them; another idea is to make the plural have the same score as its singular, because the definition is the word is basically the same. How does one deal with this whe
Category Data Science