Jargon extraction in a text

I have a big text corpus (documentation from a company) and I want to extract the terms that are specific to that area/business. I can do that using TF or TF-IDF and guide myself by the frequency of the words, which isn't always reliable.

I want to also do that for single, shorter sentences, but I think this is already more difficult. I was also thinking of using Wikipedia articles to train a model and then apply it to my documentation texts.

Is there any way of identifying words that are related to a specific field?

Topic corpus nlp python

Category Data Science


You can use TF-IDF, TextRank, TopicRank, YAKE!, and KeyBERT for keyword extraction.

Check this article: https://towardsdatascience.com/keyword-extraction-python-tf-idf-textrank-topicrank-yake-bert-7405d51cd839


I had created a similar application some time back, I had extracted the features(important defining terms) from the corpus using TF-IDF and then calculated word similarity between these words with my input words and aggregated the results.

You could use word embeddings like GloVe if you want to compare these words semantically.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.