nltk.corpus for data science related words?

from job description I scraped from the internet, I've went through all nlp processes and I've got to place where I found:

freq = nltk.FreqDist(lemmatized_list)
most_freq_words = freq.most_common(100)

which outputs:

[('data', 179),
 ('experience', 86),
 ('work', 78),
 ('business', 71),
 ('team', 59),
 ('learn', 56),
 ('model', 49),
 ('skills', 47),
 ('science', 41),
 ('use', 41),
 ('build', 39),
 ('machine', 37),
 ('ability', 36),.....

and so on. My problem is I do not want to consider words like "experience", "work", and only consider keywords related to data science. I'm guessing there is a corpus for data science terms which I can use like how I use stop word corpus to not select them. Let me know if there is a way, Thanks!

Topic nltk nlp python

Category Data Science


Overall, I agree with Andy M's suggestion.

To address the issue you point out and get rid of words work and experience, you can probably ignore the n-most-frequent words in the general corpus that also appear in the data science corpus, and keep the rest as the data-science-related terms.

So, in a more pythonic way:

general_texts = [
    ['a', 'sentence', 'about', 'experience'],
    ['another', 'sentence', 'typed', 'at', 'work'],
    ['work', 'experience'],
    ...
]


data_science_texts = [
    ['data', 'science', 'experience'],
    ['work', 'on', 'machine', 'learning'],
    ...
]

freqdist_gnrl = Counter()
freqdist_ds = Counter()

for text in general_texts:
    freqdist_gnrl.update(text)

for text in data_science_texts:
    freqdist_ds.update(text)

mostfreq_words_gnrl = freqdist_gnrl.most_common(2)   # 'work', 'experience'

words_ds = [
    w for w, _ in freqdist_ds.most_common()
    if w not in mostfreq_words_gnrl        # every word other than 'work' or 'experience'
]

In this example, I have used 2 as n for the n-most frequent terms to make it work but, over a larger corpus, you can probably take a few hundred words.

After applying this filter, the first k words in the variable words_ds should all be related to data science to a reasonable extent.

Hope this helps!


I have a way through which you can solve your problem. For it you will require a,

  • A pretrained embedding generator. It can be Word2Vec or GloVe. Any of them could work.

Next, we have a corpus of words which have higher frequencies. Suppose we have a set of 100 such words where the 1st word has the highest frequency.

Now, we convert every word in this set to a vector using our pretrained word embedding. Hence you will have a set of vectors for the words from the corpus. Let's call it $z_i$

We have the word "data science". Get a vector for this too. Let's call it $x$

  1. Measure the euclidean distance between the vector $x$ and $z_i$.
  2. Or, you can measure cosine similarity betwee $x$ and $z_i$.
  3. Both of above methods will produce a set of values which would show proximity of $x$ with values of $z_i$.
  4. From these 100 values, we get the least 10 values and convert them to words again.

These 10 words would have the highest similarity with the word "data science".

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.