How to keep only the top k-frequent ngrams in a text field with pandas?

Question

How to keep only the top k-frequent ngrams in a text field with pandas?

Hing

2022年4月12日 15:10

How to keep only the top k-frequent ngrams in a text field with pandas? For example, I've a text column. For every row in it, I only want to keep those substrings that belong to the top k-frequent ngram in the list of ngrams built from the same columns with all rows. How should I implement it on a pandas dataframe?

Topic ngrams

Category Data Science

Kasra Manshaei · Accepted Answer · 2022年4月12日 15:10

For k=3:

import pandas as pd
from collections import Counter

corpus = ['is this text very frequent',
          'is this text very',
          'is this text',
          'is this',
          'is']

word_frequency = Counter(' '.join(corpus).split()).most_common()
top_3_frequents = [ii[0] for ii in word_frequency[:3]]

print(word_frequency)
# [('is', 5), ('this', 4), ('text', 3), ('very', 2), ('frequent', 1)]

print(top_3_frequents)
#.['is', 'this', 'text']



df = pd.DataFrame(corpus, columns=['text'])
df['frequents'] = df['text'].apply(lambda x:set(x.split()) & set(top_3_frequents))
df

How to keep only the top k-frequent ngrams in a text field with pandas?

About