How to choose threshold for gensim Phrases when generating bigrams?

Question

How to choose threshold for gensim Phrases when generating bigrams?

lefnire

2022年5月28日 08:02

I'm generating bigrams with from gensim.models.phrases, which I'll use downstream with TF-IDF and/or gensim.LDA

from gensim.models.phrases import Phrases, Phraser

# 7k documents, ~500-1k tokens each. Already ran cleanup, stop_words, lemmatization, etc
docs = get_docs()

phrases = Phrases(docs)
bigram = Phraser(phrases)
docs = [bigram[d] for d in docs]

Phrases has min_count=5, threshold=10. I don't quite understand how they interact, they seem related? Anyway, I see threshold having values in different tutorials ranging 1-1000, described as important in determining the number of bigrams generated. I can't find an explanation on how to come by a decent value for one's purposes, simply fiddle and what works best for you. Is there any intuition / formula for choosing this value, maybe something like if you want x% more tokens added to your dictionary, use y; or if your corpus size is x, try y? I also see scoring='default' can be set to 'npmi' instead. From the linked paper, they say and t is a chosen threshold, typically around 10e−5. Might that be a decent approach if I just want this to work good enough without needing to fiddle much? That is, phrases = Phrases(docs, scoring='npmi', threshold=10e-5).

TL;DR: is there a simple or intuitive way to choose a decent threshold (eg, based on corpus size); alternatively would scoring='npmi',threshold=10e-5 be simpler?

Topic gensim text-mining lda nlp

Category Data Science

Brian Spiering · Accepted Answer · 2021年7月27日 13:54

1

Brian Spiering answered at 2021年7月27日 13:54

Since min_count and threshold are hyperparameters, better values could be found through cross validation. Evaluate a range of values to empirically find the values that have the highest performance on a validation set.

How to choose threshold for gensim Phrases when generating bigrams?

About