How to choose threshold for gensim Phrases when generating bigrams?
I'm generating bigrams with from gensim.models.phrases
, which I'll use downstream with TF-IDF and/or gensim.LDA
from gensim.models.phrases import Phrases, Phraser
# 7k documents, ~500-1k tokens each. Already ran cleanup, stop_words, lemmatization, etc
docs = get_docs()
phrases = Phrases(docs)
bigram = Phraser(phrases)
docs = [bigram[d] for d in docs]
Phrases
has min_count=5
, threshold=10
. I don't quite understand how they interact, they seem related? Anyway, I see threshold
having values in different tutorials ranging 1-1000, described as important in determining the number of bigrams generated. I can't find an explanation on how to come by a decent value for one's purposes, simply fiddle and what works best for you. Is there any intuition / formula for choosing this value, maybe something like if you want x% more tokens added to your dictionary, use y; or if your corpus size is x, try y? I also see scoring='default'
can be set to 'npmi'
instead. From the linked paper, they say and t is a chosen threshold, typically around 10e−5
. Might that be a decent approach if I just want this to work good enough without needing to fiddle much? That is, phrases = Phrases(docs, scoring='npmi', threshold=10e-5)
.
TL;DR: is there a simple or intuitive way to choose a decent threshold
(eg, based on corpus size); alternatively would scoring='npmi',threshold=10e-5
be simpler?
Topic gensim text-mining lda nlp
Category Data Science