Understanding output of gensim LDA topic modeling API

Question

Understanding output of gensim LDA topic modeling API

Maha

2022年4月12日 17:10

I was trying to understand gensim mallet wrapper for topic modeling as explained in this notebook.

In point 11, it prepares corpus which if of format Term Document frequency:

 print(corpus[:1]) # for 1st document
 [[(0, 1), (1, 1), (2, 1), (3, 1), (4, 5), (5, 1), (6, 2), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 2), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1)]]

Above, (0,1) means token with id 0 appears once in this document.

In point 16, it trains LDA topic model and prints topics:

 ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=20, id2word=id2word)
 pprint(ldamallet.show_topics(formatted=False))
[(14,
  [('drive', 0.034160504468538765),
   ('card', 0.026970013156103978),
   ('problem', 0.026130744454021686),
   ('system', 0.024928548745633536),
   ('driver', 0.018804155514222202),
   ('run', 0.017579276867939937),
   ('work', 0.015242934264845983),
   ('disk', 0.015220251326951867),
   ('bit', 0.014040738556457832),
   ('memory', 0.013723177425940208)]),
 (18,
  [('line', 0.02928489135385687),
   ('good', 0.028347131795407658),
   ('buy', 0.026632371459957668),
   ('price', 0.02585537068295689),
   ('sell', 0.023953058435817055),
   ('cost', 0.01596870562387804),
   ('pay', 0.015459636149291321),
   ('sale', 0.015191704846877261),
   ('offer', 0.012753529994909306),
   ('call', 0.012565978083219463)]),
 (6,
  [('window', 0.020890399084636455),
   ('image', 0.017891196560452134),
   ('file', 0.015464096251863667),
   ('program', 0.014597274713082071),
   ('display', 0.013799798897403003),
   ('software', 0.012187510835269234),
   ('application', 0.012170174404493602),
   ('version', 0.011996810096737283),
   ('run', 0.01133802572726327),
   ('color', 0.011112652127180055)]),
 (4,
  [('game', 0.031754338224520125),
   ('team', 0.030393737048123416),
   ('year', 0.029535511690703953),
   ('play', 0.025432775835723107),
   ('player', 0.018964687166391058),
   ('win', 0.016264417139388358),
   ('good', 0.014485169447177275),
   ('season', 0.01230820756494254),
   ('run', 0.009838193121637745),
   ('hit', 0.009419546605823373)]),
 (9,
  [('car', 0.041619492985478714),
   ('bike', 0.014250553777996553),
   ('ride', 0.010952498154073344),
   ('drive', 0.010066453359586513),
   ('engine', 0.008663549101649027),
   ('turn', 0.008048240216588728),
   ('speed', 0.007605217819345311),
   ('front', 0.0075806054639428995),
   ('road', 0.007432931331528427),
   ('mile', 0.007112970711297071)]),
 (19,
  [('gun', 0.022815035249938),
   ('people', 0.018227229248591773),
   ('state', 0.015765047649413683),
   ('government', 0.01417082934778758),
   ('law', 0.009529882736387147),
   ('job', 0.008927622489106175),
   ('make', 0.00823679455840153),
   ('crime', 0.008183653948347327),
   ('year', 0.00811279980160839),
   ('weapon', 0.007652247847805293)]),
 (15,
  [('make', 0.05305789275398719),
   ('thing', 0.04772070827577546),
   ('time', 0.03884633094729792),
   ('good', 0.037234710536230065),
   ('work', 0.028381263342961194),
   ('point', 0.026267319686885178),
   ('problem', 0.024132445895600485),
   ('give', 0.02331617062246222),
   ('bad', 0.021369668048055592),
   ('put', 0.016974339654234165)]),
 (0,
  [('people', 0.02214538150568254),
   ('drug', 0.013627927080420025),
   ('man', 0.013144847575703644),
   ('article', 0.0130431466273423),
   ('make', 0.011644758587373827),
   ('show', 0.01027179578449569),
   ('write', 0.010195520073224683),
   ('number', 0.010017543413592333),
   ('find', 0.009636164857237294),
   ('food', 0.009585314383056622)]),
 (11,
  [('power', 0.01920617807085726),
   ('ground', 0.012536669759045207),
   ('line', 0.011283002783140686),
   ('current', 0.010706315974224606),
   ('high', 0.009277135621693453),
   ('wire', 0.009151768924103002),
   ('water', 0.00912669558458491),
   ('work', 0.008123762003861295),
   ('low', 0.008123762003861295),
   ('light', 0.007547075194945215)]),
 (3,
  [('space', 0.017264917771103318),
   ('system', 0.011131881287937661),
   ('year', 0.010438764151141543),
   ('project', 0.010291739303942365),
   ('launch', 0.009220558274348365),
   ('program', 0.008275398542353658),
   ('design', 0.006910167818361303),
   ('technology', 0.006742139421562244),
   ('cost', 0.006637121673562832),
   ('base', 0.00653210392556342)])]

Doc describes show_topics() method as follows:

Get the num_words most probable words for num_topics number of topics. Returns

list of (float, str) – Topics as list of (weight, word) pairs (if formatted=False)

The doc does not say anything about document specific topic. So, I am guessing if the about output is for whole corpus, i.e. all docs in the corpus? I yes, how do I get document specific topic information? Should I train against single document instead of whole corpus? Seems that am missing some simple understanding about the API.

Topic gensim lda topic-model machine-learning

Category Data Science

Understanding output of gensim LDA topic modeling API

About