Understanding output of gensim LDA topic modeling API
I was trying to understand gensim mallet wrapper for topic modeling as explained in this notebook.
In point 11, it prepares corpus which if of format Term Document frequency:
print(corpus[:1]) # for 1st document
[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 5), (5, 1), (6, 2), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 2), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1)]]
Above, (0,1)
means token with id 0
appears once in this document.
In point 16, it trains LDA topic model and prints topics:
ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=20, id2word=id2word)
pprint(ldamallet.show_topics(formatted=False))
[(14,
[('drive', 0.034160504468538765),
('card', 0.026970013156103978),
('problem', 0.026130744454021686),
('system', 0.024928548745633536),
('driver', 0.018804155514222202),
('run', 0.017579276867939937),
('work', 0.015242934264845983),
('disk', 0.015220251326951867),
('bit', 0.014040738556457832),
('memory', 0.013723177425940208)]),
(18,
[('line', 0.02928489135385687),
('good', 0.028347131795407658),
('buy', 0.026632371459957668),
('price', 0.02585537068295689),
('sell', 0.023953058435817055),
('cost', 0.01596870562387804),
('pay', 0.015459636149291321),
('sale', 0.015191704846877261),
('offer', 0.012753529994909306),
('call', 0.012565978083219463)]),
(6,
[('window', 0.020890399084636455),
('image', 0.017891196560452134),
('file', 0.015464096251863667),
('program', 0.014597274713082071),
('display', 0.013799798897403003),
('software', 0.012187510835269234),
('application', 0.012170174404493602),
('version', 0.011996810096737283),
('run', 0.01133802572726327),
('color', 0.011112652127180055)]),
(4,
[('game', 0.031754338224520125),
('team', 0.030393737048123416),
('year', 0.029535511690703953),
('play', 0.025432775835723107),
('player', 0.018964687166391058),
('win', 0.016264417139388358),
('good', 0.014485169447177275),
('season', 0.01230820756494254),
('run', 0.009838193121637745),
('hit', 0.009419546605823373)]),
(9,
[('car', 0.041619492985478714),
('bike', 0.014250553777996553),
('ride', 0.010952498154073344),
('drive', 0.010066453359586513),
('engine', 0.008663549101649027),
('turn', 0.008048240216588728),
('speed', 0.007605217819345311),
('front', 0.0075806054639428995),
('road', 0.007432931331528427),
('mile', 0.007112970711297071)]),
(19,
[('gun', 0.022815035249938),
('people', 0.018227229248591773),
('state', 0.015765047649413683),
('government', 0.01417082934778758),
('law', 0.009529882736387147),
('job', 0.008927622489106175),
('make', 0.00823679455840153),
('crime', 0.008183653948347327),
('year', 0.00811279980160839),
('weapon', 0.007652247847805293)]),
(15,
[('make', 0.05305789275398719),
('thing', 0.04772070827577546),
('time', 0.03884633094729792),
('good', 0.037234710536230065),
('work', 0.028381263342961194),
('point', 0.026267319686885178),
('problem', 0.024132445895600485),
('give', 0.02331617062246222),
('bad', 0.021369668048055592),
('put', 0.016974339654234165)]),
(0,
[('people', 0.02214538150568254),
('drug', 0.013627927080420025),
('man', 0.013144847575703644),
('article', 0.0130431466273423),
('make', 0.011644758587373827),
('show', 0.01027179578449569),
('write', 0.010195520073224683),
('number', 0.010017543413592333),
('find', 0.009636164857237294),
('food', 0.009585314383056622)]),
(11,
[('power', 0.01920617807085726),
('ground', 0.012536669759045207),
('line', 0.011283002783140686),
('current', 0.010706315974224606),
('high', 0.009277135621693453),
('wire', 0.009151768924103002),
('water', 0.00912669558458491),
('work', 0.008123762003861295),
('low', 0.008123762003861295),
('light', 0.007547075194945215)]),
(3,
[('space', 0.017264917771103318),
('system', 0.011131881287937661),
('year', 0.010438764151141543),
('project', 0.010291739303942365),
('launch', 0.009220558274348365),
('program', 0.008275398542353658),
('design', 0.006910167818361303),
('technology', 0.006742139421562244),
('cost', 0.006637121673562832),
('base', 0.00653210392556342)])]
Doc describes show_topics()
method as follows:
Get the num_words most probable words for num_topics number of topics. Returns
- list of (float, str) – Topics as list of (weight, word) pairs (if formatted=False)
The doc does not say anything about document specific topic. So, I am guessing if the about output is for whole corpus, i.e. all docs in the corpus? I yes, how do I get document specific topic information? Should I train against single document instead of whole corpus? Seems that am missing some simple understanding about the API.
Topic gensim lda topic-model machine-learning
Category Data Science