What is "energy spectrum" in Latent Semantic Indexing (LSI)?
What is meant by energy spectrum in LSI(Latent Semantic Indexing)?
I am doing topic modeling with gensim's LsiModel, and part of the output per chunk is the following:
INFO : preparing a new chunk of documents
INFO : using 100 extra samples and 2 power iterations
INFO : 1st phase: constructing (100000, 600) action matrix
INFO : orthonormalizing (100000, 600) action matrix
INFO : 2nd phase: running dense svd on (600, 20000) matrix
INFO : computing the final decomposition
INFO : keeping 500 factors (discarding 6.560% of energy spectrum)
INFO : merging projections: (100000, 500) + (100000, 500)
INFO : keeping 500 factors (discarding 0.843% of energy spectrum)
INFO : processed documents up to #1400000
The above output is from 1,400,000 documents into the process, (out of aproxx. 3,500,000), and it appears to discard less and less for each chunk. The 2nd chunk was higher:
INFO : keeping 500 factors (discarding 6.556% of energy spectrum)
INFO : merging projections: (100000, 500) + (100000, 500)
INFO : keeping 500 factors (discarding 13.469% of energy spectrum)
INFO : processed documents up to #40000
I am not sure whether the discarding X % of energy spectrum
is better with high or low numbers. Is the "energy" analogous to entropy? Does discarding mean it looses information, or does it mean the Singular Value Decomposition is getting better and better with more information? (Or none of the above)
Topic lsi gensim topic-model
Category Data Science