lsi - Geeks Mental

How to represent a document in test data with the Document-Term Matrix created from the training set?

Paw in Data

2022年4月10日 09:01

I build a classifier of documents using the vector representation of each document in the training set (i.e a row in the Document-Term Matrix). Now I need to test the model on the test data. But how can I represent a new document with the Document-Term Matrix since some terms might not be included in training data?

Topic: vector-space-models lsi text-mining python

Category: Data Science

Topic modelling on only 24 documents gives the same "topic" for any K

Luisda

2022年3月19日 23:00

Description: I have 24 documents, each one of around 2.5K tokens. They are public speeches. My text preprocessing pipeline is a generic one, including punctuation removal, expansion of English contractions, removal of stopwords and tokenization. I have implemented and analyzed both Latent Dirichlet Allocation and Latent Semantic Analysis in Python and gensim. I am calculating the optimal number of topic by the topics' coherence. Issue: For any number of topics K (I have tried many, e.g. 10, 50, 100, 200) …

Topic: lsi gensim lda topic-model

Category: Data Science

LSA Model Improvement

Bala

2021年6月17日 16:48

I followed gensim's Core Tutorial and build an LSA Classification, topic modeling and Document Similarity model for newsgroups dataset. My code is available here. I need help with below 3 concepts. Topic Classification: I get only 50% accuracy with KNN algo. Topic Modeling: The words highlighted for each of the 20 topics doesnt stand out. Document Similarity: I wrote a small test code to find that document similarity also doesnt produce great results. I am going to follow up it …

Topic: similar-documents lsi gensim lda topic-model

Category: Data Science

document similarity when the document size less than 30 tokens?

IndPythCoder

2020年1月9日 09:17

I was solving a problem to compare 3 million, 2018 documents against the 2019 documents. There are three text attributes to be compared from one item against the other. I used Latent Semantic Indexing (LSI) for one variable, containing about 5 word tokens, with reasonable performance. What will be the minimum document size for LSI/LDA for a multivariate [Item Description (text) (10 tokens), Item specification (text) (5 tokens)] problem to compute document similarity? I had used cosine similarity, String Distances …

Topic: lsi similarity

Category: Data Science

What is "energy spectrum" in Latent Semantic Indexing (LSI)?

Thomas Fauskanger

2017年6月13日 13:32

What is meant by energy spectrum in LSI(Latent Semantic Indexing)? I am doing topic modeling with gensim's LsiModel, and part of the output per chunk is the following: INFO : preparing a new chunk of documents INFO : using 100 extra samples and 2 power iterations INFO : 1st phase: constructing (100000, 600) action matrix INFO : orthonormalizing (100000, 600) action matrix INFO : 2nd phase: running dense svd on (600, 20000) matrix INFO : computing the final decomposition INFO …

Topic: lsi gensim topic-model

Category: Data Science

Tokenizing words of length 1, what would happen if I do topic modeling?

Jeffrey04

2016年1月2日 10:15

Suppose my dataset contains some very small documents (about 20 words each). And each of them may have words in at least two languages (combination of malay and english, for instance). Also there are some numbers inside each of them. Just out of curiosity, while usually customizable, why are some tokenizers choose to ignore tokens that are just numbers by default, or anything that doesn't meet certain length? For example, the CountVectorizer in scikit-learn ignores words that do not have …

Topic: lsi information-retrieval

Category: Data Science

Latent Semantic Indexing False Positive Detection

2015年12月14日 13:00

I am using the Gensim LsiModel. I have a set of documents and fixed set of topics. Some of the documents are already categorized, others are not. The goal is to categorize the uncategorized documents with the most relevant category. I am using a similarity search as described here. http://radimrehurek.com/gensim/tut3.html So, I am comparing each uncategorized document to the categorized corpus to find the most relevant category. I am seeing very good performance on documents which have an appropriate category. …

Topic: lsi gensim

Category: Data Science

How to represent a document in test data with the Document-Term Matrix created from the training set?

Topic modelling on only 24 documents gives the same "topic" for any K

LSA Model Improvement

document similarity when the document size less than 30 tokens?

What is "energy spectrum" in Latent Semantic Indexing (LSI)?

Tokenizing words of length 1, what would happen if I do topic modeling?

Latent Semantic Indexing False Positive Detection

About