I build a classifier of documents using the vector representation of each document in the training set (i.e a row in the Document-Term Matrix). Now I need to test the model on the test data. But how can I represent a new document with the Document-Term Matrix since some terms might not be included in training data?
Description: I have 24 documents, each one of around 2.5K tokens. They are public speeches. My text preprocessing pipeline is a generic one, including punctuation removal, expansion of English contractions, removal of stopwords and tokenization. I have implemented and analyzed both Latent Dirichlet Allocation and Latent Semantic Analysis in Python and gensim. I am calculating the optimal number of topic by the topics' coherence. Issue: For any number of topics K (I have tried many, e.g. 10, 50, 100, 200) …
I followed gensim's Core Tutorial and build an LSA Classification, topic modeling and Document Similarity model for newsgroups dataset. My code is available here. I need help with below 3 concepts. Topic Classification: I get only 50% accuracy with KNN algo. Topic Modeling: The words highlighted for each of the 20 topics doesnt stand out. Document Similarity: I wrote a small test code to find that document similarity also doesnt produce great results. I am going to follow up it …
I was solving a problem to compare 3 million, 2018 documents against the 2019 documents. There are three text attributes to be compared from one item against the other. I used Latent Semantic Indexing (LSI) for one variable, containing about 5 word tokens, with reasonable performance. What will be the minimum document size for LSI/LDA for a multivariate [Item Description (text) (10 tokens), Item specification (text) (5 tokens)] problem to compute document similarity? I had used cosine similarity, String Distances …
What is meant by energy spectrum in LSI(Latent Semantic Indexing)? I am doing topic modeling with gensim's LsiModel, and part of the output per chunk is the following: INFO : preparing a new chunk of documents INFO : using 100 extra samples and 2 power iterations INFO : 1st phase: constructing (100000, 600) action matrix INFO : orthonormalizing (100000, 600) action matrix INFO : 2nd phase: running dense svd on (600, 20000) matrix INFO : computing the final decomposition INFO …
Suppose my dataset contains some very small documents (about 20 words each). And each of them may have words in at least two languages (combination of malay and english, for instance). Also there are some numbers inside each of them. Just out of curiosity, while usually customizable, why are some tokenizers choose to ignore tokens that are just numbers by default, or anything that doesn't meet certain length? For example, the CountVectorizer in scikit-learn ignores words that do not have …
I am using the Gensim LsiModel. I have a set of documents and fixed set of topics. Some of the documents are already categorized, others are not. The goal is to categorize the uncategorized documents with the most relevant category. I am using a similarity search as described here. http://radimrehurek.com/gensim/tut3.html So, I am comparing each uncategorized document to the categorized corpus to find the most relevant category. I am seeing very good performance on documents which have an appropriate category. …