Tokenizing words of length 1, what would happen if I do topic modeling?
Suppose my dataset contains some very small documents (about 20 words each). And each of them may have words in at least two languages (combination of malay and english, for instance). Also there are some numbers inside each of them.
Just out of curiosity, while usually customizable, why are some tokenizers choose to ignore tokens that are just numbers by default, or anything that doesn't meet certain length? For example, the CountVectorizer in scikit-learn ignores words that do not have more than 1 alphanumeric. And the tokenizer utility in gensim ignores words with digits.
I used CountVectorizer in the end and made it to accept words containing digits and words with length 1 as well. This is because I needed exact match, as slight differences in those words of length 1 may point to a different document.
I am currently trying topic modeling (gensim's LSI) to perform topic analysis, but the main intention of doing that is to be able to reduce the dimension so I can feed it to spotify's annoy library for quick approximate searching (from 58k features, to 500 topics). Also I expect it to reduce the time and memory taken to compute classifier models.
So the question is really, if I tokenize words of length 1, would it makes sense to perform topic modeling? Even if I do a brute force search by just comparing cosine similarity (no fancy ANN), would it affect the precision or accuracy (i.e. would it be able to recognize the slight change in words of length 1 in query and retrieve the right document)?
Topic lsi information-retrieval
Category Data Science