Tokenizing words of length 1, what would happen if I do topic modeling?

Suppose my dataset contains some very small documents (about 20 words each). And each of them may have words in at least two languages (combination of malay and english, for instance). Also there are some numbers inside each of them.

Just out of curiosity, while usually customizable, why are some tokenizers choose to ignore tokens that are just numbers by default, or anything that doesn't meet certain length? For example, the CountVectorizer in scikit-learn ignores words that do not have more than 1 alphanumeric. And the tokenizer utility in gensim ignores words with digits.

I used CountVectorizer in the end and made it to accept words containing digits and words with length 1 as well. This is because I needed exact match, as slight differences in those words of length 1 may point to a different document.

I am currently trying topic modeling (gensim's LSI) to perform topic analysis, but the main intention of doing that is to be able to reduce the dimension so I can feed it to spotify's annoy library for quick approximate searching (from 58k features, to 500 topics). Also I expect it to reduce the time and memory taken to compute classifier models.

So the question is really, if I tokenize words of length 1, would it makes sense to perform topic modeling? Even if I do a brute force search by just comparing cosine similarity (no fancy ANN), would it affect the precision or accuracy (i.e. would it be able to recognize the slight change in words of length 1 in query and retrieve the right document)?

Topic lsi information-retrieval

Category Data Science


The libraries usually exclude 1-length tokens and tokens with no alpha-numeric characters because typically they are noise and do not have any descriptive power. That is, these tokens are usually not helpful, say, in distinguishing between relevant vs not relevant documents.

However, if in your domain you feel like 1-length tokens can be helpful, feel free to use them as well. For example, if all the document that contain 1 belong to the same topic, it may be a good idea to preserve this token. 1 has descriptive power in this case: it can help distinguish between one particular topic and the rest.

Now, your next question is about LSI. For LSI there is no difference if a column in the document-term matrix corresponds to a 1-char token or to a 5-char token. So you can use LSI in your analysis.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.