document similarity when the document size less than 30 tokens?

I was solving a problem to compare 3 million, 2018 documents against the 2019 documents. There are three text attributes to be compared from one item against the other. I used Latent Semantic Indexing (LSI) for one variable, containing about 5 word tokens, with reasonable performance.

  • What will be the minimum document size for LSI/LDA for a multivariate [Item Description (text) (10 tokens), Item specification (text) (5 tokens)] problem to compute document similarity?
  • I had used cosine similarity, String Distances to measure the closeness of match of 2018 Description and 2019 description. Are there any other statistical methods to evaluate the Model?

Topic lsi similarity

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.