Unsupervised document similarity state of the art

I have a set of N documents with lengths ranging from 0 to more than 20000 characters. I want to calculate a similarity score between 0 and 1 between all pairs of documents where a higher number indicates higher similarity. Assume below that deploying a supervised model is infeasible due to resource constraints that are not necessarily data science related (gathering labels is expensive, infrastructure cannot be approved for supervised models for whatever reason etc).

Approaches I have considered:

  1. tf-idf
  2. Smooth Inverse Frequency (SIF) embeddings and its developments (uSIF, p-SIF). https://openreview.net/pdf?id=SyK00v5xx https://www.aclweb.org/anthology/W18-3012/ https://arxiv.org/abs/2005.09069
  3. BERT or bert-like embeddings, e.g., https://arxiv.org/abs/2010.06467
  4. Hierarchical Optimal Transport for Document Representation (HOTT): https://papers.nips.cc/paper/2019/hash/8b5040a8a5baf3e0e67386c2e3a9b903-Abstract.html

Question: Is there an unsupervised technique that has been shown in a peer-reviewed setting to achieve higher accuracy (or F1 or similar) on long texts (more than, say, 10000 characters) than HOTT?

Background: The HOTT paper benchmarks various approaches with a k-NN classifier and shows that HOTT performs best, but not dramatically better than tf-idf (HOTT has 0.52 vs tf-idf's 0.66 normalized error). Note that while the HOTT algorithm is unsupervised the datasets in the paper are labeled, otherwise a benchmark would not be possible. The SIF papers mostly deal with the STS datasets which are not long texts. p-SIF has a benchmark on the Reuters dataset, but uses a SVM supervised approach. Interestingly, the HOTT paper finds that SIF does not perform well with the k-NN approach with 0.79 normalized error. In many cases BERT requires pre-training and if it does not, its max or average pooled performance appears to be worse than glove embeddings (https://arxiv.org/abs/2010.06467 page 114). I have also not been able to find unsupervised benchmarks for Doc2Vec, Universal Sentence Encoder (USE).

There is additionally the question of how to calculate the similarity once the embedding is obtained (e.g., https://www.aclweb.org/anthology/N19-1100.pdf) but that is out of scope for this question unless it affects comparison between unsupervised benchmarks (e.g., the k-NN approach can use various distance metrics which may affect accuracy).

If the benchmarks in HOTT are representative and no other methods exist that perform substantially better it is tempting to make the conclusion that tf-idf is still a strong approach since it is so simple to implement and understand (it is certainly simpler than HOTT). If that is the case I think it is a remarkable conclusion given the deep learning developments in the last 5-10 years.

Related posts that do not specifically address this question: What is considered short and long text in NLP (document similarity) Alternatives to TF-IDF and Cosine Similarity when comparing documents of differing formats How to measure the similarity between two text documents? Cluster documents based on topic similarity Document similarity: Vector embedding versus BoW performance? Word2Vec - document similarity Weighted sum of word vectors for document similarity Document similarity Cosine similarity between query and document confusion Evaluate document similarity / content-based recommender system Use embeddings to find similarity between documents

Topic similar-documents unsupervised-learning

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.