How best to embed large and noisy documents

I have a large corpus of documents (web pages) collected from various sites of around 10k-30k chars each, I am processing them to extract relevant text as much as possible, but they are never perfect.

Right now I creating a doc for each page, processing it with TFIDF and then creating a dense feature vector using UMAP.

My final goal is to really pick out the differences in the articles, for similarity analysis, clustering and classification - however at this stage my goal is to generate the best possible embeddings.

For this type of data what is the best method to create document embeddings?

Also is it possible (is yes how) to embed various parts of the page; title, description, tags separately and them maybe combine this into a final vector?

Topic tfidf word-embeddings nlp python machine-learning

Category Data Science


Transformer models such as BERT, DistilBert can be used to capture the document embeddings.
Transformer models can capture the context more accurately than other models.


Doc2vec and similar algorithms are useful methods to create document embeddings.

You should try to include as much data and metadata as possible.

The size of the documents does not matter much because they will be project into a fixed dimensional embedding space.

As far as noise, the effect will be task specific. If you are doing similarity analysis and clustering, you can use approximate methods which are robust to noise.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.