How best to embed large and noisy documents

Question

How best to embed large and noisy documents

dendog

2022年5月7日 20:03

I have a large corpus of documents (web pages) collected from various sites of around 10k-30k chars each, I am processing them to extract relevant text as much as possible, but they are never perfect.

Right now I creating a doc for each page, processing it with TFIDF and then creating a dense feature vector using UMAP.

My final goal is to really pick out the differences in the articles, for similarity analysis, clustering and classification - however at this stage my goal is to generate the best possible embeddings.

For this type of data what is the best method to create document embeddings?

Also is it possible (is yes how) to embed various parts of the page; title, description, tags separately and them maybe combine this into a final vector?

Topic tfidf word-embeddings nlp python machine-learning

Category Data Science

Shrinidhi M · Accepted Answer · 2021年7月30日 07:22

1

Shrinidhi M answered at 2021年7月30日 07:22

Transformer models such as BERT, DistilBert can be used to capture the document embeddings.
Transformer models can capture the context more accurately than other models.

Brian Spiering · Accepted Answer · 2021年2月26日 02:19

Doc2vec and similar algorithms are useful methods to create document embeddings.

You should try to include as much data and metadata as possible.

The size of the documents does not matter much because they will be project into a fixed dimensional embedding space.

As far as noise, the effect will be task specific. If you are doing similarity analysis and clustering, you can use approximate methods which are robust to noise.

How best to embed large and noisy documents

About