How best to embed large and noisy documents
I have a large corpus of documents (web pages) collected from various sites of around 10k-30k chars each, I am processing them to extract relevant text as much as possible, but they are never perfect.
Right now I creating a doc for each page, processing it with TFIDF
and then creating a dense feature vector using UMAP
.
My final goal is to really pick out the differences in the articles, for similarity analysis, clustering and classification - however at this stage my goal is to generate the best possible embeddings.
For this type of data what is the best method to create document embeddings?
Also is it possible (is yes how) to embed various parts of the page; title, description, tags separately and them maybe combine this into a final vector?
Topic tfidf word-embeddings nlp python machine-learning
Category Data Science