Preprocessing for Document Similarity Using Doc2Vec
I'm trying to determine document similarity using Doc2Vec on a large series of legal opinions, which can contain some highly jargonistic language and phrases (e.g. en banc, de novo, etc.). I'm wondering if anyone has any thoughts about the criteria I should consider, if any, about how to treat compound words/phrases in Doc2Vec for the purposes of calculating similarity.
Were I just using tf-idf or something more straightforward, I'd consider going through each phrase and combining the words manually during preprocessing (ex: en-banc), but I don't know if that's necessary here, given that embeddings consider the context surrounding a word by definition.
Also, doing this will significantly add to the length of time it will take to derive document similarity scores, so if it's not necessary or unlikely to seismically change the resulting scores, I'd like to avoid doing that. There's such potential variation in phrases, and in highly jargonistic texts such as these, it could also significantly cut the number of tokens created later on.
I'd appreciate anyone's opinions on the matter. Thanks!
Topic doc2vec similar-documents
Category Data Science