What is the meaning of, or explanation for, having multiple tags in a Doc2Vec model's TaggedDocuments?
I've tried reading the other answers on this topic but I'm unsure if I understand completely.
For my dataset, I have a series of tagged documents, good or bad. Each document belongs to an entity, and each entity has a different number of documents.
Eventually, I'd like to create a classifier to detect whether or not an entity's document is good or bad and to also see what sentences are most similar to the good/bad tag.
All that being said, does it make sense to label my data as following:
train_corpus = []
i = 0
for entity in entities:
for doc_name in entity:
for sentence in get_doc(doc_name):
train_corpus.append(TaggedDocument(sentence, tags = [i, doc_name, entity, doc_name.good_or_bad])
i+=1
From what I understand, this means that each entity is contextualized by all TaggedDocuments that have that entity's name, whereas each document is contextualized by each sentence that composes it. And the overall good/bad idea is composed of all the sentences that make up either the good or bad documents. Is this a correct interpretation? And if that's the case, could I then do something like:
unlabeled_data = [...]
model.infer_tag(unlabeled_data[0])
# return predicted good/bag tag
model.cosine_distance(unlabeled_data[0], bad)
#get a numerical measure of how far some unlabeled data is from the bad tag
Topic document-understanding doc2vec word2vec nlp python
Category Data Science