What is the meaning of, or explanation for, having multiple tags in a Doc2Vec model's TaggedDocuments?

Question

What is the meaning of, or explanation for, having multiple tags in a Doc2Vec model's TaggedDocuments?

Jayke

2021年3月8日 16:06

I've tried reading the other answers on this topic but I'm unsure if I understand completely.

For my dataset, I have a series of tagged documents, good or bad. Each document belongs to an entity, and each entity has a different number of documents.

Eventually, I'd like to create a classifier to detect whether or not an entity's document is good or bad and to also see what sentences are most similar to the good/bad tag.

All that being said, does it make sense to label my data as following:

train_corpus = []
i = 0
for entity in entities:
    for doc_name in entity:
        for sentence in get_doc(doc_name):
           train_corpus.append(TaggedDocument(sentence, tags = [i, doc_name, entity, doc_name.good_or_bad])
           i+=1

From what I understand, this means that each entity is contextualized by all TaggedDocuments that have that entity's name, whereas each document is contextualized by each sentence that composes it. And the overall good/bad idea is composed of all the sentences that make up either the good or bad documents. Is this a correct interpretation? And if that's the case, could I then do something like:

unlabeled_data = [...]
model.infer_tag(unlabeled_data[0])
# return predicted good/bag tag
model.cosine_distance(unlabeled_data[0], bad)
#get a numerical measure of how far some unlabeled data is from the bad tag

Topic document-understanding doc2vec word2vec nlp python

Category Data Science

What is the meaning of, or explanation for, having multiple tags in a Doc2Vec model's TaggedDocuments?

About