Text classification with Word2Vec on a larger corpus
I am working on a small project and I would like to use the word2vec technique as a text representation method. I need to classify patents but I have only a few of them labelled and to increase the performance of my ML model, I would like to increase the corpus/vocabulary of my model by using a large amount of patents. The question is, once I have train my word embedding feature, how to use this larger corpus with my training data - my labelled data?
My data set is composed by 2000 patents which are labelled.
The patents used to train my word embedding corpus are 3 millions (some of my 2000 labelled patents are already included in this larger corpus) which I trained using Gensim.
Do you have any suggestions on how to do it?
Thank you very much in advance.
Topic corpus text-classification word2vec nlp machine-learning
Category Data Science