How to use scikit-learn to extract features from text when I only have positive and unlabeled data?

I'm looking for something similar to this

https://scikit-learn.org/stable/auto_examples/text/plot_document_classification_20newsgroups.html#sphx-glr-auto-examples-text-plot-document-classification-20newsgroups-py

But instead of positive and negative examples, I have positive examples and a bunch of unlabeled data that will contain some positive examples but is mostly negative.

I'm planning on using this in a pipeline to transform text data into a vector, then feeding it into a classifier using

https://pulearn.github.io/pulearn/doc/pulearn/

The issue is I'm not sure the best way to build the preprocessing stage where I transform the raw text data into a vector which would then be fed into the classification model.

If anyone has any different ideas on how I can transform positive and unlabeled raw text into a vector to feed into the pulearn module I would like to hear as well, thanks!

Topic bag-of-words text-classification scikit-learn feature-selection clustering

Category Data Science


So you should be clearer with what you are asking.

  1. What are the classes of your classifier? Positive and unlabelled?

  2. To create numeric feature from text you can use:

    a) tf-idf, which works well with small datasets/ sentences.

    b) Handmade features, like extracting sentence length, occurence of certain words..

    c) Word embeddings, like word2vec or sentence2vec. Strong method when a does not work. This will convert every word or sentence/ document to a numeric vector of length n, usually between 100 and 500. See the gensim library for that

    d) a combination of the above.

The main criteria of choosing the feature transformer should be that it captures your text well and that is something you can decide only by looking at the data yourself and trying out some experiments.

Good luck!

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.