How should I construct a binary classifier for small set of positive data and million of unlabeled data?

Does anyone have suggestions for specific algorithm or implementation for labeled data of only one class and unlabeled data that can be from either classes? And I'm unsure what is the proportion of Class A to B that exists within the unlabeled data and also my labeled data is not randomly chosen.

Topic labelling classification machine-learning

Category Data Science


This is called PU Learning, and it can be used when using a probabilistic classifier and certain assumptions are met about how the data is labeled.

If the assumptions are met, you

  1. Label positive, already labeled instances as positive
  2. Labeled unlabeled instances as negative
  3. Train a probabilistic classifier.

This produces the same ranking of class probabilities as a classifier would if trained on a dataset labeled with true positive/negative labels.

This video covers the assumptions pretty well and the Elkan paper is pretty accessible.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.