How should I construct a binary classifier for small set of positive data and million of unlabeled data?

Question

How should I construct a binary classifier for small set of positive data and million of unlabeled data?

Deli

2021年5月29日 14:37

Does anyone have suggestions for specific algorithm or implementation for labeled data of only one class and unlabeled data that can be from either classes? And I'm unsure what is the proportion of Class A to B that exists within the unlabeled data and also my labeled data is not randomly chosen.

Topic labelling classification machine-learning

Category Data Science

Bert Kellerman · Accepted Answer · 2021年5月29日 14:37

This is called PU Learning, and it can be used when using a probabilistic classifier and certain assumptions are met about how the data is labeled.

If the assumptions are met, you

Label positive, already labeled instances as positive
Labeled unlabeled instances as negative
Train a probabilistic classifier.

This produces the same ranking of class probabilities as a classifier would if trained on a dataset labeled with true positive/negative labels.

This video covers the assumptions pretty well and the Elkan paper is pretty accessible.

How should I construct a binary classifier for small set of positive data and million of unlabeled data?

About