Is active learning able to detect challenging cases?

Let's say we have a set of data points that need to be labelled for a classification task. In the pool-based active learning, if we go with the uncertainty measure, is the AL approach able to detect challenging cases? By challenging cases I mean samples that receive a high prediction score for $\hat{y}$ (e.g. 90%) but, most probably, $\neg\hat{y}$ is the correct prediction.

The rationale behind my question is: does adding more samples to the training set always improve the performance of a classifier?

Topic active-learning classification

Category Data Science


In general it depends on the exact method used to select instances and of course on the data. Assuming that the selection is based solely on the uncertainty measure of a single classifier, then by definition the method will prioritize instances predicted with a probability around 50%, i.e. where the classifier is "unsure". As a consequence an instance predicted with a high probability is unlikely to be selected for annotation. However the iterative training process will make the classifier re-estimate the probability of all the instances, so it's possible that an instance wrongly classified with 90% probability at a particular iteration will later be assigned a lower probability, or even the true class. But overall there's no guarantee: like with any statistical system there can be instances misclassified with a high probability.

The rationale behind my question is: does adding more samples to the training set always improve the performance of a classifier?

In active learning, the performance depends more on how many instances end up being manually annotated than on the size of the unlabelled sample. But as usual the performance strongly depends on the data itself.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.