Python package for machine-learning aided data labelling

In a lot of cases unlabelled data needs to be transformed to labelled data. The best solution is to use (multiple) human classifiers. However, going to all the data by hand (i.e. in text-mining or image-processing) is often a daunting task. Is there software that can combine human classifiers and machine-learning techniques in real time? I am especially interested in python packages.

To illustrate, classifying images from video streams is very repetitive. After 100 images (from different streams) a machine-learning algorithm could be used to predict the labels given by the human classifier. The machine classifier might be very confident about some (un)seen samples and very uncertain about others. The human classifier can then focus on the uncertain samples helping the machine classifier to learn better what is does not yet know.

Topic labelling labels active-learning python machine-learning

Category Data Science


It sounds like you are looking for active learning. In active learning, the classifier learns which samples would be most useful to have labelled by a human.

There are many techniques for active learning, and many ways to adapt an existing (standard) learning algorithm to the active learning setting. The particular approach you mentioned is called "uncertainty sampling", and can be applied to any standard classifier that outputs confidence/certainty scores. There are other selection methods as well, which may perform better in some settings.

You can also apply unsupervised methods to cluster the samples, then label one or a few samples from each cluster.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.