How to construct a test set for an active learning project?

With active learning I hope to keep the annotation effort to a minimum, yet building still a good classifier.

My initial starting point is that I have about 20k images which can belong to ten different classes, and have 0 labeled images at the moment. After each active learning iteration, I hope to get the labels of e.g. 100 images. If it matters, unfortunately, the data is very likely imbalanced which means that five classes are probably very rare.

So how do I construct my test set for active learing?

  1. Draw a random sample of a certain percentage right at the beginning, annotate it and keep the test set static throughout the whole project?

  2. Grow the test set with each active learning iteration? (example: 10 of the 100 new labeled images are randomly added to the growing test set?)

  3. Any other idea?

I was looking for this topic on Google and Google Scholar, but found no good hits regarding papers which elaborate on test set construction for active learning projects.

Any ideas, experiences or further readings welcome! Thank you!

Topic active-learning evaluation

Category Data Science


I am working on how to apply active learning to testing. I understood that you have already a training dataset but you do not have a testing dataset and you would use active learning to label more samples for testing. You have two options:

Option 1: If the training dataset is large enough. Then, you can consider this dataset as your entire dataset and split it into (70%,30%) for training and testing. No need to use active learning to select testing samples. Split the dataset and train the model from scratch.

Option 2: If the training dataset is small, you might probably apply active learning to label more samples for testing. Which samples I should select for testing? the simple solution is to apply the same technique you used before in sampling the training dataset. Otherwise, if you would like to apply more special criteria for sampling e.g., representative, diverse, or hard samples, you could read this paper and implement the algorithms over there.

An Active Learning Approach with Uncertainty, Representativeness, and Diversity

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.