too much data to label

I'm working on a Data Science project to flag bots on Instagram. I collected a lot of data (+80k users) and now I have to label them as bot/legit users. I already flagged 20k users with different techniques but now I feel like I'm gonna have to flag them one by one with will likely take months.

Can I just stop and be like I'm fine with what I have or is this bad practice? Stopping now would also mean that the distribution of the data is NOT the same as my labeling techniques were used to find bots and not legit users.

What are my options?

Topic labelling data machine-learning

Category Data Science


You could look into semi-supervised learning, which is useful for training models when you have both labeled and unlabeled data. Semi-supervised methods consider the distribution of unlabeled data to improve the performance of your model. The following picture should give you some intuition regarding how unlabeled data can be useful.

https://upload.wikimedia.org/wikipedia/commons/d/d0/Example_of_unlabeled_data_in_semisupervised_learning.png

In another direction, you may train a classifier with the labels you have so far. Then, use the classifier to predict the probability of each label for your unlabeled data. Sort labels by their probability, and manually label a small sample of low (p<0.25), medium (0.25 < p < 0.75) and high (p> 0.75) probabilities. Then, try to estimate in which probability range your model is struggling most. In theory, it should be a better investment of your time to manually label the cases that fall in the medium probability range, as these are the ones your current model is more uncertain about. This and similar approaches belong to the category of active learning.

In short, look into semi-supervised or active learning.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.