How to label legit users when trying developing a bot flagging classification model?

Question

How to label legit users when trying developing a bot flagging classification model?

Marc

2022年6月2日 14:07

I’m working on a project where I try to flag bots from legit users on social media. The data I collected is not labeled but I have labeled about 17% of it (22k users) thought different techniques. Finding bots was easy as they all have similarities with each other but it's different for legit users.

In my labeled data, I have most if not all bots labeled but still have a ton of legit users to label which is really hard without doing it manually (and even manually, it sucks).

I found from labeling users randomly and manually at the beginning that this is a very imbalanced data set (86/14 - legit/bot). As it was easier to spot bots rather than legit users in the labeling process, my labeled data is now balanced as (60/40).

One of the steps of the labeling process was to build a model to help me label the data and it's pretty amazing today. I got 99 for the accuracy, 97 for the precision, and 98 for recall.

For the rest of the data, I thought about predicting my whole dataset with the model and looking at the users having a predict proba less than 70/80/90 for the dominant class. I can then look at them and manually label but this might take quite some time depending on what proba threshold I choose.

Any advice/help?

Topic labelling labels python machine-learning

Category Data Science

How to label legit users when trying developing a bot flagging classification model?

About