Solutions for Labelling Training Data for Binary Classification Problems

Question

Solutions for Labelling Training Data for Binary Classification Problems

ha9u63ar

2021年4月8日 10:03

I have a huge dataset for which I am trying to use an 80-20 (Holdout method) approach to train and test my model. However, the dataset I have been given has 6m rows. The objective is to train+test+validate the model before using live data traffic for real-time predictions.

The expected result here is It's not corrupted with 97% accuracy which is implementation details and output of some Jupyter notebook etc.

My Question is - Is there any alternatives than manually labelling such a big dataset?

By manually labelling - I mean a human (or a group) going through all the 6m rows(!). Also, not all input strings have identical contents so it's hard to just push it through some script/csv and automate it. But I am trying to understand if this is the ONLY way.

Topic labelling semi-supervised-learning classification

Category Data Science

Noah Weber · Accepted Answer · 2020年11月8日 13:34

Ofcourse not. Here is a simple possible solution.

Do unsupervised learning. If you do it good and efficiently you will only see these two groups in your data (binary classification). And your silhuette score will be high. Hence you can automatically than label these groups/clusters.

Solutions for Labelling Training Data for Binary Classification Problems

About