Solutions for Labelling Training Data for Binary Classification Problems

I have a huge dataset for which I am trying to use an 80-20 (Holdout method) approach to train and test my model. However, the dataset I have been given has 6m rows. The objective is to train+test+validate the model before using live data traffic for real-time predictions.

The expected result here is It's not corrupted with 97% accuracy which is implementation details and output of some Jupyter notebook etc.

My Question is - Is there any alternatives than manually labelling such a big dataset?

By manually labelling - I mean a human (or a group) going through all the 6m rows(!). Also, not all input strings have identical contents so it's hard to just push it through some script/csv and automate it. But I am trying to understand if this is the ONLY way.

Topic labelling semi-supervised-learning classification

Category Data Science


Ofcourse not. Here is a simple possible solution.

Do unsupervised learning. If you do it good and efficiently you will only see these two groups in your data (binary classification). And your silhuette score will be high. Hence you can automatically than label these groups/clusters.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.