How to do Data acquistion focused on improving accuracy on hold-out test set?

I have the task of coming up with a model of 95% accuracy for a classification problem. I have training data and a hold-out data set. I have the opportunity to request data of a particular class with desired characteristics to achieve this objective.

What method shall I use to plan the data acquisition through another team? I am currently at 86% accuracy. I use LightGBM for the model development. Would consider parameter tuning and ensemble with XGBoost and TabNet. But I think I need better data to achieve higher accuracy. Feature engineering is also in play.

Also note that it is a multi-class classification problem.

Topic active-learning classification dataset

Category Data Science


As mentioned in the answer, try to get data for classes that were misclassified.

Apart from that you could also request data for the minor class. This would balance your dataset and hence might improve results.


Since it is a multi-class classification problem, look at the confusion matrix to find the specific categories that are being misclassified. Then acquire more data for categories where the most mistakes happen.

Another approach would to examine the decision boundary and acquire more data near the decision boundary.

These techniques can be combined - request data that has relevant feature values from commonly misclassified categories.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.