How to do Data acquistion focused on improving accuracy on hold-out test set?

Question

How to do Data acquistion focused on improving accuracy on hold-out test set?

kosmos

2022年1月28日 10:03

I have the task of coming up with a model of 95% accuracy for a classification problem. I have training data and a hold-out data set. I have the opportunity to request data of a particular class with desired characteristics to achieve this objective.

What method shall I use to plan the data acquisition through another team? I am currently at 86% accuracy. I use LightGBM for the model development. Would consider parameter tuning and ensemble with XGBoost and TabNet. But I think I need better data to achieve higher accuracy. Feature engineering is also in play.

Also note that it is a multi-class classification problem.

Topic active-learning classification dataset

Category Data Science

spectre · Accepted Answer · 2022年1月28日 10:03

1

spectre answered at 2022年1月28日 10:03

As mentioned in the answer, try to get data for classes that were misclassified.

Apart from that you could also request data for the minor class. This would balance your dataset and hence might improve results.

Brian Spiering · Accepted Answer · 2021年9月16日 22:30

Since it is a multi-class classification problem, look at the confusion matrix to find the specific categories that are being misclassified. Then acquire more data for categories where the most mistakes happen.

Another approach would to examine the decision boundary and acquire more data near the decision boundary.

These techniques can be combined - request data that has relevant feature values from commonly misclassified categories.

How to do Data acquistion focused on improving accuracy on hold-out test set?

About