Is removing poorly predicted data points a valid approach?

Question

Is removing poorly predicted data points a valid approach?

mtosic

2021年7月7日 18:12

I'm getting my feet wet with data science and machine learning. Please bear with me as I try to explain my problem. I haven't been able to find anything about this method, but I suspect there's a simple name for what I'm trying to do. To give an idea about my level of knowledge, I've had some basic statistics training in my university studies (social sciences), and I work as a programmer at the moment. So when it comes to statistics I'm not a complete layperson, but my understanding leaves a lot to be desired.

I need to make predictions on a certain dataset. All features and the target are continuous variables (interval/ratio level) and I'm using a regression multi-layer perceptron with a single hidden layer. The $R^{\textrm{2}}$ is relatively low when fitting on the entire dataset, so I was hoping to improve the predictive power by doing cluster analysis and fitting multiple regressors on each cluster separately. This didn't work for me. The highest cluster $R^{\textrm{2}}$ was the same as for the entire set so I dropped that approach. I've tried multiple clustering algorithms and haven't noticed much of a difference among them; but, I haven't done an exhaustive search so I may have missed something.

What I ended up doing is fitting the model on the training subset, identifying the data point for which the prediction error in the test subset was greatest, and removing that one data point. The train-test split is done randomly so the probability of some bad data points remaining by chance is pretty low. I repeated this process until I got an $R^{\textrm{2}}$ I'm satisfied with, after which I designated the remaining data points as belonging to e.g. Group 1. This whole process was then repeated on the entire dataset minus Group 1, and ultimately all of the data should be divided into groups inside which reasonably reliable predictions can be made. To give an idea about the data: $R^{\textrm{2}}$ on the entire set of about 11000 data points hovers around 0.7. In Group 1, I kept 7000 data points for which I can get up to 0.9. The remaining groups also have acceptable (for my standards) $R^{\textrm{2}}$s.

After all of the data is split into groups in this way, I expect to be able to train a classifier on the features, to predict the group that a certain data point belongs to and use the appropriate regression model for that group to predict the final target variable. My question is, is there some methodological flaw in this approach, specifically the removal of certain data points based on prediction errors? Am I introducing some artifacts into the data or something like that? As far as I can tell there's no information leakage about the target. What this looks like to me is, some roundabout way of doing cluster analysis, as it seems to exclude outliers in a certain sense, but I couldn't be more precise and that may not be the case at all.

Note: all mentions of $R^{\textrm{2}}$ refer to predictions, i.e. the score I get on the test set, not training.

EDIT: I removed data points only from the test subset (which changes randomly in each iteration), but now that I think about it I guess there's no reason to limit exclusions to the test subset as a worse prediction may happen in the training subset. I'll update when I try it. Also, I haven't yet tried fitting the classifier so I suppose I may not get any better results with this final model. Regardless of the results I get, I'm interested in the validity of this approach. Also, if someone knows of a theoretical limitation here, if there's a reason why I couldn't get better results in principle with this approach, I'd like to know.

Topic methodology regression predictive-modeling machine-learning

Category Data Science

Hilal Yılmaz · Accepted Answer · 2021年7月7日 07:06

I have the same problem, and I'm considering to do the same approach: delete the extremely bad predicted data points. I investigated the bad predictions and I came up with two results: -Either the data are not correct, there has been an error when recording them -Or there is an important attribute that is not presented in the dataset and is responsible for the sudden change, but the model is not able to understand the reason

IMO, your approach is not wrong. If I find a similar problem in the literature, I will edit my answer.

LuckyLuke · Accepted Answer · 2021年7月6日 09:43

The method proposed seems to be related to the topic of Confident Learning/noisy labels. Check these out: Confident Learning: Estimating Uncertainty in Dataset Labels, cleanlab

In that setting you can either remove datapoints or change their label, but on the validation set.

kbrose · Accepted Answer · 2018年4月24日 13:10

The process you described is not valid, it is almost certainly overfitting. However the overall idea may very well be valid. What you should do instead is take a portion of your dataset as the testing which is not looked at during algorithm development, and split the rest between training and validation sets however you please.

Looking only at the training and validation sets, develop your entire end to end algorithm, including the classifier!

Once you are happy with the algorithm, try running it on the test set to get a more accurate measure of performance.

Is removing poorly predicted data points a valid approach?

About