Is removing poorly predicted data points a valid approach?
I'm getting my feet wet with data science and machine learning. Please bear with me as I try to explain my problem. I haven't been able to find anything about this method, but I suspect there's a simple name for what I'm trying to do. To give an idea about my level of knowledge, I've had some basic statistics training in my university studies (social sciences), and I work as a programmer at the moment. So when it comes to statistics I'm not a complete layperson, but my understanding leaves a lot to be desired.
I need to make predictions on a certain dataset. All features and the target are continuous variables (interval/ratio level) and I'm using a regression multi-layer perceptron with a single hidden layer. The $R^{\textrm{2}}$ is relatively low when fitting on the entire dataset, so I was hoping to improve the predictive power by doing cluster analysis and fitting multiple regressors on each cluster separately. This didn't work for me. The highest cluster $R^{\textrm{2}}$ was the same as for the entire set so I dropped that approach. I've tried multiple clustering algorithms and haven't noticed much of a difference among them; but, I haven't done an exhaustive search so I may have missed something.
What I ended up doing is fitting the model on the training subset, identifying the data point for which the prediction error in the test subset was greatest, and removing that one data point. The train-test split is done randomly so the probability of some bad data points remaining by chance is pretty low. I repeated this process until I got an $R^{\textrm{2}}$ I'm satisfied with, after which I designated the remaining data points as belonging to e.g. Group 1. This whole process was then repeated on the entire dataset minus Group 1, and ultimately all of the data should be divided into groups inside which reasonably reliable predictions can be made. To give an idea about the data: $R^{\textrm{2}}$ on the entire set of about 11000 data points hovers around 0.7. In Group 1, I kept 7000 data points for which I can get up to 0.9. The remaining groups also have acceptable (for my standards) $R^{\textrm{2}}$s.
After all of the data is split into groups in this way, I expect to be able to train a classifier on the features, to predict the group that a certain data point belongs to and use the appropriate regression model for that group to predict the final target variable. My question is, is there some methodological flaw in this approach, specifically the removal of certain data points based on prediction errors? Am I introducing some artifacts into the data or something like that? As far as I can tell there's no information leakage about the target. What this looks like to me is, some roundabout way of doing cluster analysis, as it seems to exclude outliers in a certain sense, but I couldn't be more precise and that may not be the case at all.
Note: all mentions of $R^{\textrm{2}}$ refer to predictions, i.e. the score I get on the test set, not training.
EDIT: I removed data points only from the test subset (which changes randomly in each iteration), but now that I think about it I guess there's no reason to limit exclusions to the test subset as a worse prediction may happen in the training subset. I'll update when I try it. Also, I haven't yet tried fitting the classifier so I suppose I may not get any better results with this final model. Regardless of the results I get, I'm interested in the validity of this approach. Also, if someone knows of a theoretical limitation here, if there's a reason why I couldn't get better results in principle with this approach, I'd like to know.
Topic methodology regression predictive-modeling machine-learning
Category Data Science