Dealing with an apparently inseparable dataset

I'm attempting to build a model/suite of models to predict a binary target. The exact details of the models aren't important, but suffice to say that I've tried with half a dozen different types of models, with comparable results from all of them.

On looking at the predictions on various subsets of the training data, it appears that a certain subset of features is important for around 30% of the data, while a different subset is important for the remaining 70%. With the training data/holdout set. This separation is fairly easy to detect when the target is known (run a model with subset1, another with subset2, find the subset where one model does better than the other). Obviously, this is not possible with the test data, since the target is not known there.

There are clearly (at least) two regions in the data that are substantially different from each other, since a model trained on the whole dataset does worse than the combination of separate models trained on each of these regions.

However, the dataset seems spectacularly resistant to separation, with no major difference between principal components (linear and RBF kernel) and no clear/stable clusters after applying several different clustering algorithms (KMeans, agglomerative, mean shift).

Are there any popular methods that allow separation/clustering of data that are resistant to "normal" methods? The end goal here is to find out which model to use on which rows when target is unknown.

Topic domain-adaptation machine-learning

Category Data Science


Ensembles!

This consists of building several classifiers, each of them acting particularly well in a subset of the data, and the combine their predictions into the final prediction.

From your description, boosting algorithms seem to be good candidates, e.g. AdaBoost.


You might have a domain adaptation problem. The samples are actually taken from two different sources and have different rules governing their behavior.

I suggest that you'll try the following method.

  • Split the data into two data set using the target concept.
  • Build a model for each data set. In case your performance doesn't improve it is likely that the domain adaptation assumption is wrong.
  • Otherwise, we gain an improvement. Now we should find out which model to apply. Use the full dataset and use the target in order to create a new concept indicating to which dataset the sample belongs. Since you said that given the target you can deduce that I assume the two are different.
  • Build a model for the new dataset. If it performs well, the problem is solved. Run this model first and apply the proper model for each sample. Otherwise, it means that the two data set are indistinguishable (at least with respect to the methods you applied), weakening the assumption the the cause of the problem is domain adaptation.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.