Feature selection for two seperate datasets

Currently, I'm doing research with experimental data. The data comes from two experiments with two slightly different tasks, but with the same setup in a VR environment. Both experiments were done with different populations but with same two groups of participants: healthy controls and patients of a certain kind.

From the experimental data the same set of features (over 200 features) were constructed and extracted for both datasets. The goal in this research is to apply machine learning in order to distinguish the patients form the controls based on those features.

Because of the slightly different tasks, the two datasets cannot be merged. Therefore, I select the most important features for both datasets separately with feature selection methods, and then run two separate models. Now, both models perform reasonably well for the classification task, however the rely on very different features..

Ultimately, I'd like to find the features that are have common discriminative properties in both datasets. And build two models for the two datasets, but with the same set of features.

I have been able to do this quite well, by only considering those features that have the same direction of correlation with the label in both datasets and then selecting the common features from the top 30 most contributing features for both datasets. The performance of the models is not as good as with the separate features, but is still quite acceptable and surprisingly it even seems to be more consistent.

However, I this approach is not based on anything I could find in the literature, it just seemed a logical choice, but I'm doubting if it is completely valid to do it this way... Oddly enough, I couldn't find anything in the literature that discusses the consistency of features in separate datasets. Or I just don't know where to look...

If I don't do the correlation direction check (which I'm most unsure about), I end up with some features that are distributed in an opposite way in the two datasets. This is not really want, as I want to find features that contribute in the same manner to the classification task.

So basically the conclusion of this whole story comes down to one questions: Does anyone have any knowledge of a valid way to select features that have common discriminative properties in two datasets? Or else, does anyone perhaps have any suggestions how to deal with his problem in a different way?

Topic experiments classification dataset feature-selection python

Category Data Science


I'm not aware of anything similar in the literature, this might be too specific but I don't know everything. Anyway I think your approach makes sense. I'm not sure if it would help but conditional entropy is also an option for calculating the discriminative power of the individual features.

Assuming the training of a model is not too long, you could consider a more advanced design in order to find the optimal common subset for both tasks. I think genetic learning would be a good option for that:

  • the features are the "genes" to be selected
  • for every "individual" (subset of features), train a model for each task and evaluate on a validation set.
  • define the reward/cost function based on evaluating both tasks, for instance with the mean performance.

This way the genetic algorithm should converge to an optimal subset of features which maximizes the mean performance between the two tasks. Don't forget to keep a separate fresh test set for the final evaluation.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.