Feature selection for two seperate datasets
Currently, I'm doing research with experimental data. The data comes from two experiments with two slightly different tasks, but with the same setup in a VR environment. Both experiments were done with different populations but with same two groups of participants: healthy controls and patients of a certain kind.
From the experimental data the same set of features (over 200 features) were constructed and extracted for both datasets. The goal in this research is to apply machine learning in order to distinguish the patients form the controls based on those features.
Because of the slightly different tasks, the two datasets cannot be merged. Therefore, I select the most important features for both datasets separately with feature selection methods, and then run two separate models. Now, both models perform reasonably well for the classification task, however the rely on very different features..
Ultimately, I'd like to find the features that are have common discriminative properties in both datasets. And build two models for the two datasets, but with the same set of features.
I have been able to do this quite well, by only considering those features that have the same direction of correlation with the label in both datasets and then selecting the common features from the top 30 most contributing features for both datasets. The performance of the models is not as good as with the separate features, but is still quite acceptable and surprisingly it even seems to be more consistent.
However, I this approach is not based on anything I could find in the literature, it just seemed a logical choice, but I'm doubting if it is completely valid to do it this way... Oddly enough, I couldn't find anything in the literature that discusses the consistency of features in separate datasets. Or I just don't know where to look...
If I don't do the correlation direction check (which I'm most unsure about), I end up with some features that are distributed in an opposite way in the two datasets. This is not really want, as I want to find features that contribute in the same manner to the classification task.
So basically the conclusion of this whole story comes down to one questions: Does anyone have any knowledge of a valid way to select features that have common discriminative properties in two datasets? Or else, does anyone perhaps have any suggestions how to deal with his problem in a different way?
Topic experiments classification dataset feature-selection python
Category Data Science