Does it make sense to randomly select features as a baseline?

In my paper, I am saying that the accuracy of classification is $x\%$ when using the top N features.

My supervisor thinks that we should capture the classification accuracy when using N randomly selected features to show that the initial feature selection technique makes an actual difference.

Does this make sense?

I've argued that no one cares about randomly selected features so this addition doesn't make sense. It's quite obvious that randomly selecting features will provide a worse classification accuracy so there's no need to show that using any sort of feature ranking metric will be superior.

Topic feature-reduction feature-selection

Category Data Science


Your supervisor is right. Maybe not in the specific way to show your solution's dominance on the problem but at least in the main idea:

  • He is right because you need a benchmark to prove your feature selection is better than not-doing-anything.

  • Look for alternatives to represent why is your selection better:

    • A ranking for 5 possible selections: Yours and another four.
    • The alternative your supervisor told you: Random vs Yours.
    • The alternative your supervisor told you but improved: Select randomly features M times, select the best and show that your solution is the best.
    • Your solution vs all the features (the problem with this is that the accuracy improves with more variables many time, so you would have to use a measure which punishes using many features just as AIC)

Your supervisor is asking to use random feature selection as a baseline. Random performance is a common baseline in machine learning. For example in binary classification with equal numbers of samples from each group, a trained classification should perform better than 50% (random guessing).

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.