The Merits of Feature Reduction Routines

I am interested in learning what routine others use (if any) for Feature Reduction/Selection.

For example, If my data has several thousand features, I typically try {2,3,4} things right away depending on circumstances.

  1. Zero variance/Near zero variance

    • Using R package caret, nzv
    • I find a v.small percentage is zero variance and a few more are near zero variance.
    • Then by using nzv$PercentUnique I may remove the bottom quartile of features depending on the range of PercentUnique's.
  2. Correlation to find multicollinearity

    • I find the correlation matrix and remove values 0.75 and remove.
    • I have seen others use correlations 0.5 or 0.6, but don't have any references for it.
  3. Boruta / Random Forest

    • Love Boruta package but it takes a while.
    • Then here again use Forward Feature Selection.
  4. PCA

    • Depending on the nature of the data I will try PCA last.
    • If the model must be explainable then I skip this.
    • I may use several criteria: 80, 90, 95% error explained
    • Forward Feature selection, look for first ~3:10 orthogonal features

NOTE: I am not suggesting this is the best/worst routine but I'm opening the floor to civil debate. If you need a definition for Civil Debate see Wikipedia.

Topic boruta pca correlation feature-selection r

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.