The Merits of Feature Reduction Routines
I am interested in learning what routine others use (if any) for Feature Reduction/Selection.
For example, If my data has several thousand features, I typically try {2,3,4} things right away depending on circumstances.
Zero variance/Near zero variance
- Using R package caret,
nzv
- I find a v.small percentage is zero variance and a few more are near zero variance.
- Then by using nzv$PercentUnique I may remove the bottom quartile of features depending on the range of PercentUnique's.
- Using R package caret,
Correlation to find multicollinearity
- I find the correlation matrix and remove values 0.75 and remove.
- I have seen others use correlations 0.5 or 0.6, but don't have any references for it.
Boruta / Random Forest
- Love Boruta package but it takes a while.
- Then here again use Forward Feature Selection.
PCA
- Depending on the nature of the data I will try PCA last.
- If the model must be explainable then I skip this.
- I may use several criteria: 80, 90, 95% error explained
- Forward Feature selection, look for first ~3:10 orthogonal features
NOTE: I am not suggesting this is the best/worst routine but I'm opening the floor to civil debate. If you need a definition for Civil Debate see Wikipedia.
Topic boruta pca correlation feature-selection r
Category Data Science