Feature reduction by removing certain columns in dataframe

I am working with the Emotion recognition model with the IEMOCAP dataset. For the feature extraction, I am taking mel-spectrogram and then convert it into a NumPy array and converting the array into a data frame of spectrogram features.

The generated dataframe has a shape of 2380 rows X 11761 columns

like

            0         1         2         3         4         5         6         7  ...  11754  11755  11756  11757  11758  11759  11760  11761
262  0.036491  0.037793  0.041035  0.044644  0.047210  0.048467  0.049556  0.052137  ...    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0
323  0.004577  0.004684  0.004951  0.005228  0.005357  0.005255  0.004969  0.004632  ...    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0
680  0.003169  0.003221  0.003349  0.003490  0.003600  0.003682  0.003766  0.003860  ...    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0
568  0.001942  0.001935  0.001934  0.001969  0.002071  0.002247  0.002456  0.002622  ...    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0
769  0.002546  0.002483  0.002299  0.002050  0.001813  0.001661  0.001652  0.001793  ...    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0

When I thoroughly checked, many columns have only 0.00000 in the last except few rows having some information.

My question is can I remove columns that have less than a certain number of nonzero elements in the column? Is the dimensionality reduction possible this way? Please guide me through this.

Topic feature-reduction feature-engineering

Category Data Science


I've used such a method to remove tokenized words in a TFIDF NLP transformer that were very infrequent. In my case these words were mostly spelling mistakes or random characters that I didn't want to track as features. Basically you can run numpy.count_nonzero on every column & if it returns a number that's less than some threshold (say 1 or 2) you can run pandas.DataFrame.drop on that column.


It depends, generally speaking it might be okay as those columns only seems to holds very few information. However there are some case, notably when the outcome is unbalanced, where some skewed features are usefull. Say you are trying to predict something that happen 1% of the time, a feature with only 2% of non-zero value but causally leading to the outcome, might be very predictive for your target.

So when removing those it is important to check (before) there is no clear dependency to the outcome and (after) the performance doesn’t drop too much when removing them.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.