Feature reduction by removing certain columns in dataframe

Question

Feature reduction by removing certain columns in dataframe

adikh

2020年12月16日 08:03

I am working with the Emotion recognition model with the IEMOCAP dataset. For the feature extraction, I am taking mel-spectrogram and then convert it into a NumPy array and converting the array into a data frame of spectrogram features.

The generated dataframe has a shape of 2380 rows X 11761 columns

like

            0         1         2         3         4         5         6         7  ...  11754  11755  11756  11757  11758  11759  11760  11761
262  0.036491  0.037793  0.041035  0.044644  0.047210  0.048467  0.049556  0.052137  ...    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0
323  0.004577  0.004684  0.004951  0.005228  0.005357  0.005255  0.004969  0.004632  ...    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0
680  0.003169  0.003221  0.003349  0.003490  0.003600  0.003682  0.003766  0.003860  ...    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0
568  0.001942  0.001935  0.001934  0.001969  0.002071  0.002247  0.002456  0.002622  ...    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0
769  0.002546  0.002483  0.002299  0.002050  0.001813  0.001661  0.001652  0.001793  ...    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0

When I thoroughly checked, many columns have only 0.00000 in the last except few rows having some information.

My question is can I remove columns that have less than a certain number of nonzero elements in the column? Is the dimensionality reduction possible this way? Please guide me through this.

Topic feature-reduction feature-engineering

Category Data Science

Oliver Foster · Accepted Answer · 2020年11月15日 14:14

I've used such a method to remove tokenized words in a TFIDF NLP transformer that were very infrequent. In my case these words were mostly spelling mistakes or random characters that I didn't want to track as features. Basically you can run numpy.count_nonzero on every column & if it returns a number that's less than some threshold (say 1 or 2) you can run pandas.DataFrame.drop on that column.

lcrmorin · Accepted Answer · 2020年11月13日 09:33

It depends, generally speaking it might be okay as those columns only seems to holds very few information. However there are some case, notably when the outcome is unbalanced, where some skewed features are usefull. Say you are trying to predict something that happen 1% of the time, a feature with only 2% of non-zero value but causally leading to the outcome, might be very predictive for your target.

So when removing those it is important to check (before) there is no clear dependency to the outcome and (after) the performance doesn’t drop too much when removing them.

Feature reduction by removing certain columns in dataframe

About