Correlation Matrix for non-numeric features

Currently, I have dataset with numeric as well non-numeric attributes. I am trying to remove the redundant features in the dataset using R Programming Languages. Note: Non-numeric attributes cannot be turned into binary.

The Caret R package provides the findCorrelation which will analyze a correlation matrix of your data’s attributes report on attributes that can be removed. However, It only works numeric values of 'x'. I have been unable to find a package which does it for non-numeric attributes.

Is there a function in Caret R Package that does that for non-numeric attributes as well? If not, any method/package that would help me achieve the same?

Topic feature-reduction r machine-learning

Category Data Science


Let's say your data is stored in df data frame, and that you are interested in analyzing the correlation between features V1, V2, V3 and V4. You could try the following:

m_V1 <- model.matrix(~ V1 - 1, df)
m_V2 <- model.matrix(~ V2 - 1, df)
m_V3 <- model.matrix(~ V3 - 1, df)
m_V4 <- model.matrix(~ V4 - 1, df)

Then, for V1 you could do the following:

cor(m_V1, m_V2)
cor(m_V1, m_V3)
cor(m_V1, m_V4) 

You would have to do this manually for each feature/vector. This solution would work for numeric as well as non-numeric variables.

Also, by default the correlation is pearson. However, you can choose what kind of correlation value you are looking for by something like:

cor(m_V2, m_V3, method = 'pearson')
cor(m_V2, m_V3, method = 'kendall')```


There are measures of association for categorical variables. If you are looking at two ordinal variables you may use Spearman's correlation coefficient. There are also many measures for association for purely categorical variables, such as gender and race. Yule's Q and Crammer's v are popular choices.


two options come to mind looking at your question:

1) One-Hot-Encoding + Correlation: One-Hot-Encoding takes your categorical features and replaces them by making a binary column out of each unique value of your categorical feature. Using this you would get a bunch of new numerical/binary features that you could then use for claculating the correlation between each bineary feature and your numerical ones.

2) Logistic Regression: If you want to measure correlation between your categorical feature and a numeric one you could try calculating a logistic regression between both (having your categorical feature as target and the numeric one as your only input). The hypothesis behind this is, that assuming you have a high correlation between your categorical feature and a numeric one it should be possible to build a fairly good performing model for predicting your categorical feature using your numerical one as input.

Of course there are a bunch more ways to measure correlation between categorical and numerical values. They all however are based on some sort of distance measure you need to define for your categorical features.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.