Calculating correlation for categorical variables

Question

Calculating correlation for categorical variables

Ricky

2021年3月8日 06:33

I am struggling to find out a suitable way to calculate correlation coefficient for categorical variables. Pearson's coefficient is not supported for categorical features. I want to find out features with most highest influence on the target variable. My objectives are:

Correlation between categorical and categorical variables. e.g. For a binary target (like Titanic dataset), I want to find out what is the influence of a category on the target (like, influence of gender on survival (0/1))
Capture some non linear dependencies. e.g. For supermarket sales data, the sales is usually higher during weekends as people might visit such store more during holidays. So we expect to see spikes at an interval of roughly 7 days. Is there any way to capture this non-linearity/seasonality by a correlation coefficient?

Topic pearsons-correlation-coefficient descriptive-statistics categorical-data

Category Data Science

Nikos M. · Accepted Answer · 2021年3月7日 20:05

According to The Search for Categorical Correlation post on TowardsDataScience, one can use a variation of correlation called Cramer's association.

Going categorical

What we need is something that will look like correlation, but will work with categorical values — or more formally, we’re looking for a measure of association between two categorical features. Introducing: Cramér’s V. It is based on a nominal variation of Pearson’s Chi-Square Test, and comes built-in with some great benefits:

Similarly to correlation, the output is in the range of [0,1], where 0 means no association and 1 is full association. (Unlike correlation, there are no negative values, as there’s no such thing as a negative association. Either there is, or there isn’t)

Like correlation, Cramer’s V is symmetrical — it is insensitive to swapping x and y

def cramers_v(x, y):
    confusion_matrix = pd.crosstab(x,y)
    chi2 = ss.chi2_contingency(confusion_matrix)[0]
    n = confusion_matrix.sum().sum()
    phi2 = chi2/n
    r,k = confusion_matrix.shape
    phi2corr = max(0, phi2-((k-1)*(r-1))/(n-1))
    rcorr = r-((r-1)**2)/(n-1)
    kcorr = k-((k-1)**2)/(n-1)
    return np.sqrt(phi2corr/min((kcorr-1),(rcorr-1)))

Calculating correlation for categorical variables

About