Correlation is a bivariate analysis that measures the strength of association between two variables and the direction of the relationship. In terms of the strength of the relationship, the value of the correlation coefficient varies between +1 and -1. A value of ± 1 indicates a perfect degree of association between the two variables. As the correlation coefficient value goes towards 0, the relationship between the two variables will be weaker. The direction of the relationship is indicated by the sign of the coefficient; a + sign indicates a positive relationship and a – sign indicates a negative relationship.

Pearson's correlation coefficient and the others are the non-parametric method, Spearman's rank correlation coefficient and Kendall's tau coefficient.

Pearson's Correlation Coefficient

$$ r = \frac{\sum(X - \overline{X})(Y - \overline{Y})} {\sqrt{\sum(X-\overline{X})^{2}\cdot\sum(Y-\overline{Y})^{2}}}\\ ~ \\ \begin{align} Where, ~ \overline{X} &= mean ~ of ~ X~variable\\ \overline{Y} &= mean ~ of ~ Y ~ variable\\ \end{align} $$

Assumptions:

  • Each observation should have a pair of values.

  • Each variable should be continuous.

  • It should be the absence of outliers.

  • It assumes linearity and homoscedasticity.

Spearman's Rank Correlation Coefficient

$$\rho = \frac{\sum_{i=1}^{n}(R(x_i) - \overline{R(x)})(R(y_i) - \overline{R(y)})} {\sqrt{\sum_{i=1}^{n}(R(x_i) - \overline{R(x)})^{2}\cdot\sum_{i=1}^{n}(R(y_i)-\overline{R(y)})^{2}}} = 1 - \frac{6\sum_{i=1}^{n}(R(x_i) - R(y_i))^{2}}{n(n^{2} - 1)}\\ ~ \\ \begin{align} Where, ~ R(x_i) &= rank ~ of ~ x_i\\ R(y_i) &= rank ~ of ~ y_i\\ \overline{R(x)} &=mean ~ rank ~ of ~ x\\ \overline{R(y)} &=mean ~ rank ~ of ~ y\\ n &= number ~ of ~ pairs \end{align} $$

Assumptions:

  • Pairs of observations are independent.

  • Two variables should be measured on an ordinal, interval or ratio scale.

  • It assumes that there is a monotonic relationship between the two variables.

Kendall's Tau Coefficient

$$ \tau = \frac{n_c - n_d}{n_c + n_d} = \frac{n_c - n_d}{n(n-1)/2}\\ ~ \\ \begin{align} Where, ~ n_c &= number ~ of ~ concordant ~ pairs\\ n_d &= number ~ of ~ discordant ~ pairs\\ n &= number ~ of ~ pairs \end{align} $$

Assumptions:

  • It's the same as assumptions of Spearman's rank correlation coefficient

Comparison of Each Correlation Coefficients

Pearson correlation vs Spearman and Kendall correlation

  • Non-parametric correlations are less powerful because they use less information in their calculations. In the case of Pearson's correlation uses information about the mean and deviation from the mean, while non-parametric correlations use only the ordinal information and scores of pairs.

  • In the case of non-parametric correlation, it's possible that the X and Y values can be continuous or ordinal, and approximate normal distributions for X and Y are not required. But in the case of Pearson's correlation, it assumes the distributions of X and Y should be normal distribution and also be continuous.

  • Correlation coefficients only measure linear (Pearson) or monotonic (Spearman and Kendall) relationships.

Spearman correlation vs Kendall correlation

  • In the normal case, Kendall correlation is more robust and efficient than Spearman correlation. It means that Kendall correlation is preferred when there are small samples or some outliers.

  • Kendall correlation has a O(n^2) computation complexity comparing with O(n logn) of Spearman correlation, where n is the sample size.

  • Spearman’s rho usually is larger than Kendall’s tau.

  • The interpretation of Kendall’s tau in terms of the probabilities of observing the agreeable (concordant) and non-agreeable (discordant) pairs is very direct.

Example Python Implementation

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.