Pearson vs Spearman vs Kendall
What are the characteristics of the three correlation coefficients and what are the comparisons of each of them/assumptions?
Can somebody kindly take me through the concepts?
What are the characteristics of the three correlation coefficients and what are the comparisons of each of them/assumptions?
Can somebody kindly take me through the concepts?
Correlation is a bivariate analysis that measures the strength of association between two variables and the direction of the relationship. In terms of the strength of the relationship, the value of the correlation coefficient varies between +1 and -1. A value of ± 1 indicates a perfect degree of association between the two variables. As the correlation coefficient value goes towards 0, the relationship between the two variables will be weaker. The direction of the relationship is indicated by the sign of the coefficient; a + sign indicates a positive relationship and a – sign indicates a negative relationship.
Pearson's correlation coefficient and the others are the non-parametric method, Spearman's rank correlation coefficient and Kendall's tau coefficient.
Pearson's Correlation Coefficient
$$ r = \frac{\sum(X - \overline{X})(Y - \overline{Y})} {\sqrt{\sum(X-\overline{X})^{2}\cdot\sum(Y-\overline{Y})^{2}}}\\ ~ \\ \begin{align} Where, ~ \overline{X} &= mean ~ of ~ X~variable\\ \overline{Y} &= mean ~ of ~ Y ~ variable\\ \end{align} $$
Assumptions:
Each observation should have a pair of values.
Each variable should be continuous.
It should be the absence of outliers.
It assumes linearity and homoscedasticity.
Spearman's Rank Correlation Coefficient
$$\rho = \frac{\sum_{i=1}^{n}(R(x_i) - \overline{R(x)})(R(y_i) - \overline{R(y)})} {\sqrt{\sum_{i=1}^{n}(R(x_i) - \overline{R(x)})^{2}\cdot\sum_{i=1}^{n}(R(y_i)-\overline{R(y)})^{2}}} = 1 - \frac{6\sum_{i=1}^{n}(R(x_i) - R(y_i))^{2}}{n(n^{2} - 1)}\\ ~ \\ \begin{align} Where, ~ R(x_i) &= rank ~ of ~ x_i\\ R(y_i) &= rank ~ of ~ y_i\\ \overline{R(x)} &=mean ~ rank ~ of ~ x\\ \overline{R(y)} &=mean ~ rank ~ of ~ y\\ n &= number ~ of ~ pairs \end{align} $$
Assumptions:
Pairs of observations are independent.
Two variables should be measured on an ordinal, interval or ratio scale.
It assumes that there is a monotonic relationship between the two variables.
Kendall's Tau Coefficient
$$ \tau = \frac{n_c - n_d}{n_c + n_d} = \frac{n_c - n_d}{n(n-1)/2}\\ ~ \\ \begin{align} Where, ~ n_c &= number ~ of ~ concordant ~ pairs\\ n_d &= number ~ of ~ discordant ~ pairs\\ n &= number ~ of ~ pairs \end{align} $$
Assumptions:
Pearson correlation vs Spearman and Kendall correlation
Non-parametric correlations are less powerful because they use less information in their calculations. In the case of Pearson's correlation uses information about the mean and deviation from the mean, while non-parametric correlations use only the ordinal information and scores of pairs.
In the case of non-parametric correlation, it's possible that the X and Y values can be continuous or ordinal, and approximate normal distributions for X and Y are not required. But in the case of Pearson's correlation, it assumes the distributions of X and Y should be normal distribution and also be continuous.
Correlation coefficients only measure linear (Pearson) or monotonic (Spearman and Kendall) relationships.
Spearman correlation vs Kendall correlation
In the normal case, Kendall correlation is more robust and efficient than Spearman correlation. It means that Kendall correlation is preferred when there are small samples or some outliers.
Kendall correlation has a O(n^2) computation complexity comparing with O(n logn) of Spearman correlation, where n is the sample size.
Spearman’s rho usually is larger than Kendall’s tau.
The interpretation of Kendall’s tau in terms of the probabilities of observing the agreeable (concordant) and non-agreeable (discordant) pairs is very direct.
Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.