Hive: How to calculate the Kendall coefficient of correlation of a pair of a numeric columns in the group?

Question

Hive: How to calculate the Kendall coefficient of correlation of a pair of a numeric columns in the group?

Marcin Kosiński

2015年4月28日 21:46

In this wiki page there is a function corr() that calculates the Pearson coefficient of correlation, but my question is that: is there any function in Hive that enables to calculate the Kendall coefficient of correlation of a pair of a numeric columns in the group?

Topic hive correlation apache-hadoop

Category Data Science

Hack-R · Accepted Answer · 2015年2月27日 20:00

In Hive itself? Unfortunately, the answer is simply no -- as the language definition manual shows, that statistic is simply not built in. In addition to the language manual, you can get more information on statistics in development in Hive here and here.

Having said that, there are plenty of ways to calculate Kendall's W on data that's in Hive.

You could write out the data to a file or query it into R or a statistical package such as SAS, Stat, MATLAB, Excel, etc then run your calculation and, if necessary, write your results back to Hive.

In R, for instance, you could do something like this:

install.packages("RODBC")
require(RODBC)
db   <- odbcConnect("Hive_DB")
hql  <- "select * from table A"
data <- sqlQuery(db , hql)
kenw <- cor(x = data$a, y = data$b, method="kendall")
sqlSave(db, kenw, tablename = "new_table_of_kendall_coef")

or (if using Linux or Unix) then you could use RHive without needing to use an ODBC name.

Another way to go about it would be to take the functions that do exist in Hive (which you linked to) and calculate Kendall's coefficient yourself with a custom function. As to how to specifically implement that, well you'd probably want to post on Cross Validated (stats.stackexchange.com).

Hive: How to calculate the Kendall coefficient of correlation of a pair of a numeric columns in the group?

About