Hive: How to calculate the Kendall coefficient of correlation of a pair of a numeric columns in the group?

In this wiki page there is a function corr() that calculates the Pearson coefficient of correlation, but my question is that: is there any function in Hive that enables to calculate the Kendall coefficient of correlation of a pair of a numeric columns in the group?

Topic hive correlation apache-hadoop

Category Data Science


In Hive itself? Unfortunately, the answer is simply no -- as the language definition manual shows, that statistic is simply not built in. In addition to the language manual, you can get more information on statistics in development in Hive here and here.

Having said that, there are plenty of ways to calculate Kendall's W on data that's in Hive.

You could write out the data to a file or query it into R or a statistical package such as SAS, Stat, MATLAB, Excel, etc then run your calculation and, if necessary, write your results back to Hive.

In R, for instance, you could do something like this:

install.packages("RODBC")
require(RODBC)
db   <- odbcConnect("Hive_DB")
hql  <- "select * from table A"
data <- sqlQuery(db , hql)
kenw <- cor(x = data$a, y = data$b, method="kendall")
sqlSave(db, kenw, tablename = "new_table_of_kendall_coef")

or (if using Linux or Unix) then you could use RHive without needing to use an ODBC name.

Another way to go about it would be to take the functions that do exist in Hive (which you linked to) and calculate Kendall's coefficient yourself with a custom function. As to how to specifically implement that, well you'd probably want to post on Cross Validated (stats.stackexchange.com).

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.