Hadoop/Pig Aggregate Data

I am working on a project with two data sets. A time vs. speed data set (let's call it traffic), and a time vs. weather data set (called weather).

I am looking to find a correlation between these two sets using Pig. However the traffic data set has the time field, D/M/Y hr:min:sec, and the weather data set has the time field, D/M/Y.

Due to this I would like to average the speed per day and put it into a single D/M/Y value inside the traffic file.

I then plan to use:

data = JOIN speed BY day, JOIN weather BY day with 'merge'

I will then find the correlation using: (I am borrowing this code from elsewhere)

set = LOAD 'data.txt' AS (speed:double, weather:double)
rel = GROUP set ALL
cor = FOREACH rel GENERATE COR(set.speed, set.weather)
dump cor;

This is my first experience with Pig (I've never even used SQL), so I would like to know a few things:

1. How can I merge the rows of my traffic file (ie. average D/M/Y hr:min:sec into D/M/Y)?
2. Is there a better way to find a correlation between the fields of different datasets?
3. Are the JOIN BY and the COR() functions used appropriately in my above code?  

Topic apache-pig correlation beginner apache-hadoop

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.