Hadoop/Pig Aggregate Data
I am working on a project with two data sets. A time vs. speed data set (let's call it traffic), and a time vs. weather data set (called weather).
I am looking to find a correlation between these two sets using Pig. However the traffic data set has the time field, D/M/Y hr:min:sec, and the weather data set has the time field, D/M/Y.
Due to this I would like to average the speed per day and put it into a single D/M/Y value inside the traffic file.
I then plan to use:
data = JOIN speed BY day, JOIN weather BY day with 'merge'
I will then find the correlation using: (I am borrowing this code from elsewhere)
set = LOAD 'data.txt' AS (speed:double, weather:double)
rel = GROUP set ALL
cor = FOREACH rel GENERATE COR(set.speed, set.weather)
dump cor;
This is my first experience with Pig (I've never even used SQL), so I would like to know a few things:
1. How can I merge the rows of my traffic file (ie. average D/M/Y hr:min:sec into D/M/Y)?
2. Is there a better way to find a correlation between the fields of different datasets?
3. Are the JOIN BY and the COR() functions used appropriately in my above code?
Topic apache-pig correlation beginner apache-hadoop
Category Data Science