Suggestions on what patterns/analysis to derive from Airlines Big Data

I recently started learning Hadoop, I found this data set http://stat-computing.org/dataexpo/2009/the-data.html - (2009 data),

I want some suggestions as what type of patterns or analysis can I do in Hadoop MapReduce, i just need something to get started with, If anyone has a better data set link which I can use for learning, help me here.

The attributes are as:

1   Year    1987-2008
2   Month   1-12
3   DayofMonth  1-31
4   DayOfWeek   1 (Monday) - 7 (Sunday)
5   DepTime actual departure time (local, hhmm)
6   CRSDepTime  scheduled departure time (local, hhmm)
7   ArrTime actual arrival time (local, hhmm)
8   CRSArrTime  scheduled arrival time (local, hhmm)
9   UniqueCarrier   unique carrier code
10  FlightNum   flight number
11  TailNum plane tail number
12  ActualElapsedTime   in minutes
13  CRSElapsedTime  in minutes
14  AirTime in minutes
15  ArrDelay    arrival delay, in minutes
16  DepDelay    departure delay, in minutes
17  Origin  origin IATA airport code
18  Dest    destination IATA airport code
19  Distance    in miles
20  TaxiIn  taxi in time, in minutes
21  TaxiOut taxi out time in minutes
22  Cancelled   was the flight cancelled?
23  CancellationCode    reason for cancellation (A = carrier, B = weather, C     = NAS, D = security)
24  Diverted    1 = yes, 0 = no
25  CarrierDelay    in minutes
26  WeatherDelay    in minutes
27  NASDelay    in minutes
28  SecurityDelay   in minutes
29  LateAircraftDelay   in minutes

Thanks

Topic map-reduce apache-hadoop

Category Data Science


There really is no wrong answer here, but I recommend predicting flight cancellations (#22) and/or delays (25-29), since this is how I often see this data set being used. It could also have practical significance to you if you should ever find yourself flying to or departing from one of the worst offending airports/airlines.

I'm not sure if you have a choice (perhaps your employer requires it), but don't use Map Reduce -- it's incredibly difficult to learn/maintain, it's slow, and on top of that it has become obsolete. Use something like Spark's ML lib (http://spark.apache.org/docs/latest/mllib-guide.html). It's much easier to use and is much more current.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.