PREDICTING Flight Delays Washington DC area airports BIT 5534 - Applied Business Intelligence and Analytics - Spring 2017 Group 3: Alexandra Robleto, Caitlin Fernandez, Lucas Cameron, Kevin Sherman
Project Summary Establish business need Collect data Understand flight data Prepare data for modeling Create predictive models Measure models ability to predict delays Evaluate findings Make recommendations
data Data source Data content Data understanding Data preparation https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&DB_Short_Name=On-Time Data content Calendar Year 2016 All Flights from Washington DC area airports: BWI, IAD, and DCA Data understanding Available variables Relationships Data preparation Missing values Outliers Redundant variables
Preliminary findings – Average delays Washington Reagan airport had more delays on average Certain airlines experience more delays and the airlines are different depending on the airport Highest delays during summer months June and July, followed by December
Preliminary findings – Average delays by time of day Flights departing before noon tend to arrive early Delays tend to get worse after noon
Preliminary findings – total delays by reason Late aircraft was the most common cause of flight delays in 2016, followed by carrier delays Security issues were least likely to cause delays, and weather was not an important cause of delays either
Preliminary findings – canceled flights While weather was not an important cause of delay, it did contribute to most flight cancelations, especially in January 2016. While Southwest Airlines cancelled the most flights, they also had the highest number of flights. On the other hand, Delta Airlines had fewer cancelled flights compared to their number of flights.
Preliminary findings – diverted flights Summer months had highest delays, and also the most diverted flights, regardless of the airport.
Predictive modeling process Training and Validation Logistic Regression Classification Tree Neural Network
Evaluation of predictive models Receiver Operating Characteristic (ROC) curve The closer it gets to the top left corner the better Area Under the Curve (AUC) The closer to one the better LR: Logistic Regression model DT: Decision (or classification) Tree model Neural: Neural Network model
Evaluation of predictive models Fit or Accuracy Rsquare: the higher (closer to one) the better Misclassification Rate: the lower the better Lift curves Model performance as opposed to guessing The higher the better LR: Logistic Regression model DT: Decision (or classification) Tree model Neural: Neural Network model
Conclusion and recommendations Best model based on evaluation techniques Classification Tree How the model and insights address the business need Possible delays identified based on flight booking information Alternative flights presented Ways to improve the model Include more inputs Increase amount of data