Predicting the performance of US Airline carriers

Slides:



Advertisements
Similar presentations
Assumptions underlying regression analysis
Advertisements

Multiple Analysis of Variance – MANOVA
Correlation and regression
A Short Introduction to Curve Fitting and Regression by Brad Morantz
Introduction to Data Mining with XLMiner
x – independent variable (input)
Data Mining Techniques Outline
BA 555 Practical Business Analysis
Chapter 3 Forecasting McGraw-Hill/Irwin
Lecture 24: Thurs., April 8th
Data Mining By Archana Ketkar.
Stat 512 – Lecture 17 Inference for Regression (9.5, 9.6)
Data Mining – Intro.
Relationships Among Variables
Quantitative Business Analysis for Decision Making Multiple Linear RegressionAnalysis.
Transcontinental Flight Capacity Peter Cerussi Michael Gerson.
Correlation and Linear Regression
Data Mining Techniques
Introduction to Linear Regression and Correlation Analysis
Correlation and Linear Regression
DATA MINING Team #1 Kristen Durst Mark Gillespie Banan Mandura University of DaytonMBA APR 09.
Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 22 Regression Diagnostics.
Some working definitions…. ‘Data Mining’ and ‘Knowledge Discovery in Databases’ (KDD) are used interchangeably Data mining = –the discovery of interesting,
Examining Relationships in Quantitative Research
Data Mining: Neural Network Applications by Louise Francis CAS Annual Meeting, Nov 11, 2002 Francis Analytics and Actuarial Data Mining, Inc.
Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 19 Linear Patterns.
Time series Model assessment. Tourist arrivals to NZ Period is quarterly.
APPLICATION OF DATAMINING TOOL FOR CLASSIFICATION OF ORGANIZATIONAL CHANGE EXPECTATION Şule ÖZMEN Serra YURTKORU Beril SİPAHİ.
Effect of Neighboring Flight Patterns on a Particular Flight Presented by Venugopal Rajagopal CIS 595 Dr. Slobodan Vucetic.
September 18-19, 2006 – Denver, Colorado Sponsored by the U.S. Department of Housing and Urban Development Conducting and interpreting multivariate analyses.
Customer Relationship Management (CRM) Chapter 4 Customer Portfolio Analysis Learning Objectives Why customer portfolio analysis is necessary for CRM implementation.
Neural Networks Demystified by Louise Francis Francis Analytics and Actuarial Data Mining, Inc.
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice- Hall, Inc. Chap 14-1 Business Statistics: A Decision-Making Approach 6 th Edition.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Multiple Regression Model Building Statistics for Managers.
Data Mining: Neural Network Applications by Louise Francis CAS Convention, Nov 13, 2001 Francis Analytics and Actuarial Data Mining, Inc.
Data Summit 2016 H104: Building Hadoop Applications Abhik Roy Database Technologies - Experian LinkedIn Profile:
CPH Dr. Charnigo Chap. 11 Notes Figure 11.2 provides a diagram which shows, at a glance, what a neural network does. Inputs X 1, X 2,.., X P are.
Data Mining: Concepts and Techniques1 Prediction Prediction vs. classification Classification predicts categorical class label Prediction predicts continuous-valued.
Prepared by Fayes Salma.  Introduction: Financial Tasks  Data Mining process  Methods in Financial Data mining o Neural Network o Decision Tree  Trading.
Yandell – Econ 216 Chap 15-1 Chapter 15 Multiple Regression Model Building.
Which way will 2016 swing? BIT 5534 Group 3 Final Project
Correlation and Linear Regression
Meta Data and Group Decision-Making
Machine Learning with Spark MLlib
PREDICTING Flight Delays
Decision Trees in Analytical Model Development
Correlation, Bivariate Regression, and Multiple Regression
Regression Analysis Module 3.
FAA Air Traffic Organization (ATO)
Data Mining CAS 2004 Ratemaking Seminar Philadelphia, Pa.
Correlation and Simple Linear Regression
Belinda Boateng, Kara Johnson, Hassan Riaz
Regression Techniques
USE OF DATA ANALYTICS TO PREDICT THE DEMAND OF BIKES
Multiple Regression Analysis and Model Building
Week 14 Chapter 16 – Partial Correlation and Multiple Regression and Correlation.
Advanced Analytics Using Enterprise Miner
Predicting Academic Performance of University Students
Using Data Analytics to Predict Liquor Sales in Iowa State
NBA Draft Prediction BIT 5534 May 2nd 2018
Employee Turnover: Data Analysis and Exploration
Government Travel Advisory Committee September 23, 2014
Predicting Government Spending on Professional Services
CHAPTER 29: Multiple Regression*
Predict Failures with Developer Networks and Social Network Analysis
The Weather Turbulence
Machine Learning Interpretability
Presentation transcript:

Predicting the performance of US Airline carriers Applied Business Analytics & Business Intelligence (BIT 5534) Submitted by: Group 4 Akash Yadav & Suresh Malhotra

Agenda Problem Definition Data Preparation Data Exploration 1 Problem Definition 2 Data Preparation 3 Data Exploration 4 Modeling & Analysis 5 Modeling Selection & Comparison 6 Recommendation 7 Customized Models & Future work Virginia Tech

Model Comparison & Selection Business Problem Problem: Insufficient information availability on Flight carriers & their flights. Traveler’s concern: Which Flight carrier is better & what are the chances of a flight delay. Our Goal: Predict flight carriers performance based on delay time in future. Determine the main causes of flight delay & suggest improvements. What’s in it for customers or traveler Able to make flight reservations for time-crunch business meetings Ease of choosing inter-connected flights and airports ensuring better services Educate themselves to make better decision while making a flight reservation. Note: Data Source: (http://www.transtats.bts.gov/OT_Delay/OT_DelayCause1.asp?pn=1) Define Problem Data Preparation Data Exploration Modeling Model Comparison & Selection Recommendation Virginia Tech

Data Definition & Preparation Remove Missing Value Cases Scrub-off Outliers Training / Validation Data Exploration(next step) Multivariate Analytical Models Variable Dictionary Attribute Data Type Description year Nominal Time horizon of data. month Period of Year. carrier Airline company code. carrier Name Name of Airline carrier. airport Airport code. airport Name Name of Airport arr-flights Continuous No. of flights arriving. carrier_ct Flights delayed due to airline’s own issue weather_ct weather issues. nas_ct National Aviation system issues/problems. security_ct security issues. late_aircraft_ct previous flight delay. arr_delay Delay time (Target variable) Missing values ? Independent Variables Define Problem Data Preparation Data Exploration Modeling Model Comparison & Selection Recommendation Virginia Tech

Model Comparison & Selection Data Exploration 1) Carrier_ct, nas_ct & later_aircraft_ct: Better predictors of flight delay. (Red Boxes) Reason: Small variation from the mean (red trend lines) & more compact variation of data between arr_delay and above variables. 2) weather_ct & security_ct: Least contributors in prediction objective. Reason: Large variation from the mean (red trend lines) & more scattered pattern of data between arr_delay and above variables. 3) Small or No Redundancy: Reason: Since the plots among independent variables are more scattered or widespread in space which indicates weak correlation. (This is a good indication as it validates independency of variables). Define Problem Data Preparation Data Exploration Modeling Model Comparison & Selection Recommendation Virginia Tech

Model Comparison & Selection Models & Analysis Linear Regression Used a linear combination of 5 independent variables for predicting the target. Plot (below) shows predicted values (using regression) of avg. flight delay against actual values of delay. PCA Analysis Created new variables from given variables. 2 new variables sufficient for analysis as they cover 79% of variation in data. Cluster Analysis Better for data mining. Clusters data with similar attributes/characteristics. Hierarchical: estimated no. of clusters = 20 K-means: Optimal Number of clusters = 21 Decision Tree Creates groups of data with similar attributes. Data split is based on some threshold value of variables. No. of Split = 35 Neural Network Created a hidden layer of new variables which receive data from current variables. Utilizes a regression like approach and gives prediction values as output. Below plot: Output vs Actual values of delay. Diff Groups Define Problem Data Preparation Data Exploration Modeling Model Comparison & Selection Recommendation Virginia Tech

Model & Analysis (continued…) R2 : Higher value depicts that model is capable of accounting or explaining most of the variation in the data which is important. (Refer plots below) RMSE: Root mean square error Low RMSE value means predicted values are close to the actual values or the deviation from actual value is small. Note 1: All the models perform quite well as evident from significantly high R2 and low RMSE values. Note 1: Consistent performance on both Training & Validation set which ensures that data preparation was good and models are acceptable. Linear Regression & Neural Network Model - Best Since already R2 is high, so 5-6% increase matters a lot. Further, RMSE is also low in comparison for these two models. Modeling Technique Modeling Highlight Training Validation RSquare RMSE Linear Regression 5 Independent Variables 95.5 1851 95.6 1805 Principle Component Analysis 2 Principle Components 90.5 2688 90.8 2617 Cluster Analysis 21 Clusters 87.8 3053 87.5 3042 Decision Trees 35 Splits 90.9 2639 90.3 2694 Neural Networks 1 Hidden Layer, 6 Nodes, Learning Rate = 0.1, Transform Covariates 96.2 1703 96.02 1728 PCA Neural Define Problem Data Preparation Data Exploration Modeling Model Comparison & Selection Recommendation Virginia Tech

Model Comparison & Selection Neural Network Model - Selected Regression Neural - neural - linear regression Why Neural Network !!! In comparison to linear regression model, Neural Network Model is slightly better on R2 and RMSE values. Profiler depicts that variables variation profiles match much closely with the desired profiles in case of Neural Model. Define Problem Data Preparation Data Exploration Modeling Model Comparison & Selection Recommendation Virginia Tech

Recommendation & Key points Good to know: - Although we have focused more on Delay time, the model has the potential to predict flight cancellations and other performance metrics. How to ensure an efficient model - Its critical to explore and prepare the data efficiently. - Treating missing values and removing outliers is of utmost importance for model stability. - Model should be created on training data and then tested on Validation data set. (Modify model if needed) What else could be done or added to scope - Additional predictor variables can be included like distance between arrival & departure airports, air traffic etc. - Use data from different source and different time horizon. - Focus on specific airports or air carriers. - Using Forecast models for prediction. Define Problem Data Preparation Data Exploration Modeling Model Comparison & Selection Recommendation Virginia Tech

Customized Models and Future Work   Air Carrier R2 RMSE Logworth Comment & LogWorth AA American Airlines 97.38 1554 A - 125, B - 152, C - 181 All significant. Security - 4 DL Delta Airways 95.79 1568 A - 555, B - 334, C - 135 All significant. Security - 3.4 WN Southwest Airlines 95.68 1485 All significant. Security - 4.2 AS United Airlines 98.07 2554 A - 737, B - 409, C - 114 Security_ct insignificant. B6 JetBlue 96.82 1784 A - 318, B - 165, C - 71 All significant. Security - 20 Busiest Airport ATL Atlanta, GA 96.72 1378 A - 190, B - 179, C - 55 LAX Los Angeles, CA 97.19 2002 A - 344, B - 246, D - 123 All significant. Security - 31 ORD Chicago, IL 96.07 2819 A - 177, B - 145, D – 31 Weather and security - Low DFW Dallas, TX 97.12 776 A - 95, B - 480, D - 126 JFK New York, NY 96.10 2287 A - 156, B - 125, D – 55 All significant. Security - 4.9 Best Airport (US) SLC Salt Lake City 97.6 1157 A - 201, B - 292, D - 135 All significant. Security - 17 DCA Washington 95.4 1179 A - 118, B - 235, C - 117 All significant. Security – 2.2 SEA Seattle-Tacoma 97.5 1026 A - 228, B - 290, D - 128 All significant. Security - 1.4 PDX Portland 95.3 810 A - 157, B - 283, D - 123 All significant. Security – 4.5 MSP Minneapolis 97.1 1297 A - 180, B - 221, D - 64 All significant. Security – 4.7 Delta & Airports 96.28 1120 A - 18, B - 23, D - 10 Security insignificant 88.29 1065 A - 25, B - 3.7, D - 4.4 Security - Zeroed 94.5 1868 A - 7.5, B - 17, D - 8.4 All significant. 88.28 1140 A - 11, B - 5.6, D - 10 98.4 2183 B - 4.8, C - 8.1, D - 7.4 Nas_ct Insignificant. A = nas_ct; B = late_aircraft_ct; C = weather_ct; D = carrier_ct The Dashboard Story (Future Scope of Work) Individual models to be prepared for each flight carrier, airport and a combination of flight carrier + airport for monthly predictions. Prediction from the model will be used to provide information to the travelers. Description Models will inform the current performance of flight carriers and airports based on historical data. Interactive Visual Information Delivery is the goal !! Virginia Tech

Thank You Virginia Tech