STAT 689 Class Project STAT 689 Class Project

Slides:



Advertisements
Similar presentations
Detecting Statistical Interactions with Additive Groves of Trees
Advertisements

Random Forest Predrag Radenković 3237/10
Welcome to Econ 420 Applied Regression Analysis
Imbalanced data David Kauchak CS 451 – Fall 2013.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Presentor:
Chapter 7 – Classification and Regression Trees
BA 555 Practical Business Analysis
REGRESSION What is Regression? What is the Regression Equation? What is the Least-Squares Solution? How is Regression Based on Correlation? What are the.
Classification.
Lecture 7 PY 427 Statistics 1 Fall 2006 Kin Ching Kong, Ph.D
REGRESSION Predict future scores on Y based on measured scores on X Predictions are based on a correlation from a sample where both X and Y were measured.
1 BA 555 Practical Business Analysis Review of Statistics Confidence Interval Estimation Hypothesis Testing Linear Regression Analysis Introduction Case.
Ensemble Learning (2), Tree and Forest
Inference for regression - Simple linear regression
Overview DM for Business Intelligence.
Predicting Income from Census Data using Multiple Classifiers Presented By: Arghya Kusum Das Arnab Ganguly Manohar Karki Saikat Basu Subhajit Sidhanta.
Chapter 9 – Classification and Regression Trees
Chapter 6 Data Mining 1. Introduction The increase in the use of data-mining techniques in business has been caused largely by three events. The explosion.
Scaling up Decision Trees. Decision tree learning.
June 11, 2008Stat Lecture 10 - Review1 Midterm review Chapters 1-5 Statistics Lecture 10.
Human pose recognition from depth image MS Research Cambridge.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.
Konstantina Christakopoulou Liang Zeng Group G21
Real-Time Trip Information Service for a Large Taxi Fleet
Classification Ensemble Methods 1
Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.
Ensemble Learning, Boosting, and Bagging: Scaling up Decision Trees (with thanks to William Cohen of CMU, Michael Malohlava of 0xdata, and Manish Amde.
Introduction Exploring Categorical Variables Exploring Numerical Variables Exploring Categorical/Numerical Variables Selecting Interesting Subsets of Data.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
BY International School of Engineering {We Are Applied Engineering} Disclaimer: Some of the Images and content have been taken from multiple online sources.
Data Screening. What is it? Data screening is very important to make sure you’ve met all your assumptions, outliers, and error problems. Each type of.
STA248 week 121 Bootstrap Test for Pairs of Means of a Non-Normal Population – small samples Suppose X 1, …, X n are iid from some distribution independent.
1 Linear Regression Model. 2 Types of Regression Models.
A Smart Tool to Predict Salary Trends of H1-B Holders
JMP Discovery Summit 2016 Janet Alvarado
Data Transformation: Normalization
T-Share: A Large-Scale Dynamic Taxi Ridesharing Service
Logistic Regression APKC – STATS AFAC (2016).
Inference for Regression (Chapter 14) A.P. Stats Review Topic #3
Linear Regression.
Multiple Regression Prof. Andy Field.
Hypothesis Testing and Confidence Intervals (Part 1): Using the Standard Normal Lecture 8 Justin Kern October 10 and 12, 2017.
COMP61011 : Machine Learning Ensemble Models
Correlation A Lecture for the Intro Stat Course
Predict House Sales Price
CS548 Fall 2017 Decision Trees / Random Forest Showcase by Yimin Lin, Youqiao Ma, Ran Lin, Shaoju Wu, Bhon Bunnag Showcasing work by Cano,
Classification and Prediction
Exam #3 Review Zuyin (Alvin) Zheng.
Baselining PMU Data to Find Patterns and Anomalies
Teaching Analytics with Case Studies: Finding Love in a Classification Tree Ruth Hummel, PhD JMP Academic Ambassador.
Tabulations and Statistics
Should They Stay or Should They Go
Review: What influences confidence intervals?
Inference for Regression
The European Statistical Training Programme (ESTP)
Week 6 Fatemeh Yazdiananari.
Implementing AdaBoost
CSCI N317 Computation for Scientific Applications Unit Weka
CHAPTER 12 More About Regression
Creative Activity and Research Day (CARD)
Intro to Machine Learning
Ensemble learning Reminder - Bagging of Trees Random Forest
CART on TOC CART for TOC R 2 = 0.83
Inferential Statistics
Inference for Regression
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Outlines Introduction & Objectives Methodology & Workflow
Evaluation David Kauchak CS 158 – Fall 2019.
Presentation transcript:

STAT 689 Class Project STAT 689 Class Project To Tip or not to Tip – that’s the question! STAT 689 Class Project Predicting Chicago Taxi Presented by – Abhilash Tangadpalliwar & Debapriyo Paul

Strengths & challenges Our Agenda 1 Introduction 3 Data Cleaning 6 Model validation 2 Data Information 4 Data Exploration 5 Model building 7 Strengths & challenges 8 Key Learnings

Introduction The City of Chicago in November of 2016 released a public dataset containing information over 100 million taxi rides since 2013 (https://data.cityofchicago.org/Transportation/Taxi- Trips/wrvz-psew/data) This public dataset does not include any data from the rideshare services like Uber and Lyft, but in 2015, the taxi-owners association of Chicago claimed that Uber and Lyft have caused them a loss of 30-40% in business Uber and Lyft started their operations in Chicago in 2011 and 2013 respectively

Data Information Fields Limitations Taxi ID Trip ID Trip Start and End Time Trip Duration Trip Distance Fare Payment Type Taxi Company Pickup & Dropoff Location, etc. Limitations Trips not reported in real time Masking of Taxi ID Exact Pickup & Dropoff Location unknown Location available on Census Tract and Community area level Census Tracts not available for ¼ trips

Data Data cleaning preprocessing Model build From Source From our side Handling erroneous values – e.g. trip duration Removing duplicates and redundant fields Parsing pickup and drop-off timestamps Replacing Trip ID with an index > < Dataset for prediction If you don’t pre-process, you Re-process

Data Loading Data Exploration and Modeling was performed in Google Colaboratory since it uses an accelerated GPU and doesn’t require PC’s memory for handling ~40 GB data Read data to Python in chunks Convert to SQLite3 DB Query data Split data into smaller CSVs Used Google Colab for analysis

Chicago Taxi Trips in numbers over the years (2013-2016) Data Exploration 2015 2016 Decreasing through the years 2013 2014 Chicago Taxi Trips in numbers over the years (2013-2016)

Average Taxi Fares over the years (2013 to 2016) Data Exploration Increasing through the years Average Taxi Fares over the years (2013 to 2016)

Data Exploration Credit Cards (42%) Prepay Cards, Vouchers (2%) Cash (56%) Typical distribution of Payment Types for Taxi fares

Histogram of no. of trips with Trip Distance Data Exploration Most trips < 5 miles Histogram of no. of trips with Trip Distance

Hour-wise trips on a Typical Day Day-wise trips on a Typical Week Data Exploration 5-8 pm TGIF 8-10 am Hour-wise trips on a Typical Day Day-wise trips on a Typical Week

Heatmap for Day-wise and Hour-wise Trips Data Exploration Friday Evening Weekend Midnight Heatmap for Day-wise and Hour-wise Trips

Community-area wise Pickups (2013 v/s 2016) Data Exploration Downtown Airports Community-area wise Pickups (2013 v/s 2016)

Market-share of Taxi Companies over the years Data Exploration 58% 55% 51% 50% 2% 2% 5% 7% 4% 5% 7% 5% 15% 19% 15% 16% 9% 9% 10% 11% 12% 11% 11% 11% Market-share of Taxi Companies over the years

Data Exploration 2013 2016 Community area-wise pickups for Top 5 Taxi Companies over the years (2013 v/s 2016)

Community area-wise pickups for KOAM Taxi Association (2013 to 2016) Data Exploration Community area-wise pickups for KOAM Taxi Association (2013 to 2016)

Goal is to predict “Fare” Model Building – Part 1 Goal is to predict “Fare” Regression Random Forest

Scatterplot for Predicted V/s Actual Responses (Normalized) Regression Results R2 = 81% Regression Equation y = 0.4678*(trip_seconds) + 0.2904*(trip_miles) + 0.0214* (pickup_community) + 0.0152* (dropff_community) + 0.0015 * (day_name) – 0.0026*(hour) Removed records where Tip% > 100% Scatterplot for Predicted V/s Actual Responses (Normalized)

Tuned Random Forest Regressor Trip Seconds Trip Miles Dropoff Area Pickup Area Hour Day R2 = 92% Scatterplot for Predicted V/s Actual Responses (Normalized) Feature Importance (Tuned Random Forest) Removed records where Tip% > 100%

Model Building – Part 2 Attribute of Interest = Tips Tips contribute a major share of taxi drivers’ take-home income Which factors affect tips?

Histogram of % Tip value Tips – How do they look? Average Tip ~ 25% Removed records where Tip% > 100% Histogram of % Tip value

Histogram of % Tip value for O’Hare vs Rest of Chicago Pickups Tips – How do they look? p-value = 0 Different Distributions Histogram of % Tip value for O’Hare vs Rest of Chicago Pickups

Relationship heatmap between variables Tips or Tip % ? Hypothesis: Almost no linear relationship with other continuous variables Therefore, we hypothesize that random forest would give a better prediction that logistic regression Relationship heatmap between variables

Pair Plot between variables Therefore, we hypothesize that random forest would give a better prediction that logistic regression since there is almost no linear relationship

Data Imbalancing 4% - 96% imbalance Only 4% records have 0 tips Imbalance poses a challenge in classification Under sampling Separated those 4% records from rest of the data Used rest 96% data for sampling Training Data Sampled observations from 96% dataset in a ratio of 3:1 to the 4% dataset 1-year representative data used as training dataset Initial Training Data is 80% of actual Data whereas Test Data is 20% of actual data

Classification Models and their Accuracy 68% 60% 57% 54% Tuned RF LDA Untuned Logit Untuned RF Tuned RF Logit LDA Parameters to vary: n_estimators = max no of trees, Number of features to consider at every split Maximum number of levels in tree Minimum number of samples required to split a node Minimum number of samples required at each leaf node Method of selecting samples for training each tree Tuning performed on over 4000 settings

Feature Importance (Tuned Random Forest) Fare Hour Miles Duration Dropoff area Pickup area Day of the week

Strength & Limitations Our methodology helps analyse huge datasets even ~40 GB ones YES to Big Data Significantly high computation time even with GPU accelerated system Computation Time RF model ensures that performance is unhindered by non linear relationships Collinearity is not a curse Easy implementation, validation and testing Highly performant Route Prediction is not possible due lack of relevant information, etc. Data Inadequacies

What we learnt? Do not underestimate data preprocessing 80/20 Do not underestimate data preprocessing Taxi Drivers will thank you! Go for it Requires Creativity Can (attempt to) predict anything Predictive Modeling is like a sandbox Can change your point of view Perhaps disproving something still carries value Key Learnings

P.S. Don’t forget to tip the cabbie ;) Thank You! #BTHOFinals P.S. Don’t forget to tip the cabbie ;)