Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.

Slides:

Advertisements

Similar presentations

The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke

Advertisements

Lazy Paired Hyper-Parameter Tuning

Random Forest Predrag Radenković 3237/10

Integrated Instance- and Class- based Generative Modeling for Text Classification Antti PuurulaUniversity of Waikato Sung-Hyon MyaengKAIST 5/12/2013 Australasian.

Data Mining Classification: Alternative Techniques

Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.

1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)

Indian Statistical Institute Kolkata

Software Quality Ranking: Bringing Order to Software Modules in Testing Fei Xing Michael R. Lyu Ping Guo.

Sparse vs. Ensemble Approaches to Supervised Learning

Ensemble Learning: An Introduction

Data mining and statistical learning - lecture 13 Separating hyperplane.

1 Ensembles of Nearest Neighbor Forecasts Dragomir Yankov, Eamonn Keogh Dept. of Computer Science & Eng. University of California Riverside Dennis DeCoste.

Sparse vs. Ensemble Approaches to Supervised Learning

Ensemble Learning (2), Tree and Forest

Machine Learning Usman Roshan Dept. of Computer Science NJIT.

For Better Accuracy Eick: Ensemble Learning

1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.

Who would be a good loanee? Zheyun Feng 7/17/2015.

Walter Hop Web-shop Order Prediction Using Machine Learning Master’s Thesis Computational Economics.

Ensembles of Classifiers Evgueni Smirnov

Machine Learning CS 165B Spring 2012

Predicting Income from Census Data using Multiple Classifiers Presented By: Arghya Kusum Das Arnab Ganguly Manohar Karki Saikat Basu Subhajit Sidhanta.

Data mining and machine learning A brief introduction.

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

ENSEMBLE LEARNING David Kauchak CS451 – Fall 2013.

COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.

LOGO Ensemble Learning Lecturer: Dr. Bo Yuan

Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.

Ensemble Methods: Bagging and Boosting

Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.

Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.

CLASSIFICATION: Ensemble Methods

USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.

Konstantina Christakopoulou Liang Zeng Group G21

Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.

Classification Ensemble Methods 1

COMP24111: Machine Learning Ensemble Models Gavin Brown

1 January 24, 2016Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 7 — Classification Ensemble Learning.

Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

Data Analytics CMIS Short Course part II Day 1 Part 1: Introduction Sam Buttrey December 2015.

Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.

Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

Machine Learning Usman Roshan Dept. of Computer Science NJIT.

Machine Learning: Ensemble Methods

Data Mining, Machine Learning, Data Analysis, etc. scikit-learn

A Smart Tool to Predict Salary Trends of H1-B Holders

Restaurant Revenue Prediction using Machine Learning Algorithms

Zaman Faisal Kyushu Institute of Technology Fukuoka, JAPAN

Trees, bagging, boosting, and stacking

Table 1. Advantages and Disadvantages of Traditional DM/ML Methods

COMP61011 : Machine Learning Ensemble Models

Ensemble Learning Introduction to Machine Learning and Data Mining, Carla Brodley.

Basic machine learning background with Python scikit-learn

Introduction Feature Extraction Discussions Conclusions Results

CAMCOS Report Day December 9th, 2015 San Jose State University

Data Mining Practical Machine Learning Tools and Techniques

Machine Learning Ensemble Learning: Voting, Boosting(Adaboost)

Ensemble learning.

Data Mining, Machine Learning, Data Analysis, etc. scikit-learn

Data Mining, Machine Learning, Data Analysis, etc. scikit-learn

Model generalization Brief summary of methods

Analysis for Predicting the Selling Price of Apartments Pratik Nikte

Multivariate Methods Berlin Chen

Multivariate Methods Berlin Chen, 2005 References:

CAMCOS Report Day December 9th, 2015 San Jose State University

Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017

Outlines Introduction & Objectives Methodology & Workflow

Presentation transcript:

Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University

Agenda Kaggle Competition: Springleaf dataset introduction Data Preprocessing Classification Methodologies & Results Logistic Regression Random Forest XGBoost Stacking Summary & Conclusion

Kaggle Competition: Springleaf Objective: Predict whether customers will respond to a direct mail loan offer Customers: 145,231 Independent variables: 1932 “Anonymous” features Dependent variable: –target = 0: DID NOT RESPOND –target = 1: RESPONDED Training sets: 96,820 obs. Testing sets: 48,411 obs.

Dataset facts R package used to read file: data.table::fread Target=0 obs.: 111,458 Target=1 obs.: 33,773 Numerical variables: 1,876 Character variables: 51 Constant variables: 5 Variable level counts: –67.0% columns have levels <= 100 Count of levels for each column 76.7% 23.3% Class 0 and 1 count Variables count

Missing values “”, “NA”: 0.6% “[]”, -1: 2.0% , 96, …, 999, …, : 24.9% 25.3% columns have missing values 61.7% Count of NAs in each columnCount of NAs in each row

Challenges for classification Huge Dataset (145,231 X 1932) “Anonymous” features Uneven distribution of response variable 27.6% of missing values Deal with both numerical and categorical variables Undetermined portion of Categorical variables Data pre-processing complexity

Data preprocessing Remove ID and target Replace NA by median Replace NA randomly Replace [] and -1 as NA Remove duplicate cols Replace character colsRemove low variance cols Regard NA as a new group Normalize Log(1+|x|)

Principal Component Analysis When PC is close to 400, it can explain 90% variance. pc1

LDA: Linear discriminant analysis We are interested in the most discriminatory direction, not the maximum variance. Find the direction that best separates the two classes. Var1 and Var2 are large Significant overlap µ1µ2 µ1 and µ2 are close

Methodology K Nearest Neighbor (KNN) Support Vector Machine (SVM) Logistic Regression Random Forest XGBoost (eXtreme Gradient Boosting) Stacking

K Nearest Neighbor (KNN) Target =0 Target =1  Overall Accuracy  Target = 1 Accuracy Accuracy

Support Vector Machine (SVM) Expensive; takes long time for each run Good results for numerical data Accuracy Overall78.1% Target = 113.3% Target = 097.6% Confusion matrix Prediction Truth

Logistic Regression Logistic regression is a regression model where the dependent variable is categorical. Measures the relationship between dependent variable and independent variables by estimating probabilities

Logistic Regression Accuracy Overall79.2 % Target = % Target = % Confusion matrix Prediction Truth

Random Forest Machine learning ensemble algorithm -- Combining multiple predictors Based on tree model For both regression and classification Automatic variable selection Handles missing values Robust, improving model stability and accuracy

Random Forest Train dataset Draw Bootstrap Samples Build random tree Predict based on each tree Majority vote A Random Tree

Random Forest Accuracy Overall79.3% Target = 120.1% Target = 096.8% Confusion matrix Prediction Truth Target =1 Overall Target =0 Tree number(500) vs Misclassification Error

XGBoost Additive tree model: add new trees that complement the already-built ones Response is the optimal linear combination of all decision trees Popular in Kaggle competitions for efficiency and accuracy …….. Greedy Algorithm Number of Tree Error Additive tree model

XGBoost Additive tree model: add new trees that complement the already-built ones Response is the optimal linear combination of all decision trees Popular in Kaggle competitions for efficiency and accuracy

XGBoost Accuracy Overall80.0% Target = 126.8% Target = 096.1% Train error Test error Confusion matrix Prediction Truth

Methods Comparison

Winner or Combination ?

Stacking Base learners Meta learner Labeled data Labeled data …… Final prediction Test Base learner C1 Base learner C2 Base learner Cn Main Idea: Learn and combine multiple classifiers Meta features Meta features Train

Generating Base and Meta Learners Base model—efficiency, accuracy and diversity  Sampling training examples  Sampling features  Using different learning models Meta learner  Majority voting  Weighted averaging  Kmeans  Higher level classifier — Supervised(XGBoost) 24 Unsupervised

Stacking model XGBoostPredictionsXGBoost Logistic Regression Random Forest Total data Base learners Meta learner Final prediction Meta Features ❶❸❷ Combined data Total data Sparse Condense Low level PCA …

Stacking Results Base ModelAccuracy Accuracy (target=1) XGB + total data 80.0%28.5% XGB + condense data 79.5%27.9% XGB + Low level data 79.5%27.7% Logistic regression+ sparse data 78.2%26.8 % Logistic regression+ condense data 79.1%28.1% Random forest + PCA 77.6%20.9% Meta ModelAccuracy Accuracy (target=1) XGB 81.11%29.21% Averaging 79.44%27.31% Kmeans 77.45%23.91% Accuracy of XGB

Stacking Results Base ModelAccuracy Accuracy (target=1) XGB + total data 80.0%28.5% XGB + condense data 79.5%27.9% XGB + Low level data 79.5%27.7% Logistic regression+ sparse data 78.2%26.8 % Logistic regression+ condense data 79.1%28.1% Random forest + PCA 77.6%20.9% Meta ModelAccuracy Accuracy (target=1) XGB 81.11%29.21% Averaging 79.44%27.31% Kmeans 77.45%23.91% Accuracy of XGB

Summary and Conclusion Data mining project in the real world  Huge and noisy data Data preprocessing  Feature encoding  Different missing value process: New level, Median / Mean, or Random assignment Classification techniques  Classifiers based on distance are not suitable  Classifiers handling mixed type of variables are preferred  Categorical variables are dominant  Stacking makes further promotion Biggest improvement came from model selection, parameter tuning, stacking Result comparison ： Winner result: 80.4% Our result: 79.5%

Acknowledgements We would like to express our deep gratitude to the following people / organization: Profs. Bremer and Simic for their proposal that made this project possible Woodward Foundation for funding Profs. Simic and CAMCOS for all the support Prof. Chen for his guidance, valuable comments and suggestions

QUESTIONS ?