Introduction to Data Mining and Classification

Slides:



Advertisements
Similar presentations
Brief introduction on Logistic Regression
Advertisements

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,
Decision Tree.
Week 3. Logistic Regression Overview and applications Additional issues Select Inputs Optimize complexity Transforming Inputs.
“I Don’t Need Enterprise Miner”
The Home Equity Loan Case
Section 2.1 Introduction to Enterprise Miner. 2 Objectives Open Enterprise Miner. Explore the workspace components of Enterprise Miner. Set up a project.
Some slide material taken from: Groth, Han and Kamber, SAS Institute Data Mining A special presentation for BCIS 4660 Spring 2012 Dr. Nick Evangelopoulos,
Model Assessment, Selection and Averaging
Chapter 7 – Classification and Regression Trees
Model assessment and cross-validation - overview
Chapter 7 – Classification and Regression Trees
Chapter 6: Model Assessment
x – independent variable (input)
Lecture 5 (Classification with Decision Trees)
Measurement of Cost Behaviour
© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.
An Introduction to Logistic Regression
Comp 540 Chapter 9: Additive Models, Trees, and Related Methods
Chapter 5 Data mining : A Closer Look.
Decision Tree Models in Data Mining
Collaborative Filtering Matrix Factorization Approach
Logistic Regression KNN Ch. 14 (pp ) MINITAB User’s Guide
Application of SAS®! Enterprise Miner™ in Credit Risk Analytics
Introduction to Linear Regression and Correlation Analysis
Assessment of Model Development Techniques and Evaluation Methods for Binary Classification in the Credit Industry DSI Conference Jennifer Lewis Priestley.
Lecture Notes 4 Pruning Zhangxi Lin ISQS
Building And Interpreting Decision Trees in Enterprise Miner.
沈致远. Test error(generalization error): the expected prediction error over an independent test sample Training error: the average loss over the training.
Lecture 12 Model Building BMTRY 701 Biostatistical Methods II.
Biostatistics Case Studies 2005 Peter D. Christenson Biostatistician Session 5: Classification Trees: An Alternative to Logistic.
1 Data Mining Lecture 3: Decision Trees. 2 Classification: Definition l Given a collection of records (training set ) –Each record contains a set of attributes,
Chapter 9 – Classification and Regression Trees
Copyright © 2010, SAS Institute Inc. All rights reserved. Applied Analytics Using SAS ® Enterprise Miner™
5.2 Input Selection 5.3 Stopped Training
Zhangxi Lin ISQS Texas Tech University Note: Most slides are from Decision Tree Modeling by SAS Lecture Notes 5 Auxiliary Uses of Trees.
NFL Play Predictions Will Burton, NCSU Industrial Engineering 2015
Jennifer Lewis Priestley Presentation of “Assessment of Evaluation Methods for Prediction and Classification of Consumer Risk in the Credit Industry” co-authored.
A way to integrate IR and Academic activities to enhance institutional effectiveness. Introduction The University of Alabama (State of Alabama, USA) was.
Reserve Variability – Session II: Who Is Doing What? Mark R. Shapland, FCAS, ASA, MAAA Casualty Actuarial Society Spring Meeting San Juan, Puerto Rico.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
1 Chapter 4: Introduction to Predictive Modeling: Regressions 4.1 Introduction 4.2 Selecting Regression Inputs 4.3 Optimizing Regression Complexity 4.4.
Lecture Notes for Chapter 4 Introduction to Data Mining
Evaluating Classification Performance
Machine Learning 5. Parametric Methods.
Chapter 5 – Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel & Bruce.
Chapter 14 Introduction to Regression Analysis. Objectives Regression Analysis Uses of Regression Analysis Method of Least Squares Difference between.
2011 Data Mining Industrial & Information Systems Engineering Pilsung Kang Industrial & Information Systems Engineering Seoul National University of Science.
Chapter 11 – Neural Nets © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
Advanced statistical methods for credit risk modeling in practice
Cost Estimation & Cost Behaviour
Chapter 4 Basic Estimation Techniques
Chapter 7. Classification and Prediction
KAIR 2013 Nov 7, 2013 A Data Driven Analytic Strategy for Increasing Yield and Retention at Western Kentucky University Matt Bogard Office of Institutional.
Evaluation – next steps
An Empirical Comparison of Supervised Learning Algorithms
Notes on Logistic Regression
Performance Measures II
Advanced Analytics Using Enterprise Miner
Reducing Loan Risk Using Data Analytics
Collaborative Filtering Matrix Factorization Approach
Predict Failures with Developer Networks and Social Network Analysis
Presenter: Georgi Nalbantov
Linear Model Selection and regularization
Lecture 12 Model Building
Machine Learning Interpretability
Model generalization Brief summary of methods
Roc curves By Vittoria Cozza, matr
STT : Intro. to Statistical Learning
Presentation transcript:

Introduction to Data Mining and Classification F. Michael Speed, Ph.D. Analytical Consultant SAS Global Academic Program

Objectives State one of the major principles underlying data mining Give a high level overview of three classification procedures

A Basic principle of Data Mining Splitting the data: Training Data Set – this is a must do Validation Data Set – this is a must do Testing Data Set – This is optional

Training Set For a given procedure (logistic or neural net or decision tree) we use the training set to generate a sequence of models. For example: If we use logistic regression, we get: Model 1 Model 2 Model q Training Data Logistic Reg

How Do We decide Which of the q Models is Best? We want the model with the fewest terms (most parsimonious). We want the model with largest (smallest) value of our criteria index (adjusted r-square, misclassification rate, AIC, BIC, SBC etc.) We use the validation set to compute the criteria (Fit Index) for each model and then choose the “best.”

Compute the Fit Index for Each Model Then find the “best” using a fixed Fit Index Model 1 Model 2 Model q Validation Set Fit Index 1 Fit Index 2 Fit Index q

Fit Indices (Statistics) Default — The default selection uses different statistics based on the type of target variable and whether a profit/loss matrix has been defined. If a profit/loss matrix is defined for a categorical target, the average profit or average loss is used. If no profit/loss matrix is defined for a categorical target, the misclassification rate is used. If the target variable is interval, the average squared error is used. Akaike's Information Criterion — chooses the model with the smallest Akaike's Information Criterion value. Average Squared Error — chooses the model with the smallest average squared error value. Mean Squared Error — chooses the model with the smallest mean squared error value. ROC — chooses the model with the greatest area under the ROC curve. Captured Response — chooses the model with the greatest captured response values using the decile range that is specified in the Selection Depth property.

Continued Gain — chooses the model with the greatest gain using the decile range that is specified in the Selection Depth property. Gini Coefficient — chooses the model with the highest Gini coefficient value. Kolmogorov-Smirnov Statistic —  chooses the model with the highest Kolmogorov - Smirnov statistic value. Lift — chooses the model with the greatest lift using the decile range that is specified in the Selection Depth property. Misclassification Rate — chooses the model with the lowest misclassification rate. Average Profit/Loss — chooses the model with the greatest average profit/loss. Percent Response — chooses the model with the greatest % response. Cumulative Captured Response — chooses the model with the greatest cumulative % captured response. Cumulative Lift — chooses the model with the greatest cumulative lift. Cumulative Percent Response — chooses the model with the greatest cumulative % response.

Misclassification Rate(MR) Prediction = 0 Prediction =1 Actual = 0 True Negative False Positive Actual = 1 False Negative True Positive MR = (FN +FP)/(TN+FP+FN+TP)

Equity Data Set. The variable BAD = 1 if the borrower is a bad credit risk and = 0 if not. We want to build a model to predict if a person is a bad credit risk Other Variables: Job, YOJ, Loan, DebtInc Mortdue - How much they need to pay on their mortgage Value - Assessed valuation Derog - Number of Derogatory Reports Deliniq - Number of Delinquent Trade Lines Clage - Age of Oldest Trade Line Ninq - Number of recent credit inquiries. Clno - Number of trade lines

Three Procedures Decision Tree Regression (Logistic) Neural Network

Decision Tree Very Simple to Understand Easy to use Can explain to the boss/supervisor

Example

Maximal Tree – Ignoring Validation Data

Optimal Tree

Continued

Fit Statistics Prediction = 0 Prediction =1 Actual = 0 2266 146 225 370 MC=(225+146)/2981= .124455

Logistic Regression Since we observe a 0 or a 1, ordinary least squares is not an option. We need a different approach The probability of getting a 1 depends upon X. We write that as p(X). Log odds = log(p(X)/(1-p(X))= a + bX

Logistic Graph –Solve for p(X)

Fit Statistics

MCR Prediction = 0 Prediction =1 Actual = 0 2306 80 Actual = 1 332 263

Neural Net Very Complex Mathematical Equations Interpretations of the meaning of the input variables are not possible with final model Often a good prediction of the response.

Neural Net Diagram

Fit Statistics

MCR Prediction = 0 Prediction =1 Actual = 0 2291 95 Actual = 1 288 307

Comparison

Enterprise Miner Interface

Enterprise Guide Interface

RPM

Continued

Continued

Fit Statistics

Summary Divide your data into training and validation We looked at trees, logistic regression and neural nets We also looked at RPM

Q&A