MIS2502: Data Analytics Classification Using Decision Trees

Slides:

Advertisements

Similar presentations

DECISION TREES. Decision trees  One possible representation for hypotheses.

Advertisements

Decision Tree Approach in Data Mining

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,

Classification: Definition Given a collection of records (training set ) –Each record contains a set of attributes, one of the attributes is the class.

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach,

Classification Techniques: Decision Tree Learning

Chapter 7 – Classification and Regression Trees

Chapter 7 – Classification and Regression Trees

Lecture Notes for Chapter 4 Introduction to Data Mining

Classification: Decision Trees, and Naïve Bayes etc. March 17, 2010 Adapted from Chapters 4 and 5 of the book Introduction to Data Mining by Tan, Steinbach,

Lecture outline Classification Decision-tree classification.

Online Algorithms – II Amrinder Arora Permalink:

Decision Tree Rong Jin. Determine Milage Per Gallon.

Induction of Decision Trees

1 Classification with Decision Trees I Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei.

Classification Continued

Three kinds of learning

Classification II.

Data Mining: A Closer Look Chapter Data Mining Strategies (p35) Moh!

ML ALGORITHMS. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of classifying new examples.

Chapter 5 Data mining : A Closer Look.

Introduction to Data Mining Data mining is a rapidly growing field of business analytics focused on better understanding of characteristics and.

Basic Data Mining Techniques

DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.

Fall 2004 TDIDT Learning CS478 - Machine Learning.

1 Data Mining Lecture 3: Decision Trees. 2 Classification: Definition l Given a collection of records (training set ) –Each record contains a set of attributes,

Chapter 9 – Classification and Regression Trees

Chapter 4 Classification. 2 Classification: Definition Given a collection of records (training set ) –Each record contains a set of attributes, one of.

Lecture 7. Outline 1. Overview of Classification and Decision Tree 2. Algorithm to build Decision Tree 3. Formula to measure information 4. Weka, data.

Final Exam Review. The following is a list of items that you should review in preparation for the exam. Note that not every item in the following slides.

Decision Trees Jyh-Shing Roger Jang ( 張智星 ) CSIE Dept, National Taiwan University.

Decision Trees DefinitionDefinition MechanismMechanism Splitting FunctionSplitting Function Issues in Decision-Tree LearningIssues in Decision-Tree Learning.

Classification and Prediction Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot Readings: Chapter 6 – Han and Kamber.

1 Universidad de Buenos Aires Maestría en Data Mining y Knowledge Discovery Aprendizaje Automático 5-Inducción de árboles de decisión (2/2) Eduardo Poggi.

Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.

Lecture Notes for Chapter 4 Introduction to Data Mining

Data Mining and Decision Support

1 Illustration of the Classification Task: Learning Algorithm Model.

Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU.

Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.

BY International School of Engineering {We Are Applied Engineering} Disclaimer: Some of the Images and content have been taken from multiple online sources.

Classification Tree Interaction Detection. Use of decision trees Segmentation Stratification Prediction Data reduction and variable screening Interaction.

CSE573 Autumn /11/98 Machine Learning Administrative –Finish this topic –The rest of the time is yours –Final exam Tuesday, Mar. 17, 2:30-4:20.

Machine Learning: Ensemble Methods

Chapter 12 – Discriminant Analysis

A checking account is an account held at a bank or credit union into which account owners deposit funds.

DECISION TREES An internal node represents a test on an attribute.

Decision Trees an introduction.

Ch9: Decision Trees 9.1 Introduction A decision tree:

Chapter 6 Classification and Prediction

Issues in Decision-Tree Learning Avoiding overfitting through pruning

Introduction to Data Mining, 2nd Edition by

Classification and Prediction

Lecture Notes for Chapter 4 Introduction to Data Mining

Introduction to Data Mining, 2nd Edition by

Exam #3 Review Zuyin (Alvin) Zheng.

Introduction to Data Mining, 2nd Edition by

Classification by Decision Tree Induction

MIS2502: Data Analytics Classification using Decision Trees

Data Mining – Chapter 3 Classification

CSCI N317 Computation for Scientific Applications Unit Weka

MIS2502: Data Analytics Clustering and Segmentation

MIS2502: Data Analytics Clustering and Segmentation

Statistical Learning Dong Liu Dept. EEIS, USTC.

Ensemble learning.

Machine Learning in Practice Lecture 17

Evaluating Classifiers

COSC 4368 Intro Supervised Learning Organization

Machine Learning in Business John C. Hull

Presentation transcript:

MIS2502: Data Analytics Classification Using Decision Trees Zhe (Joe) Deng deng@temple.edu http://community.mis.temple.edu/zdeng

Case: Apple says that it’ll use machine learning and Apple Maps to label stores that you use in the app, and use that data to track purchases across categories like “food and drink” or “shopping.”

Everything you buy gets a category and a color.

Case: Banks Dilemma While Working to Protect You From Fraud To satisfy consumer demands, most banks today practice a 99% transaction approval rate on credit cards. Yet with no proactive monitoring and fraud prevention mechanisms in place, financial institutions become vulnerable to all sorts of credit card scams. "Machine learning can start modeling out what are the appropriate behaviors“ - Kurt Long, the CEO of FairWarning

What is classification? Determining to what group a data element belongs Or “attributes” of that “entity” Examples Determining whether a customer should be given a loan Flagging a credit card transaction as a fraudulent charge Categorizing a news story as finance, entertainment, or sports “Everything you buy gets a category and a color”

What is Classification? A statistical method used to determine to what category (or “class”) a new observation belongs On the basis of a training data set containing observations whose categories were known Examples Task Observations Categories (Classes) Flagging a credit card transaction as a fraudulent charge Credit card transactions Fraudulent vs. legitimate Filtering spam emails Emails Spam vs. non-spam Determining whether a customer should be given a loan Customers Default vs. no default

How Classification Works 1 Choose a categorical outcome variable 2 Split the data set into training and validation subsets 3 Use the training set to find a model that predicts the outcome as a function of the other attributes 4 Apply the model to the validation set to check accuracy 5 Apply the final model to future cases (i.e. prediction)

How Classification Works Training Set TID Income Debt Owns/ Rents Outcome 1 25k 35% Owns Default 2 35k 40% Rent 3 33k 15% No default 4 28k 19% Rents 5 55k 30% 6 48k 7 65k 17% 8 85k 10% … Derive model Classification software (such as R) Validation Set TID Income Debt Owns/ Rents Outcome 1 80k 35% Rent ? 2 20k 40% Owns 3 15k 15% 4 50k 19% Rents 5 35k 30% Apply model

Goals Accuracy The trained model should assign new observations to the right category It won’t be 100% accurate, but should be as close as possible Prediction The model’s rules can be applied to new records as they come along An automated, reliable way to predict the outcome

Classification Method: Decision Tree A model to determine membership of cases or values of a outcome variable based on one or more predictor variables (Tan, Steinback, and Kumar 2004)

Classification Method: Decision Tree Split over income Classification tree Age Income <50K >=50K 45 Split over age Age NO X <45 >=45 NO X YES 50K Did not default Default Income X New customer

How Decision Tree algorithm works Start with single node with all training data Are samples all of same classification? Are there predictor(s) that will split the data? Partition node into child nodes according to predictor(s) Are there more nodes (i.e., new child nodes)? DONE! Yes No Go to next node.

Example: Credit Card Default Predictors Classification We create the tree from a set of training data Each unique combination of predictors is associated with an outcome This set was “rigged” so that every combination is accounted for and has an outcome TID Income Debt Owns/ Rents Outcome 1 25k 35% Owns Default 2 35k 40% Rent 3 33k 15% No default 4 28k 19% Rents 5 55k 30% 6 48k 7 65k 17% 8 85k 10% … Training Set

Example: Credit Card Default Predictors Classification Child node Leaf node TID Income Debt Owns/ Rents Outcome 1 25k 35% Owns Default 2 35k 40% Rent 3 33k 15% No default 4 28k 19% Rents 5 55k 30% 6 48k 7 65k 17% 8 85k 10% … Root node Debt > 20% Default Income <40k Owns house No Default Debt < 20% Rents Default Credit Approval Owns house No Default Debt > 20% Rents Default Income >40k Debt < 20% No Default Training Set

Same Data, Different Tree Predictors Classification Credit Approval Owns Income < 40k Debt > 20% Default Debt < 20% No Default Income > 40k Rents TID Income Debt Owns/ Rents Outcome 1 25k 35% Owns Default 2 35k 40% Rent 3 33k 15% No default 4 28k 19% Rents 5 55k 30% 6 48k 7 65k 17% 8 85k 10% … Training Set We just changed the order of the predictors.

Start with root node Credit Approval Income <40k Debt > 20% Default Debt < 20% Owns house No Default Rents Income >40k There are both “defaults” and “no defaults” in the set So we need to look for predictors to split the data Credit Approval

First split: on income Income <40k Credit Approval Income >40k Debt > 20% Default Debt < 20% Owns house No Default Rents Income >40k Income (with 40k as the cutoff point) is a factor that produces the greatest “separation” More income, less default But there are also a combination of defaults and no defaults within each income group So look for another split Credit Approval Income <40k Income >40k

Second split: on debt Debt is a factor Credit Approval Income <40k Debt > 20% Default Debt < 20% Owns house No Default Rents Income >40k Debt is a factor (Income < 40, Debt < 20, Owns = No default) but (Income < 40, Debt < 20, Rents = Default) But there are also a combination of defaults and no defaults within some debt groups So look for another split Credit Approval Income <40k Debt >20% Debt <20% Income >40k

Third split: on owns/rents Credit Approval Income <40k Debt > 20% Default Debt < 20% Owns house No Default Rents Income >40k Credit Approval Income <40k Debt >20% Default Debt <20% Owns house No Default Rents Income >40k Owns/Rents is a factor For some cases it doesn’t matter, but for some it does So you group similar branches …And we stop because we’re out of predictors!

How does it know when and how to split? There are statistics that show When a predictor variable maximizes distinct outcomes (if age is a predictor, we should see that older people buy; younger people don’t) When a predictor variable separates outcomes (if age is a predictor, we should not see older people who buy mixed up with older people who don’t)

Decision Tree Algorithm The decision tree algorithm takes a large set of training data to compute the tree In the data: Similar cases may have different outcomes So probability of an outcome is computed Credit Approval Income <40k Debt > 20% Default Debt < 20% Owns house No Default Rents Income >40k 0.6 For instance, you may find: When income > 40k, debt > 20%, and the customers rents, default occurs 60% of the time.

Reading the Decision Tree Result Outcome cases are not 100% certain There are probabilities attached to each outcome in a node So let’s code “Default” as 1 and “No Default” as 0 Credit Approval Income <40k Debt >20% 1 Debt <20% Owns house Rents Income >40k 0.85 0.20 0.30 0.22 0.60 0.26 How many leaf nodes are there?

Reading the Decision Tree Result Outcome cases are not 100% certain There are probabilities attached to each outcome in a node So let’s code “Default” as 1 and “No Default” as 0 Credit Approval Income <40k Debt >20% 1 Debt <20% Owns house Rents Income >40k 0.85 0.20 0.30 0.22 0.60 0.26 Numbers next to the leaf nodes: Represent the probabilities of the predicted outcome being 1 (1=“Default”)

Reading the Decision Tree Result Outcome cases are not 100% certain There are probabilities attached to each outcome in a node So let’s code “Default” as 1 and “No Default” as 0 Credit Approval Income <40k Debt >20% 1 Debt <20% Owns house Rents Income >40k 0.85 0.20 0.30 0.22 0.60 0.26 For example, for people with income<40K, Debt>20%, the probability of “default” is 0.85 (or 85%)

Reading the Decision Tree Result Outcome cases are not 100% certain There are probabilities attached to each outcome in a node So let’s code “Default” as 1 and “No Default” as 0 Value of 1: we predict that these people are likely to “Default” Credit Approval Income <40k Debt >20% 1 Debt <20% Owns house Rents Income >40k 0.85 0.20 0.30 0.22 0.60 0.26 Value of 0: we predict that these people are likely to “No Default”

So what is the chance that: Reading the Decision Tree Result Outcome cases are not 100% certain There are probabilities attached to each outcome in a node So let’s code “Default” as 1 and “No Default” as 0 So what is the chance that: 1. A renter making more than $40,000 and debt more than 20% of income will default? 2. A home owner making less than $40,000 and debt less than 20% of income will default? Credit Approval Income <40k Debt >20% 1 Debt <20% Owns house Rents Income >40k 0.85 0.20 0.30 0.22 0.60 0.26

So what is the chance that: Reading the Decision Tree Result Outcome cases are not 100% certain There are probabilities attached to each outcome in a node So let’s code “Default” as 1 and “No Default” as 0 So what is the chance that: 1. A renter making more than $40,000 and debt more than 20% of income will default? 2. A home owner making less than $40,000 and debt less than 20% of income will default? Credit Approval Income <40k Debt >20% 1 Debt <20% Owns house Rents Income >40k 0.85 0.20 0.30 0.22 0.60 0.26

So what is the chance that: Reading the Decision Tree Result Outcome cases are not 100% certain There are probabilities attached to each outcome in a node So let’s code “Default” as 1 and “No Default” as 0 So what is the chance that: 1. A renter making more than $40,000 and debt more than 20% of income will default? 2. A home owner making less than $40,000 and debt less than 20% of income will default? Credit Approval Income <40k Debt >20% 1 Debt <20% Owns house Rents Income >40k 0.85 0.20 0.30 0.22 0.60 0.26

Reading the Decision Tree Result Outcome cases are not 100% certain There are probabilities attached to each outcome in a node So let’s code “Default” as 1 and “No Default” as 0 Credit Approval Income <40k Debt >20% 1 Debt <20% Owns house Rents Income >40k 0.85 0.20 0.30 0.22 0.60 0.26 Describe the characteristics of the most likely and least likely groups to default.

Reading the Decision Tree Result Outcome cases are not 100% certain There are probabilities attached to each outcome in a node So let’s code “Default” as 1 and “No Default” as 0 Credit Approval Income <40k Debt >20% 1 Debt <20% Owns house Rents Income >40k 0.85 0.20 0.30 0.22 0.60 0.26 Describe the characteristics of the most likely and least likely groups to default.

Apply to new (validation) data Validation Set Credit Approval Income <40k Debt > 20% Default Debt < 20% Owns house No Default Rents Income >40k TID Income Debt Owns/ Rents Decision (Predicted) Decision (Actual) 1 80k 35% Rent Default No Default 2 20k 40% Owns 3 15k 15% 4 50k 19% Rents 5 35k 30% How well did the decision tree do in predicting the outcome? When it’s “good enough,” we’ve got our model for future decisions.

Classification Accuracy: How often does the tree make a correct prediction? Error rate: Percent of misclassified records out of the total records Correct classification rate: Percent of correctly classified records out of the total records Error rate + Correct classification rate = 100% A good decision tree model should have high classification accuracy, meaning Low error rate High correct classification rate

Classification Accuracy: A Numeric Example A Confusion Matrix compares the predicted outcomes to the observed (actual) outcomes The Confusion Matrix Predicted outcome: Default No default Observed outcome: 600 100 150 650 Total: 1500 Error rate = (150+100) /1500= 16.7% Correct classification rate = (100%-16.7%) = 83.3% (Another way to compute: Correct classification rate = (600+650)/1500 = 83.3%)

Can we keep splitting as long as it can? You may get better classification accuracy for the training set But the tree will become too complex Difficult to interpret Another Problem: “Overfitting”

How overfitting affects prediction If the tree is too complex, it may have poor predictive performance for new data, as it can exaggerate minor fluctuations (noises) in the training data. Training set Error rate Tree size Validation set “A good model must not only fit the training data well but also accurately classify records it has never seen.”

Avoid Overfitting: Control Tree Size In R, you can control tree size using: Minimum split Minimum number of observations in each node needed to add an additional split. Smaller minimum split → more complex tree Complexity factor Minimum reduction in error needed to add an additional split. Smaller complexity factor → more complex tree

Avoid Overfitting: Prune the Tree Idea behind pruning: A very large tree is likely to overfit the training set Therefore the weakest branches, which hardly reduce error rate, should be removed

Summary What is classification Structure of a decision tree Outcome variables: categorical values Predictor variables Classification accuracy: Error rate and correct classification rate Interpret a decision tree Determine the probability of an event happening based on predictor variable values Pros and cons of a complex tree Overfitting

Time for our 11th ICA!