1 COMP3503 Inductive Decision Trees with Daniel L. Silver Daniel L. Silver.

Slides:



Advertisements
Similar presentations
Random Forest Predrag Radenković 3237/10
Advertisements

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Decision Tree Approach in Data Mining
Pavan J Joshi 2010MCS2095 Special Topics in Database Systems
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.
Radosław Wesołowski Tomasz Pękalski, Michal Borkowicz, Maciej Kopaczyński
Ensemble Learning: An Introduction
Learning From Data Chichang Jou Tamkang University.
Basic Data Mining Techniques
Three kinds of learning
Learning….in a rather broad sense: improvement of performance on the basis of experience Machine learning…… improve for task T with respect to performance.
Classification and Prediction: Basic Concepts Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Ensemble Learning (2), Tree and Forest
Decision Tree Models in Data Mining
Issues with Data Mining
DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.
Fall 2004 TDIDT Learning CS478 - Machine Learning.
Lecture Notes 4 Pruning Zhangxi Lin ISQS
Machine Learning Chapter 3. Decision Tree Learning
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Short Introduction to Machine Learning Instructor: Rada Mihalcea.
Inductive learning Simplest form: learn a function from examples
COMP3503 Intro to Inductive Modeling
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Chapter 9 – Classification and Regression Trees
Computational Intelligence: Methods and Applications Lecture 19 Pruning of decision trees Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
K Nearest Neighbors Classifier & Decision Trees
Feature Selection: Why?
Learning from observations
1 Knowledge Discovery Transparencies prepared by Ho Tu Bao [JAIST] ITCS 6162.
Categorical data. Decision Tree Classification Which feature to split on? Try to classify as many as possible with each split (This is a good split)
Decision Tree Learning Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 25, 2014.
For Wednesday No reading Homework: –Chapter 18, exercise 6.
Machine Learning, Decision Trees, Overfitting Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University January 14,
Learning with Decision Trees Artificial Intelligence CMSC February 20, 2003.
For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.
CS 8751 ML & KDDDecision Trees1 Decision tree representation ID3 learning algorithm Entropy, Information gain Overfitting.
CogNova Technologies 1 Evaluating Induced Models Evaluating Induced Models with Daniel L. Silver Daniel L. Silver Copyright (c), 2004 All Rights Reserved.
MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.
CS 5751 Machine Learning Chapter 3 Decision Tree Learning1 Decision Trees Decision tree representation ID3 learning algorithm Entropy, Information gain.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
1 Decision Tree Learning Original slides by Raymond J. Mooney University of Texas at Austin.
Customer Relationship Management (CRM) Chapter 4 Customer Portfolio Analysis Learning Objectives Why customer portfolio analysis is necessary for CRM implementation.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Data Mining and Decision Support
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
Learning From Observations Inductive Learning Decision Trees Ensembles.
Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
DATA MINING TECHNIQUES (DECISION TREES ) Presented by: Shweta Ghate MIT College OF Engineering.
Machine Learning: Ensemble Methods
DECISION TREES An internal node represents a test on an attribute.
Introduction to Machine Learning and Tree Based Methods
C4.5 - pruning decision trees
Artificial Intelligence
Data Science Algorithms: The Basic Methods
Introduction to Data Mining, 2nd Edition by
Classification and Prediction
Introduction to Data Mining, 2nd Edition by
ANN Design and Training
Machine Learning Chapter 3. Decision Tree Learning
Machine Learning: Lecture 3
with Daniel L. Silver, Ph.D. Christian Frey, BBA April 11-12, 2017
Machine Learning Chapter 3. Decision Tree Learning
CSCI N317 Computation for Scientific Applications Unit Weka
Presentation transcript:

1 COMP3503 Inductive Decision Trees with Daniel L. Silver Daniel L. Silver

2 Agenda  Explanatory/Descriptive Modeling  Inductive Decision Tree Theory  The Weka IDT System  Weka IDT Tutorial

3 Explanatory/Descriptive Modeling

4 Overview of Data Mining Methods  Automated Exploration/Discovery e.g.. discovering new market segments e.g.. discovering new market segments distance and probabilistic clustering algorithms distance and probabilistic clustering algorithms  Prediction/Classification e.g.. forecasting gross sales given current factors e.g.. forecasting gross sales given current factors statistics (regression, K-nearest neighbour) statistics (regression, K-nearest neighbour) artificial neural networks, genetic algorithms artificial neural networks, genetic algorithms  Explanation/Description e.g.. characterizing customers by demographics e.g.. characterizing customers by demographics inductive decision trees/rules inductive decision trees/rules rough sets, Bayesian belief nets rough sets, Bayesian belief nets x1 x2 f(x) x if age > 35 and income < $35k then... A B

5 Inductive Modeling = Learning Objective: Develop a general model or hypothesis from specific examples  Function approximation (curve fitting)  Classification (concept learning, pattern recognition) x1 x2 A B f(x) x

6 Inductive Modeling with IDT Basic Framework for Inductive Learning Inductive Learning System Environment Training Examples Testing Examples Induced Model of Classifier Output Classification (x, f(x)) (x, h(x)) h(x) = f(x)? The focus is on developing a model h(x) that can be understood (is transparent). ~

7 Inductive Decision Tree Theory

8 Inductive Decision Trees Decision Tree  A representational structure  An acyclic, directed graph  Nodes are either a: Leaf - indicates class or value (distribution) Leaf - indicates class or value (distribution) Decision node - a test on a single attribute - will have one branch and subtree for each possible outcome of the test Decision node - a test on a single attribute - will have one branch and subtree for each possible outcome of the test  Classification made by traversing from root to a leaf in accord with tests A? B?C? D? Root Leaf Yes

9 Inductive Decision Trees (IDTs) A Long and Diverse History  Independently developed in the 60 ’ s and 70 ’ s by researchers in... Statistics: L. Breiman & J. Friedman - CART (Classification and Regression Trees) Pattern Recognition: Uof Michigan - AID, G.V. Kass - CHAID (Chi-squared Automated Interaction Detection) AI and Info. Theory: R. Quinlan - ID3, C4.5 (Iterative Dichotomizer) closest to Scenario

10 Inducing a Decision Tree Given: Set of examples with Pos. & Neg. classes Problem: Generate a Decision Tree model to classify a separate (validation) set of examples with minimal error Approach: Occam ’ s Razor - produce the simplest model that is consistent with the training examples -> narrow, short tree. Every traverse should be as short as possible Formally: Finding the absolute simplest tree is intractable, but we can at least try our best

11 Inducing a Decision Tree How do we produce an optimal tree? Heuristic (strategy) 1: Grow the tree from the top down. Place the most important variable test at the root of each successive subtree The most important variable: the variable (predictor) that gains the most ground in classifying the set of training examples the variable (predictor) that gains the most ground in classifying the set of training examples the variable that has the most significant relationship to the response variable the variable that has the most significant relationship to the response variable to which the response is most dependent or least independent to which the response is most dependent or least independent

12 Inducing a Decision Tree Importance of a predictor variable  CHAID/CART Chi-squared [or F (Fisher)] statistic is used to test the independence between the catagorical [or continuous] response variable and each predictor variable Chi-squared [or F (Fisher)] statistic is used to test the independence between the catagorical [or continuous] response variable and each predictor variable The lowest probability (p-value) from the test determines the most important predictor (p-values are first corrected by the Bonferroni adjustment) The lowest probability (p-value) from the test determines the most important predictor (p-values are first corrected by the Bonferroni adjustment)  C4.5 (section 4.3 of WFH, and PDF slides) Theoretic Information Gain is computed for each predictor and one with the highest Gain is chosen Theoretic Information Gain is computed for each predictor and one with the highest Gain is chosen

13 Inducing a Decision Tree How do we produce an optimal tree? Heuristic (strategy) 2: To be fair to predictors variables that have only 2 values, divide variables with multiple values into similar groups or segments which are then treated as separated variables (CART/CHAID only)  The p-values from the Chi-squared or F statistic is used to determine variable/value combinations which are most similar in terms of their relationship to the response variable

14 Inducing a Decision Tree How do we produce an optimal tree? Heuristic (strategy) 3: Prevent overfitting the tree to the training data so that it generalizes well to a validation set by: Stopping: Prevent the split on a predictor variable if it is above a level of statistical significance - simply make it a leaf (CHAID) Pruning: After a complex tree has been grown, replace a split (subtree) with a leaf if the predicted validation error is no worse than the more complex tree (CART, C4.5)

15 Inducing a Decision Tree Stopping (pre-pruning) means a choice of level of significance (CART)....  If the probability (p-value) of the statistic is less than the chosen level of significance then a split is allowed  Typically the significance level is set to: 0.05 which provides 95% confidence 0.05 which provides 95% confidence 0.01 which provides 99% confidence 0.01 which provides 99% confidence

16 Inducing a Decision Tree Stopping means a minimum number of examples at a leaf node (C4.5 = J48)....  M factor = minimum number of examples allowed at a leave node  M =2 is default

17 Inducing a Decision Tree Pruning means reducing the complexity of a tree.. (C4.5 = J48)....  C factor = confidence in the data used to train the tree  C = 25% is default  If there is 25% confidence that a pruned branch will generate < or = training errors on a test set then prune it.  p.196 WFH, PDF slides

18 The Weka IDT System  Weka SimpleCART creates a tree-based classification model  The target or response variable must be categorical (multiple classes allowed)  Uses the Chi-Squared test for significance  Prunes the tree by using a test/tuning set Copyright (c), 2002 All Rights Reserved

19 The Weka IDT System  Weka J48 creates a tree-based classification model = Ross Quinlan’s orginal C4.5 algorithm  The target or response variable must be categorical  Uses information gain test for significance  Prunes the tree by using a test/tuning set Copyright (c), 2002 All Rights Reserved

20 The Weka IDT System  Weka M5P creates a tree-based classification model = also by Ross Quinlan  The target or response variable must be continuous  Uses information gain test for significance  Prunes the tree by using a test/tuning set Copyright (c), 2002 All Rights Reserved

21 IDT Training

22 IDT Training How do you ensure that a decision tree has been well trained?  Objective: To achieve good generalization accuracy on new examples/cases accuracy on new examples/cases  Establish a maximum acceptable error rate  Train the tree using a method to prevent over-fitting – stopping / pruning  Validate the trained network against a separate test set

23 IDT Training Available Examples Training Set HO Set Approach #1: Large Sample When the amount of available data is large... 70% 30% Used to develop one IDT model Compute goodness of fit Divide randomly Generalization = goodness of fit Test Set

24 IDT Training Available Examples Training Set HO Set Approach #2: Cross-validation When the amount of available data is small... 10% 90% Repeat 10 times Used to develop 10 different IDT models Tabulate goodness of fit stats Generalization = mean and stddev of goodness of fit Test Set

25 IDT Training How do you select between two induced decision trees ?  A statistical test of hypothesis is required to ensure that a significant difference exists between the fit of two IDT models  If Large Sample method has been used then apply McNemar ’ s test* or diff. of proportions  If Cross-validation then use a paired t test for difference of two proportions *We assume a classification problem, if this is function approximation then use paired t test for difference of means

26 Pros and Cons of IDTs Cons:  Only one response variable at a time  Different significance tests required for nominal and continuous responses  Can have difficulties with noisy data  Discriminate functions are often suboptimal due to orthogonal decision hyperplanes

27 Pros and Cons of IDTs Pros:  Proven modeling method for 20 years  Provides explanation and prediction  Ability to learn arbitrary functions  Handles unknown values well  Rapid training and recognition speed  Has inspired many inductive learning algorithms using statistical regression

28 The IDT Application Development Process Guidelines for inducting decision trees 1. IDTs are good method to start with 2. Get a suitable training set 3. Use a sensible coding for input variables 4. Develop the simplest tree by adjusting tuning parameters (significance level) 5. Use a method to prevent over-fitting 6. Determine confidence in generalization through cross-validation

29 THE END