Ch9: Decision Trees 9.1 Introduction A decision tree:

Slides:



Advertisements
Similar presentations
CHAPTER 9: Decision Trees
Advertisements

Hunt’s Algorithm CIT365: Data Mining & Data Warehousing Bajuna Salehe
Introduction Training Complexity, Pruning CART vs. ID3 vs. C4.5
1 Data Mining Classification Techniques: Decision Trees (BUSINESS INTELLIGENCE) Slides prepared by Elizabeth Anglo, DISCS ADMU.
Classification Techniques: Decision Tree Learning
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.
SLIQ: A Fast Scalable Classifier for Data Mining Manish Mehta, Rakesh Agrawal, Jorma Rissanen Presentation by: Vladan Radosavljevic.
Decision Trees Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei Han.
Lecture outline Classification Decision-tree classification.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
1 Classification with Decision Trees I Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei.
Tree-based methods, neutral networks
Lecture 5 (Classification with Decision Trees)
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Example of a Decision Tree categorical continuous class Splitting Attributes Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K.
ICS 273A Intro Machine Learning
Classification.
R OBERTO B ATTITI, M AURO B RUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.
Ensemble Learning (2), Tree and Forest
Decision Tree Models in Data Mining
Learning Chapter 18 and Parts of Chapter 20
Decision Tree Learning
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Fall 2004 TDIDT Learning CS478 - Machine Learning.
Chapter 9 – Classification and Regression Trees
Chapter 4 Classification. 2 Classification: Definition Given a collection of records (training set ) –Each record contains a set of attributes, one of.
Decision Trees Jyh-Shing Roger Jang ( 張智星 ) CSIE Dept, National Taiwan University.
Comparing Univariate and Multivariate Decision Trees Olcay Taner Yıldız Ethem Alpaydın Department of Computer Engineering Bogazici University
Decision Tree Learning Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 25, 2014.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.3: Decision Trees Rodney Nielsen Many of.
For Wednesday No reading Homework: –Chapter 18, exercise 6.
For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.
MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
1 Decision Tree Learning Original slides by Raymond J. Mooney University of Texas at Austin.
Decision Trees Example of a Decision Tree categorical continuous class Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Splitting.
DECISION TREE Ge Song. Introduction ■ Decision Tree: is a supervised learning algorithm used for classification or regression. ■ Decision Tree Graph:
Decision Tree Learning
Lecture Notes for Chapter 4 Introduction to Data Mining
Decision Trees.
Classification and Regression Trees
Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
Decision Tree Learning DA514 - Lecture Slides 2 Modified and expanded from: E. Alpaydin-ML (chapter 9) T. Mitchell-ML.
Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
10. Decision Trees and Markov Chains for Gene Finding.
DECISION TREES An internal node represents a test on an attribute.
Decision Trees an introduction.
Decision Trees.
C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.
C4.5 - pruning decision trees
C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.
Artificial Intelligence
Data Science Algorithms: The Basic Methods
Introduction to Data Mining, 2nd Edition by
Classification and Prediction
Introduction to Data Mining, 2nd Edition by
Introduction to Data Mining, 2nd Edition by
Roberto Battiti, Mauro Brunato
Data Mining – Chapter 3 Classification
Machine Learning: Lecture 3
Classification with Decision Trees
Statistical Learning Dong Liu Dept. EEIS, USTC.
INTRODUCTION TO Machine Learning
INTRODUCTION TO Machine Learning 2nd Edition
STT : Intro. to Statistical Learning
Presentation transcript:

Ch9: Decision Trees 9.1 Introduction A decision tree: (a) is a hierarchical data structure (b) implements the divide-and-conquer strategy A decision tree is composed of internal (decision) and terminal (leaf) nodes. Each decision node accompanies a test function with discrete outcomes labeling the branches. Each leaf node has a class label (for classification) or a numeric value (for regression)

A leaf node defines a localized region in which examples belong to the same class or have similar values. The boundaries of the region are determined by the discriminants that coded in the decision nodes on the path from the root to the leaf node. Advantages of decision trees: i) Fast localization of the region covering the input ii) Interpretability

9.2 Univariate Trees Decision nodes Univariate: Uses a single attribute, xi Numeric xi : binary split : xi > wm Discrete xi : n-way split for n possible values Multivariate: Uses all attributes, x Leave nodes Classification: class labels, or proportions Regression: numeric; r average, or local fit

Example: Numeric input, binary split The leaf nodes define hyperrectangles in the input space.

Given a training set, many trees can code it. Objective: Find the smallest one. Tree size is measured by: i) # nodes, ii) the complexity of decision nodes 9.2.1 Classification Trees The goodness of a split is quantified by an impurity measure. A split with minimal impurity is desirable because the smallest tree is desired. A split is pure if for all branches all the examples choosing a branch belong to the same class.

For node m, Nm examples reach m, belong to Ci with The probability of an example x reaching node m belonging to class Ci , Node m is pure if are either 0 or 1. It can be a leaf node labeled with the class for which . One possible measure of impurity is entropy The smaller, the purer.

In information theory, Let probability of occurrence If e occurs, we receive bits of information. Bit: the amount of information receives when any of two equally probable alternatives is observed, i.e., This means that if we know for sure that an event will occur, its occurrence provides no information at all.

。 Consider a sequence of symbols output from the source with occurring probabilities Zero-memory source: the probability of sending each symbol is independent of symbols previously sent. The amount of information received from each symbol is Entropy of the information source: the average amount of information received per symbol

Entropy measures the degree of disorder of a system. The most disorderly system is the one whose symbols occur with equal probability. Proof: The difference in entropy between the two sources

If is a source with equiprobable symbols, then

The second term in (1) is zero where information gain (relative entropy or Kullback-Leibler divergence between and )

In statistical mechanics – Systems deal with ensembles, each of which contains a large number of identical small systems, e.g., thermodynamics, quantum mechanics -- The property of an ensemble is determined by the average behavior of constituent small systems. Example: A thermodynamic system that is initially at temperature is changed to T

At temperature T, sufficient time should be given in order to allow the system to reach an equilibrium, in which the probability that a particle i has an energy following the Boltzmann distribution T : Kelvins temperature where : Boltzmann constant The system energy at the equilibrium state ( ) is in an average sense . 13

The coarseness of particle i with energy : : Partition function The coarseness of particle i with energy : The entropy of the system is the average coarseness of its constituent particles 14

Consider 2-class case, let is an impurity measure function if is increasing in p on [0, 1/2] and decreasing in p on [1/2, 1] Examples: 1. Entropy: 2. Gini index: 3. Misclassification error:

CART Algorithm: If node m is pure, generate a leaf and stop, otherwise split and continue recursively Impurity after split: of Nm take branch j. belong to Ci For all attributes, calculate their split impurity and choose the one with the minimum impurity.

9.2.2 Regression Trees Difficulties with the CART Algorithm: Splitting favors attributes with many values many values  many branches  less impurity Noise may lead to a very large tree if the purest tree is desired. 9.2.2 Regression Trees The goodness of a split is measured by the mean square error from the estimated value. For node m, Xm is the subset of X reaching node m, Let be the estimated value in node m.

The mean square error where After splitting:

If the error is acceptable, i. e If the error is acceptable, i.e. , a leaf node is created and value is stored. If the error is not acceptable, examples reaching node m is split further such that the sum of the errors in the branches is minimized. Example: Different error thresholds

The CART algorithm can be modified to training a regression tree by replacing (i) entropy with mean square error and (ii) class labels with averages. Another possible error function: worst possible error

9.3 Pruning Two types of pruning: Prepruning : early stopping, e.g., small number of examples reaching a node Postpruning: grow the whole tree then prune unnecessary subtrees * Prepruning is faster, postpruning is more accurate

9.4 Rules Extraction from Trees Example: IF-Then Rules: Rule support – the percentage of training data covered by the rule

9.6 Multivariate Trees At a decision node m, all input dimensions can be used to split the node. When all inputs are numeric A linear multivariate node:

A quadratic multivariate node: Sphere node: