Decision tree LING 572 Fei Xia 1/10/06. Outline Basic concepts Main issues Advanced topics.

Slides:



Advertisements
Similar presentations
DECISION TREES. Decision trees  One possible representation for hypotheses.
Advertisements

1 Machine Learning: Lecture 3 Decision Tree Learning (Based on Chapter 3 of Mitchell T.., Machine Learning, 1997)
Decision Trees Decision tree representation ID3 learning algorithm
Machine Learning III Decision Tree Induction
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
1er. Escuela Red ProTIC - Tandil, de Abril, Decision Tree Learning 3.1 Introduction –Method for approximation of discrete-valued target functions.
ICS320-Foundations of Adaptive and Learning Systems
Classification Techniques: Decision Tree Learning
Decision Tree Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.
Decision Tree Learning
Decision Tree Rong Jin. Determine Milage Per Gallon.
Part 7.3 Decision Trees Decision tree representation ID3 learning algorithm Entropy, information gain Overfitting.
CS 590M Fall 2001: Security Issues in Data Mining Lecture 4: ID3.
Ensemble Learning: An Introduction
Decision Trees Decision tree representation Top Down Construction
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
Three kinds of learning
Classification.
Ensemble Learning (2), Tree and Forest
Decision Tree Learning
Decision tree LING 572 Fei Xia 1/16/06.
By Wang Rui State Key Lab of CAD&CG
Fall 2004 TDIDT Learning CS478 - Machine Learning.
Machine Learning Chapter 3. Decision Tree Learning
Decision Trees Advanced Statistical Methods in NLP Ling572 January 10, 2012.
Mohammad Ali Keyvanrad
For Wednesday No new reading Homework: –Chapter 18, exercises 3, 4, 7.
Decision tree learning Maria Simi, 2010/2011 Inductive inference with decision trees  Decision Trees is one of the most widely used and practical methods.
Chapter 9 – Classification and Regression Trees
Machine Learning Lecture 10 Decision Tree Learning 1.
CpSc 810: Machine Learning Decision Tree Learning.
Learning from Observations Chapter 18 Through
Decision Trees DefinitionDefinition MechanismMechanism Splitting FunctionSplitting Function Issues in Decision-Tree LearningIssues in Decision-Tree Learning.
For Wednesday No reading Homework: –Chapter 18, exercise 6.
Learning with Decision Trees Artificial Intelligence CMSC February 20, 2003.
For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.
CS 8751 ML & KDDDecision Trees1 Decision tree representation ID3 learning algorithm Entropy, Information gain Overfitting.
1 Universidad de Buenos Aires Maestría en Data Mining y Knowledge Discovery Aprendizaje Automático 5-Inducción de árboles de decisión (2/2) Eduardo Poggi.
CS 5751 Machine Learning Chapter 3 Decision Tree Learning1 Decision Trees Decision tree representation ID3 learning algorithm Entropy, Information gain.
机器学习 陈昱 北京大学计算机科学技术研究所 信息安全工程研究中心. 课程基本信息  主讲教师:陈昱 Tel :  助教:程再兴, Tel :  课程网页:
1 Decision Tree Learning Original slides by Raymond J. Mooney University of Texas at Austin.
Decision Trees, Part 1 Reading: Textbook, Chapter 6.
Decision Tree Learning
Decision Tree Learning Presented by Ping Zhang Nov. 26th, 2007.
Data Mining and Decision Support
CSC 8520 Spring Paula Matuszek DecisionTreeFirstDraft Paula Matuszek Spring,
Decision Trees.
Classification and Regression Trees
Outline Decision tree representation ID3 learning algorithm Entropy, Information gain Issues in decision tree learning 2.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
Decision Tree Learning DA514 - Lecture Slides 2 Modified and expanded from: E. Alpaydin-ML (chapter 9) T. Mitchell-ML.
Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Machine Learning Inductive Learning and Decision Trees
Decision Tree Learning
Machine Learning Lecture 2: Decision Tree Learning.
Artificial Intelligence
Data Science Algorithms: The Basic Methods
Issues in Decision-Tree Learning Avoiding overfitting through pruning
Decision Tree Saed Sayad 9/21/2018.
Introduction to Data Mining, 2nd Edition by
Machine Learning Chapter 3. Decision Tree Learning
Machine Learning: Lecture 3
Machine Learning Chapter 3. Decision Tree Learning
Decision Trees.
Statistical Learning Dong Liu Dept. EEIS, USTC.
Decision Trees Berlin Chen
Presentation transcript:

Decision tree LING 572 Fei Xia 1/10/06

Outline Basic concepts Main issues Advanced topics

Basic concepts

A classification problem DistrictHouse type IncomePrevious Customer Outcome SuburbanDetachedHighNoNothing SuburbanSemi- detached HighYesRespond RuralSemi- detached LowNoRespond UrbanDetachedLowYesNothing …

Classification and estimation problems Given – a finite set of (input) attributes: features Ex: District, House type, Income, Previous customer – a target attribute: the goal Ex: Outcome: {Nothing, Respond} – training data: a set of classified examples in attribute-value representation Ex: the previous table Predict the value of the goal given the values of input attributes –The goal is a discrete variable  classification problem –The goal is a continuous variable  estimation problem

District Suburban (3/5) Rural (4/4) Urban (3/5) Respond House type Detached (2/2) Nothing Previous customer Yes(3/3) No (2/2) Semi-detached (3/3) Respond Nothing Respond Decision tree

Decision tree representation Each internal node is a test: –Theoretically, a node can test multiple attributes –In most systems, a node tests exactly one attribute Each branch corresponds to test results –A branch corresponds to an attribute value or a range of attribute values Each leaf node assigns –a class: decision tree –a real value: regression tree

What’s a best decision tree? “Best”: You need a bias (e.g., prefer the “smallest” tree): least depth? Fewest nodes? Which trees are the best predictors of unseen data? Occam's Razor: we prefer the simplest hypothesis that fits the data.  Find a decision tree that is as small as possible and fits the data

Finding a smallest decision tree A decision tree can represent any discrete function of the inputs: y=f(x 1, x 2, …, x n ) The space of decision trees is too big for systemic search for a smallest decision tree. Solution: greedy algorithm

Basic algorithm: top-down induction Find the “best” decision attribute, A, and assign A as decision attribute for node For each value of A, create new branch, and divide up training examples Repeat the process until the gain is small enough

Major issues

Q1: Choosing best attribute: what quality measure to use? Q2: Determining when to stop splitting: avoid overfitting Q3: Handling continuous attributes Q4: Handling training data with missing attribute values Q5: Handing attributes with different costs Q6: Dealing with continuous goal attribute

Q1: What quality measure Information gain Gain Ratio …

Entropy of a training set S is a sample of training examples Entropy is one way of measuring the impurity of S P c is the proportion of examples in S whose target attribute has value c.

Information Gain Gain(S,A)=expected reduction in entropy due to sorting on A. Choose the A with the max information gain. (a.k.a. choose the A with the min average entropy)

An example Humidity High Normal S=[9+,5-] E=0.940 [3+,4-] [6+,1-] InfoGain(S, Humidity) =0.940-(7/14)*0.985-(7/14)0.592 =0.151 Wind Weak Strong S=[9+,5-] E=0.940 [6+,2-] [3+,3-] InfoGain(S, Wind) =0.940-(8/14)*0.811-(6/14)*1.0 =0.048 E=0.985 E=0.592E=0.811E=1.00

Other quality measures Problem of information gain: –Information Gain prefers attributes with many values. An alternative: Gain Ratio Where S i is subset of S for which A has value v i.

Q2: avoiding overfitting Overfitting occurs when our decision tree characterizes too much detail, or noise in our training data. Consider error of hypothesis h over –Training data: ErrorTrain(h) –Entire distribution D of data: ErrorD(h) A hypothesis h overfits training data if there is an alternative hypothesis h’, such that –ErrorTrain(h) < ErrorTrain(h’), and –ErrorD(h) > errorD(h’)

How to avoiding Overfitting Stop growing the tree earlier –Ex: InfoGain < threshold –Ex: Size of examples in a node < threshold –…–… Grow full tree, then post-prune  In practice, the latter works better than the former.

Post-pruning Split data into training and validation set Do until further pruning is harmful: –Evaluate impact on validation set of pruning each possible node (plus those below it) –Greedily remove the ones that don’t improve the performance on validation set  Produces a smaller tree with best performance measure

Performance measure Accuracy: –on validation data –K-fold cross validation Misclassification cost: Sometimes more accuracy is desired for some classes than others. MDL: size(tree) + errors(tree)

Rule post-pruning Convert tree to equivalent set of rules Prune each rule independently of others Sort final rules into desired sequence for use Perhaps most frequently used method (e.g., C4.5)

Q3: handling numeric attributes Continuous attribute  discrete attribute Example –Original attribute: Temperature = 82.5 –New attribute: (temperature > 72.3) = t, f  Question: how to choose thresholds?

Choosing thresholds for a continuous attribute Sort the examples according to the continuous attribute. Identify adjacent examples that differ in their target classification  a set of candidate thresholds Choose the candidate with the highest information gain.

Q4: Unknown attribute values Assume an attribute can take the value “blank”. Assign most common value of A among training data at node n. Assign most common value of A among training data at node n which have the same target class. Assign prob p i to each possible value v i of A –Assign a fraction (pi) of example to each descendant in tree –This method is used in C4.5.

Q5: Attributes with cost Consider medical diagnosis (e.g., blood test) has a cost Question: how to learn a consistent tree with low expected cost? One approach: replace gain by –Tan and Schlimmer (1990)

Q6: Dealing with continuous goal attribute  Regression tree A variant of decision trees Estimation problem: approximate real-valued functions: e.g., the crime rate A leaf node is marked with a real value or a linear function: e.g., the mean of the target values of the examples at the node. Measure of impurity: e.g., variance, standard deviation, …

Summary of Major issues Q1: Choosing best attribute: different quality measure. Q2: Determining when to stop splitting: stop earlier or post-pruning Q3: Handling continuous attributes: find the breakpoints

Summary of major issues (cont) Q4: Handling training data with missing attribute values: blank value, most common value, or fractional count Q5: Handing attributes with different costs: use a quality measure that includes the cost factors. Q6: Dealing with continuous goal attribute: various ways of building regression trees.

Common algorithms ID3 C4.5 CART

ID3 Proposed by Quinlan (so is C4.5) Can handle basic cases: discrete attributes, no missing information, etc. Information gain as quality measure

C4.5 An extension of ID3: –Several quality measures –Incomplete information (missing attribute values) –Numerical (continuous) attributes –Pruning of decision trees –Rule derivation –Random mood and batch mood

CART CART (classification and regression tree) Proposed by Breiman et. al. (1984) Constant numerical values in leaves Variance as measure of impurity

Strengths of decision tree methods Ability to generate understandable rules Ease of calculation at classification time Ability to handle both continuous and categorical variables Ability to clearly indicate best attributes

The weaknesses of decision tree methods Greedy algorithm: no global optimization Error-prone with too many classes: numbers of training examples become smaller quickly in a tree with many levels/branches. Expensive to train: sorting, combination of attributes, calculating quality measures, etc. Trouble with non-rectangular regions: the rectangular classification boxes that may not correspond well with the actual distribution of records in the decision space.

Advanced topics

Combining multiple models The inherent instability of top-down decision tree induction: different training datasets from a given problem domain will produce quite different trees. Techniques: –Bagging –Boosting

Bagging Introduced by BreimanBreiman It first creates multiple decision trees based on different training sets. Then, it compares each tree and incorporates the best features of each. This addresses some of the problems inherent in regular ID3.

Boosting Introduced by Freund and SchapireFreundSchapire It examines the trees that incorrectly classify an instance and assign them a weight. These weights are used to eliminate hypotheses or refocus the algorithm on the hypotheses that are performing well.

Summary Basic case: –Discrete input attributes –Discrete goal attribute –No missing attribute values –Same cost for all tests and all kinds of misclassification. Extended cases: –Continuous attributes –Real-valued goal attribute –Some examples miss some attribute values –Some tests are more expensive than others. –Some misclassifications are more serious than others.

Summary (cont) Basic algorithm: –greedy algorithm –top-down induction –Bias for small trees Major issues:

Uncovered issues Incremental decision tree induction? How can a decision relate to other decisions? what's the order of making the decisions? (e.g., POS tagging) What's the difference between decision tree and decision list?