© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 1 5-Slide Example: Gene Chip Data.

Slides:

Advertisements

Similar presentations

DECISION TREES. Decision trees  One possible representation for hypotheses.

Advertisements

Random Forest Predrag Radenković 3237/10

CHAPTER 9: Decision Trees

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.

Pavan J Joshi 2010MCS2095 Special Topics in Database Systems

Introduction Training Complexity, Pruning CART vs. ID3 vs. C4.5

Classification Techniques: Decision Tree Learning

Decision Tree Learning 主講人：虞台文大同大學資工所智慧型多媒體研究室.

Chapter 7 – Classification and Regression Trees

Chapter 7 – Classification and Regression Trees

Decision Trees Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei Han.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Induction of Decision Trees

1 Classification with Decision Trees I Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei.

Three kinds of learning

INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

ICS 273A Intro Machine Learning

Classification.

Ensemble Learning (2), Tree and Forest

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Fall 2004 TDIDT Learning CS478 - Machine Learning.

6-Slide Example: Gene Chip Data © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

Machine Learning Chapter 3. Decision Tree Learning

Decision Trees Advanced Statistical Methods in NLP Ling572 January 10, 2012.

CS-424 Gregory Dudek Today’s outline Administrative issues –Assignment deadlines: 1 day = 24 hrs (holidays are special) –The project –Assignment 3 –Midterm.

Mohammad Ali Keyvanrad

For Wednesday No new reading Homework: –Chapter 18, exercises 3, 4, 7.

Chapter 9 – Classification and Regression Trees

Today’s Topics HW0 due 11:55pm tonight and no later than next Tuesday HW1 out on class home page; discussion page in MoodleHW1discussion page Please do.

CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:

Today’s Topics Dealing with Noise Overfitting (the key issue in all of ML) A ‘Greedy’ Algorithm for Pruning D-Trees Generating IF-THEN Rules from D-Trees.

Learning from Observations Chapter 18 Through

CS Fall 2015 (© Jude Shavlik), Lecture 7, Week 3

Decision Trees DefinitionDefinition MechanismMechanism Splitting FunctionSplitting Function Issues in Decision-Tree LearningIssues in Decision-Tree Learning.

Decision Tree Learning Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 25, 2014.

1 Learning Chapter 18 and Parts of Chapter 20 AI systems are complex and may have many parameters. It is impractical and often impossible to encode all.

For Wednesday No reading Homework: –Chapter 18, exercise 6.

CS690L Data Mining: Classification

For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.

Today’s Topics Read Chapter 3 & Section 4.1 (Skim Section 3.6 and rest of Chapter 4), Sections 5.1, 5.2, 5.3, 5,7, 5.8, & 5.9 (skim rest of Chapter 5)

CS 8751 ML & KDDDecision Trees1 Decision tree representation ID3 learning algorithm Entropy, Information gain Overfitting.

Today’s Topics Learning Decision Trees (Chapter 18) –We’ll use d-trees to introduce/motivate many general issues in ML (eg, overfitting reduction) “Forests”

MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.

1 Universidad de Buenos Aires Maestría en Data Mining y Knowledge Discovery Aprendizaje Automático 5-Inducción de árboles de decisión (2/2) Eduardo Poggi.

CS 5751 Machine Learning Chapter 3 Decision Tree Learning1 Decision Trees Decision tree representation ID3 learning algorithm Entropy, Information gain.

Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.

1 Decision Tree Learning Original slides by Raymond J. Mooney University of Texas at Austin.

Today’s Topics HW1 Due 11:55pm Today (no later than next Tuesday) HW2 Out, Due in Two Weeks Next Week We’ll Discuss the Make-Up Midterm Be Sure to Check.

Decision Tree Learning

Machine Learning Recitation 8 Oct 21, 2009 Oznur Tastan.

Outline Decision tree representation ID3 learning algorithm Entropy, Information gain Issues in decision tree learning 2.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

BY International School of Engineering {We Are Applied Engineering} Disclaimer: Some of the Images and content have been taken from multiple online sources.

Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.

Decision Tree Learning DA514 - Lecture Slides 2 Modified and expanded from: E. Alpaydin-ML (chapter 9) T. Mitchell-ML.

Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

DECISION TREES An internal node represents a test on an attribute.

Decision Trees an introduction.

Artificial Intelligence

Ch9: Decision Trees 9.1 Introduction A decision tree:

CS Fall 2016 (© Jude Shavlik), Lecture 4

Data Science Algorithms: The Basic Methods

Issues in Decision-Tree Learning Avoiding overfitting through pruning

CS Fall 2016 (© Jude Shavlik), Lecture 6, Week 4

Machine Learning Chapter 3. Decision Tree Learning

Machine Learning: Lecture 3

Machine Learning Chapter 3. Decision Tree Learning

Statistical Learning Dong Liu Dept. EEIS, USTC.

Machine Learning in Practice Lecture 17

INTRODUCTION TO Machine Learning 2nd Edition

Presentation transcript:

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 1 5-Slide Example: Gene Chip Data

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 2 Decision Trees in One Picture

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 3 Example: Gene Expression Decision tree: AD_X57809_at <= : myeloma (74) AD_X57809_at > : normal (31) Leave-one-out cross-validation accuracy estimate: 97.1% X57809: IGL (immunoglobulin lambda locus)

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 4 Problem with Result Easy to predict accurately with genes related to immune function, such as IGL, but this gives us no new insight. Eliminate these genes prior to training. Possible of comprehensibility of decision trees.

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 5 Ignoring Genes Associated with Immune function Decision tree: AD_X04898_rna1_at <= : normal (30) AD_X04898_rna1_at > : myeloma (74/1) X04898: APOA2 (Apolipoprotein AII) Leave-one-out accuracy estimate: 98.1%.

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 6 Another Tree AD_M15881_at > 992: normal (28) AD_M15881_at <= 992: AC_D82348_at = A: normal (3) AC_D82348_at = P: myeloma (74)

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 7 A Measure of Node Purity Let f + = fraction of positive examples Let f - = fraction of negative examples f + = p / (p + n), f - = n / (p + n), p=#pos, n=#neg Under an optimal code, the information needed (expected number of bits) to label one example is Info( f +, f - ) = - f + lg (f + ) - f - lg (f - ) This is also called the entropy of the set of examples (derived later)

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 8 Another Commonly-Used Measure of Node Purity Gini Index: (f + ) ( f - )Gini Index: (f + ) ( f - ) Used in CART (Classification and Regression Trees, Breiman et al., 1984)Used in CART (Classification and Regression Trees, Breiman et al., 1984)

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 9 All same class (+, say)All same class (+, say) Info(1, 0) = -1 lg(1) + -0 lg(0) mixture50-50 mixture Info(½, ½) = 2[ -½ lg(½)] = 1 Info(f +, f - ) : Consider the Extreme Cases (by def) f+f+ I(f +, 1-f + )

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 10 Evaluating a Feature How much does it help to know the value of attribute/feature A ?How much does it help to know the value of attribute/feature A ? Assume A divides the current set of examples into N groupsAssume A divides the current set of examples into N groups Let q i = fraction of data on branch i f i + = fraction of +’s on branch i f i - = fraction of –’s on branch i

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 11 E(A ) Σ q i x I (f i +, f i - ) E(A )  Σ q i x I (f i +, f i - ) Info needed after determining the value of attribute AInfo needed after determining the value of attribute A Another expected value calcAnother expected value calcPictorally Evaluating a Feature (con’t) i= 1 N A v1v1 vNvN I (f N +, f N - ) I (f +, f - ) I (f 1 +, f 1 - )

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 12 Info Gain Gain(A) I(f +, f - ) – E(A) Gain(A)  I(f +, f - ) – E(A) Our scoring function in our hill-climbing (greedy) algorithm So pick A with smallest E(A) Constant for all features That is, choose the feature that statistically tells us the most about the class of another example drawn from this distribution

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 13 Example Info-Gain Calculation +BIGRed +BIGRed -SMALLYellow -SMALLRed +BIGBlueClassSizeShapeColor

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 14 Info-Gain Calculation (cont.) Note that “Size” provides complete classification, so done

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 15

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 16 ID3 Info Gain Measure Justified (Ref. C4.5, J. R. Quinlan, Morgan Kaufmann, 1993, pp21-22) Definition of Information Info conveyed by message M depends on its probability, i.e., Info conveyed by message M depends on its probability, i.e., (due to Shannon) (due to Shannon) Select example from a set S and announce it belongs to class C The probability of this occurring is the fraction of C ’s in S Hence info in this announcement is, by definition,

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 17 Let there be K different classes in set S, the classes are: What is expected info from a msg about the class of an example in set S ? is the average number of bits of information (by looking at feature values) needed to classify a member of set S is the average number of bits of information (by looking at feature values) needed to classify a member of set S ID3 Info Gain Measure (cont.)

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 18 Handling Hierarchical Features in ID3 Define a new feature for each level in hierarchy, e.g., Let ID3 choose the appropriate level of abstraction! Let ID3 choose the appropriate level of abstraction! Shape CircularPolygonal Shape 1 = {Circular, Polygonal} Shape2 = {,,,, }

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 19 Handling Numeric Features in ID3 On the fly create binary features and choose best Step 1: Plot current examples (green=pos, red=neg) Step 2: Divide midway between every consecutive pair of points with different categories to create new binary features, eg feature new1 = F<8 and feature new2 = F<10 Step 3: Choose split with best info gain Value of Feature

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 20 Handling Numeric Features (cont.) Note Note F<10 F< T TF F Cannot discard numeric feature after use in one portion of d-tree

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 21 Characteristic Property of Using Info-Gain Measure FAVORS FEATURES WITH HIGH BRANCHING FACTORS (i.e. many possible values) (i.e. many possible values) Extreme Case: At most one example per leaf and all I(.,.) scores for leafs equals zero, so gets perfect score! But generalizes very poorly (ie, memorizes data) Student ID

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 22 Fix: Method 1 Convert all features to binary e.g., Color = {Red, Blue, Green} From 1 N-valued feature to N binary features From 1 N-valued feature to N binary features Color = Red? {True, False} Color = Red? {True, False} Color = Blue? {True, False} Color = Blue? {True, False} Color = Green? {True, False} Color = Green? {True, False} Used in Neural Nets and SVMs Used in Neural Nets and SVMs D-tree readability probably less, but not necessarily D-tree readability probably less, but not necessarily

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 23 Fix: Method 2 Find info content in answer to: What is value of feature A ignoring output category? What is value of feature A ignoring output category? fraction of all examples with A=i Choose A that maximizes: Read text (Mitchell) for exact details!

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 24 Fix: Method 3 Group values of nominal features vs. vs. Done in CART (Breiman et.al. 1984)Done in CART (Breiman et.al. 1984) Breiman et.al. proved for the 2-category case, optimal binary partition can be found be considering only O(N) possibilities instead of O(2 N )Breiman et.al. proved for the 2-category case, optimal binary partition can be found be considering only O(N) possibilities instead of O(2 N ) Color? R B G Y R vs B vs …G vs Y vs …

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 25 Multiple Category Classification – Method 1 (used in SVMs) Approach 1: Learn one tree (ie, model) per category What happens if test ex. is predicted to lie in multiple categories? To none? Pass test ex’s through each tree

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 26 Multiple Category Classification – Method 2 Approach 2: Learn one tree in total Subdivides the full space such that every point belongs to one and only one category (drawing slightly misleading)

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 27 Noise - Major Issue in ML Worst Case of Noise +, - at same point in feature space +, - at same point in feature space Causes of Noise 1. Too few features (“hidden variables”) or too few possible values 1. Too few features (“hidden variables”) or too few possible values 2. Incorrectly reported/measures/judged feature values 2. Incorrectly reported/measures/judged feature values 3. Mis-classified instances 3. Mis-classified instances

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 28 Noise - Major Issue in ML (cont.) Overfitting Producing an “awkward” concept because of a few “noisy” points Producing an “awkward” concept because of a few “noisy” points Bad performance on future ex’s?Better performance?

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 29 Overfitting Viewed in Terms of Function-Fitting Data = Red Line + Noise Model f(x) x Underfitting?

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 30 Definition of Overfitting Assuming large enough test set so that it is representative Concept C overfits the training data if there exists a simpler concept S so that but > < Training set accuracy of C Training set accuracy of S Test set accuracy of C Test set accuracy of S

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 31 Remember! It is easy to learn/fit the training dataIt is easy to learn/fit the training data What’s hard is generalizing well to future (“test set”) data!What’s hard is generalizing well to future (“test set”) data! Overfitting avoidance (reduction, really) is the key issue in MLOverfitting avoidance (reduction, really) is the key issue in ML

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 32 Can One Underfit? Sure, if not fully fitting the training setSure, if not fully fitting the training set Eg, just return majority category (+ or -) in the trainset as the learned model Eg, just return majority category (+ or -) in the trainset as the learned model But also if not enough data to illustrate important distinctionsBut also if not enough data to illustrate important distinctions Eg, color may be important, but all examples seen are red, so no reason to include color and make more complex modelEg, color may be important, but all examples seen are red, so no reason to include color and make more complex model

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 33 ID3 & Noisy Data To avoid overfitting, could allow splitting to stop before all ex’s are of one class Early stopping was Quinlan’s original ideaEarly stopping was Quinlan’s original idea But post-pruning now seen as betterBut post-pruning now seen as better - More robust to weaknesses of greedy algo’s (eg, benefits from seeing the full tree, if node only temporally looked bad)

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 34 ID3 & Noisy Data (cont.) Build complete tree, then use some “spare” (tuning) examples to decide which parts of tree can be pruned - called “Reduced [tuneset] Error Pruning” - called “Reduced [tuneset] Error Pruning”

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 35 ID3 & Noisy Data (cont.) See which subtree has highest tune-set accuracySee which subtree has highest tune-set accuracy Repeat (ie, another greedy algo)Repeat (ie, another greedy algo) Better tuneset accuracy? Discard (replace by leaf)?

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 36 Greedily Pruning D-Trees Sample (Hill Climbing) Search Space best Stop if best is not an improvement

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 37 Greedily Pruning D-trees - Pseudocode 1.Run ID3 to fully fit TRAIN Set, measure accuracy on TUNE 2.Consider all subtrees where ONE interior node removed and replaced by leaf - label with majority category - label with majority category in pruned subtree in pruned subtree Choose best subtree if progress on TUNE Choose best subtree if progress on TUNE If no improvement, quit If no improvement, quit 3. Go to 2 +

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 38 Train/Tune/Test Accuracies (same sort of curves for other tuned param’s in other algo’s) 100% Accuracy Tune Test Train Ideal tree to choose Chosen Pruned Tree Amount of Pruning

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 39 The Tradeoff in Greedy Algorithms Efficiency vs Optimality R AB C D F E Initial Tree True Best Cuts Discard C’s & F’s subtrees Single Best Cut Discard B’s subtrees - irrevocable Greedy Search: Powerful, General- Purpose, Trick–of-the-Trade

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 40 Pruning in C4.5 (Successor to ID3) Works bottom-up in a single pass, so very fastWorks bottom-up in a single pass, so very fast Can replace a subtree rooted at a node with either a leaf or the best child of that nodeCan replace a subtree rooted at a node with either a leaf or the best child of that node Does not use tuning set, yet works surprisingly wellDoes not use tuning set, yet works surprisingly well

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 41 Decision “Stumps” Holte (MLJ) compared:Holte (MLJ) compared: Decision trees with only one decision (decision stumps)Decision trees with only one decision (decision stumps)VS Trees produced by C4.5 (with pruning algorithm used)Trees produced by C4.5 (with pruning algorithm used) Decision “stumps” do remarkably well on UC Irvine data setsDecision “stumps” do remarkably well on UC Irvine data sets Archive too easy? Some datasets seem to be.Archive too easy? Some datasets seem to be. Decision stumps are a “quick and dirty” control for comparing to new algorithmsDecision stumps are a “quick and dirty” control for comparing to new algorithms But C4.5 easy to use and probably a better controlBut C4.5 easy to use and probably a better control

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 42 C4.5 Compared to 1R (“Decision Stumps”) See Holte paper for key (eg, HD=heart disease) DatasetC4.51R BC72.0%68.7% CH99.2%68.7% GL63.2%67.6% G274.3%53.8% HD73.6%72.9% HE81.2%76.3% HO83.6%81.0% HY99.1%97.2% IR93.8%93.5% LA77.2%71.5% LY77.5%70.7% MU100.0%98.4% SE97.7%95.0% SO97.5%81.0% VO95.6%95.2% V189.4%86.8% Testset Accuracy

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 43 Generating IF-THEN Rules from Trees Antecedent: Conjunction of all decisions leading to terminal nodeAntecedent: Conjunction of all decisions leading to terminal node Consequent: Label of terminal nodeConsequent: Label of terminal node ExampleExample Red COLOR ? SIZE ? Blue Big Small Green -

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 44 Generating Rules (cont.) Prev example generates these rules If Color=Green  Output = - If Color=Green  Output = - If Color=Blue  Output = + If Color=Blue  Output = + If Color=Red and Size=Big  + If Color=Red and Size=Big  + If Color=Red and Size=Small  - If Color=Red and Size=Small  -Note 1. Can “clean up” the rule set (next) 1. Can “clean up” the rule set (next) 2. Decision trees learn disjunctive concepts 2. Decision trees learn disjunctive concepts

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 45 Rule Post-Pruning (Another Greedy Algorithm) 1.Induce a decision tree 2.Convert to rules (see earlier slide) 3.Consider dropping any one rule antecedent Delete the one that improves tuning set accuracy the mostDelete the one that improves tuning set accuracy the most Repeat as long as progress being madeRepeat as long as progress being made

© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 46 Rule Post-Pruning (cont) Advantages Allows an intermediate node to be pruned from some rules but retained in othersAllows an intermediate node to be pruned from some rules but retained in others Can correct poor early decisions in tree constructionCan correct poor early decisions in tree construction Final concept more understandableFinal concept more understandable