ECE – Pattern Recognition Lecture 13 – Decision Tree

Slides:



Advertisements
Similar presentations
Random Forest Predrag Radenković 3237/10
Advertisements

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Classification with Multiple Decision Trees
Introduction Training Complexity, Pruning CART vs. ID3 vs. C4.5
An Overview of Machine Learning
Overview Previous techniques have consisted of real-valued feature vectors (or discrete-valued) and natural measures of distance (e.g., Euclidean). Consider.
Chapter 7 – Classification and Regression Trees
Sparse vs. Ensemble Approaches to Supervised Learning
Ensemble Learning (2), Tree and Forest
Chapter 9 – Classification and Regression Trees
Computational Intelligence: Methods and Applications Lecture 19 Pruning of decision trees Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
Decision Tree Learning Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 25, 2014.
Business Intelligence and Decision Modeling Week 9 Customer Profiling Decision Trees (Part 2) CHAID CRT.
For Wednesday No reading Homework: –Chapter 18, exercise 6.
For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.
MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.
ECE 471/571 – Lecture 2 Bayesian Decision Theory 08/25/15.
ECE 471/571 – Lecture 6 Dimensionality Reduction – Fisher’s Linear Discriminant 09/08/15.
1 Decision Tree Learning Original slides by Raymond J. Mooney University of Texas at Austin.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
ECE 471/571 - Lecture 19 Review 11/12/15. A Roadmap 2 Pattern Classification Statistical ApproachNon-Statistical Approach SupervisedUnsupervised Basic.
ECE 471/571 – Lecture 20 Decision Tree 11/19/15. 2 Nominal Data Descriptions that are discrete and without any natural notion of similarity or even ordering.
Classification and Regression Trees
ECE 471/571 – Lecture 21 Syntactic Pattern Recognition 11/19/15.
Supervised learning in high-throughput data  General considerations  Dimension reduction with outcome variables  Classification models.
ECE 471/571 – Lecture 3 Discriminant Function and Normal Density 08/27/15.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Syntactic Pattern Recognition 04/28/17
Introduction to Machine Learning
ECE 471/571 - Lecture 19 Review 02/24/17.
Nonparametric Density Estimation – k-nearest neighbor (kNN) 02/20/17
LECTURE 20: DECISION TREES
Performance Evaluation 02/15/17
Ch9: Decision Trees 9.1 Introduction A decision tree:
ECE 471/571 – Lecture 18 Classifier Fusion 04/12/17.
Unsupervised Learning - Clustering 04/03/17
Decision Tree Saed Sayad 9/21/2018.
Unsupervised Learning - Clustering
Introduction to Data Mining, 2nd Edition by
ECE 471/571 – Lecture 12 Decision Tree.
Introduction to Data Mining, 2nd Edition by
Roberto Battiti, Mauro Brunato
Random Survival Forests
ECE 471/571 – Lecture 12 Perceptron.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
ECE 471/571 – Review 1.
Parametric Estimation
Hairong Qi, Gonzalez Family Professor
Statistical Learning Dong Liu Dept. EEIS, USTC.
Classification with CART
Text Categorization Berlin Chen 2003 Reference:
LECTURE 18: DECISION TREES
ECE – Pattern Recognition Lecture 16: NN – Back Propagation
ECE – Pattern Recognition Lecture 15: NN - Perceptron
Hairong Qi, Gonzalez Family Professor
Recap: Naïve Bayes classifier
Hairong Qi, Gonzalez Family Professor
ECE – Lecture 1 Introduction.
ECE – Pattern Recognition Lecture 10 – Nonparametric Density Estimation – k-nearest-neighbor (kNN) Hairong Qi, Gonzalez Family Professor Electrical.
Bayesian Decision Theory
Hairong Qi, Gonzalez Family Professor
ECE – Pattern Recognition Lecture 8 – Performance Evaluation
ECE – Pattern Recognition Lecture 4 – Parametric Estimation
Hairong Qi, Gonzalez Family Professor
ECE – Pattern Recognition Lecture 14 – Gradient Descent
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
ECE – Pattern Recognition Midterm Review
Presentation transcript:

ECE471-571 – Pattern Recognition Lecture 13 – Decision Tree Hairong Qi, Gonzalez Family Professor Electrical Engineering and Computer Science University of Tennessee, Knoxville http://www.eecs.utk.edu/faculty/qi Email: hqi@utk.edu

Pattern Classification Statistical Approach Non-Statistical Approach Supervised Unsupervised Decision-tree Basic concepts: Baysian decision rule (MPP, LR, Discri.) Basic concepts: Distance Agglomerative method Syntactic approach Parameter estimate (ML, BL) k-means Non-Parametric learning (kNN) Winner-takes-all LDF (Perceptron) Kohonen maps NN (BP) Mean-shift Support Vector Machine Deep Learning (DL) Dimensionality Reduction FLD, PCA Performance Evaluation ROC curve (TP, TN, FN, FP) cross validation Stochastic Methods local opt (GD) global opt (SA, GA) Classifier Fusion majority voting NB, BKS

Review - Bayes Decision Rule Maximum Posterior Probability For a given x, if P(w1|x) > P(w2|x), then x belongs to class 1, otherwise 2 If , then x belongs to class 1, otherwise, 2. Likelihood Ratio Discriminant Function Case 1: Minimum Euclidean Distance (Linear Machine), Si=s2I Case 2: Minimum Mahalanobis Distance (Linear Machine), Si = S Case 3: Quadratic classifier , Si = arbitrary Non-parametric kNN For a given x, if k1/k > k2/k, then x belongs to class 1, otherwise 2 Dimensionality reduction Estimate Gaussian (Maximum Likelihood Estimation, MLE), Two-modal Gaussian Performance evaluation ROC curve

Nominal Data Descriptions that are discrete and without any natural notion of similarity or even ordering

Some Terminologies Decision tree Root Link (branch) - directional Leaf Descendent node

CART Classification and regression trees A generic tree growing methodology Issues studied How many splits from a node? Which property to test at each node? When to declare a leaf? How to prune a large, redundant tree? If the leaf is impure, how to classify? How to handle missing data?

Number of Splits Binary tree Expressive power and comparative simplicity in training

Node Impurity – Occam’s Razor The fundamental principle underlying tree creation is that of simplicity: we prefer simple, compact tree with few nodes http://math.ucr.edu/home/baez/physics/occam.html Occam's (or Ockham's) razor is a principle attributed to the 14th century logician and Franciscan friar; William of Occam. Ockham was the village in the English county of Surrey where he was born. The principle states that “Entities should not be multiplied unnecessarily.” "when you have two competing theories which make exactly the same predictions, the one that is simpler is the better.“ Stephen Hawking explains in A Brief History of Time: “We could still imagine that there is a set of laws that determines events completely for some supernatural being, who could observe the present state of the universe without disturbing it. However, such models of the universe are not of much interest to us mortals. It seems better to employ the principle known as Occam's razor and cut out all the features of the theory which cannot be observed.” Everything should be made as simple as possible, but not simpler

Property Query and Impurity Measurement We seek a property query T at each node N that makes the data reach the immediate descendent nodes as pure as possible We want i(N) to be 0 if all the patterns reach the node bear the same category label Entropy impurity (information impurity)

Other Impurity Measurements Variance impurity (2-category case) Gini impurity Misclassification impurity

Choose the Property Test? Choose the query that decreases the impurity as much as possible NL, NR: left and right descendent nodes i(NL), i(NR): impurities PL: fraction of patterns at node N that will go to NL Solve for extrema (local extrema)

Example Node N: Split candidate: 90 patterns in w1 10 patterns in w2 70 w1 patterns & 0 w2 patterns to the right 20 w1 patterns & 10 w2 patterns to the left If using Gini impurity, i(N)=1-0.9*0.9-0.1*0.1=1-0.81-0.01=0.18, di(n) = 0.18 0.13=0.3*(1-2/3*2/3-1/3*1/3)+0.7*(1-1x1)=0.3*4/9=0.13 If using misclassification impurity, i(N)=0.1, di(N)=0.10.1

When to Stop Splitting? Two extreme scenarios Approaches Overfitting (each leaf is one sample) High error rate Approaches Validation and cross-validation 90% of the data set as training data 10% of the data set as validation data Use threshold Unbalanced tree Hard to choose threshold Minimum description length (MDL) i(N) measures the uncertainty of the training data Size of the tree measures the complexity of the classifier itself Size: number of nodes

When to Stop Splitting? (cont’) Use stopping criterion based on the statistical significance of the reduction of impurity Use chi-square statistic Whether the candidate split differs significantly from a random split

Pruning Another way to stop splitting Horizon effect Lack of sufficient look ahead Let the tree fully grow, i.e. beyond any putative horizon, then all pairs of neighboring leaf nodes are considered for elimination

Instability

Other Methods Quinlan’s ID3 C4.5 (successor and refinement of ID3) http://www.rulequest.com/Personal/

MATLAB Routine classregtree Classification tree vs. Regression tree If the target variable is categorical or numeric e.g., predicted price of a consumer good http://www.mathworks.com/help/stats/classregtree.html

http://www.simafore.com/blog/bid/62482/2-main-differences-between-classification-and-regression-trees

Random Forest Potential issue with decision trees Prof. Leo Breiman Ensemble learning methods Bagging (Bootstrap aggregating): Proposed by Breiman in 1994 to improve the classification by combining classifications of randomly generated training sets Random forest: bagging + random selection of features at each node to determine a split overfitting

MATLAB Implementation B = TreeBagger(nTrees, train_x, train_y); pred = B.predict(test_x);

Reference [CART] L. Breiman, J.H. Friedman, R.A. Olshen, C.J. Stone, Classification and Regression Trees. Monterey, CA: Wadsworth & Brooks/Cole Advanced Books & Software, 1984. [Bagging] L. Breiman, “Bagging predictors,” Machine Learning, 24(2):123-140, August 1996. (citation: 16,393) [RF] L. Breiman, “Random forests,” Machine Learning, 45(1):5-32, October 2001.