ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Decision/Classification Trees Readings: Murphy 16.1-16.2; Hastie 9.2.

Slides:



Advertisements
Similar presentations
DECISION TREES. Decision trees  One possible representation for hypotheses.
Advertisements

Decision Tree Approach in Data Mining
Decision Trees Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei Han.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –SVM –Multi-class SVMs –Neural Networks –Multi-layer Perceptron Readings:
Decision Tree Rong Jin. Determine Milage Per Gallon.
Sparse vs. Ensemble Approaches to Supervised Learning
Decision Tree Algorithm
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Induction of Decision Trees
1 Classification with Decision Trees I Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei.
Three kinds of learning
ICS 273A Intro Machine Learning
Sparse vs. Ensemble Approaches to Supervised Learning
ECE 5984: Introduction to Machine Learning
MACHINE LEARNING. What is learning? A computer program learns if it improves its performance at some task through experience (T. Mitchell, 1997) A computer.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Introduction to Machine Learning
Ensemble Learning (2), Tree and Forest
ECE 5984: Introduction to Machine Learning
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
ECE 6504: Deep Learning for Perception Dhruv Batra Virginia Tech Topics: –Neural Networks –Backprop –Modular Design.
Decision Tree Learning Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 25, 2014.
Scaling up Decision Trees. Decision tree learning.
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.3: Decision Trees Rodney Nielsen Many of.
ECE 6504: Deep Learning for Perception
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
Machine Learning, Decision Trees, Overfitting Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University January 14,
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Unsupervised Learning: Kmeans, GMM, EM Readings: Barber
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –(Finish) Nearest Neighbour Readings: Barber 14 (kNN)
Today’s Topics Learning Decision Trees (Chapter 18) –We’ll use d-trees to introduce/motivate many general issues in ML (eg, overfitting reduction) “Forests”
ECE 6504: Deep Learning for Perception Dhruv Batra Virginia Tech Topics: –(Finish) Backprop –Convolutional Neural Nets.
Decision Trees, Part 1 Reading: Textbook, Chapter 6.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Logistic Regression –NB & LR connections Readings: Barber.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Supervised Learning –General Setup, learning from data –Nearest Neighbour.
CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Regression Readings: Barber 17.1, 17.2.
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16.
COMP 2208 Dr. Long Tran-Thanh University of Southampton Decision Trees.
1 Decision Trees Greg Grudic (Notes borrowed from Thomas G. Dietterich and Tom Mitchell) [Edited by J. Wiebe] Decision Trees.
COM24111: Machine Learning Decision Trees Gavin Brown
Machine Learning Recitation 8 Oct 21, 2009 Oznur Tastan.
DECISION TREES Asher Moody, CS 157B. Overview  Definition  Motivation  Algorithms  ID3  Example  Entropy  Information Gain  Applications  Conclusion.
Machine Learning Lecture 1: Intro + Decision Trees Moshe Koppel Slides adapted from Tom Mitchell and from Dan Roth.
CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer; tweaked by Dan Weld.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
CSE 446: Decision Trees Winter 2012 Slides adapted from Carlos Guestrin and Andrew Moore by Luke Zettlemoyer & Dan Weld.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –(Finish) Model selection –Error decomposition –Bias-Variance Tradeoff –Classification:
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
DECISION TREES An internal node represents a test on an attribute.
Decision Trees an introduction.
ECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning
Dhruv Batra Georgia Tech
ECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning
Artificial Intelligence
Decision Trees (suggested time: 30 min)
ECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning
Data Science Algorithms: The Basic Methods
ECE 5424: Introduction to Machine Learning
Supervised Learning Seminar Social Media Mining University UC3M
ECE 5424: Introduction to Machine Learning
Decision Trees Greg Grudic
Introduction to Data Mining, 2nd Edition
Statistical Learning Dong Liu Dept. EEIS, USTC.
CS639: Data Management for Data Science
Presentation transcript:

ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Decision/Classification Trees Readings: Murphy ; Hastie 9.2

Project Proposals Graded Mean 3.6/5 = 72% (C) Dhruv Batra2

Administrativia Project Mid-Sem Spotlight Presentations –Friday: 5-7pm, 3-5pm Whittemore A –5 slides (recommended) –4 minute time (STRICT) min Q&A –Tell the class what you’re working on –Any results yet? –Problems faced? –Upload slides on Scholar (C) Dhruv Batra3

Recap of Last Time (C) Dhruv Batra4

Convolution Explained (C) Dhruv Batra5

6 Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra7 Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra8 Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra9 Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra10 Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra11 Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra12 Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra13 Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra14 Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra15 Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra16 Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra17 Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra18 Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra19 Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra20 Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra21 Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra22 Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra23 Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra24 Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra25 Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra26 Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra27 Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra28 Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra29 Slide Credit: Marc'Aurelio Ranzato

(C) Dhruv Batra30 Slide Credit: Marc'Aurelio Ranzato

Convolutional Nets Example: – (C) Dhruv Batra31Image Credit: Yann LeCun, Kevin Murphy

Visualizing Learned Filters (C) Dhruv Batra32Figure Credit: [Zeiler & Fergus ECCV14]

Visualizing Learned Filters (C) Dhruv Batra33Figure Credit: [Zeiler & Fergus ECCV14]

Visualizing Learned Filters (C) Dhruv Batra34Figure Credit: [Zeiler & Fergus ECCV14]

Addressing non-linearly separable data – Option 1, non-linear features Slide Credit: Carlos Guestrin(C) Dhruv Batra35 Choose non-linear features, e.g., –Typical linear features: w 0 +  i w i x i –Example of non-linear features: Degree 2 polynomials, w 0 +  i w i x i +  ij w ij x i x j Classifier h w (x) still linear in parameters w –As easy to learn –Data is linearly separable in higher dimensional spaces –Express via kernels

Addressing non-linearly separable data – Option 2, non-linear classifier Slide Credit: Carlos Guestrin(C) Dhruv Batra36 Choose a classifier h w (x) that is non-linear in parameters w, e.g., –Decision trees, neural networks, … More general than linear classifiers But, can often be harder to learn (non- convex/concave optimization required) Often very useful (outperforms linear classifiers) In a way, both ideas are related

New Topic: Decision Trees (C) Dhruv Batra37

Synonyms Decision Trees Classification and Regression Trees (CART) Algorithms for learning decision trees: –ID3 –C4.5 Random Forests –Multiple decision trees (C) Dhruv Batra38

Decision Trees Demo – (C) Dhruv Batra39

Pose Estimation Random Forests! –Multiple decision trees – (C) Dhruv Batra40

Slide Credit: Pedro Domingos, Tom Mitchel, Tom Dietterich

A small dataset: Miles Per Gallon From the UCI repository (thanks to Ross Quinlan) 40 Records Suppose we want to predict MPG Slide Credit: Carlos Guestrin(C) Dhruv Batra45

A Decision Stump Slide Credit: Carlos Guestrin(C) Dhruv Batra46

The final tree Slide Credit: Carlos Guestrin(C) Dhruv Batra47

Comments Not all features/attributes need to appear in the tree. A features/attribute X i may appear in multiple branches. On a path, no feature may appear more than once. –Not true for continuous features. We’ll see later. Many trees can represent the same concept But, not all trees will have the same size! –e.g., Y = (A^B)  (  A^C) (A and B) or (not A and C) (C) Dhruv Batra48

Learning decision trees is hard!!! Learning the simplest (smallest) decision tree is an NP-complete problem [Hyafil & Rivest ’76] Resort to a greedy heuristic: –Start from empty decision tree –Split on next best attribute (feature) –Recurse “Iterative Dichotomizer” (ID3) C4.5 (ID3+improvements) Slide Credit: Carlos Guestrin(C) Dhruv Batra49

Recursion Step Take the Original Dataset.. And partition it according to the value of the attribute we split on Records in which cylinders = 4 Records in which cylinders = 5 Records in which cylinders = 6 Records in which cylinders = 8 Slide Credit: Carlos Guestrin(C) Dhruv Batra50

Recursion Step Records in which cylinders = 4 Records in which cylinders = 5 Records in which cylinders = 6 Records in which cylinders = 8 Build tree from These records.. Build tree from These records.. Build tree from These records.. Build tree from These records.. Slide Credit: Carlos Guestrin(C) Dhruv Batra51

Second level of tree Recursively build a tree from the seven records in which there are four cylinders and the maker was based in Asia (Similar recursion in the other cases) Slide Credit: Carlos Guestrin(C) Dhruv Batra52

The final tree Slide Credit: Carlos Guestrin(C) Dhruv Batra53

Choosing a good attribute X1X1 X2X2 Y TTT TFT TTT TFT FTT FFF FTF FFF Slide Credit: Carlos Guestrin(C) Dhruv Batra54

Measuring uncertainty Good split if we are more certain about classification after split –Deterministic good (all true or all false) –Uniform distribution bad P(Y=F | X 2 =F) = 1/2 P(Y=T | X 2 =F) = 1/2 P(Y=F | X 1 = T) = 0 P(Y=T | X 1 = T) = 1 (C) Dhruv Batra F T

Entropy Entropy H(X) of a random variable Y More uncertainty, more entropy! Information Theory interpretation: H(Y) is the expected number of bits needed to encode a randomly drawn value of Y (under most efficient code) Slide Credit: Carlos Guestrin(C) Dhruv Batra56

Information gain Advantage of attribute – decrease in uncertainty –Entropy of Y before you split –Entropy after split Weight by probability of following each branch, i.e., normalized number of records Information gain is difference –(Technically it’s mutual information; but in this context also referred to as information gain) Slide Credit: Carlos Guestrin(C) Dhruv Batra57

Learning decision trees Start from empty decision tree Split on next best attribute (feature) –Use, for example, information gain to select attribute –Split on Recurse Slide Credit: Carlos Guestrin(C) Dhruv Batra58

Look at all the information gains… Suppose we want to predict MPG Slide Credit: Carlos Guestrin(C) Dhruv Batra59

When do we stop? (C) Dhruv Batra60

Base Case One Don’t split a node if all matching records have the same output value Slide Credit: Carlos Guestrin(C) Dhruv Batra61

Don’t split a node if none of the attributes can create multiple non- empty children Base Case Two: No attributes can distinguish Slide Credit: Carlos Guestrin(C) Dhruv Batra62

Base Cases Base Case One: If all records in current data subset have the same output then don’t recurse Base Case Two: If all records have exactly the same set of input attributes then don’t recurse Slide Credit: Carlos Guestrin(C) Dhruv Batra63

Base Cases: An idea Base Case One: If all records in current data subset have the same output then don’t recurse Base Case Two: If all records have exactly the same set of input attributes then don’t recurse Proposed Base Case 3: If all attributes have zero information gain then don’t recurse Is this a good idea? Slide Credit: Carlos Guestrin(C) Dhruv Batra64

The problem with Base Case 3 y = a XOR b The information gains: The resulting decision tree: Slide Credit: Carlos Guestrin(C) Dhruv Batra65

If we omit Base Case 3: y = a XOR b The resulting decision tree: Slide Credit: Carlos Guestrin(C) Dhruv Batra66