Machine Learning II Decision Tree Induction CSE 573.

Slides:



Advertisements
Similar presentations
1 Machine Learning: Lecture 3 Decision Tree Learning (Based on Chapter 3 of Mitchell T.., Machine Learning, 1997)
Advertisements

Decision Trees Decision tree representation ID3 learning algorithm
Machine Learning III Decision Tree Induction
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Decision Tree Algorithm (C4.5)
Classification Techniques: Decision Tree Learning
Decision Tree Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.
Jesse Davis Machine Learning Jesse Davis
Machine Learning CSE 454. Administrivia PS1 due next tues 10/13 Project proposals also due then Group meetings with Dan Signup out shortly.
Learning from Observations Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 18 Fall 2005.
Machine Learning II Decision Tree Induction CSE 473.
Decision Trees. DEFINE: Set X of Instances (of n-tuples x = ) –E.g., days decribed by attributes (or features): Sky, Temp, Humidity, Wind, Water, Forecast.
Part 7.3 Decision Trees Decision tree representation ID3 learning algorithm Entropy, information gain Overfitting.
Decision Tree Learning Learning Decision Trees (Mitchell 1997, Russell & Norvig 2003) –Decision tree induction is a simple but powerful learning paradigm.
Learning from Observations Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 18 Fall 2004.
Ensemble Learning: An Introduction
Induction of Decision Trees
Learning from Observations Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 18.
Machine Learning IV Ensembles CSE 473. © Daniel S. Weld 2 Machine Learning Outline Machine learning: Supervised learning Overfitting Ensembles of classifiers.
Machine Learning CSE 473. © Daniel S. Weld Topics Agency Problem Spaces Search Knowledge Representation Reinforcement Learning InferencePlanning.
Three kinds of learning
Decision Tree Learning
Ensembles of Classifiers Evgueni Smirnov
Issues with Data Mining
By Wang Rui State Key Lab of CAD&CG
Machine Learning Chapter 3. Decision Tree Learning
INTRODUCTION TO MACHINE LEARNING. $1,000,000 Machine Learning  Learn models from data  Three main types of learning :  Supervised learning  Unsupervised.
Mohammad Ali Keyvanrad
For Wednesday No new reading Homework: –Chapter 18, exercises 3, 4, 7.
Machine Learning Lecture 10 Decision Tree Learning 1.
CpSc 810: Machine Learning Decision Tree Learning.
Learning from Observations Chapter 18 Through
Decision-Tree Induction & Decision-Rule Induction
For Wednesday No reading Homework: –Chapter 18, exercise 6.
For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.
CS 8751 ML & KDDDecision Trees1 Decision tree representation ID3 learning algorithm Entropy, Information gain Overfitting.
CS 5751 Machine Learning Chapter 3 Decision Tree Learning1 Decision Trees Decision tree representation ID3 learning algorithm Entropy, Information gain.
1 Decision Tree Learning Original slides by Raymond J. Mooney University of Texas at Austin.
Decision Trees, Part 1 Reading: Textbook, Chapter 6.
Ensemble Methods in Machine Learning
Decision Tree Learning Presented by Ping Zhang Nov. 26th, 2007.
Outline Logistics Review Machine Learning –Induction of Decision Trees (7.2) –Version Spaces & Candidate Elimination –PAC Learning Theory (7.1) –Ensembles.
Machine Learning Recitation 8 Oct 21, 2009 Oznur Tastan.
Outline Decision tree representation ID3 learning algorithm Entropy, Information gain Issues in decision tree learning 2.
Learning From Observations Inductive Learning Decision Trees Ensembles.
CSE 446: Ensemble Learning Winter 2012 Daniel Weld Slides adapted from Tom Dietterich, Luke Zettlemoyer, Carlos Guestrin, Nick Kushmerick, Padraig Cunningham.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
CSE573 Autumn /09/98 Machine Learning Administrative –Last topic: Decision Tree Learning Reading: 5.1, 5.4 Last time –finished NLP sample system’s.
CSE573 Autumn /11/98 Machine Learning Administrative –Finish this topic –The rest of the time is yours –Final exam Tuesday, Mar. 17, 2:30-4:20.
Machine Learning Inductive Learning and Decision Trees
Decision Tree Learning
Machine Learning Lecture 2: Decision Tree Learning.
Decision Tree Learning
Artificial Intelligence
Knowledge Representation
Data Science Algorithms: The Basic Methods
Decision Tree Saed Sayad 9/21/2018.
CSE P573 Applications of Artificial Intelligence Decision Trees
Machine Learning Chapter 3. Decision Tree Learning
Machine Learning: Lecture 3
CSE 573 Introduction to Artificial Intelligence Decision Trees
Machine Learning Chapter 3. Decision Tree Learning
Machine Learning CSE 454.
Why Machine Learning Flood of data
Ensemble learning.
A task of induction to find patterns
A task of induction to find patterns
Presentation transcript:

Machine Learning II Decision Tree Induction CSE 573

© Daniel S. Weld 2 Logistics Reading Ch 13 Ch 14 thru 14.3 Project Writeups due Wednesday November 10 … 9 days to go …

© Daniel S. Weld 3 Learning from Training Experience Credit assignment problem: Direct training examples: E.g. individual checker boards + correct move for each Supervised learning Indirect training examples : E.g. complete sequence of moves and final result Reinforcement learning Which examples: Random, teacher chooses, learner chooses Unsupervised Learning

© Daniel S. Weld 4 Machine Learning Outline Machine learning: √ Function approximation √ Bias Supervised learning √ Classifiers & concept learning Decision-trees induction (pref bias) Overfitting Ensembles of classifiers Co-training

© Daniel S. Weld 5 Need for Bias Example space: 4 Boolean attributes How many ML hypotheses?

© Daniel S. Weld 6 Two Strategies for ML Restriction bias: use prior knowledge to specify a restricted hypothesis space. Version space algorithm over conjunctions. Preference bias: use a broad hypothesis space, but impose an ordering on the hypotheses. Decision trees.

© Daniel S. Weld 7 Decision Trees Convenient Representation Developed with learning in mind Deterministic Expressive Equivalent to propositional DNF Handles discrete and continuous parameters Simple learning algorithm Handles noise well Classify as follows Constructive (build DT by adding nodes) Eager Batch (but incremental versions exist)

© Daniel S. Weld 8 Decision Tree Representation Outlook Humidity Wind Yes No Sunny Overcast Rain High Strong Normal Weak Good day for tennis? Leaves = classification Arcs = choice of value for parent attribute Decision tree is equivalent to logic in disjunctive normal form G-Day  (Sunny  Normal)  Overcast  (Rain  Weak)

© Daniel S. Weld 9 DT Learning as Search Nodes Operators Initial node Heuristic? Goal? Type of Search? Decision Trees Tree Refinement: Sprouting the tree Smallest tree possible: a single leaf Information Gain Best tree possible (???) Hill climbing

© Daniel S. Weld 10 Successors Yes Outlook Temp Humid Wind Which attribute should we use to split?

© Daniel S. Weld 11 Decision Tree Algorithm BuildTree(TraingData) Split(TrainingData) Split(D) If (all points in D are of the same class) Then Return For each attribute A Evaluate splits on attribute A Use best split to partition D into D1, D2 Split (D1) Split (D2)

© Daniel S. Weld 12 Movie Recommendation Features? Rambo Matrix Rambo 2

© Daniel S. Weld 13 Key Questions How to choose best attribute? Mutual Information (Information gain) Entropy (disorder) When to stop growing tree? Non-Boolean attributes Missing data

© Daniel S. Weld 14 Issues Content vs. Social Non-Boolean Attributes Missing Data Scaling up

© Daniel S. Weld 15 Missing Data 1 Assign attribute … most common value, … given classification Day Temp Humid WindTennis? d1hhweak n d2hhs n d8mhweak n d9c weak yes d11mns yes Don’t use this instance for learning? most common value at node, or

© Daniel S. Weld 16 Fractional Values 75% h and 25% n Use in information gain calculations Further subdivide if other missing attributes Same approach to classify test ex with missing attr Classification is most probable classification Summing over leaves where it got divided Day Temp Humid WindTennis? d1hhweak n d2hhs n d8mhweak n d9c weak yes d11mns yes [0.75+, 3-] [1.25+, 0-]

© Daniel S. Weld 17 Non-Boolean Features Features with multiple discrete values Real-valued Features Construct a multi-way split Test for one value vs. all of the others? Group values into two disjoint subsets? Discretize? Consider a threshold split using observed values?

© Daniel S. Weld 18 Attributes with many values So many values that it Divides examples into tiny sets Each set is likely uniform  high info gain But poor predictor… Need to penalize these attributes

© Daniel S. Weld 19 One approach: Gain ratio SplitInfo  entropy of S wrt values of A (Contrast with entropy of S wrt target value)  attribs with many uniformly distrib values e.g. if A splits S uniformly into n sets SplitInformation = log 2 (n)… = 1 for Boolean

© Daniel S. Weld 20 Cross validation Partition examples into k disjoint equiv classes Now create k training sets Each set is union of all equiv classes except one So each set has (k-1)/k of the original training data  Train  Test

© Daniel S. Weld 21 Cross Validation Partition examples into k disjoint equiv classes Now create k training sets Each set is union of all equiv classes except one So each set has (k-1)/k of the original training data Test

© Daniel S. Weld 22 Cross Validation Partition examples into k disjoint equiv classes Now create k training sets Each set is union of all equiv classes except one So each set has (k-1)/k of the original training data Test

© Daniel S. Weld 23 Machine Learning Outline Machine learning: Supervised learning Overfitting What is the problem? Reduced error pruning Ensembles of classifiers Co-training

© Daniel S. Weld 24 Overfitting Model complexity (e.g. Number of Nodes in Decision tree ) Accuracy On training data On test data

© Daniel S. Weld 25 Overfitting… DT is overfit when exists another DT’ and DT has smaller error on training examples, but DT has bigger error on test examples Causes of overfitting Noisy data, or Training set is too small

© Daniel S. Weld 26

© Daniel S. Weld 27

© Daniel S. Weld 28 Effect of Reduced-Error Pruning Cut tree back to here

© Daniel S. Weld 29 Machine Learning Outline Machine learning: Supervised learning Overfitting Ensembles of classifiers Bagging Cross-validated committees Boosting Stacking Co-training

© Daniel S. Weld 30 Voting

© Daniel S. Weld 31 Ensembles of Classifiers Assume Errors are independent (suppose 30% error) Majority vote Probability that majority is wrong… If individual area is 0.3 Area under curve for  11 wrong is Order of magnitude improvement! Ensemble of 21 classifiers Prob Number of classifiers in error = area under binomial distribution

© Daniel S. Weld 32 Constructing Ensembles Partition examples into k disjoint equiv classes Now create k training sets Each set is union of all equiv classes except one So each set has (k-1)/k of the original training data Now train a classifier on each set Cross-validated committees Holdout

© Daniel S. Weld 33 Ensemble Construction II Generate k sets of training examples For each set Draw m examples randomly (with replacement) From the original set of m examples Each training set corresponds to 63.2% of original (+ duplicates) Now train classifier on each set Bagging

© Daniel S. Weld 34 Ensemble Creation III Maintain prob distribution over set of training ex Create k sets of training data iteratively: On iteration i Draw m examples randomly (like bagging) But use probability distribution to bias selection Train classifier number i on this training set Test partial ensemble (of i classifiers) on all training exs Modify distribution: increase P of each error ex Create harder and harder learning problems... “Bagging with optimized choice of examples” Boosting

© Daniel S. Weld 35 Ensemble Creation IV Stacking Train several base learners Next train meta-learner Learns when base learners are right / wrong Now meta learner arbitrates Train using cross validated committees Meta-L inputs = base learner predictions Training examples = ‘test set’ from cross validation

© Daniel S. Weld 36 Machine Learning Outline Machine learning: Supervised learning Overfitting Ensembles of classifiers Co-training

© Daniel S. Weld 37 Co-Training Motivation Learning methods need labeled data Lots of pairs Hard to get… (who wants to label data?) But unlabeled data is usually plentiful… Could we use this instead??????

© Daniel S. Weld 38 Co-training Suppose each instance has two parts: x = [x1, x2] x1, x2 conditionally independent given f(x) Suppose each half can be used to classify instance  f1, f2 such that f1(x1) = f2(x2) = f(x) Suppose f1, f2 are learnable f1  H1, f2  H2,  learning algorithms A1, A2 Unlabeled Instances [x1, x2] Labeled Instances A1 f2 Hypothesis ~ A2 Small labeled data needed

© Daniel S. Weld 39 Observations Can apply A1 to generate as much training data as one wants If x1 is conditionally independent of x2 / f(x), then the error in the labels produced by A1 will look like random noise to A2 !!! Thus no limit to quality of the hypothesis A2 can make

© Daniel S. Weld 40 It really works! Learning to classify web pages as course pages x1 = bag of words on a page x2 = bag of words from all anchors pointing to a page Naïve Bayes classifiers 12 labeled pages 1039 unlabeled