Today’s Topics Learning Decision Trees (Chapter 18) –We’ll use d-trees to introduce/motivate many general issues in ML (eg, overfitting reduction) “Forests”

Slides:



Advertisements
Similar presentations
1 Machine Learning: Lecture 3 Decision Tree Learning (Based on Chapter 3 of Mitchell T.., Machine Learning, 1997)
Advertisements

Decision Trees Decision tree representation ID3 learning algorithm
Machine Learning III Decision Tree Induction
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Introduction Training Complexity, Pruning CART vs. ID3 vs. C4.5
Decision Tree Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.
Decision Tree under MapReduce Week 14 Part II. Decision Tree.
CS 391L: Machine Learning: Decision Tree Learning
Machine Learning Group University College Dublin Decision Trees What is a Decision Tree? How to build a good one…
ID3 Algorithm Abbas Rizvi CS157 B Spring What is the ID3 algorithm? ID3 stands for Iterative Dichotomiser 3 Algorithm used to generate a decision.
Decision tree LING 572 Fei Xia 1/10/06. Outline Basic concepts Main issues Advanced topics.
Induction of Decision Trees
1 Classification with Decision Trees I Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei.
Three kinds of learning
ICS 273A Intro Machine Learning
For Monday No reading Homework: –Chapter 18, exercises 1 and 2.
Fall 2004 TDIDT Learning CS478 - Machine Learning.
6-Slide Example: Gene Chip Data © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)
Machine Learning Chapter 3. Decision Tree Learning
Learning what questions to ask. 8/29/03Decision Trees2  Job is to build a tree that represents a series of questions that the classifier will ask of.
Learning CPSC 386 Artificial Intelligence Ellen Walker Hiram College.
For Wednesday No new reading Homework: –Chapter 18, exercises 3, 4, 7.
For Monday Read chapter 18, sections 5-6 Homework: –Chapter 18, exercises 1-2.
Today’s Topics Chapter 2 in One Slide Chapter 18: Machine Learning (ML) Creating an ML Dataset –“Fixed-length feature vectors” –Relational/graph-based.
For Friday No reading No homework. Program 4 Exam 2 A week from Friday Covers 10, 11, 13, 14, 18, Take home due at the exam.
Machine Learning Lecture 10 Decision Tree Learning 1.
Today’s Topics HW0 due 11:55pm tonight and no later than next Tuesday HW1 out on class home page; discussion page in MoodleHW1discussion page Please do.
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Today’s Topics Dealing with Noise Overfitting (the key issue in all of ML) A ‘Greedy’ Algorithm for Pruning D-Trees Generating IF-THEN Rules from D-Trees.
Learning from Observations Chapter 18 Through
CS Fall 2015 (© Jude Shavlik), Lecture 7, Week 3
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.3: Decision Trees Rodney Nielsen Many of.
Today’s Topics Read –For exam: Chapter 13 of textbook –Not on exam: Sections & Genetic Algorithms (GAs) –Mutation –Crossover –Fitness-proportional.
For Wednesday No reading Homework: –Chapter 18, exercise 6.
Machine Learning, Decision Trees, Overfitting Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University January 14,
For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.
Today’s Topics Read Chapter 3 & Section 4.1 (Skim Section 3.6 and rest of Chapter 4), Sections 5.1, 5.2, 5.3, 5,7, 5.8, & 5.9 (skim rest of Chapter 5)
CS 8751 ML & KDDDecision Trees1 Decision tree representation ID3 learning algorithm Entropy, Information gain Overfitting.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
1 Universidad de Buenos Aires Maestría en Data Mining y Knowledge Discovery Aprendizaje Automático 5-Inducción de árboles de decisión (2/2) Eduardo Poggi.
CS 5751 Machine Learning Chapter 3 Decision Tree Learning1 Decision Trees Decision tree representation ID3 learning algorithm Entropy, Information gain.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
1 Decision Tree Learning Original slides by Raymond J. Mooney University of Texas at Austin.
Today’s Topics HW1 Due 11:55pm Today (no later than next Tuesday) HW2 Out, Due in Two Weeks Next Week We’ll Discuss the Make-Up Midterm Be Sure to Check.
Decision Tree Learning
CSC 8520 Spring Paula Matuszek DecisionTreeFirstDraft Paula Matuszek Spring,
ECE 471/571 – Lecture 20 Decision Tree 11/19/15. 2 Nominal Data Descriptions that are discrete and without any natural notion of similarity or even ordering.
DECISION TREES Asher Moody, CS 157B. Overview  Definition  Motivation  Algorithms  ID3  Example  Entropy  Information Gain  Applications  Conclusion.
© Jude Shavlik 2006, David Page 2007 CS 760 – Machine Learning (UW-Madison)Lecture #12, Slide 1 5-Slide Example: Gene Chip Data.
CSE343/543 Machine Learning: Lecture 4.  Chapter 3: Decision Trees  Weekly assignment:  There are lot of applications and systems using machine learning.
Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Machine Learning Inductive Learning and Decision Trees
Decision Trees.
Artificial Intelligence
CS Fall 2016 (© Jude Shavlik), Lecture 4
Data Science Algorithms: The Basic Methods
ECE 471/571 – Lecture 12 Decision Tree.
Classification and Prediction
CS Fall 2016 (© Jude Shavlik), Lecture 6, Week 4
Machine Learning Chapter 3. Decision Tree Learning
CS Fall 2016 (© Jude Shavlik), Lecture 3
Machine Learning: Lecture 3
CS Fall 2016 (Shavlik©), Lecture 2
Machine Learning Chapter 3. Decision Tree Learning
Statistical Learning Dong Liu Dept. EEIS, USTC.
CS639: Data Management for Data Science
Presentation transcript:

Today’s Topics Learning Decision Trees (Chapter 18) –We’ll use d-trees to introduce/motivate many general issues in ML (eg, overfitting reduction) “Forests” of Decision Trees very Successful ML approach, arguably the best on many tasks Expected-Value Calculations (a topic we’ll revisit a few times) Information Gain Advanced Topic: Regression Trees Coding Tips for HW1 9/15/15CS Fall 2015 (© Jude Shavlik), Lecture 4, Week 21

9/15/15CS Fall 2015 (© Jude Shavlik), Lecture 4, Week 2 Learning Decision Trees (Ch 18): The ID3 Algorithm (Quinlan 1979; Machine Learning 1:1 1986) Induction of Decision Trees (top-down) –Based on Hunt’s CLS psych model (1963) –Handles noisy & missing feature values –C4.5 and C5.0 successors; CART very similar COLOR? SIZE? Red Blue Big Small

9/15/15CS Fall 2015 (© Jude Shavlik), Lecture 4, Week 2 Main Hypothesis of ID3 The simplest tree that classifies training examples will work best on future examples (Occam’s Razor) COLOR? SIZE? Red Blue Big Small SIZE? Big Small - + VS. NP-Hard to find the smallest tree (Hyafil +Rivest, 1976) 3 Ross Quinlan

9/15/15CS Fall 2015 (© Jude Shavlik), Lecture 4, Week 2 Why Occam’s Razor? (Occam lived 1285 – 1349) There are fewer short hypotheses (small trees in ID3) than long ones Short hypothesis that fits training data unlikely to be coincidence Long hypothesis that fits training data might be (since many more possibilities) COLT community formally addresses these issues (ML theory) 4

9/15/15CS Fall 2015 (© Jude Shavlik), Lecture 4, Week 2 Finding Small Decision Trees ID3 - Generate small trees with greedy algorithm: –Find a feature that “best” divides the data –Recur on each subset of the data that the feature creates What does “best” mean? –We’ll briefly postpone answering this 5

9/15/15CS Fall 2015 (© Jude Shavlik), Lecture 4, Week 2 Overview of ID3 (Recursion!) A1A1 A2A2 A3A3 A4A4 ID A1A1 A3A3 A2A A1A1 A3A3 A4A4 A1A1 A3A3 A4A A1A1 A3A3 A4A A1A1 A3A3 A4A A1A1 A3A3 - + A4A4 Splitting Attribute (aka Feature) ? Use Majority class at parent node (+) - why? Dataset Resulting d-tree shown in red 6

9/15/15CS Fall 2015 (© Jude Shavlik), Lecture 4, Week 2 ID3 Algorithm (Figure 18.5 of textbook) Given E, a set of classified examples F, a set of features not yet in decision tree If |E| = 0 then return majority class at parent Else if All_Examples_Same_Class, Return Else if |F| = 0 return majority class (have +/- ex’s with same feature values) Else Let bestF = FeatureThatGainsMostInfo(E, F) Let leftF = F – bestF Add node bestF to decision tree For each possible value, v, of bestF do Add arc (labeled v) to decision tree And connect to result of ID3({ex in E| ex has value v for feature bestF}, leftF) 7

9/15/15CS Fall 2015 (© Jude Shavlik), Lecture 4, Week 2 Venn Diagram View of ID3 Question: How do decision trees divide feature space? F2 F1 8

9/15/15CS Fall 2015 (© Jude Shavlik), Lecture 4, Week 2 Venn Diagram View of ID3 Question: How do decision trees divide the feature space? F1 - F ‘Axis- parallel splits’ 9

9/15/15CS Fall 2015 (© Jude Shavlik), Lecture 4, Week 210 Use this as a guide on how to print d- trees in ASCII

9/15/15CS Fall 2015 (© Jude Shavlik), Lecture 4, Week 2 Main Issue How to choose next feature to place in decision tree? –Random choice? [works better than you’d expect] –Feature with largest number of values? –Feature with fewest? –Information theoretic measure (Quinlan’s approach) General-purpose tool, eg often used for “feature selection” 11

9/15/15CS Fall 2015 (© Jude Shavlik), Lecture 4, Week 2 Expected Value Calculations: Sample Task Imagine you invest $1 in a lottery ticket It says odds are –1 in 10 times you’ll win $5 –1 in 1,000,00 times you’ll win $100,000 How much do you expect to get back? 0.1 x $ x $100,000 = $

More Generally Assume event A has N discrete and disjoint random outcomes Expected value (event A ) =  prob (outcome i occurs)  value(outcome i ) 9/15/15CS Fall 2015 (© Jude Shavlik), Lecture 4, Week 213 i = 1 N

9/15/15CS Fall 2015 (© Jude Shavlik), Lecture 4, Week 2 Scoring the Features (so we can pick the best one) Let f + = fraction of positive examples Let f - = fraction of negative examples f + = p / (p + n), f - = n / (p + n) where p = #pos, n = #neg The expected information needed to determine the category of one these examples is InfoNeeded( f +, f - ) = - f + lg (f + ) - f - lg (f - ) This is also called the entropy of the set of examples (derived later) 14 From where will we get this info?

9/15/15CS Fall 2015 (© Jude Shavlik), Lecture 4, Week 2 Consider the Extreme Cases of InfoNeeded(f +, f - ) f+f+ InfoNeeded(f +, 1-f + ) All same class (+, say) InfoNeeded(1, 0) = -1 lg(1) - 0 lg(0) = mixture InfoNeeded(½, ½) = 2 [ -½ lg(½) ] = 1 0 (by def’n) 15

9/15/15CS Fall 2015 (© Jude Shavlik), Lecture 4, Week 2 Evaluating a Feature How much does it help to know the value of attribute/feature A ? Assume A divides the current set of examples into N groups Let q i = fraction of data on branch i f i + = fraction of +’s on branch i f i - = fraction of –’s on branch i 16

9/15/15CS Fall 2015 (© Jude Shavlik), Lecture 4, Week 2 InfoRemaining(A )  Σ q i x InfoNeeded(f i +, f i - ) –Info still needed after determining the value of attribute A –Another expected value calc Pictorially Evaluating a Feature (cont.) i= 1 N A v1v1 vNvN InfoNeeded(f N +, f N - ) InfoNeeded(f +, f - ) InfoNeeded(f 1 +, f 1 - ) 17

9/15/15CS Fall 2015 (© Jude Shavlik), Lecture 4, Week 2 Info Gain Gain(A)  InfoNeeded(f +, f - ) – InfoRemaining(A) Our scoring function in our hill-climbing (greedy) algorithm So pick A with smallest Remainder(A) Constant for all features That is, choose the feature that statistically tells us the most about the class of another example drawn from this distribution 18

9/15/15CS Fall 2015 (© Jude Shavlik), Lecture 4, Week 2 Sample Info-Gain Calculation InfoNeeded( f +, f - ) = - f + lg (f + ) - f - lg (f - ) +BIGRed +BIGRed -SMALLYellow -SMALLRed + BIG BlueClassSizeShapeColor 19

9/15/15CS Fall 2015 (© Jude Shavlik), Lecture 4, Week 2 Info-Gain Calculation (cont.) Note that “Size” provides complete classification, so done 20

Recursive Methods You’ll Need to Write The d-tree learning algo (pseudocode appeared above) Classifying a ‘testset’ example Leaf nodes: return leaf’s label (ie, the predicted category) Interior nodes: determine which feature value to lookup in ex return result of recursive call on the ‘left’ or ‘right’ branch Printing the d-tree in ‘plain ASCII’ (you need not follow verbatim) Tip: pass in ‘currentDepthOfRecursion’ (initially 0) Leaf nodes: print LABEL (and maybe # training ex’s reaching here) + LINEFEED Interior nodes: for each outgoing arc print LINEFEED and 3 x currentDepthOfRecursion spaces print FEATURE NAME +“ = “ + the arc’s value + “: “ make recursive call on arc, with currentDepthOfRecursion + 1 9/15/15CS Fall 2015 (© Jude Shavlik), Lecture 4, Week 221

Suggested Approach Randomly choose a feature –Get tree building to work –Get tree printing to work –Get tree traversal (for test ex’s) to work Add in code for infoGain –Test on simple, handcrafted datasets Train and test on SAME file (why?) –Should get ALL correct (except if extreme noise) Produce what the HW requests 9/15/15 CS Fall 2015 (© Jude Shavlik), Lecture 4, Week 2 Lecture 1, Slide 22