Class1 Class2 The methods discussed so far are Linear Discriminants.

Slides:



Advertisements
Similar presentations
Data Mining Classification: Alternative Techniques
Advertisements

DECISION TREES. Decision trees  One possible representation for hypotheses.
Decision Trees Decision tree representation ID3 learning algorithm
Classification Algorithms
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
RIPPER Fast Effective Rule Induction
Decision Tree Approach in Data Mining
Classification Techniques: Decision Tree Learning
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
Decision Trees Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei Han.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
ID3 Algorithm Abbas Rizvi CS157 B Spring What is the ID3 algorithm? ID3 stands for Iterative Dichotomiser 3 Algorithm used to generate a decision.
Part 7.3 Decision Trees Decision tree representation ID3 learning algorithm Entropy, information gain Overfitting.
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Decision Tree Algorithm
Induction of Decision Trees
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Covering Algorithms. Trees vs. rules From trees to rules. Easy: converting a tree into a set of rules –One rule for each leaf: –Antecedent contains a.
Example of a Decision Tree categorical continuous class Splitting Attributes Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K.
ICS 273A Intro Machine Learning
Machine Learning: Ensemble Methods
MACHINE LEARNING. What is learning? A computer program learns if it improves its performance at some task through experience (T. Mitchell, 1997) A computer.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
By Wang Rui State Key Lab of CAD&CG
Fall 2004 TDIDT Learning CS478 - Machine Learning.
Machine Learning Chapter 3. Decision Tree Learning
Learning what questions to ask. 8/29/03Decision Trees2  Job is to build a tree that represents a series of questions that the classifier will ask of.
For Wednesday No new reading Homework: –Chapter 18, exercises 3, 4, 7.
315 Feature Selection. 316 Goals –What is Feature Selection for classification? –Why feature selection is important? –What is the filter and what is the.
Chapter 9 – Classification and Regression Trees
CS Decision Trees1 Decision Trees Highly used and successful Iteratively split the Data Set into subsets one attribute at a time, using most informative.
Bab 5 Classification: Alternative Techniques Part 1 Rule-Based Classifer.
For Wednesday No reading Homework: –Chapter 18, exercise 6.
CSC 211 Data Structures Lecture 13
CS690L Data Mining: Classification
For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.
CS 8751 ML & KDDDecision Trees1 Decision tree representation ID3 learning algorithm Entropy, Information gain Overfitting.
Rule Induction Overview Generic separate-and-conquer strategy CN2 rule induction algorithm Improvements to rule induction.
MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.
1 Branch and Bound Searching Strategies Updated: 12/27/2010.
CS 5751 Machine Learning Chapter 3 Decision Tree Learning1 Decision Trees Decision tree representation ID3 learning algorithm Entropy, Information gain.
Classification Algorithms Decision trees Rule-based induction Neural networks Memory(Case) based reasoning Genetic algorithms Bayesian networks Basic Principle.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
1 Decision Tree Learning Original slides by Raymond J. Mooney University of Texas at Austin.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 6.2: Classification Rules Rodney Nielsen Many.
Decision Tree Learning
ECE 471/571 – Lecture 20 Decision Tree 11/19/15. 2 Nominal Data Descriptions that are discrete and without any natural notion of similarity or even ordering.
Classification and Regression Trees
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
1 Universidad de Buenos Aires Maestría en Data Mining y Knowledge Discovery Aprendizaje Automático 4-Inducción de árboles de decisión (1/2) Eduardo Poggi.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
CSE573 Autumn /11/98 Machine Learning Administrative –Finish this topic –The rest of the time is yours –Final exam Tuesday, Mar. 17, 2:30-4:20.
Machine Learning: Ensemble Methods
DECISION TREES An internal node represents a test on an attribute.
Decision Trees an introduction.
Fast Effective Rule Induction
C4.5 - pruning decision trees
Ch9: Decision Trees 9.1 Introduction A decision tree:
Decision Tree Saed Sayad 9/21/2018.
Machine Learning Chapter 3. Decision Tree Learning
Machine Learning: Lecture 3
Machine Learning Chapter 3. Decision Tree Learning
Decision Trees By Cole Daily CSCI 446.
Statistical Learning Dong Liu Dept. EEIS, USTC.
INTRODUCTION TO Machine Learning 2nd Edition
Decision Trees Jeff Storey.
Presentation transcript:

Class1 Class2 The methods discussed so far are Linear Discriminants

XOR Problem: Not Linearly Separable!

Decision Rules for XOR Problem: If x=0 then If y=0 then class=0 Else class = 1 Else if x=1 then If y=0 then class=1 Else class =

f f t t C1 C2C1 Y X A Sample Decision Tree By default, a false value is to the left, true to the right. It is easy to generate a tree to perfectly classify training data; it is much harder to generate a tree that works well on the test data!

Decision Tree Induction Pick feature to test, say X Split training cases into a set where X=True, and X=False If a set is entirely of cases in one class, label it as a leaf. Alternately, label a set as a leaf if there are fewer than some threshold number of cases, e.g. 5 Repeat process on sets that are not leaves.

Decision Tree Induction How to pick which feature to test? Use heuristic search! Entropy heuristic attempts to reduce the degree of randomness, or “impurity” of the selected feature (# bits needed to encode) E.g. High randomness: Feature that splits into two sets, each set 50% class 1 and 2. Low Randomness: Feature that splits into two sets, everything in set1=class1, everything in set2=class2.

The entropy of a particular state is the negative sum over all the classes of the probability of each class multiplied by the log of the probability: 2 classes, C1 and C2 100 cases For this state, 50 cases are in C1 and 50 cases are in C2 Thus the probability of each class, P1 and P2 are 0.5. The entropy of this node = -[ (0.5)(lg 0.5) + (0.5)(lg 0.5) ] = 1 75 cases in C1 and 25 cases in C2 P(C1) = 0.75, P(C2) = 0.25 Entropy = -[ (0.75)(lg 0.75) + (0.25)(lg 0.25) ] = 0.81

Our algorithm will pick the feature or test that reduces the entropy the most. This can be achieved by selecting the feature test that maximizes the following equation, in the event that there are only two classes: Find the feature test that maximizes: If there were more classes, we would have to include p class *entropy(node class ).

For example, let’s say that we are at a node with an entropy of 1 as calculated previously. If we split the current set of cases testing if Feature A is true or false: Feature A 100 cases F T 10 cases C1 20 cases C2 50 cases C1 20 cases C2 E=-[(1/3)*lg(1/3) + (2/3)*lg(2/3) = 0.92 E=-[(5/7)*lg(5/7) + (2/7)*lg(2/7) = 0.86

If we split the current set of cases testing if Feature B is true or false: Feature B 100 cases F T 50 cases C150 cases C2 E=-[(1)*lg(1) + (0)*lg(0) = 0 (actually undefined at lg(0)) E=-[(0)*lg(0) + (1)*lg(1) = 0 Larger change in entropy, will pick Feature B over Feature A!

# of Terminals vs. Error Rates (for Iris Data problem)

Reduction in Tree Size Prune Branches –Induct Tree –From the bottom, move up to the subtree starting at non-terminal node –Prune this node –Test the new tree on the *test* cases –If it performs better than the original tree, keep the changes and continue Subtle Flaw: Trains on the Test Data. Need a large sample size to get valid results.

Web Demo g/DecisionTrees/Applet/DecisionTreeApple t.html

Rule Induction Overview Generic separate-and-conquer strategy CN2 rule induction algorithm Improvements to rule induction

Problem Given: –A target concept –Positive and negative examples –Examples composed of features Find: –A simple set of rules that discriminates between (unseen) positive and negative examples of the target concept

Sample Unordered Rules If X then C1 If X and Y then C2 If NOT X and Z and Y then C3 If B then C2 What if two rules fire at once? Just OR together?

Target Concept Target concept in the form of rules. If we only have 3 features, X, Y, and Z, then we could generate the following possible rules: –If X then… –If X and Y then… –If X and Y and Z then… –If X and Z then … –If Y then … –If Y and Z then … –If Z then… Exponentially large space, larger if allow NOT’s

Generic Separate-and-Conquer Strategy TargetConcept = NULL While NumPositive(Examples) > 0 BestRule = TRUE Rule = BestRule Cover = ApplyRule(Rule) While NumNegative(Cover) > 0 For each feature  Features Refinement=Rule  feature If Heuristic(Refinement, Examples) > Heuristic(BestRule, Examples) BestRule = Refinement Rule = BestRule Cover = ApplyRule(Rule) TargetConcept = TargetConcept  Rule Examples = Examples - Cover

Trivial Example 1: a,b 2: b,c 3: c,d 4: d,e + - H(T)=2/4 H(a)=1/1 H(b)=2/2 H(c)=1/2 H(d)=0/1 H(e)=0/1 Say we pick a. Remove covered examples: 2: b,c 3: c,d 4: d,e + - H(a  b)=1/1 H(a  c)=1/2 H(a  d)=0/2 H(a  e)=0/1 Pick as our rule: a  b.

CN2 Rule Induction (Clark & Boswell, 1991) More specialized version of separate-and- conquer: CN2Unordered(allexamples, allclasses) Ruleset  {} For each class in allclasses Generate rules by CN2ForOneClass(allexamples, class) Add rules to ruleset Return ruleset

CN2 CN2ForOneClass(examples, class) Rules  {} Repeat Bestcond  FindBestCondition(examples, class) If bestcond <> null then Add the rule “IF bestcond THEN PREDICT class” Remove from examples all + cases in class covered by bestcond Until bestcond = null Return rules Keeps negative examples around so future rules won’t impact existing negatives (allows unordered rules)

CN2 FindBestCondition(examples, class) MGC  true ‘ most general condition Star  MGC, Newstar  {}, Bestcond  null While Star is not empty (or loopcount < MAXCONJUNCTS) For each rule R in Star For each possible feature F R’  specialization of Rule formed by adding F as an Extra conjunct to Rule (i.e. Rule’ = Rule AND F) Removing null conditions (i.e. A AND NOT A) Removing redundancies (i.e. A AND A) and previously generated rules. If LaPlaceHeuristic(R’,class) > LaPlaceHeuristic (Bestcond, class) Bestcond  R’ Add R’ to Newstar If size(NewStar) > MAXRULESIZE then Remove worst in Newstar until Size=MAXRULESIZE Star  Newstar Return Bestcond

LaPlace Heuristic In our case, NumClasses=2. A common problem is a specific rule that covers only 1 example. In this case, LaPlace = 1+1/1+2 = However, a rule that covers say 2 examples gets a higher value of 2+1/2+2 = 0.75.

Trivial Example Revisited 1: a,b 2: b,c 3: c,d 4: d,e + - L(T)=3/6 L(a)=2/3 L(b)=3/4 L(c)=1/4 L(d)=0/4 L(e)=0/3 Say we pick beam=3. Keep T, a, b. L(a  b)=2/3 L(a  c)=0 L(a  d)=0 L(a  e)=0 Our best rule out of all these is just “b”. Specialize T : (all already done) Specialize a: Specialize b: L(b  a)=2/3 L(b  c)=2/3 L(b  d)=0 L(b  e)=0 Continue until out of features, or max num of conjuncts reached.

Improvements to Rule Induction Better feature selection algorithm Add rule pruning phase –Problem of overfitting the data –Split training examples into a GrowSet (2/3) and PruneSet (1/3) Train on GrowSet Test on PruneSet with pruned rules, keep rule with best results –Needs more training examples!

Improvements to Rule Induction Ripper / Slipper –Rule induction with pruning, new heuristics on when to stop adding rules, prune rules –Slipper builds on Ripper, but uses boosting to reduce weight of negative examples instead of removing them entirely Other search approaches –Instead of beam search, genetic, pure hill climbing (would be faster), etc.

In-Class VB Demo Rule Induction for Multiplexer