Presentation is loading. Please wait.

Presentation is loading. Please wait.

Classification Decision Trees

Similar presentations


Presentation on theme: "Classification Decision Trees"— Presentation transcript:

1 Classification Decision Trees

2 Definition: Classification is the task of learning a target function f that maps each attribute set X to one the predefined class labels y. The target function is also known as informally as a classification model. The classification model is useful in following purpose 1. Descriptive Modeling :- Distinguish between objects of different classes. 2. Predictive Modeling :- Predicting the class label of unknown records.

3 Examples of Classification Task
Predicting tumor cells as benign or malignant Classifying credit card transactions as legitimate or fraudulent Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil Categorizing news stories as finance, weather, entertainment, sports, etc

4 General Approach to Solving a classification Problem:
A Key objective of the learning algorithm is to build models with good generalization capability; i.e., models that accurately predict the class labels of previously unknown records. the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.

5 Classification Techniques
Decision Tree based Methods Rule-based Methods Memory based reasoning Neural Networks Naïve Bayes and Bayesian Belief Networks Support Vector Machines

6 Example of a Decision Tree
categorical continuous class Splitting Attributes Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K NO YES Training Data Model: Decision Tree

7 Another Example of Decision Tree
categorical categorical continuous class MarSt Single, Divorced Married NO Refund No Yes NO TaxInc < 80K > 80K NO YES There could be more than one tree that fits the same data!

8 Decision Tree Classification Task

9 Apply Model to Test Data
Start from the root of tree. Refund MarSt TaxInc YES NO Yes No Married Single, Divorced < 80K > 80K

10 Apply Model to Test Data
Refund MarSt TaxInc YES NO Yes No Married Single, Divorced < 80K > 80K

11 Apply Model to Test Data
Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K NO YES

12 Apply Model to Test Data
Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K NO YES

13 Apply Model to Test Data
Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K NO YES

14 Apply Model to Test Data
Refund Yes No NO MarSt Married Assign Cheat to “No” Single, Divorced TaxInc NO < 80K > 80K NO YES

15 Decision Tree Classification Task

16 Decision Tree Induction
Many Algorithms: Hunt’s Algorithm (one of the earliest) CART ID3, C4.5 SLIQ,SPRINT

17 General Structure of Hunt’s Algorithm
Let Dt be the set of training records that reach a node t General Procedure: If Dt contains records that belong the same class yt, then t is a leaf node labeled as yt If Dt is an empty set, then t is a leaf node labeled by the default class, yd If Dt contains records that belong to more than one class, use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each subset. Dt

18 Hunt’s Algorithm Don’t Cheat Yes No Don’t Cheat Don’t Cheat Yes No
Refund Don’t Cheat Yes No Don’t Cheat Refund Don’t Cheat Yes No Marital Status Single, Divorced Married Refund Don’t Cheat Yes No Marital Status Single, Divorced Married Taxable Income < 80K >= 80K

19 Design Issues of Decision Tree Induction:
1. How should the training records be split? The algorithm must provide a method for specifying the test condition for different attribute types as well as an objective measure for evaluating the goodness of each test condition. 2. How should the splitting procedure stops? Splitting will done until either all the records belong to the same class or all the records have identical attribute values. Method for Expressing Attribute Test Conditions: Binary Attributes: Have only two values.

20 Nominal Attributes: These can have many values, its test condition can be expressed in two ways.

21 Ordinal Attributes: Have the ordered grouping of the objects.
Continues Attributes: the test condition can be expressed as a comparison test with binary outcomes or a range query with outcomes.

22 How to determine the Best Split
Greedy approach: Nodes with homogeneous class distribution are preferred Need a measure of node impurity: Non-homogeneous, High degree of impurity Homogeneous, Low degree of impurity

23 Measures of Node Impurity
There are many measures that can be used to determine the best way to split the records. Gini Index Entropy Misclassification error

24 How to Find the Best Split
Before Splitting: M0 A? B? Yes No Yes No Node N1 Node N2 Node N3 Node N4 M1 M2 M3 M4 M12 M34 Gain = M0 – M12 vs M0 – M34

25 Measure of Impurity: GINI
Gini Index for a given node t : (NOTE: p( j | t) is the relative frequency of class j at node t). Maximum (1 - 1/nc) when records are equally distributed among all classes, implying least interesting information Minimum (0.0) when all records belong to one class, implying most interesting information

26 Example 1:

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46 Model Overfitting: The errors committed by a classification model are generally divided into two types: 1. Training errors ( resubstitution error or apparent errors) It is the number of misclassification errors committed on training records. 2. Generalization errors It is the expected error of the model on previously unseen records.  A good classification model must have low training error as well as low generalization errors. The situation where “A decision tree test error rate increase even though its training error rate is decrease” is called Model Overfitting. Reasons to occur Model Overfitting:  Due to Presence of Noise  Lack of Representative Samples

47 Model Overfitting:

48 How to Address Overfitting
Pre-Pruning (Early Stopping Rule) Stop the algorithm before it becomes a fully-grown tree Typical stopping conditions for a node: Stop if all instances belong to the same class Stop if all the attribute values are the same More restrictive conditions: Stop if number of instances is less than some user-specified threshold Stop if class distribution of instances are independent of the available features (e.g., using  2 test) Stop if expanding the current node does not improve impurity measures (e.g., Gini or information gain).

49 How to Address Overfitting…
Post-pruning Grow decision tree to its entirety Trim the nodes of the decision tree in a bottom-up fashion If generalization error improves after trimming, replace sub-tree by a leaf node. Class label of leaf node is determined from majority class of instances in the sub-tree Can use MDL for post-pruning

50

51 Metrics for Performance Evaluation:
Evaluation of the performance of a classification model is based on the counts of test records correctly and incorrectly predicted by the classification model. These counts are tabulated in a table known as a confusion matrix. Definition(Confusion matrix): A confusion matrix (Kohavi and Provost, 1998) contains information about actual and predicted classifications done by a classification system. Performance of such systems is commonly evaluated using the data in the matrix.   A confusion matrix or error matrix provides the information needed to determine how well a classifier performs.

52  By calculating Accuracy and Error Rate a confusion matrix can evaluate of the performance of a classification model. Accuracy = Number of correct predictions / Total number of predictions. Error Rate = Number of wrong prediction / Total number of predictions.  Most classification algorithms seek models that attain the highest accuracy or equivalent , the lowest error rate when applied to the test set.

53 Methods of Estimation Holdout Reserve 2/3 for training and 1/3 for testing Random subsampling Repeated holdout Cross validation Partition data into k disjoint subsets k-fold: train on k-1 partitions, test on the remaining one Leave-one-out: k=n Bootstrap Sampling with replacement

54 Rule-Based Classifier
Classify records by using a collection of “if…then…” rules Rule: (Condition)  y where Condition is a conjunctions of attributes y is the class label LHS: rule antecedent or condition RHS: rule consequent Examples of classification rules: (Blood Type=Warm)  (Lay Eggs=Yes)  Birds (Taxable Income < 50K)  (Refund=Yes)  Evade=No

55 Rule-based Classifier (Example)
R1: (Give Birth = no)  (Can Fly = yes)  Birds R2: (Give Birth = no)  (Live in Water = yes)  Fishes R3: (Give Birth = yes)  (Blood Type = warm)  Mammals R4: (Give Birth = no)  (Can Fly = no)  Reptiles R5: (Live in Water = sometimes)  Amphibians

56 Application of Rule-Based Classifier
A rule r covers an instance x if the attributes of the instance satisfy the condition of the rule R1: (Give Birth = no)  (Can Fly = yes)  Birds R2: (Give Birth = no)  (Live in Water = yes)  Fishes R3: (Give Birth = yes)  (Blood Type = warm)  Mammals R4: (Give Birth = no)  (Can Fly = no)  Reptiles R5: (Live in Water = sometimes)  Amphibians The rule R1 covers a hawk => Bird The rule R3 covers the grizzly bear => Mammal

57 Rule Coverage and Accuracy
Coverage of a rule: Fraction of records that satisfy the antecedent of a rule Accuracy of a rule: Fraction of records that satisfy both the antecedent and consequent of a rule (Status=Single)  No Coverage = 40%, Accuracy = 50%

58 How does Rule-based Classifier Work?
R1: (Give Birth = no)  (Can Fly = yes)  Birds R2: (Give Birth = no)  (Live in Water = yes)  Fishes R3: (Give Birth = yes)  (Blood Type = warm)  Mammals R4: (Give Birth = no)  (Can Fly = no)  Reptiles R5: (Live in Water = sometimes)  Amphibians A lemur triggers rule R3, so it is classified as a mammal A turtle triggers both R4 and R5 A dogfish shark triggers none of the rules

59 Characteristics of Rule-Based Classifier
Mutually exclusive rules Classifier contains mutually exclusive rules if the rules are independent of each other Every record is covered by at most one rule Exhaustive rules Classifier has exhaustive coverage if it accounts for every possible combination of attribute values Each record is covered by at least one rule

60 From Decision Trees To Rules
Rules are mutually exclusive and exhaustive Rule set contains as much information as the tree

61 Rules Can Be Simplified
Initial Rule: (Refund=No)  (Status=Married)  No Simplified Rule: (Status=Married)  No

62 Effect of Rule Simplification
Rules are no longer mutually exclusive A record may trigger more than one rule Solution? Ordered rule set Unordered rule set – use voting schemes Rules are no longer exhaustive A record may not trigger any rules Use a default class Default rule : A default rule has an empty antecedent and its triggered when all other rules have fails. rd : ( ) yd . Here yd is called default class.

63 Ordered Rule Set Unordered Rule Set
The rules in a rule set are ordered in decreasing order of their priority ( which can defined in many ways e.g. based on accuracy, coverage or total description length). An ordered rule set is also known as a decision list. When a test record is present , it is classified by the highest-ranked rule that covers the record. Unordered Rule Set This approach allows a test record to trigger multiple classification rules and consider the consequent of each rule as a vote for a particular class. The votes are tallied and the class that receives the highest number of votes will be assigned to that test record.

64 Rule Ordering Schemes Rule-based ordering Class-based ordering
Individual rules are ranked based on their quality Class-based ordering Rules that belong to the same class appear together

65 Rule – Growing Strategy:
There are two common strategies for growing a classification rule: General – to – specific Specific – to – general General – to – Specific : An initial rule r: { }  y is created, where left-hand side is an empty set and right –hand side contains target class. Initial rule is poor in quality because it covers all the examples in the training set. New conjuncts are subsequently added to improve the rule’s quality. This process continues until the stopping criteria is met. ( When added conjunct does not improve the quality of the rule)

66 Specific – to – General:
One of the positive example randomly chosen as the initial seed for the rule-growing process. In the refinement step, the rule is generalized by removing one of its conjuncts so that it can cover more positive examples. The refinement step is repeated until the stopping criteria is met. i.e. when the rule starts covering negative examples.


Download ppt "Classification Decision Trees"

Similar presentations


Ads by Google