C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.

Slides:



Advertisements
Similar presentations
DECISION TREES. Decision trees  One possible representation for hypotheses.
Advertisements

Machine Learning in Real World: C4.5
C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.
Data Mining Techniques: Classification. Classification What is Classification? –Classifying tuples in a database –In training set E each tuple consists.
Decision Tree Approach in Data Mining
Introduction Training Complexity, Pruning CART vs. ID3 vs. C4.5
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,
Evolutionary Computing Systems Lab (ECSL), University of Nevada, Reno 1.
Classification Techniques: Decision Tree Learning
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.
Decision Tree Algorithm
Induction of Decision Trees
Classification Continued
Example of a Decision Tree categorical continuous class Splitting Attributes Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K.
Classification.
Chapter 7 Decision Tree.
Fall 2004 TDIDT Learning CS478 - Machine Learning.
Mohammad Ali Keyvanrad
Basics of Decision Trees  A flow-chart-like hierarchical tree structure –Often restricted to a binary structure  Root: represents the entire dataset.
Chapter 9 – Classification and Regression Trees
Chapter 4 Classification. 2 Classification: Definition Given a collection of records (training set ) –Each record contains a set of attributes, one of.
Lecture 7. Outline 1. Overview of Classification and Decision Tree 2. Algorithm to build Decision Tree 3. Formula to measure information 4. Weka, data.
Decision Tree Learning Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 25, 2014.
Classification and Prediction Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot Readings: Chapter 6 – Han and Kamber.
CS690L Data Mining: Classification
MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.
Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas.
CS 5751 Machine Learning Chapter 3 Decision Tree Learning1 Decision Trees Decision tree representation ID3 learning algorithm Entropy, Information gain.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
Decision Trees, Part 1 Reading: Textbook, Chapter 6.
MULTI-INTERVAL DISCRETIZATION OF CONTINUOUS VALUED ATTRIBUTES FOR CLASSIFICATION LEARNING KIRANKUMAR K. TAMBALKAR.
Decision Trees Example of a Decision Tree categorical continuous class Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Splitting.
Machine Learning Decision Trees. E. Keogh, UC Riverside Decision Tree Classifier Ross Quinlan Antenna Length Abdomen Length.
Bootstrapped Optimistic Algorithm for Tree Construction
1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.
Decision Trees.
Classification and Regression Trees
Outline Decision tree representation ID3 learning algorithm Entropy, Information gain Issues in decision tree learning 2.
BY International School of Engineering {We Are Applied Engineering} Disclaimer: Some of the Images and content have been taken from multiple online sources.
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
DATA MINING TECHNIQUES (DECISION TREES ) Presented by: Shweta Ghate MIT College OF Engineering.
Decision Tree Learning DA514 - Lecture Slides 2 Modified and expanded from: E. Alpaydin-ML (chapter 9) T. Mitchell-ML.
Decision Trees (Lecture# 09-10) Dr. Tahseen Ahmed Jilani Assistant Professor Member IEEE-CIS, IFSA, IRSS Department of Computer Science University of Karachi.
Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
By N.Gopinath AP/CSE.  A decision tree is a flowchart-like tree structure, where each internal node (nonleaf node) denotes a test on an attribute, each.
10. Decision Trees and Markov Chains for Gene Finding.
Chapter 6 Decision Tree.
DECISION TREES An internal node represents a test on an attribute.
Decision Trees an introduction.
Classification Algorithms
C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.
C4.5 - pruning decision trees
Artificial Intelligence
Ch9: Decision Trees 9.1 Introduction A decision tree:
Chapter 6 Classification and Prediction
ID3 Vlad Dumitriu.
Data Science Algorithms: The Basic Methods
Decision Tree Saed Sayad 9/21/2018.
Classification and Prediction
Chapter 8 Tutorial.
Classification by Decision Tree Induction
Data Mining – Chapter 3 Classification
Statistical Learning Dong Liu Dept. EEIS, USTC.
Authors: Wai Lam and Kon Fan Low Announcer: Kyu-Baek Hwang
Machine Learning in Practice Lecture 17
Decision trees One possible representation for hypotheses
Decision Trees Jeff Storey.
Presentation transcript:

C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node of decision tree: 1. T contains one or more samples, all belonging to a single class Cj. The decision tree for T is a leaf identifying class Cj.

C4.5 algorithm 2. T contains no samples. The decision tree is again a leaf, but the class to be associated with the leaf must be determined from information other than T, such as the overall majority class in T. C4.5 algorithm uses as a criterion the most frequent class at the parent of the given node.

C4.5 algorithm 3. T contains samples that belong to a mixture of classes. In this situation, the idea is to refine T into subsets of samples that are heading towards single-class collections of samples. An appropriate test is chosen, based on single attribute, that has one or more mutually exclusive outcomes {O1,O2, …,On}: T is partitioned into subsets T1, T2, …, Tn where Ti contains all the samples in T that have outcome Oi of the chosen test. The decision tree for T consists of a decision node identifying the test and one branch for each possible outcome.

C4.5 algorithm Test – entropy: If S is any set of samples, let freq (Ci, S) stand for the number of samples in S that belong to class Ci (out of k possible classes), and S denotes the number of samples in the set S. Then the entropy of the set S: Info(S) = -  ( (freq(Ci, S)/ S)  log2 (freq(Ci, S)/ S)) k i=1

Infox(T) =  ((Ti/ T)  Info(Ti)) C4.5 algorithm After set T has been partitioned in accordance with n outcomes of one attribute test X: Infox(T) =  ((Ti/ T)  Info(Ti)) Gain(X) = Info(T) - Infox(T) Criterion: select an attribute with the highest Gain value. n i=1

Example of C4.5 algorithm TABLE 7.1 (p.145) A simple flat database of examples for training

Example of C4.5 algorithm Info(T)=-9/14*log2(9/14)-5/14*log2(5/14) =0.940 bits Infox1(T)=5/14(-2/5*log2(2/5)-3/5*log2(3/5)) +4/14(-4/4*log2(4/4)-0/4*log2(0/4)) +5/14(-3/5*log2(3/5)-2/5*log2(2/5)) =0.694 bits Gain(x1)=0.940-0.694=0.246 bits

------------------------------- Example of C4.5 algorithm Test X1: Attribite1 A B C T1: T2: T3: Att.2 Att.3 Class ------------------------------- 70 True CLASS1 90 True CLASS2 85 False CLASS2 95 False CLASS2 70 False CLASS1 Att.2 Att.3 Class ------------------------------- 90 True CLASS1 78 False CLASS1 65 True CLASS1 75 False CLASS1 Att.2 Att.3 Class ------------------------------- 80 True CLASS2 70 True CLASS2 80 False CLASS1 96 False CLASS1

Example of C4.5 algorithm Info(T)=-9/14*log2(9/14)-5/14*log2(5/14) =0.940 bits InfoA3(T)=6/14(-3/6*log2(3/6)-3/6*log2(3/6)) +8/14(-6/8*log2(6/8)-2/8*log2(2/8)) =0.892 bits Gain(A3)=0.940-0.892=0.048 bits

------------------------------- Example of C4.5 algorithm Test Attribite3 T3: T1: True False Att.1 Att.2 Class ------------------------------- A 85 CLASS2 A 95 CLASS2 A 70 CLASS1 B 78 CLASS1 B 75 CLASS1 C 80 CLASS1 C 96 CLASS1 Att.1 Att.2 Class ------------------------------- A 70 CLASS1 A 90 CLASS2 B 90 CLASS1 B 65 CLASS1 C 80 CLASS2 C 70 CLASS2

C4.5 algorithm C4.5 contains mechanisms for proposing three types of tests: The “standard” test on a discrete attribute, with one outcome and branch for each possible value of that attribute. If attribute Y has continuous numeric values, a binary test with outcomes YZ and Y>Z could be defined, based on comparing the value of attribute against a threshold value Z.

C4.5 algorithm A more complex test based also on a discrete attribute, in which the possible values are allocated to a variable number of groups with one outcome and branch for each group.

Handle numeric values Threshold value Z: The training samples are first sorted on the values of the attribute Y being considered. There are only a finite number of these values, so let us denote them in sorted order as {v1, v2, …, vm}. Any threshold value lying between vi and vi+1 will have the same effect of dividing the cases into those whose value of the attribute Y lies in {v1, v2, …, vi} and those whose value is in {vi+1, vi+2, …, vm}. There are thus only m-1 possible splits on Y, all of which should be examined systematically to obtain an optimal split.

Handle numeric values It is usual to choose the midpoint of each interval: (vi +vi+1)/2 as the representative threshold. C4.5 chooses as the threshold a smaller value vi for every interval {vi, vi+1}, rather than the midpoint itself.

Example(1/2) Attribute2: After a sorting process, the set of values is: {65, 70, 75, 78, 80, 85, 90, 95, 96}, the set of potential threshold values Z is (C4.5): {65, 70, 75, 78, 80, 85, 90, 95}. The optimal Z value is Z=80 and the corresponding process of information gain computation for the test x3 (Attribute2  80 or Attribute2 > 80).

Example(2/2) Infox3(T)=9/14(-7/9log2(7/9)–2/9log2(2/9)) =0.837 bits Gain(x3)= 0.940- 0.837=0.103 bits Attribute1 gives the highest gain of 0.246 bits, and therefore this attribute will be selected for the first splitting.

Unknown attribute values In C4.5 it is accepted a principle that samples with the unknown values are distributed probabilistically according to the relative frequency of known values. The new gain criterion will have the form: Gain(x) = F ( Info(T) – Infox(T)) F = number of samples in database with known value for a given attribute / total number of samples in a data set

Example Attribute1 Attribute2 Attribute3 Class ------------------------------------------------------------------------------------- A 70 True CLASS1 A 90 True CLASS2 A 85 False CLASS2 A 95 False CLASS2 A 70 False CLASS1 ? 90 True CLASS1 B 78 False CLASS1 B 65 True CLASS1 B 75 False CLASS1 C 80 True CLASS2 C 70 True CLASS2 C 80 False CLASS1 C 96 False CLASS1 --------------------------------------------------------------------------------------

Example Info(T) = -8/13log2(8/13)-5/13log2(5/13)= 0.961 bits Infox1(T) = 5/13(-2/5log2(2/5)–3/5log2(3/5)) + 3/13(-3/3log2(3/3)–0/3log2(0/3)) + 5/13(-3/5log2(3/5)–2/5log2(2/5)) = 0.747 bits Gain(x1) = 13/14 (0.961 – 0.747) = 0.199 bits

Unknown attribute values When a case from T with known value is assigned to subset Ti , its probability belonging to Ti is 1, and in all other subsets is 0. C4.5 therefore associate with each sample (having missing value) in each subset Ti a weight w representing the probability that the case belongs to each subset.

Unknown attribute values Splitting set T using test x1 on Attribute1. New weights wi will be equal to probabilities in this case: 5/13, 3/13, and 5/13, because initial (old) value for w is equal to one. T1 = 5+5/13, T2 = 3 +3/13, and T3 = 5+5/13.

Example: Fig 7.7 T1: (attribute1 = A) T1: (attribute1 = B) T1: (attribute1 = C) Att.2 Att.3 Class w 70 90 85 95 True False C1 C2 1 5/13 Att.2 Att.3 Class w 90 78 65 75 True False C1 3/13 1 Att.2 Att.3 Class w 80 70 96 90 True False C2 C1 1 5/13

Unknown attribute values The decision tree leafs are defined with two new parameters: (Ti/E). Ti is the sum of the fractional samples that reach the leaf, and E is the number of samples that belong to classes other than nominated class.

Unknown attribute values If Attribute1 = A Then If Attribute2 <= 70 Then Classification = CLASS1 (2.0 / 0); else Classification = CLASS2 (3.4 / 0.4); elseif Attribute1 = B Then Classification = CLASS1 (3.2 / 0); elseif Attribute1 = C Then If Attribute3 = true Then Classification = CLASS2 (2.4 / 0); Classification = CLASS1 (3.0 / 0).

Pruning decision trees Discarding one or more subtrees and replacing them with leaves simplify decision tree and that is the main task in decision tree pruning: Prepruning Postpruning C4.5 follows a postpruning approach (pessimistic pruning).

Pruning decision trees Prepruning Deciding not to divide a set of samples any further under some conditions. The stopping criterion is usually based on some statistical test, such as the χ2-test. Postpruning Removing retrospectively some of the tree structure using selected accuracy criteria.

Pruning decision trees in C4.5

Generating decision rules Large decision trees are difficult to understand because each node has a specific context established by the outcomes of tests at antecedent nodes. To make a decision-tree model more readable, a path to each leaf can be transformed into an IF-THEN production rule.

Generating decision rules The IF part consists of all tests on a path. The IF parts of the rules would be mutually exclusive(互斥). The THEN part is a final classification.

Generating decision rules

Generating decision rules Decision rules for decision tree in Fig 7.5: If Attribute1 = A and Attribute2 <= 70 Then Classification = CLASS1 (2.0 / 0); If Attribute1 = A and Attribute2 > 70 Then Classification = CLASS2 (3.4 / 0.4); If Attribute1 = B Then Classification = CLASS1 (3.2 / 0); If Attribute1 = C and Attribute3 = True Then Classification = CLASS2 (2.4 / 0); If Attribute1 = C and Attribute3 = False Then Classification = CLASS1 (3.0 / 0).