Lecture 7
Outline 1. Overview of Classification and Decision Tree 2. Algorithm to build Decision Tree 3. Formula to measure information 4. Weka, data preparation and Visualization
1. Illustration of the Classification Task Courtesy to Professor David Mease for Next 10 slides Learning Algorithm Model
Classification: Definition Given a collection of records (training set) – Each record contains a set of attributes (x), with one additional attribute which is the class (y). l Find a model to predict the class as a function of the values of other attributes. l Goal: previously unseen records should be assigned a class as accurately as possible. – A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.
Classification Examples Classifying credit card transactions as legitimate or fraudulent l Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil l Categorizing news stories as finance, weather, entertainment, sports, etc l Predicting tumor cells as benign or malignant
Classification Techniques l There are many techniques/algorithms for carrying out classification l In this chapter we will study only decision trees l In Chapter 5 we will study other techniques, including some very modern and effective techniques
An Example of a Decision Tree categorical continuous class Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Splitting Attributes Training Data Model: Decision Tree
Applying the Tree Model to Predict the Class for a New Observation Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Test Data Start from the root of tree.
Applying the Tree Model to Predict the Class for a New Observation Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Test Data
Applying the Tree Model to Predict the Class for a New Observation Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Test Data
Applying the Tree Model to Predict the Class for a New Observation Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Test Data
Applying the Tree Model to Predict the Class for a New Observation Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Test Data
Applying the Tree Model to Predict the Class for a New Observation Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Test Data Assign Cheat to “No”
DECISION TREE CHARACTERISTICS Easy to understand -- similar to human decision process Deals with both discrete and continuous features Simple, nonparametric classifier No assumptions regarding probability distribution types NP-complete Algorithms are computationally inexpensive can represent arbitrarily complex decision boundaries Overfitting can be a problem
2. Algorithm to Build Decision Trees Defined recursively Select attribute as root node Use a greedy algorithm to create branches for each possible value of a selected attribute Repeat recursively until Either running out of instances or attributes Or reach a predefined thresholds of purity Use only branches that are reached by instances
WEATHER EXAMPLE, 9 yes/5 no
Outlook 2,Yes 3 No 4 Yes 3 Yes 2 No Sunny Overcast Rainy Temp Hum Wind Play Hot high FALSE No Hot high TRUE No Mild high FALSE No Cool normal FALSE Yes Mild normal TRUE Yes Tem Hum Wind Play Hot high FALSE Yes Cool High TRUE Yes Mild high TRUE Yes Hot normal FALSE Yes Temp Hum Wind Play Mild high FALSE Yes Cool normal FALSE Yes Cool normal TURE No mild normal FALSE Yes Mild high TRUE No
DECISION TREES: Weather Example Temperature Yes No Yes No Yes No Hot Mild Cool
DECISION TREES: Weather Example Humidity Yes No Yes No HighNormal
DECISION TREES: Weather Example Windy Yes No Yes No True False
3. Information Measures Selecting attribute upon which to split Need measure of “purity”/information Entropy Gini Index Classification error
A Graphical Comparison
Entropy l Measures purity similar to Gini l Used in C4.5 l After the entropy is computed in each node, the overall value of the entropy is computed as the weighted average of the entropy in each node as with the Gini index l The decrease in Entropy is called “information gain” (page 160)
Entropy Examples for a Single Node P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0 P(C1) = 1/6 P(C2) = 5/6 Entropy = – (1/6) log 2 (1/6) – (5/6) log 2 (5/6) = 0.65 P(C1) = 2/6 P(C2) = 4/6 Entropy = – (2/6) log 2 (2/6) – (4/6) log 2 (4/6) = 0.92
3. Entropy: Calculating Information All three measures are consistent with each other Will use entropy as example, The less purity, the more bits, the less info, Outlook as root: Info[2, 3] = bits Info[4, 0] = 0.0 bits Info[3, 2] = bits Total info = 5/14* /14* /14* = bits
3. Selecting Root Attribute Initial info = info[9, 5] = bits Gain(outlook) = – = bits Gain( temperature) = bits Gain(humidity) = bits Gain(windy) = bits So, select outlook as root for splitting
Outlook Humidity 2/3 FALSE YES TRUE YES FALSE No TRUE No FALSE No 4 Yes Wind 3/2 YES No NO Sunny OvercastRainy High Normal FALSE TRUE
Outlook Humidity 2/3 FALSE YES TRUE YES FALSE No TRUE No FALSE No 4 Yes Wind 3/2 YES Cool normal Yes Cool normal No Mild high No Sunny OvercastRainy High Normal FALSE TRUE Contradicted Training example
3. Hunt’s Algorithm Many algorithms use a version of a “top-down” or “divide- and-conquer” approach known as Hunt’s Algorithm (Page 152): Let D t be the set of training records that reach a node t – If D t contains records that belong the same class y t, then t is a leaf node labeled as y t – If D t contains records that belong to more than one class, use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each subset.
An Example of Hunt’s Algorithm Don’t Cheat Refund Don’t Cheat Don’t Cheat YesNo Refund Don’t Cheat YesNo Marital Status Don’t Cheat Single, Divorced Married Taxable Income Don’t Cheat < 80K>= 80K Refund Don’t Cheat YesNo Marital Status Don’t Cheat Single, Divorced Married
How to Apply Hunt’s Algorithm Usually it is done in a “greedy” fashion. “Greedy” means that the optimal split is chosen at each stage according to some criterion. This may not be optimal at the end even for the same criterion. However, the greedy approach is computational efficient so it is popular.
Using the greedy approach we still have to decide 3 things: #1) What attribute test conditions to consider #2) What criterion to use to select the “best” split #3) When to stop splitting For #1 we will consider only binary splits for both numeric and categorical predictors as discussed on the next slide For #2 we will consider misclassification error, Gini index and entropy #3 is a subtle business involving model selection. It is tricky because we don’t want to overfit or underfit.
l Misclassification error is usually our final metric which we want to minimize on the test set, so there is a logical argument for using it as the split criterion l It is simply the fraction of total cases misclassified l 1 - Misclassification error = “Accuracy” (page 149) Misclassification Error
Gini Index l This is commonly used in many algorithms like CART and the rpart() function in R l After the Gini index is computed in each node, the overall value of the Gini index is computed as the weighted average of the Gini index in each node
Gini Examples for a Single Node P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Gini = 1 – P(C1) 2 – P(C2) 2 = 1 – 0 – 1 = 0 P(C1) = 1/6 P(C2) = 5/6 Gini = 1 – (1/6) 2 – (5/6) 2 = P(C1) = 2/6 P(C2) = 4/6 Gini = 1 – (2/6) 2 – (4/6) 2 = 0.444
The Gini index decreases from.42 to.343 while the misclassification error stays at 30%. This illustrates why we often want to use a surrogate loss function like the Gini index even if we really only care about misclassification. A? YesNo Node N1Node N2 Gini(N1) = 1 – (3/3) 2 – (0/3) 2 = 0 Gini(Children) = 3/10 * 0 + 7/10 * 0.49 = Gini(N2) = 1 – (4/7) 2 – (3/7) 2 = Misclassification Error Vs. Gini Index
5. Discretization of Numeric Data