COMP3740 CR32: Knowledge Management and Adaptive Systems Supervised ML to learn Classifiers: Decision Trees and Classification Rules Eric Atwell, School of Computing, University of Leeds (including re-use of teaching resources from other sources, esp. Knowledge Management by Stuart Roberts, School of Computing, University of Leeds)
Reminder: Objectives of data mining Data mining aims to find useful patterns in data. For this we need: Data mining techniques, algorithms, tools, eg WEKA A methodological framework to guide us, in collecting data and applying the best algorithms, CRISP-DM TODAY’S objective: learn how to learn classifiers Decision Trees and Classification Rules Supervised Machine Learning: training set has the “answer” (class) for each example (instance)
Reminder: Concepts that can be “learnt” The types of concepts we try to ‘learn’ include: Clusters or ‘Natural’ partitions; Eg we might cluster customers according to their shopping habits. Rules for classifying examples into pre-defined classes. Eg “Mature students studying information systems with high grade for General Studies A level are likely to get a 1st class degree” General Associations Eg “People who buy nappies are in general likely also to buy beer” Numerical prediction Eg Salary = a*A-level + b*Age + c*Gender + d*Prog + e*Degree (but are Gender, Programme really numbers???)
Output: decision tree Outlook sunny rainy Humidity Windy high normal true false Play = ‘no’ Play = ‘yes’ Play = ‘no’ Play = ‘yes’
Decision Tree Analysis Example instance set Can we predict, from the first 3 columns, the risk of getting a virus? For convenience later: F = ‘shares Files’, S = ‘uses Scanner’, I = ‘Infected before’ Decision trees attempt to predict the value in one column from the values in other columns. They use a ‘training set’ which from a data warehouse may be a randomly selected set of facts from a data cube with the various dimensions providing the other attributes.
Decision tree building method Forms a decision tree tries for a small tree covering all or most of the training set internal nodes represent a test on an attribute value branches represent outcome of the test Decides which attribute to test at each node this is based on a measure of ‘entropy’ Must avoid ‘over-fitting’ if the tree is complex enough it might describe the training set exactly, but be no good for prediction May leave some ‘exceptions’ The decision tree method uses more than one attribute to form the rule. An example tree for the virus risk data is on the next slide. We would like to use the smallest tree possible that describes the data, but the search space is in general too large to guarantee the smallest tree. So we use an entropy measure to decide which attribute to use first. Over-fitting is a problem that we haven't time to discuss in full here. The point is that, unless there are some inconsistencies in the data set, you may, with enough variables, be able to build a massively dividing tree that describes accurately the training set even though there are no real rules to abstract. When you try to apply these rules to the data at large they will be no use - they are peculiar to the training set. The method for avoiding overfitting uses the principle of the minimum description length. Essentially if the number of bits needed to describe the tree is close to the number of bits required to describe the whole training set, then the tree is not a useful abstraction.
Building a decision tree (DT) The algorithm is recursive, at any step: T = set of (remaining) training instances, {C1, …, Ck} = set of classes If all instances in T belong to a single class Ci, then DT is a leaf node identifying class Ci. (done!) …continued
Building a decision tree (DT) …continued If T contains instances belonging to mixed classes, then choose a test based on a single attribute that will partition T into subsets {T1, …, Tn} according to n outcomes of the test. The DT for T comprises a root node identifying the test and one branch for each outcome of the test. The branches are formed by applying the rules above recursively on each of the subsets {T1, …, Tn} .
Tree Building example Classes = {High, Medium, Low} T = Choose a test based on F, number of outcomes, n = 2 (Yes or No) F yes no T1 = T2 =
Tree Building example Classes = {High, Medium, Low} T1 = Choose a test based on I, number of outcomes, n = 2 (Yes or No) I yes no F T4 = T3 =
Tree Building example Classes = {High, Medium, Low} T1 = Choose a test based on I, number of outcomes, n = 2 (Yes or No) I yes no F T3 = Risk = ‘High’
Tree Building example T3 = Classes = {High, Medium, Low} Choose a test based on S, number of outcomes, n = 2 (Yes or No) F no yes I no yes Risk = ‘High’ S no yes
Tree Building example T3 = Classes = {High, Medium, Low} Choose a test based on S, number of outcomes, n = 2 (Yes or No) F no yes I no yes Risk = ‘High’ S no yes Risk = ‘High’ Risk = ‘Low’
Tree Building example T2 = Classes = {High, Medium, Low} Choose a test based on S, number of outcomes, n = 2 (Yes or No) F no yes I no yes S yes no Risk = ‘High’ S no yes Risk = ‘Low’ Risk = ‘High’
Tree Building example T2 = Classes = {High, Medium, Low} Choose a test based on S, number of outcomes, n = 2 (Yes or No) F no yes I no yes S yes no Risk = ‘High’ S no yes Risk = ‘Low’ Risk = ‘Low’ Risk = ‘High’ Risk = ‘Medium’
Example Decision Tree Shares files? no yes Infected before? Uses scanner? yes no no yes medium low Uses scanner? high no Yes Here is the tree corresponding to the table earlier. Why have we started with the ‘Shares Files?’ attribute at the top of the tree? - this is what the entropy measure will tell us. high low
Which attribute to test? The ROOT could be S or I instead of F – leading to a different Decision Tree Best DT is the “smallest”, most concise model The search space in general is too large to find the smallest tree by exhaustive searching (try them all). Instead we look for the attribute which splits the training set into the most homogeneous sets. The measure used for ‘homogeneity’ is based on entropy. Intuitively we can guess that an attribute that splits the training set into very heterogeneous groups (ie an even mix of high risks and low risks) is not much use in forming a rule that predicts high risk. The next few slides look at how the various attributes fare.
Tree Building example (modified) Classes = {Yes, No} T = Choose a test based on F, number of outcomes, n = 2 (Yes or No) F yes no
Tree Building example (modified) Classes = {Yes, No} T = Choose a test based on F, number of outcomes, n = 2 (Yes or No) F yes no High Risk = ‘yes’ 5, 1 High Risk = ‘no’ 2, 0
Tree Building example (modified) Classes = {Yes, No} T = Choose a test based on S, number of outcomes, n = 2 (Yes or No) S yes no
Tree Building example (modified) Classes = {Yes, No} T = Choose a test based on S, number of outcomes, n = 2 (Yes or No) S yes no High Risk = ‘no’ 4,2 High Risk = ‘yes’ 3,1
Tree Building example (modified) Classes = {Yes, No} T = Choose a test based on I, number of outcomes, n = 2 (Yes or No) I yes no
Tree Building example (modified) Classes = {Yes, No} T = Choose a test based on I, number of outcomes, n = 2 (Yes or No) I yes no High Risk = ‘no’ 3, 1 High Risk = ‘yes’ 4,1
Decision tree building algorithm For each decision point, If all remaining examples are all +ve or all -ve, stop. Else if there are some +ve and some -ve examples left and some attributes left, pick the remaining attribute with largest information gain Else if there are no examples left, no such example has been observed; return default Else if there are no attributes left, examples with the same description have different classifications: noise or insufficient attributes or nondeterministic domain This shows the steps in the decision tree algorithm. There are a few details I haven't discussed to do with what happens if we run out of examples before running out of attributes to test or vice versa. Also I haven't included here how to avoid over-fitting.
Evaluation of decision trees At the leaf nodes two numbers are given: N: the coverage for that node: how many instances E: the error rate: how many wrongly classified instances The whole tree can be evaluated in terms of its size (number of nodes) and overall error-rate expressed in terms of the number and percentage of cases wrongly classified. We seek small trees that have low error rates.
Evaluation of decision trees The error rate for the whole tree can also be displayed in terms of a confusion matrix: (A) (B) (C) Classified as 35 2 1 Class (A) = high 4 41 5 Class (B) = medium 68 Class (C) = low
Evaluation of decision trees The error rates mentioned on previous slides are normally computed using The training set of instances. A test set of instances – some different examples! If the decision tree algorithm has ‘over-fitted’ the data, then the error rate based on the training set will be far less than that based on the test set.
Evaluation of decision trees 10-fold cross-validation can be used when the training set is limited in size: Divide the test set randomly into 10 subsets. Build a tree from 9 of the subsets and test using the 10th. Repeat the experiment 9 more times, using a different test set each time. Overall error rate is average of 10 experiments 10-fold cross-validation will lead to up to 10 different decision trees being built. The method for selecting or constructing the best tree is not clear.
From decision trees to rules Decision trees may not be easy to interpret: tests associated with lower nodes have to be read in the context of tests further up the tree ‘sub-concepts’ may sometimes be split up and distributed to different parts of the tree (see next slide) Computer Scientists may prefer “if … then …” rules!
DT for “F = G = 1 or J = K = 1” J=K=1 is split across F= 0; two subtrees. F= 0; J = 0; no J = 1; K = 0; no K = 1; yes F = 1; G = 1; yes G = 0; F J K G 1 yes no
Converting DT to rules Step 1: Every path from root to leaf represents a rule: F= 0; J = 0; no J = 1; K = 0; no K = 1; yes F = 1; G = 1; yes G = 0; If F = 0 and J = 0 then class no; If F = 0 and J = 1 and K = 0 then class no If F = 0 and J = 1 and K = 1 then class yes …. If F = 1 and G = 0 and J = 1 and K = 1 then class yes
Generalising rules If F = 0 and J = 1 and K = 1 then class yes If F = 1 and G = 0 and J = 1 and K = 1 then class yes If G = 1 then class yes If J =1 and K = 1 then class yes
Tidying up rule sets Generalisation leads to 2 problems: Rules no longer mutually exclusive Order rules and use the first matching rule used as the operative rule. Ordering is based on how many false positive errors the rule makes Rule set no longer exhaustive Choose a default value for the class when no rule applies Default class is that which contains the most training cases not covered by any rule.
Decision Tree - Revision Decision tree builder algorithm discovers rules for classifying instances. At each step, it needs to decide which attribute to test at that point in the tree; a measure of ‘information gain’ can be used. The output is a decision tree based on the ‘training’ instances, evaluated with separate “test” instances. Leaf nodes which have a small coverage may be pruned if the error rate is small for the pruned tree.
Pruning example (from W & F) Health plan contribution none half full 4 bad 2 good 1 bad 1 good 4 bad 2 good number of errors We replace the subtree with: Bad 14, 5 Number of instances
Decision trees v classification rules Decision trees can be used for prediction or interpretation. Prediction: compare an unclassified instance against the tree and predict what class it is in (with error estimate) Interpretation: examine tree and try to understand why instances end up in the class they are in. Rule sets are often better for interpretation. ‘Small’, accurate rules can be examined, even if overall accuracy of the rule set is poor.
Self Check You should be able to: Describe how the decision-trees are built from a set of instances. Build a decision tree based on a given attribute Explain what the ‘training’ and ‘test’ sets are for. Explain what “Supervised” means, and why classification is an example of supervised ML