1 The Restaurant Domain Will they wait, or not?
2 Decision Trees Patrons? NoYesWaitEst? No Alternate?Hungry?Yes Reservation?Fri/Sat?Alternate?Yes NoYesBar?Yes No Raining?Yes No none some full > no yes no yes no yesnoyes noyes noyesno yes
3 Inducing Decision Trees zStart at the root with all examples. zIf there are both positive and negative examples, choose an attribute to split them. zIf all remaining examples are positive (or negative), label with Yes (or No). zIf no example exists, determine label according to majority in parent. zIf no attributes left, but you still have both positive and negative examples, you have a problem...
4 Inducing decision trees Patrons? + - X7, X11 none some full + X1, X3, X4, X6, X8, X12 - X2, X5, X7, X9, X10, X11 +X1, X3, X6, X8 - +X4, X12 - X2, X5, X9, X10 Type? + X1 - X5 French Italian Thai +X6 - X10 +X3, X12 - X7, X9 + X4,X8 - X2, X11 Burger
5 Continuing Induction Patrons? + - X7, X11 none some full + X1, X3, X4, X6, X8, X12 - X2, X5, X7, X9, X10, X11 +X1, X3, X6, X8 - +X4, X12 - X2, X5, X9, X10 NoYes Hungry? + X4, X12 - X2, X X5, X9
6 Final Decision Tree Patrons? NoYesHungry? Type? Fri/Sat? No Yes No none some full >60 NoYes French Italian noyes Thai burger
7 Decision Trees: summary zFinding optimal decision tree is computationally intractable. zWe use heuristics: yChoosing the right attribute is the key. Choice based on information content that the attribute provides. zRepresent DNF boolean formulas. zWork well in practice. zWhat do do with noise? Continuous attributes? Attributes with large domains?
8 Choosing an Attribute: Disorder vs. Homogeneity Bad Good
9 The Value of Information zIf you control the mail, you control information zInformation theory enables to quantify the discriminating value of an attribute. zIt will rain in Seattle tomorrow (Boring) zWe’ll have an earthquake tomorrow (ok, I’m listening) zThe value of a piece of information is inversely proportional to its probability. - Seinfeld
10 Information Theory zWe quantify the value of knowing E as -Lg 2 Prob(E). zIf E1,…,En are the possible outcomes of an event, then the value of knowing the outcome is: zExamples: yP( 1/2, 1/2) = -1/2 Lg (1/2) - 1/2 Lg 1/2 = 1 yP(0.99, 0.01) = 0.08
11 Why Should We care? zSuppose we have p positive examples, and n negative ones. zIf I classify an example for you as positive or negative, then I’m giving you information: zNow let’s calculate the information you would need after I gave you the value of the attribute A.
12 The Value of an Attribute zSuppose the attribute can take on n values. zFor A=val i, there would still be p i positive examples, and n i neagive examples. zThe probability of the A=val i is (p i +n i )/(p+n). zHence, after I tell you the value of A, you need the following amount of information to classify an example:
13 The value of an Attribute (cont) zThe value of an attribute is the difference between the amount of information to classify before and after, I.e., yInitial - Remainder. zPatrons: zRemainder(Patrons) = + - X7, X11 +X1, X3, X6, X8 - +X4, X12 - X2, X5, X9, X10