COMP 2208 Dr. Long Tran-Thanh University of Southampton Decision Trees
Classification Environment Perception Behaviour Categorize inputs Update belief model Update decision making policy Decision making Perception Behaviour
Recognizing the type of situation you are in right now is a basic agent task: Classification Robotics: misidentifying a human body with some part of a car on the assembly line would be disastrous Military: friend or foo? Electric card usage: was it a fraud or not?
Last lecture: neural networks Why more classification methods? Very powerful in theory Promising direction: deep learning Still difficult to fully control the technology In many cases: other techniques are more efficient Occam’s razor: the simpler (the model) the better (the performance is) – go for something more complicated only if it’s really necessary In many real-world problems, data cleaning is the most important step – after that, a simple classification method would do the job
Classification Algorithm Classification Algorithm Bottom up: inspiration from biology - e.g., neural networks Classification Algorithm Classification Algorithm Top down: inspiration from higher abstraction levels
Prof or hobo 1?
Prof or hobo 2?
Prof or hobo 3?
Prof or hobo answers Hobo Professor
Back to classification Classification Algorithm Classification Algorithm Different ways to go: Honey?Fired? Evil plan?
Back to classification Classification Algorithm Classification Algorithm Some classification algorithms: Logistic regression Support vector machines (SVMs) Decision trees + its family … Easy to understand (Relatively) easy to implement Vey efficient in many cases
Decision making process Did it go well? Yes No
What are the clues that allow you to distinguish a prof from a hobo? Clothes people are wearing Their eyes The beard … Back to the “Prof or hobo” quiz Main idea: checking out some properties in some order
Classification with decision trees A decision tree takes a series of inputs defining a situation, and outputs a binary decision/classification. A decision tree spells out an order for checking the properties (attributes) of the situation until we have enough information to decide what's going on. We use the observable attributes to predict the outcome (or some important hidden or unknown quantity). Question: what is the optimal (efficient) order of the attributes?
The importance of the ordering Think about the “20 questions” game: inefficient questions will lead to low performance Think about binary search: Optimal: always halve the interval Decision trees are very simple to produce if we already know the underlying rules. But what we don’t have the rules, just past examples (experience)?
Often we don't know in advance how to classify things, and want our agent to learn from examples. Our objective
Which attribute to start with? The order of attributes is still very important Idea: choose the next attribute whose value can reduce the uncertainty about the outcome of the classification the most What does it mean when we say that something reduces the uncertainty in our knowledge? Reducing uncertainty (in knowledge) = increase (known) information So we should choose the attribute that provides the highest information gain
Entropy How to measure information gain (and how to define it)? Answer: borrow similar concepts from information & coding theory Entropy (Shannon, 1948): A measure of the amount of disorder or uncertainty in a system. A tidy room has low entropy: You can be reasonably certain your keys are on the hook you made for them. A messy room has high entropy: things are all over the place and your keys could be absolutely anywhere.
Input X Output Y Entropy Uncertainty about the outcome Classification: Entropy (Shannon, 1948): How often Y =y Measure of information (surprise) when Y = y (in bits)
Entropy example GoodOKTerrible Birmingham0.33 Southampton Glasgow001 Weather:
Entropy example BirminghamP(x)logP(x)- P(x)logP(x) Good OK Terrible Sum =1.58 (bits)
Entropy example SouthamptonP(x)logP(x)- P(x)logP(x) Good OK Terrible Sum =1.29 (bits)
Entropy example GlasgowP(x)logP(x)- P(x)logP(x) Good0-infinity0 OK0-infinity0 Terrible100 Sum =0 (bits) When we are certain, the entropy is 0
Conditional entropy Input X Output Y Classification: Entropy measures the uncertainty of a given state of the system How to measure the change? Conditional entropy: Joint probability Conditional probability How much uncertainty would remain about the outcome Y if we knew (for instance) the outcome of attribute X
Information gain Information gain: Current level of uncertainty (entropy) Possible new level of uncertainty (conditional entropy) The difference represents how much uncertainty would decrease
Building a decision tree Split the tree on the attribute with the highest information gain. Then repeat. Stopping Conditions: Don't split if all matching records have same output value (no point, we know what happens!). Don't split if all matching records have same attribute values (no point, we can't distinguish them). Recursive algorithm:
Example: Predicting the importance of s Objective: predict whether the user will read the
18 s: 8 read, 8 skipped “Thread” attribute: ReadsSkipsRow total new_thread7 (70%)3 (30%)10 follow_up2 (25%)6 (75%)8 Example: Predicting the importance of s What is the information gain if we choose “Thread” ? Calculation steps: Calculate H(Read) Calculate H(Read | Thread) Calculate G(Read, Thread) = H(Read) – H(Read | Thread)
Example: Predicting the importance of s Calculating H(Read) 18 s: 8 read, 8 skipped P(Read = True) = P(Read = False) = 0.5 H(Read) = -(0.5*log2(0.5) + 0.5*log2(0.5)) = 1 (bit)
Example: Predicting the importance of s Calculating H(Read | Thread) Specific conditional entropy Calculation steps: Calculate H(Read | Thread = new) Calculate H(Read | Thread = follow_up) Calculate H(Read | Thread) = p(new)*H(Read | Thread = new) + + p(follow_up)*H(Read | Thread = follow_up)
ReadsSkipsRow total new_thread7 (70%)3 (30%)10 follow_up2 (25%)6 (75%)8 Example: Predicting the importance of s P(Read = True | new)= 0.7; P(Read = False | new) = 0.3 H(Read | new) = 0.88 P(Read = True | follow_up) = 0.25; P(Read = False | follow_up) = 0.75 H(Read | follow_up) = 0.81 H(Read | Thread) = 10/18 * /18*0.81 = 0.85
Example: Predicting the importance of s Calculating G(Read,Thread): G(Read,Thread) = H(Read) – H(Read | Thread) G(Read,Thread) = 1– 0.85 = 0.15
Example: Predicting the importance of s
Advantages of decision trees Decision trees are able to generate understandable rules (i.e., human- readable). Once learned, decision trees perform classification very efficiently. Decision trees are able to handle continuous as well as categorical variables. You choose a threshold to split the continuous variables based on information gain.