Machine Learning Inductive Learning and Decision Trees CSE 473
Learning Learning is essential for unknown environments, i.e., when designer lacks omniscience Learning is useful as a system construction method, i.e., expose the agent to reality rather than trying to write it down Learning modifies the agent's decision mechanisms to improve performance © D. Weld and D. Fox
Feedback in Learning Supervised learning: correct answers for each example Unsupervised learning: correct answers not given Reinforcement learning: occasional rewards © D. Weld and D. Fox
Reinforcement Learning Same setting as MDP, model unknown Agent has to learn how the world works (transition model, reward structure) Goal is to learn optimal policy by interacting with the environment © D. Weld and D. Fox
Inductive learning Simplest form: learn a function from examples f is the target function An example is a pair (x, f(x)) Problem: find a hypothesis h such that h ≈ f given a training set of examples (This is a highly simplified model of real learning: Ignores prior knowledge Assumes examples are given) © D. Weld and D. Fox
Inductive learning method Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting: © D. Weld and D. Fox
Inductive learning method Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting: © D. Weld and D. Fox
Inductive learning method Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting: © D. Weld and D. Fox
Inductive learning method Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting: © D. Weld and D. Fox
Inductive learning method Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting: Ockham’s razor: prefer the simplest hypothesis consistent with data © D. Weld and D. Fox
Inductive learning method Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting: Ockham’s razor: prefer the simplest hypothesis consistent with data © D. Weld and D. Fox
Decision Trees Input: Description of an object or a situation through a set of attributes. Output: a decision, that is the predicted output value for the input. Both, input and output can be discrete or continuous. Discrete-valued functions lead to classification problems. Learning a continuous function is called regression. © D. Weld and D. Fox
© D. Weld and D. Fox
Experience: “Good day for tennis” Day Outlook Temp Humid Wind PlayTennis? d1 s h h w n d2 s h h s n d3 o h h w y d4 r m h w y d5 r c n w y d6 r c n s y d7 o c n s y d8 s m h w n d9 s c n w y d10 r m n w y d11 s m n s y d12 o m h s y d13 o h n w y d14 r m h s n © D. Weld and D. Fox
Decision Tree Representation Good day for tennis? Leaves = classification Arcs = choice of value for parent attribute Outlook Sunny Overcast Rain Humidity Wind Yes Strong Normal High Weak Yes No No Yes Decision tree is equivalent to logic in disjunctive normal form G-Day (Sunny Normal) Overcast (Rain Weak) © D. Weld and D. Fox
DT Learning as Search Nodes Operators Initial node Heuristic? Goal? Decision Trees Tree Refinement: Sprouting the tree Smallest tree possible: a single leaf Information Gain Best tree possible (???) © D. Weld and D. Fox
What is the Simplest Tree? Day Outlook Temp Humid Wind Play? d1 s h h w n d2 s h h s n d3 o h h w y d4 r m h w y d5 r c n w y d6 r c n s y d7 o c n s y d8 s m h w n d9 s c n w y d10 r m n w y d11 s m n s y d12 o m h s y d13 o h n w y d14 r m h s n What is the Simplest Tree? How good? [10+, 4-] Means: correct on 10 examples incorrect on 4 examples © D. Weld and D. Fox
Which attribute should we use to split? Successors Yes Humid Wind Outlook Temp Which attribute should we use to split? © D. Weld and D. Fox
To be decided: How to choose best attribute? Information gain Entropy (disorder) When to stop growing tree? © D. Weld and D. Fox
Disorder is bad Homogeneity is good No Better Good © D. Weld and D. Fox
1.0 0.5 Entropy P of as % .00 .50 1.00 © D. Weld and D. Fox
Entropy (disorder) is bad Homogeneity is good Let S be a set of examples Entropy(S) = -P log2(P) - N log2(N) where P is proportion of pos example and N is proportion of neg examples and 0 log 0 == 0 Example: S has 10 pos and 4 neg Entropy([10+, 4-]) = -(10/14) log2(10/14) - (4/14)log2(4/14) = 0.863 © D. Weld and D. Fox
Information Gain Measure of expected reduction in entropy Resulting from splitting along an attribute v Values(A) Gain(S,A) = Entropy(S) - (|Sv| / |S|) Entropy(Sv) Where Entropy(S) = -P log2(P) - N log2(N) © D. Weld and D. Fox
Gain of Splitting on Wind Day Wind Tennis? d1 weak n d2 s n d3 weak yes d4 weak yes d5 weak yes d6 s yes d7 s yes d8 weak n d9 weak yes d10 weak yes d11 s yes d12 s yes d13 weak yes d14 s n Values(wind)=weak, strong S = [10+, 4-] Sweak = [6+, 2-] Ss = [3+, 3-] Gain(S, wind) = Entropy(S) - (|Sv| / |S|) Entropy(Sv) = Entropy(S) - 8/14 Entropy(Sweak) - 6/14 Entropy(Ss) = 0.863 - (8/14) 0.811 - (6/14) 1.00 = -0.029 v {weak, s} © D. Weld and D. Fox
Evaluating Attributes Yes Humid Wind Gain(S,Wind) =-0.029 Gain(S,Humid) =0.375 Outlook Temp Gain(S,Temp) =0.029 Gain(S,Outlook) =0.401 Value is actually different than this, but ignore this detail © D. Weld and D. Fox
Resulting Tree …. Good day for tennis? Outlook Sunny Overcast Rain No [2+, 3-] Yes [4+] No [2+, 3-] © D. Weld and D. Fox
Recurse! Outlook Sunny Day Temp Humid Wind Tennis? d1 h h weak n d2 h h s n d8 m h weak n d9 c n weak yes d11 m n s yes © D. Weld and D. Fox
One Step Later… Outlook Sunny Overcast Rain Humidity Yes [4+] No [2+, 3-] Normal High Yes [2+] No [3-] © D. Weld and D. Fox
Decision Tree Algorithm BuildTree(TraingData) Split(TrainingData) Split(D) If (all points in D are of the same class) Then Return For each attribute A Evaluate splits on attribute A Use best split to partition D into D1, D2 Split (D1) Split (D2) © D. Weld and D. Fox
Overfitting On training data Accuracy On test data 0.9 0.8 0.7 0.6 Number of Nodes in Decision tree © D. Weld and D. Fox
Overfitting 2 Figure from w.w.cohen © D. Weld and D. Fox
Overfitting… DT is overfit when exists another DT’ and DT has smaller error on training examples, but DT has bigger error on test examples Causes of overfitting Noisy data, or Training set is too small © D. Weld and D. Fox
© D. Weld and D. Fox
© D. Weld and D. Fox
Effect of Reduced-Error Pruning Cut tree back to here © D. Weld and D. Fox
Other Decision Tree Features Can handle continuous data Input: Use threshold to split Output: Estimate linear function at each leaf Can handle missing values Use expectation taken from other samples © D. Weld and D. Fox