Download presentation
Presentation is loading. Please wait.
Published byArron Richards Modified over 8 years ago
1
The Hebrew University of Jerusalem School of Engineering and Computer Science Instructor: Jeff Rosenschein (Chapter 18, “Artificial Intelligence: A Modern Approach”)
2
Learning agents Inductive learning Decision tree learning 2
3
Learning modifies the agent’s decision mechanisms to improve performance Learning is essential for unknown environments ◦ i.e., when designer lacks omniscience Learning is useful as a system construction method ◦ i.e., expose the agent to reality rather than trying to write it down 3
4
4
5
Design of a learning element is dictated by ◦ what type of performance element is used ◦ which functional component is to be used ◦ how that functional component is represented ◦ what kind of feedback is available Types of feedback: ◦ Supervised learning: correct answers for each example ◦ Unsupervised learning: correct answers not given (e.g., taxi agent learning concept of “good traffic days” and “bad traffic days”) ◦ Reinforcement learning: occasional rewards 5
6
6
7
http://www.cs.utexas.edu/users/AustinVilla/ ?p=research/learned_walk http://www.cs.utexas.edu/users/AustinVilla/ ?p=research/learned_walk Aibo Learning Movies 7
8
Simplest form: learn a function from examples (tabula rasa) f is the target function An example is a pair (x, f(x)), e.g., Problem: find a hypothesis h such that h ≈ f given a training set of examples (This is a highly simplified model of real learning: ◦ Ignores prior knowledge ◦ Assumes a deterministic, observable “environment” ◦ Assumes examples are given ◦ Assumes that the agent wants to learn f – why?) 8
9
Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting: 9
10
Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting: 10
11
Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting: 11
12
Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting: 12
13
Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting: 13
14
Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting: Ockham’s razor: prefer the simplest hypothesis consistent with data 14
15
Problem: decide whether to wait for a table at a restaurant, based on the following attributes: 1.Alternate: is there an alternative restaurant nearby? 2.Bar: is there a comfortable bar area to wait in? 3.Fri/Sat: is today Friday or Saturday? 4.Hungry: are we hungry? 5.Patrons: number of people in the restaurant (None, Some, Full) 6.Price: price range ($, $$, $$$) 7.Raining: is it raining outside? 8.Reservation: have we made a reservation? 9.Type: kind of restaurant (French, Italian, Thai, Burger) 10. WaitEstimate: estimated waiting time (0-10, 10- 30, 30-60, >60) 15
16
Examples described by attribute values (Boolean, discrete, continuous) E.g., situations where I will/won’t wait for a table: Classification of examples is positive (T) or negative (F) 16
17
One possible representation for hypotheses E.g., here is the “true” tree for deciding whether to wait: 17
18
Assume all inputs are Boolean and all outputs are Boolean What is the class of Boolean functions that are possible to represent by decision trees? Answer: All Boolean functions Simple proof: 1. Take any Boolean function 2. Convert it into a truth table 3. Construct a decision tree in which each row of the truth table corresponds to one path through the decision tree 18
19
Decision trees can express any function of the input attributes E.g., for Boolean functions, truth table row → path to leaf: Trivially, there is a consistent decision tree for any training set with one path to leaf for each example (unless f nondeterministic in x) but it probably won’t generalize to new examples Prefer to find more compact decision trees 19
20
How many distinct decision trees with n Boolean attributes? = number of Boolean functions = number of distinct truth tables with 2 n rows = 2 2 n E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 trees 20 xyzOutcomes… 0000 1 0 1 0 1 0 1 0 1… 0010 0 1 1 0 0 1 1 0 0… 0100 0 0 0 1 1 1 1 0 0… 0110 0 0 0 0 0 0 0 1 1… 1000 0 0 0 0 0 0 0 0 0… 101 110 111 n = 3 256 trees = 2 2 3
21
How many distinct decision trees with n Boolean attributes? = number of Boolean functions = number of distinct truth tables with 2 n rows = 2 2 n E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 trees How many purely conjunctive hypotheses (e.g., Hungry Rain)? Each attribute can be in (positive), in (negative), or out 3 n distinct conjunctive hypotheses A more expressive hypothesis space: ◦ increases chance that target function can be expressed ◦ increases number of hypotheses consistent with training set may get worse predictions 21
22
Aim: find a small tree consistent with the training examples Idea: (recursively) choose “most significant” attribute as root of (sub)tree 22 default value for goal predicate, will be Majority Value… Majority Value
23
23
24
A decision tree learned from the 12 examples: Substantially simpler than “true” tree – a more complex hypothesis isn’t justified by small amount of data 24
25
A decision tree (left) learned from the 12 examples: Substantially simpler than “true” tree – a more complex hypothesis isn’t justified by small amount of data 25
26
Idea: a good attribute splits the examples into subsets that are (ideally) “all positive” or “all negative” Patrons? is a better choice 26
27
To implement Choose-Attribute in the DTL algorithm Discrete random variable V with possible values {v 1,..., v n } Information Content (Entropy): H(V) = H(P(v 1 ), …, P(v n )) = Σ i=1 -P(v i ) log 2 P(v i ) For a training set containing p positive examples and n negative examples: 27
28
Information Content (Entropy): H(P(v 1 ), …, P(v n )) = Σ i=1 -P(v i ) log 2 P(v i ) For a training set containing p positive examples and n negative examples: 28 H( ½, ½) = - ½ log 2 ½ - ½ log 2 ½ = 1 bit one bit of information is sufficient to convey the answer regarding the flip of a fair coin H(1, 0) = - 1 log 2 1 - 0 log 2 0 = 0 bits no bits of information required, the outcome is predictable This is assumed to be equal to 0
29
A chosen attribute A divides the training set E into subsets E 1, …, E v according to their values for A, where A has v distinct values Information Gain (IG) or reduction in entropy from the attribute test: Choose the attribute with the largest IG 29 also called “conditional entropy”…
30
Idea: a good attribute splits the examples into subsets that are (ideally) “all positive” or “all negative” Patrons? is a better choice 30
31
For the training set, p = n = 6, I(6/12, 6/12) = 1 bit Consider the attributes Patrons and Type (and others too): Patrons has the highest IG of all attributes and so is chosen by the DTL algorithm as the root 31 The entropy of the original set was 1, i.e., 6 positive and 6 negative examples
32
Andrew W. Moore Professor School of Computer Science Carnegie Mellon University www.cs.cmu.edu/~awm awm@cs.cmu.edu 412-268-7599 Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in your own lecture, please include this message, or the following link to the source repository of Andrew’s tutorials: http://www.cs.cmu.edu/~awm/tutorials. Comments and corrections gratefully received.http://www.cs.cmu.edu/~awm/tutorials
33
33 You are watching a set of independent random samples of X You see that X has four possible values So you might see: BAACBADCDADDDA… You transmit data over a binary serial link. You can encode each reading with two bits (e.g., A = 00, B = 01, C = 10, D = 11) 0100001001001110110011111100… P(X=A) = 1/4P(X=B) = 1/4P(X=C) = 1/4P(X=D) = 1/4
34
34 Someone tells you that the probabilities are not equal It’s possible… …to invent a coding for your transmission that only uses 1.75 bits on average per symbol. How? P(X=A) = 1/2P(X=B) = 1/4P(X=C) = 1/8P(X=D) = 1/8
35
35 Someone tells you that the probabilities are not equal It’s possible… …to invent a coding for your transmission that only uses 1.75 bits on average per symbol. How? (This is just one of several ways) P(X=A) = 1/2P(X=B) = 1/4P(X=C) = 1/8P(X=D) = 1/8 A0 B10 C110 D111
36
36 Suppose there are three equally likely values… Here’s a naïve coding, costing 2 bits per symbol Can you think of a coding that would need only 1.6 bits per symbol on average? In theory, it can in fact be done with 1.58496 bits per symbol. P(X=A) = 1/3P(X=B) = 1/3P(X=C) = 1/3 A00 B01 C10
37
37 Suppose X can have one of m values… V 1, V 2, … V m What’s the smallest possible number of bits, on average, per symbol, needed to transmit a stream of symbols drawn from X’s distribution? It’s H(X) = The entropy of X “High Entropy” means X is from a uniform (boring) distribution “Low Entropy” means X is from a varied (peaks and valleys) distribution P(X=V 1 ) = p 1 P(X=V 2 ) = p 2 ….P(X=V m ) = p m
38
38 Suppose X can have one of m values… V 1, V 2, … V m What’s the smallest possible number of bits, on average, per symbol, needed to transmit a stream of symbols drawn from X’s distribution? It’s H(X) = The entropy of X “High Entropy” means X is from a uniform (boring) distribution “Low Entropy” means X is from varied (peaks and valleys) distribution P(X=V 1 ) = p 1 P(X=V 2 ) = p 2 ….P(X=V m ) = p m A histogram of the frequency distribution of values of X would be flat A histogram of the frequency distribution of values of X would have many lows and one or two highs
39
39 Suppose X can have one of m values… V 1, V 2, … V m What’s the smallest possible number of bits, on average, per symbol, needed to transmit a stream of symbols drawn from X’s distribution? It’s H(X) = The entropy of X “High Entropy” means X is from a uniform (boring) distribution “Low Entropy” means X is from varied (peaks and valleys) distribution P(X=V 1 ) = p 1 P(X=V 2 ) = p 2 ….P(X=V m ) = p m A histogram of the frequency distribution of values of X would be flat A histogram of the frequency distribution of values of X would have many lows and one or two highs..and so the values sampled from it would be all over the place..and so the values sampled from it would be more predictable
40
40 Low EntropyHigh Entropy
41
41 Low EntropyHigh Entropy..the values (locations of soup) unpredictable... almost uniformly sampled throughout our dining room..the values (locations of soup) sampled entirely from within the soup bowl
42
XY MathYes HistoryNo CSYes MathNo MathNo CSYes HistoryNo MathYes Suppose I’m trying to predict output Y and I have input X Let’s assume this reflects the true probabilities E.g., From this data we estimate P(LikeA = Yes) = 0.5 P(Major = Math & LikeA = No) = 0.25 P(Major = Math) = 0.5 P(LikeA = Yes | Major = History) = 0 Note: H(X) = 1.5 = - ½ log 2 ½ - ¼ log 2 ¼ - ¼ log 2 ¼ H(Y) = 1 X = College Major Y = Likes “Avatar”
43
Definition of Specific Conditional Entropy: H(Y |X=v) = The entropy of Y among only those records in which X has value v X = College Major Y = Likes “Avatar” XY MathYes HistoryNo CSYes MathNo MathNo CSYes HistoryNo MathYes
44
Definition of Specific Conditional Entropy: H(Y |X=v) = The entropy of Y among only those records in which X has value v Example: H(Y|X=Math) = 1 H(Y|X=History) = 0 H(Y|X=CS) = 0 X = College Major Y = Likes “Avatar” XY MathYes HistoryNo CSYes MathNo MathNo CSYes HistoryNo MathYes
45
Definition of Conditional Entropy: H(Y |X) = The average specific conditional entropy of Y = if you choose a record at random what will be the conditional entropy of Y, conditioned on that row’s value of X = Expected number of bits to transmit Y if both sides will know the value of X = Σ j Prob(X=v j ) H(Y | X = v j ) X = College Major Y = Likes “Avatar” XY MathYes HistoryNo CSYes MathNo MathNo CSYes HistoryNo MathYes This is what was called remainder in the slides above…
46
Definition of Conditional Entropy: H(Y|X) = The average conditional entropy of Y = Σ j Prob(X=v j ) H(Y | X = v j ) X = College Major Y = Likes “Avatar” Example: vjvj Prob(X=v j )H(Y | X = v j ) Math0.51 History0.250 CS0.250 H(Y|X) = 0.5 * 1 + 0.25 * 0 + 0.25 * 0 = 0.5 XY MathYes HistoryNo CSYes MathNo MathNo CSYes HistoryNo MathYes
47
Definition of Information Gain: IG(Y|X) = I must transmit Y. How many bits on average would it save me if both ends of the line knew X? IG(Y|X) = H(Y) - H(Y | X) X = College Major Y = Likes “Avatar” Example: H(Y) = 1 H(Y|X) = 0.5 Thus IG(Y|X) = 1 – 0.5 = 0.5 XY MathYes HistoryNo CSYes MathNo MathNo CSYes HistoryNo MathYes
48
48
49
49
50
Definition of Relative Information Gain: RIG(Y|X) = I must transmit Y, what fraction of the bits on average would it save me if both ends of the line knew X? RIG(Y|X) = [ H(Y) - H(Y | X) ] / H(Y) X = College Major Y = Likes “Avatar” Example: H(Y|X) = 0.5 H(Y) = 1 Thus RIG(Y|X) = (1 – 0.5)/1 = 0.5 XY MathYes HistoryNo CSYes MathNo MathNo CSYes HistoryNo MathYes
51
Suppose you are trying to predict whether someone is going to live past 80 years. From historical data you might find… IG(LongLife | HairColor) = 0.01 IG(LongLife | Smoker) = 0.2 IG(LongLife | Gender) = 0.25 IG(LongLife | LastDigitOfSSN) = 0.00001 IG tells you how interesting a 2-d contingency table is going to be (more about this soon…)
52
One possible representation for hypotheses E.g., here is the “true” tree for deciding whether to wait: 52
53
Aim: find a small tree consistent with the training examples Idea: (recursively) choose “most significant” attribute as root of (sub)tree 53
54
Idea: a good attribute splits the examples into subsets that are (ideally) “all positive” or “all negative” Patrons? is a better choice 54
55
Decision tree learned from the 12 examples: Substantially simpler than “true” tree – a more complex hypothesis isn’t justified by small amount of data 55
56
How do we know that h ≈ f ? Use theorems of computational/statistical learning theory (more on this, later) OR ◦ Randomly divide set of examples into training set and test set ◦ Learn h from training set ◦ Try h on test set of examples (measure percent of test set correctly classified) ◦ Repeat for: different sizes of training sets, and for each size of training set, different randomly selected sets 56
57
Learning curve = % correct on test set as a function of training set size 57 A “happy graph” that leads us to believe there is some pattern in the data and the learning algorithm is discovering it.
58
The learning algorithm cannot be allowed to “see” (or be influenced by) the test data before the hypothesis h is tested on it If we generate different h’s (for different parameters), and report back as our h the one that gave the best performance on the test set, then we’re allowing test set results to affect our learning algorithm This taints the results, but people do it anyway… 58
59
Learning needed for unknown environments, lazy designers Learning agent = performance element + learning element For supervised learning, the aim is to find a simple hypothesis approximately consistent with training examples Decision tree learning using information gain Learning performance = prediction accuracy measured on test set 59
60
Andrew W. Moore Professor School of Computer Science Carnegie Mellon University www.cs.cmu.edu/~awm awm@cs.cmu.edu 412-268-7599 Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in your own lecture, please include this message, or the following link to the source repository of Andrew’s tutorials: http://www.cs.cmu.edu/~awm/tutorials. Comments and corrections gratefully received.http://www.cs.cmu.edu/~awm/tutorials
61
We’ll look at Information Gain, used both in Data Mining, and (again) in Decision Tree learning This gives us a new (reinforced) perspective on the topic 61
62
Machine Learning Datasets What is Classification? Contingency Tables OLAP (Online Analytical Processing) What is Data Mining? Searching for High Information Gain Learning an unpruned decision tree recursively Training Set Error Test Set Error Overfitting Avoiding Overfitting Outline 62
63
48,842 records, 16 attributes [Kohavi 1995] 63
64
Machine Learning Datasets What is Classification? Contingency Tables OLAP (Online Analytical Processing) What is Data Mining? Searching for High Information Gain Learning an unpruned decision tree recursively Training Set Error Test Set Error Overfitting Avoiding Overfitting Outline 64
65
A Major Data Mining Operation Give one attribute (e.g., wealth), try to predict the value of new people’s wealths by means of some of the other available attributes Applies to categorical outputs Categorical attribute: an attribute which takes on two or more discrete values. Also known as a symbolic attribute Real attribute: a column of real numbers 65
66
It is a tiny subset of the 1990 US Census It is publicly available online from the UCI Machine Learning Datasets repository 66
67
Well, you can look at histograms… 67 Gender Marital Status
68
Machine Learning Datasets What is Classification? Contingency Tables OLAP (Online Analytical Processing) What is Data Mining? Searching for High Information Gain Learning an unpruned decision tree recursively Training Set Error Test Set Error Overfitting Avoiding Overfitting Outline 68
69
A better name for a histogram: a one-dimensional Contingency Table Recipe for making a k-dimensional contingency table: 1.Pick k attributes from your dataset. Call them a 1,a 2, … a k. 2.For every possible combination of values, a 1,=x 1, a 2,=x 2,… a k,=x k, record how frequently that combination occurs Fun fact: A database person would call this a “k-dimensional datacube” 69
70
For each pair of values for attributes (agegroup, wealth) we can see how many records match 70
71
Easier to appreciate graphically 71
72
Easier to see “interesting” things if we stretch out the histogram bars 72
73
73
74
These are harder to look at! 74 Male Female Rich Poor 20s 30s 40s 50s
75
Machine Learning Datasets What is Classification? Contingency Tables OLAP (Online Analytical Processing) What is Data Mining? Searching for High Information Gain Learning an unpruned decision tree recursively Training Set Error Test Set Error Overfitting Avoiding Overfitting Information Gain of a real valued input Building Decision Trees with real Valued Inputs Andrew’s homebrewed hack: Binary Categorical Splits Example Decision Trees Outline 75
76
Software packages and database add-ons to do this are known as OLAP tools They usually include point and click navigation to view slices and aggregates of contingency tables They usually include nice histogram visualization 76
77
Why would people want to look at contingency tables? 77
78
With 16 attributes, how many 1-d contingency tables are there? How many 2-d contingency tables? How many 3-d tables? With 100 attributes how many 3-d tables are there? 78
79
With 16 attributes, how many 1-d contingency tables are there? 16 How many 2-d contingency tables? 16-choose-2 = 16! / [2! * (16 – 2)!] = (16 * 15) / 2 = 120 How many 3-d tables? 560 With 100 attributes how many 3-d tables are there? 161,700 79
80
Looking at one contingency table: can be as much fun as reading an interesting book Looking at ten tables: as much fun as watching CNN Looking at 100 tables: as much fun as watching an infomercial Looking at 100,000 tables: as much fun as a three-week November vacation in Duluth with a dying weasel 80
81
Machine Learning Datasets What is Classification? Contingency Tables OLAP (Online Analytical Processing) What is Data Mining? Searching for High Information Gain Learning an unpruned decision tree recursively Training Set Error Test Set Error Overfitting Avoiding Overfitting Information Gain of a real valued input Building Decision Trees with real Valued Inputs Andrew’s homebrewed hack: Binary Categorical Splits Example Decision Trees Outline 81
82
Data Mining is all about automating the process of searching for patterns in the data Which patterns are interesting? Which might be mere illusions? And how can they be exploited? 82
83
Data Mining is all about automating the process of searching for patterns in the data Which patterns are interesting? Which might be mere illusions? And how can they be exploited? 83 That’s what we’ll look at right now. And the answer (info gains) will turn out to be the engine that drives decision tree learning…(but you already know that) That’s what we’ll look at right now. And the answer (info gains) will turn out to be the engine that drives decision tree learning…(but you already know that)
84
We will use information theory A very large topic, originally used for compressing signals But more recently used for data mining… 84
85
Machine Learning Datasets What is Classification? Contingency Tables OLAP (Online Analytical Processing) What is Data Mining? Searching for High Information Gain Learning an unpruned decision tree recursively Training Set Error Test Set Error Overfitting Avoiding Overfitting Information Gain of a real valued input Building Decision Trees with real Valued Inputs Andrew’s homebrewed hack: Binary Categorical Splits Example Decision Trees Outline 85
86
Given something (e.g., wealth) you are trying to predict, it is easy to ask the computer to find which attribute has highest information gain for it 86
87
Machine Learning Datasets What is Classification? Contingency Tables OLAP (Online Analytical Processing) What is Data Mining? Searching for High Information Gain Learning an unpruned decision tree recursively Training Set Error Test Set Error Overfitting Avoiding Overfitting Outline 87
88
A Decision Tree is a tree-structured plan of a set of attributes to test in order to predict the output To decide which attribute should be tested first, simply find the one with the highest information gain Then recurse… 88
89
89 From the UCI (University of California at Irvine) repository (thanks to Ross Quinlan) 40 Records
90
90 Suppose we want to predict MPG
91
91
92
92 Take the Original Dataset.. And partition it according to the value of the attribute we split on Records in which cylinders = 4 Records in which cylinders = 5 Records in which cylinders = 6 Records in which cylinders = 8
93
93 Records in which cylinders = 4 Records in which cylinders = 5 Records in which cylinders = 6 Records in which cylinders = 8 Build tree from These records.. Build tree from These records.. Build tree from These records.. Build tree from These records..
94
94 Recursively build a tree from the seven records in which there are four cylinders and the maker was based in Asia (Similar recursion in the other cases)
95
The final tree 95
96
Base Case One Don’t split a node if all matching records have the same output value 96
97
Base Case Two Don’t split a node if none of the attributes can create multiple non- empty children 97
98
Base Case Two: No attributes can distinguish 98
99
Base Case One: If all records in current data subset have the same output then don’t recurse Base Case Two: If all records have exactly the same set of input attributes then don’t recurse 99
100
Base Case One: If all records in current data subset have the same output then don’t recurse Base Case Two: If all records have exactly the same set of input attributes then don’t recurse 100 Proposed Base Case 3: If all attributes have zero information gain then don’t recurse Is this a good idea?
101
101 y = a XOR b The information gains: The resulting decision tree:
102
102 y = a XOR b The resulting decision tree:
103
BuildTree(DataSet,Output) If all output values are the same in DataSet, return a leaf node that says “predict this unique output” If all input values are the same, return a leaf node that says “predict the majority output” Else find attribute X with highest Info Gain Suppose X has n X distinct values (i.e., X has arity n X ) ◦ Create and return a non-leaf node with n X children ◦ The i’th child should be built by calling BuildTree(DS i,Output) where DS i built consists of all those records in DataSet for which X = ith distinct value of X 103
104
Machine Learning Datasets What is Classification? Contingency Tables OLAP (Online Analytical Processing) What is Data Mining? Searching for High Information Gain Learning an unpruned decision tree recursively Training Set Error Test Set Error Overfitting Avoiding Overfitting Outline 104
105
For each record, follow the decision tree to see what it would predict For what number of records does the decision tree’s prediction disagree with the true value in the database? This quantity is called the training set error. The smaller the better. 105
106
MPG Training error 106
107
107
108
108
109
It is not usually in order to predict the training data’s output on data we have already seen 109
110
It is more commonly in order to predict the output value for future data we have not yet seen 110
111
Machine Learning Datasets What is Classification? Contingency Tables OLAP (Online Analytical Processing) What is Data Mining? Searching for High Information Gain Learning an unpruned decision tree recursively Training Set Error Test Set Error Overfitting Avoiding Overfitting Outline 111
112
Suppose we are forward thinking We hide some data away when we learn the decision tree But once learned, we see how well the tree predicts that data This is a good simulation of what happens when we try to predict future data And it is called Test Set Error 112
113
MPG Test set error 113
114
The test set error is much worse than the training set error… …why? 114
115
Machine Learning Datasets What is Classification? Contingency Tables OLAP (Online Analytical Processing) What is Data Mining? Searching for High Information Gain Learning an unpruned decision tree recursively Training Set Error Test Set Error Overfitting Avoiding Overfitting Outline 115
116
We’ll create a training dataset 116 abcdey 000000 000010 000100 000111 001001 :::::: 111111 Five inputs, all bits, are generated in all 32 possible combinations Output y = copy of e, except a random 25% of the records have y set to the opposite of e 32 records
117
Suppose someone generates a test set according to the same method The test set is identical, except that some of the y’s will be different Some y’s that were corrupted in the training set will be uncorrupted in the testing set Some y’s that were uncorrupted in the training set will be corrupted in the test set 117
118
Suppose we build a full tree (we always split until “base case 2”, i.e., don’t split a node if none of the attributes can create multiple non-empty children) 118 Root e=0 a=0a=1 e=1 a=0a=1 25% of these leaf node labels will be corrupted
119
All the leaf nodes contain exactly one record and so… We would have a training set error of zero 119
120
120 1/4 of the tree nodes are corrupted 3/4 are fine 1/4 of the test set records are corrupted 1/16 of the test set will be correctly predicted for the wrong reasons 3/16 of the test set will be wrongly predicted because the test record is corrupted 3/4 are fine3/16 of the test predictions will be wrong because the tree node is corrupted 9/16 of the test predictions will be fine In total, we expect to be wrong on 3/8 of the test set predictions
121
This explains the discrepancy between training and test set error But more importantly… …it indicates there’s something we should do about it if we want to predict well on future data 121
122
Let’s not look at the irrelevant bits 122 abcdey 000000 000010 000100 000111 001001 :::::: 111111 These bits are hidden Output y = copy of e, except a random 25% of the records have y set to the opposite of e 32 records What decision tree would we learn now?
123
123 e=0 e=1 Root These nodes will be unexpandable
124
124 e=0 e=1 Root These nodes will be unexpandable In about 12 of the 16 records in this node the output will be 0 So this will almost certainly predict 0 In about 12 of the 16 records in this node the output will be 1 So this will almost certainly predict 1
125
125 e=0 e=1 Root almost certainly none of the tree nodes are corrupted almost certainly all are fine 1/4 of the test set records are corrupted n/a1/4 of the test set will be wrongly predicted because the test record is corrupted 3/4 are finen/a3/4 of the test predictions will be fine In total, we expect to be wrong on only 1/4 of the test set predictions
126
Definition: If your machine learning algorithm fits noise (i.e., pays attention to parts of the data that are irrelevant) it is overfitting Fact (theoretical and empirical): If your machine learning algorithm is overfitting then it may perform less well on test set data 126
127
Machine Learning Datasets What is Classification? Contingency Tables OLAP (Online Analytical Processing) What is Data Mining? Searching for High Information Gain Learning an unpruned decision tree recursively Training Set Error Test Set Error Overfitting Avoiding Overfitting Outline 127
128
Usually we do not know in advance which are the irrelevant variables …and it may depend on the context For example, if y = a AND b, then b is an irrelevant variable only in the portion of the tree in which a=0 But we can use simple statistics to warn us that we might be overfitting 128
129
Consider this split 129
130
Suppose that mpg was completely uncorrelated with maker What is the chance we’d have seen data of at least this apparent level of association anyway? 130
131
Suppose that mpg was completely uncorrelated with maker What is the chance we’d have seen data of at least this apparent level of association anyway? By using a particular kind of chi-squared test, the answer is 13.5% (i.e., the probability that the attribute is really irrelevant can be calculated with the help of standard chi-squared tables) 131
132
Build the full decision tree as before But when you can grow it no more, start to prune: ◦ Beginning at the bottom of the tree, delete splits in which p chance > MaxPchance ◦ Continue working you way up until there are no more prunable nodes MaxPchance is a magic parameter you must specify to the decision tree, indicating your willingness to risk fitting noise 132
133
Original MPG Test set error 133
134
With MaxPchance = 0.1, you will see the following MPG decision tree: 134 Note the improved test set accuracy compared with the unpruned tree
135
Good news: The decision tree can automatically adjust its pruning decisions according to the amount of apparent noise and data Bad news: The user must come up with a good value of MaxPchance (note: Andrew Moore usually uses 0.05, which is his favorite value for any magic parameter) Good news: But with extra work, the best MaxPchance value can be estimated automatically by a technique called cross-validation 135
136
Set aside some fraction of the known data and use it to test the prediction performance of a hypothesis induced from the remaining data K-fold cross-validation means that you run k experiments, each time setting aside a different 1/k of the data to test on, and average the results 136
137
137 For nondeterministic functions (e.g., the true inputs are not fully observed), there is an inevitable tradeoff between the complexity of the hypothesis and the degree of fit to the data
138
Ensemble learning methods select a whole collection, or ensemble, of hypotheses from the hypothesis space and combine their predictions For example, we might generate a hundred different decision trees from the same training set, and have them vote on the best classification for a new example 138
139
Suppose we assume that each hypothesis h i in the ensemble has an error of p; that is, the probability that a randomly chosen example is misclassified by h i is p Suppose we also assume that the errors made by each hypothesis are independent Then if p is small, the probability of a large number of misclassifications occurring is very small (The independence assumption above is unrealistic, but reduced correlation of errors among hypotheses still helps) 139
140
140
141
In a weighted training set, each example has an associated weight w J > 0; the higher the weight of an example, the higher the importance attached to it during the learning of a hypothesis Boosting starts with w J = 1 for all the examples (i.e., a normal training set) From this set, it generates the first hypothesis, h 1 This hypothesis will classify some of the training examples correctly and some incorrectly 141
142
We want the next hypothesis to do better on the misclassified examples, so we increase their weights while decreasing the weights of the correctly classified examples From this new weighted training set, we generate hypothesis h 2 The process continues in this way until we have generated M hypotheses, where M is an input to the boosting algorithm The final ensemble hypothesis is a weighted- majority combination of all the M hypotheses, each weighted according to how well it performed on the training set 142
143
143
144
There are many variants of the basic boosting idea with different ways of adjusting the weights and combining the hypotheses One specific algorithm, called AdaBoost, is given in Russell and Norvig AdaBoost has an important property: if the input learning algorithm L is a weak learning algorithm (that is, L always returns a hypothesis with weighted error on the training set that is slightly better than random guessing, i.e., 50% for Boolean classification) then AdaBoost will return a hypothesis that classifies the training data perfectly for large enough M 144
145
Thus, the algorithm boosts the accuracy of the original learning algorithm on the training data This result holds no matter how inexpressive the original hypothesis space and no matter how complex the function being learned 145
146
What can we say about the “correctness” of our learning procedure? Is there the possibility of exactly learning a concept? The answer is “yes”, but the technique is so restrictive, that it’s unusable in practice A stroll down memory lane… 146
147
In all representation languages, there is a partial order according to the generality of each sentence : 147 c 1 Red(c1) c1 c2 [Red(c1) Red(c2)] c 1 c 2 [Red(c 1 ) Black(c 2 )] c 1 c 2 c 3 [Red(c 1 ) Red(c 2 ) Black(c 3 ) A small rule space
148
The “candidate - elimination algorithm” moves S up and moves G down until they are equal and contain a single concept 148 “Boundary Sets” can be used to represent a subspace of the rule space: G S more general more specific
149
Positive examples of a concept move S up (generalizing S); Negative examples of a concept move G down (specializing G). 149
150
1. Make G be the null description ( most general ) ; make S be all the most specific concepts in the space. 2. Accept a “training example” : A. If positive, i. remove from G all concepts that don’t cover the new example ii. generalize the elements in S as little as possible so that they cover the new example. B. If negative, i. remove from S all concepts that cover this counter - example. ii. specialize the elements in G as little as possible so that they will not cover this new negative example. 3. Repeat step 2 until G = S and is a singleton set. This is the concept to be learned. 150 The algorithm:
151
Ex : Consider objects that have 2 features, size (S or L) and shape (C or R or T). The initial version space is : 151 (x y) (S y)(L y)(x R)(x C)(x T) (S R)(L R)(S C)(L C)(S T)(L T) G = { (x y) } S = { (S R), (L R), (S C), (L C), (S T), (L T) }
152
First training instance is positive : (S C) So, G = { (x y) }, S = { (S C) } and version space is now : 152 (x y) (S y) (L y)(x R) (x C) (x T) (S R)(L R) (S C) (L C)(S T)(L T) We’ve changed the S-set
153
Second training instance is negative : (L T) So, G = { (x C), (S y) },S = { (S C) } and version space is now : 153 We’ve changed the G-set (x y) (S y) (L y)(x R) (x C) (x T) (S R)(L R) (S C) (L C)(S T)(L T)
154
The third example is positive : (L C) First, (S y) is eliminated from the G - set (since it doesn’t cover the example). Then the S - set is generalized. 154 G = { (x C) } S = { (x C) } So the concept (x C) is the answer (“any circle”). (x y) (S y) (L y)(x R) (x C) (x T) (S R)(L R) (S C) (L C)(S T)(L T)
155
Doesn’t tolerate news in the training set Doesn’t learn disjunctive concepts What is needed is a more general theory of learning that will approach the issue probabilistically, not deterministically Enter: PAC Learning 155
156
Any hypothesis that is seriously wrong will almost certainly be “found out” with high probability after a small number of examples, because it will make an incorrect prediction Thus, any hypothesis that is consistent with a sufficiently large set of training examples is unlikely to be seriously wrong: that is, it must be probably approximately correct 156
157
The key assumption, called the stationarity assumption, is that the training and test sets are drawn randomly and independently from the same population of examples with the same probability distribution Without the stationarity assumption, the theory can make no claims at all about the future, because there would be no necessary connection between future and past 157
158
Let X be the set of all possible examples Let D be the distribution from which examples are drawn Let H be the set of possible hypotheses Let N be the number of examples in the training set 158
159
Assume that the true function f is a member of H Define the error of a hypothesis h with respect to the true function f given a distribution D over the examples, as the probability that h is different from f on an example error(h) = P( h(x) ≠ f(x) | x drawn from D) 159
160
A hypothesis h is called approximately correct if error(h) < ( , as usual, is a small constant) We’ll show that after seeing N examples, with high probability, all consistent hypotheses will be approximately correct --- lying within the -ball around the true function f 160
161
161
162
What is the probability that hypothesis h b in H bad is consistent with the first N examples? We have error(h b ) > The probability that h b agrees with a given example is at most 1 – The bound for N examples is: P(h b agrees with N examples) ≤ (1 – ) N 162
163
The probability that H bad contains at least one consistent hypothesis is bounded by the sum of the individual probabilities: P(H bad contains a consistent hypothesis) ≤ |H bad |(1 – ) N ≤ |H|(1 – ) N We want to reduce this probability below some small number : |H|(1 – ) N ≤ 163
164
Given that 1 – ≤ e – , we can achieve this if we allow the algorithm to see N ≥ 1/ ( ln 1/ + ln |H| ) examples If a learning algorithm returns a hypothesis that is consistent with this many examples, then with probability at least 1 - , it has error at most 164
165
The number of required examples, as a function of and , is called the sample complexity of the hypothesis space One of the key issues, then, is the size of the hypothesis space To make learning effective, we sometimes can restrict the space of functions the algorithm can consider (see Russell and Norvig on “learning decision lists”) 165
166
Andrew W. Moore Associate Professor School of Computer Science Carnegie Mellon University www.cs.cmu.edu/~awm awm@cs.cmu.edu 412-268-7599 Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in your own lecture, please include this message, or the following link to the source repository of Andrew’s tutorials: http://www.cs.cmu.edu/~awm/tutorials. Comments and corrections gratefully received. http://www.cs.cmu.edu/~awm/tutorials
167
Imagine we’re doing classification with categorical inputs All inputs and outputs are binary Data is noiseless There’s a machine f(x,h) which has H possible settings (a.k.a. hypotheses), called h 1, h 2.. h H 167
168
f(x,h) consists of all logical sentences about X1, X2.. Xm that contain only logical ands Example hypotheses: X1 X3 X19 X3 X18 X7 X1 X2 X2 x4 … Xm Question: if there are 3 attributes, what is the complete set of hypotheses in f? 168
169
f(x,h) consists of all logical sentences about X1, X2.. Xm that contain only logical ands. Example hypotheses: X1 X3 X19 X3 X18 X7 X1 X2 X2 x4 … Xm Question: if there are 3 attributes, what is the complete set of hypotheses in f? (H = 8) 169 TrueX2X3X2 ^ X3 X1X1 ^ X2X1 ^ X3X1 ^ X2 ^ X3
170
f(x,h) consists of all logical sentences about X1, X2.. Xm that contain only logical ands Example hypotheses: X1 X3 X19 X3 X18 X7 X1 X2 X2 x4 … Xm Question: if there are m attributes, how many hypotheses in f? 170
171
f(x,h) consists of all logical sentences about X1, X2.. Xm that contain only logical ands Example hypotheses: X1 X3 X19 X3 X18 X7 X1 X2 X2 x4 … Xm Question: if there are m attributes, how many hypotheses in f? (H = 2 m ) 171
172
f(x,h) consists of all logical sentences about X1, X2.. Xm or their negations that contain only logical ands. Example hypotheses: X1 ^ ~X3 ^ X19 X3 ^ ~X18 ~X7 X1 ^ X2 ^ ~X3 ^ … ^ Xm Question: if there are 2 attributes, what is the complete set of hypotheses in f? 172
173
f(x,h) consists of all logical sentences about X1, X2.. Xm or their negations that contain only logical ands. Example hypotheses: X1 ^ ~X3 ^ X19 X3 ^ ~X18 ~X7 X1 ^ X2 ^ ~X3 ^ … ^ Xm Question: if there are 2 attributes, what is the complete set of hypotheses in f? (H = 9) 173 True X2 True~X2 X1True X1 X2 X1 ~X2 ~X1True ~X1 X2 ~X1 ~X2
174
f(x,h) consists of all logical sentences about X1, X2.. Xm or their negations that contain only logical ands. Example hypotheses: X1 ^ ~X3 ^ X19 X3 ^ ~X18 ~X7 X1 ^ X2 ^ ~X3 ^ … ^ Xm Question: if there are m attributes, what is the size of the complete set of hypotheses in f? 174 True X2 True~X2 X1True X1 X2 X1 ~X2 ~X1True ~X1 X2 ~X1 ~X2
175
f(x,h) consists of all logical sentences about X1, X2.. Xm or their negations that contain only logical ands. Example hypotheses: X1 ^ ~X3 ^ X19 X3 ^ ~X18 ~X7 X1 ^ X2 ^ ~X3 ^ … ^ Xm Question: if there are m attributes, what is the size of the complete set of hypotheses in f? (H = 3 m ) 175 True X2 True~X2 X1True X1 X2 X1 ~X2 ~X1True ~X1 X2 ~X1 ~X2
176
f(x,h) consists of all truth tables mapping combinations of input attributes to true and false Example hypothesis: Question: if there are m attributes, what is the size of the complete set of hypotheses in f? 176 X1X2X3X4Y 00000 00011 00101 00110 01001 01010 01100 01111 10000 10010 10100 10111 11000 11010 11100 11110
177
f(x,h) consists of all truth tables mapping combinations of input attributes to true and false Example hypothesis: Question: if there are m attributes, what is the size of the complete set of hypotheses in f? 177 X1X2X3X4Y 00000 00011 00101 00110 01001 01010 01100 01111 10000 10010 10100 10111 11000 11010 11100 11110
178
We specify f, the machine Nature choose hidden random hypothesis h* Nature randomly generates R datapoints ◦ How is a datapoint generated? 1.Vector of inputs x k = (x k1,x k2, x km ) is drawn from a fixed unknown distrib: D 2.The corresponding output y k =f(x k, h*) We learn an approximation of h* by choosing some h est for which the training set error is 0 178
179
We specify f, the machine Nature choose hidden random hypothesis h* Nature randomly generates R datapoints ◦ How is a datapoint generated? 1.Vector of inputs x k = (x k1,x k2, x km ) is drawn from a fixed unknown distrib: D 2.The corresponding output y k =f(x k, h*) We learn an approximation of h* by choosing some h est for which the training set error is 0 For each hypothesis h, Say h is Correctly Classified (CCd) if h has zero training set error Define TESTERR(h ) = Fraction of test points that h will classify correctly = P( h classifies a random test point correctly) Say h is BAD if TESTERR(h) > 179
180
We specify f, the machine Nature choose hidden random hypothesis h* Nature randomly generates R datapoints ◦ How is a datapoint generated? 1.Vector of inputs x k = (x k1,x k2, x km ) is drawn from a fixed unknown distrib: D 2.The corresponding output y k =f(x k, h*) We learn an approximation of h* by choosing some h est for which the training set error is 0 For each hypothesis h, Say h is Correctly Classified (CCd) if h has zero training set error Define TESTERR(h ) = Fraction of test points that i will classify correctly = P( h classifies a random test point correctly) Say h is BAD if TESTERR(h) > 180
181
We specify f, the machine Nature choose hidden random hypothesis h* Nature randomly generates R datapoints ◦ How is a datapoint generated? 1.Vector of inputs x k = (x k1,x k2, x km ) is drawn from a fixed unknown distrib: D 2.The corresponding output y k =f(x k, h*) We learn an approximation of h* by choosing some h est for which the training set error is 0 For each hypothesis h, Say h is Correctly Classified (CCd) if h has zero training set error Define TESTERR(h ) = Fraction of test points that i will classify correctly = P( h classifies a random test point correctly) Say h is BAD if TESTERR(h) > 181
182
Chose R such that with probability less than we’ll select a bad h est (i.e., an h est which makes mistakes more than fraction of the time) Probably Approximately Correct As we just saw, this can be achieved by choosing R such that i.e., R such that 182
183
183 MachineExample Hypothesis HR required to PAC- learn And-positive- literals X3 ^ X7 ^ X82m2m And-literalsX3 ^ ~X73m3m Lookup Table And-lits or And-lits (X1 ^ X5) v (X2 ^ ~X7 ^ X8) X1X2X3X4Y 00000 00011 00101 00110 01001 01010 01100 01111 10000 10010 10100 10111 11000 11010 11100 11110
184
Assume m attributes H k = Number of decision trees of depth k H 0 =2 H k+1 = (#choices of root attribute) * (# possible left subtrees) * (# possible right subtrees) = m * H k * H k Write L k = log 2 H k L 0 = 1 L k+1 = log 2 m + 2L k So L k = (2 k -1)(1+log 2 m) +1 So to PAC-learn, need 184
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.