The Hebrew University of Jerusalem School of Engineering and Computer Science Instructor: Jeff Rosenschein (Chapter 18, “Artificial Intelligence: A Modern.

The Hebrew University of Jerusalem School of Engineering and Computer Science Instructor: Jeff Rosenschein (Chapter 18, “Artificial Intelligence: A Modern Approach”)

 Learning agents  Inductive learning  Decision tree learning 2

 Learning modifies the agent’s decision mechanisms to improve performance  Learning is essential for unknown environments ◦ i.e., when designer lacks omniscience  Learning is useful as a system construction method ◦ i.e., expose the agent to reality rather than trying to write it down 3

 Design of a learning element is dictated by ◦ what type of performance element is used ◦ which functional component is to be used ◦ how that functional component is represented ◦ what kind of feedback is available  Types of feedback: ◦ Supervised learning: correct answers for each example ◦ Unsupervised learning: correct answers not given (e.g., taxi agent learning concept of “good traffic days” and “bad traffic days”) ◦ Reinforcement learning: occasional rewards 5

 http://www.cs.utexas.edu/users/AustinVilla/ ?p=research/learned_walk http://www.cs.utexas.edu/users/AustinVilla/ ?p=research/learned_walk  Aibo Learning Movies 7

 Simplest form: learn a function from examples (tabula rasa) f is the target function An example is a pair (x, f(x)), e.g., Problem: find a hypothesis h such that h ≈ f given a training set of examples (This is a highly simplified model of real learning: ◦ Ignores prior knowledge ◦ Assumes a deterministic, observable “environment” ◦ Assumes examples are given ◦ Assumes that the agent wants to learn f – why?) 8

 Construct/adjust h to agree with f on training set  (h is consistent if it agrees with f on all examples)  E.g., curve fitting: 9

 Construct/adjust h to agree with f on training set  (h is consistent if it agrees with f on all examples)  E.g., curve fitting:  Ockham’s razor: prefer the simplest hypothesis consistent with data 14

Problem: decide whether to wait for a table at a restaurant, based on the following attributes: 1.Alternate: is there an alternative restaurant nearby? 2.Bar: is there a comfortable bar area to wait in? 3.Fri/Sat: is today Friday or Saturday? 4.Hungry: are we hungry? 5.Patrons: number of people in the restaurant (None, Some, Full) 6.Price: price range ($, $$, $$$) 7.Raining: is it raining outside? 8.Reservation: have we made a reservation? 9.Type: kind of restaurant (French, Italian, Thai, Burger) 10. WaitEstimate: estimated waiting time (0-10, 10- 30, 30-60, >60) 15

 Examples described by attribute values (Boolean, discrete, continuous)  E.g., situations where I will/won’t wait for a table:  Classification of examples is positive (T) or negative (F) 16

 One possible representation for hypotheses  E.g., here is the “true” tree for deciding whether to wait: 17

 Assume all inputs are Boolean and all outputs are Boolean  What is the class of Boolean functions that are possible to represent by decision trees?  Answer: All Boolean functions Simple proof: 1. Take any Boolean function 2. Convert it into a truth table 3. Construct a decision tree in which each row of the truth table corresponds to one path through the decision tree 18

 Decision trees can express any function of the input attributes  E.g., for Boolean functions, truth table row → path to leaf:  Trivially, there is a consistent decision tree for any training set with one path to leaf for each example (unless f nondeterministic in x) but it probably won’t generalize to new examples  Prefer to find more compact decision trees 19

How many distinct decision trees with n Boolean attributes? = number of Boolean functions = number of distinct truth tables with 2 n rows = 2 2 n  E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 trees 20 xyzOutcomes… 0000 1 0 1 0 1 0 1 0 1… 0010 0 1 1 0 0 1 1 0 0… 0100 0 0 0 1 1 1 1 0 0… 0110 0 0 0 0 0 0 0 1 1… 1000 0 0 0 0 0 0 0 0 0… 101 110 111 n = 3 256 trees = 2 2 3

How many distinct decision trees with n Boolean attributes? = number of Boolean functions = number of distinct truth tables with 2 n rows = 2 2 n  E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 trees How many purely conjunctive hypotheses (e.g., Hungry   Rain)?  Each attribute can be in (positive), in (negative), or out  3 n distinct conjunctive hypotheses  A more expressive hypothesis space: ◦ increases chance that target function can be expressed ◦ increases number of hypotheses consistent with training set  may get worse predictions 21

 Aim: find a small tree consistent with the training examples  Idea: (recursively) choose “most significant” attribute as root of (sub)tree 22 default value for goal predicate, will be Majority Value… Majority Value

 A decision tree learned from the 12 examples:  Substantially simpler than “true” tree – a more complex hypothesis isn’t justified by small amount of data 24

 A decision tree (left) learned from the 12 examples:  Substantially simpler than “true” tree – a more complex hypothesis isn’t justified by small amount of data 25

 Idea: a good attribute splits the examples into subsets that are (ideally) “all positive” or “all negative”  Patrons? is a better choice 26

 To implement Choose-Attribute in the DTL algorithm  Discrete random variable V with possible values {v 1,..., v n }  Information Content (Entropy): H(V) = H(P(v 1 ), …, P(v n )) = Σ i=1 -P(v i ) log 2 P(v i )  For a training set containing p positive examples and n negative examples: 27

 Information Content (Entropy): H(P(v 1 ), …, P(v n )) = Σ i=1 -P(v i ) log 2 P(v i )  For a training set containing p positive examples and n negative examples: 28 H( ½, ½) = - ½ log 2 ½ - ½ log 2 ½ = 1 bit one bit of information is sufficient to convey the answer regarding the flip of a fair coin H(1, 0) = - 1 log 2 1 - 0 log 2 0 = 0 bits no bits of information required, the outcome is predictable This is assumed to be equal to 0

 A chosen attribute A divides the training set E into subsets E 1, …, E v according to their values for A, where A has v distinct values  Information Gain (IG) or reduction in entropy from the attribute test:  Choose the attribute with the largest IG 29 also called “conditional entropy”…

For the training set, p = n = 6, I(6/12, 6/12) = 1 bit Consider the attributes Patrons and Type (and others too): Patrons has the highest IG of all attributes and so is chosen by the DTL algorithm as the root 31 The entropy of the original set was 1, i.e., 6 positive and 6 negative examples

Andrew W. Moore Professor School of Computer Science Carnegie Mellon University www.cs.cmu.edu/~awm awm@cs.cmu.edu 412-268-7599 Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in your own lecture, please include this message, or the following link to the source repository of Andrew’s tutorials: http://www.cs.cmu.edu/~awm/tutorials. Comments and corrections gratefully received.http://www.cs.cmu.edu/~awm/tutorials

33 You are watching a set of independent random samples of X You see that X has four possible values So you might see: BAACBADCDADDDA… You transmit data over a binary serial link. You can encode each reading with two bits (e.g., A = 00, B = 01, C = 10, D = 11) 0100001001001110110011111100… P(X=A) = 1/4P(X=B) = 1/4P(X=C) = 1/4P(X=D) = 1/4

34 Someone tells you that the probabilities are not equal It’s possible… …to invent a coding for your transmission that only uses 1.75 bits on average per symbol. How? P(X=A) = 1/2P(X=B) = 1/4P(X=C) = 1/8P(X=D) = 1/8

35 Someone tells you that the probabilities are not equal It’s possible… …to invent a coding for your transmission that only uses 1.75 bits on average per symbol. How? (This is just one of several ways) P(X=A) = 1/2P(X=B) = 1/4P(X=C) = 1/8P(X=D) = 1/8 A0 B10 C110 D111

36 Suppose there are three equally likely values… Here’s a naïve coding, costing 2 bits per symbol Can you think of a coding that would need only 1.6 bits per symbol on average? In theory, it can in fact be done with 1.58496 bits per symbol. P(X=A) = 1/3P(X=B) = 1/3P(X=C) = 1/3 A00 B01 C10

37 Suppose X can have one of m values… V 1, V 2, … V m What’s the smallest possible number of bits, on average, per symbol, needed to transmit a stream of symbols drawn from X’s distribution? It’s H(X) = The entropy of X  “High Entropy” means X is from a uniform (boring) distribution  “Low Entropy” means X is from a varied (peaks and valleys) distribution P(X=V 1 ) = p 1 P(X=V 2 ) = p 2 ….P(X=V m ) = p m

38 Suppose X can have one of m values… V 1, V 2, … V m What’s the smallest possible number of bits, on average, per symbol, needed to transmit a stream of symbols drawn from X’s distribution? It’s H(X) = The entropy of X  “High Entropy” means X is from a uniform (boring) distribution  “Low Entropy” means X is from varied (peaks and valleys) distribution P(X=V 1 ) = p 1 P(X=V 2 ) = p 2 ….P(X=V m ) = p m A histogram of the frequency distribution of values of X would be flat A histogram of the frequency distribution of values of X would have many lows and one or two highs

39 Suppose X can have one of m values… V 1, V 2, … V m What’s the smallest possible number of bits, on average, per symbol, needed to transmit a stream of symbols drawn from X’s distribution? It’s H(X) = The entropy of X  “High Entropy” means X is from a uniform (boring) distribution  “Low Entropy” means X is from varied (peaks and valleys) distribution P(X=V 1 ) = p 1 P(X=V 2 ) = p 2 ….P(X=V m ) = p m A histogram of the frequency distribution of values of X would be flat A histogram of the frequency distribution of values of X would have many lows and one or two highs..and so the values sampled from it would be all over the place..and so the values sampled from it would be more predictable

40 Low EntropyHigh Entropy

41 Low EntropyHigh Entropy..the values (locations of soup) unpredictable... almost uniformly sampled throughout our dining room..the values (locations of soup) sampled entirely from within the soup bowl

XY MathYes HistoryNo CSYes MathNo MathNo CSYes HistoryNo MathYes Suppose I’m trying to predict output Y and I have input X Let’s assume this reflects the true probabilities E.g., From this data we estimate P(LikeA = Yes) = 0.5 P(Major = Math & LikeA = No) = 0.25 P(Major = Math) = 0.5 P(LikeA = Yes | Major = History) = 0 Note: H(X) = 1.5 = - ½ log 2 ½ - ¼ log 2 ¼ - ¼ log 2 ¼ H(Y) = 1 X = College Major Y = Likes “Avatar”

Definition of Specific Conditional Entropy: H(Y |X=v) = The entropy of Y among only those records in which X has value v X = College Major Y = Likes “Avatar” XY MathYes HistoryNo CSYes MathNo MathNo CSYes HistoryNo MathYes

Definition of Specific Conditional Entropy: H(Y |X=v) = The entropy of Y among only those records in which X has value v Example: H(Y|X=Math) = 1 H(Y|X=History) = 0 H(Y|X=CS) = 0 X = College Major Y = Likes “Avatar” XY MathYes HistoryNo CSYes MathNo MathNo CSYes HistoryNo MathYes

Definition of Conditional Entropy: H(Y |X) = The average specific conditional entropy of Y = if you choose a record at random what will be the conditional entropy of Y, conditioned on that row’s value of X = Expected number of bits to transmit Y if both sides will know the value of X = Σ j Prob(X=v j ) H(Y | X = v j ) X = College Major Y = Likes “Avatar” XY MathYes HistoryNo CSYes MathNo MathNo CSYes HistoryNo MathYes This is what was called remainder in the slides above…

Definition of Conditional Entropy: H(Y|X) = The average conditional entropy of Y = Σ j Prob(X=v j ) H(Y | X = v j ) X = College Major Y = Likes “Avatar” Example: vjvj Prob(X=v j )H(Y | X = v j ) Math0.51 History0.250 CS0.250 H(Y|X) = 0.5 * 1 + 0.25 * 0 + 0.25 * 0 = 0.5 XY MathYes HistoryNo CSYes MathNo MathNo CSYes HistoryNo MathYes

Definition of Information Gain: IG(Y|X) = I must transmit Y. How many bits on average would it save me if both ends of the line knew X? IG(Y|X) = H(Y) - H(Y | X) X = College Major Y = Likes “Avatar” Example: H(Y) = 1 H(Y|X) = 0.5 Thus IG(Y|X) = 1 – 0.5 = 0.5 XY MathYes HistoryNo CSYes MathNo MathNo CSYes HistoryNo MathYes

Definition of Relative Information Gain: RIG(Y|X) = I must transmit Y, what fraction of the bits on average would it save me if both ends of the line knew X? RIG(Y|X) = [ H(Y) - H(Y | X) ] / H(Y) X = College Major Y = Likes “Avatar” Example: H(Y|X) = 0.5 H(Y) = 1 Thus RIG(Y|X) = (1 – 0.5)/1 = 0.5 XY MathYes HistoryNo CSYes MathNo MathNo CSYes HistoryNo MathYes

Suppose you are trying to predict whether someone is going to live past 80 years. From historical data you might find… IG(LongLife | HairColor) = 0.01 IG(LongLife | Smoker) = 0.2 IG(LongLife | Gender) = 0.25 IG(LongLife | LastDigitOfSSN) = 0.00001 IG tells you how interesting a 2-d contingency table is going to be (more about this soon…)

 One possible representation for hypotheses  E.g., here is the “true” tree for deciding whether to wait: 52

 Aim: find a small tree consistent with the training examples  Idea: (recursively) choose “most significant” attribute as root of (sub)tree 53

 Decision tree learned from the 12 examples:  Substantially simpler than “true” tree – a more complex hypothesis isn’t justified by small amount of data 55

 How do we know that h ≈ f ? Use theorems of computational/statistical learning theory (more on this, later) OR ◦ Randomly divide set of examples into training set and test set ◦ Learn h from training set ◦ Try h on test set of examples (measure percent of test set correctly classified) ◦ Repeat for:  different sizes of training sets, and  for each size of training set, different randomly selected sets 56

 Learning curve = % correct on test set as a function of training set size 57 A “happy graph” that leads us to believe there is some pattern in the data and the learning algorithm is discovering it.

 The learning algorithm cannot be allowed to “see” (or be influenced by) the test data before the hypothesis h is tested on it  If we generate different h’s (for different parameters), and report back as our h the one that gave the best performance on the test set, then we’re allowing test set results to affect our learning algorithm  This taints the results, but people do it anyway… 58

 Learning needed for unknown environments, lazy designers  Learning agent = performance element + learning element  For supervised learning, the aim is to find a simple hypothesis approximately consistent with training examples  Decision tree learning using information gain  Learning performance = prediction accuracy measured on test set 59

Andrew W. Moore Professor School of Computer Science Carnegie Mellon University www.cs.cmu.edu/~awm awm@cs.cmu.edu 412-268-7599 Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in your own lecture, please include this message, or the following link to the source repository of Andrew’s tutorials: http://www.cs.cmu.edu/~awm/tutorials. Comments and corrections gratefully received.http://www.cs.cmu.edu/~awm/tutorials

 We’ll look at Information Gain, used both in Data Mining, and (again) in Decision Tree learning  This gives us a new (reinforced) perspective on the topic 61

Machine Learning Datasets What is Classification? Contingency Tables OLAP (Online Analytical Processing) What is Data Mining? Searching for High Information Gain Learning an unpruned decision tree recursively Training Set Error Test Set Error Overfitting Avoiding Overfitting Outline 62

48,842 records, 16 attributes [Kohavi 1995] 63

 A Major Data Mining Operation  Give one attribute (e.g., wealth), try to predict the value of new people’s wealths by means of some of the other available attributes  Applies to categorical outputs  Categorical attribute: an attribute which takes on two or more discrete values. Also known as a symbolic attribute  Real attribute: a column of real numbers 65

 It is a tiny subset of the 1990 US Census  It is publicly available online from the UCI Machine Learning Datasets repository 66

 Well, you can look at histograms… 67 Gender Marital Status

 A better name for a histogram: a one-dimensional Contingency Table  Recipe for making a k-dimensional contingency table: 1.Pick k attributes from your dataset. Call them a 1,a 2, … a k. 2.For every possible combination of values, a 1,=x 1, a 2,=x 2,… a k,=x k, record how frequently that combination occurs Fun fact: A database person would call this a “k-dimensional datacube” 69

 For each pair of values for attributes (agegroup, wealth) we can see how many records match 70

 Easier to appreciate graphically 71

 Easier to see “interesting” things if we stretch out the histogram bars 72

 These are harder to look at! 74 Male Female Rich Poor 20s 30s 40s 50s

Machine Learning Datasets What is Classification? Contingency Tables OLAP (Online Analytical Processing) What is Data Mining? Searching for High Information Gain Learning an unpruned decision tree recursively Training Set Error Test Set Error Overfitting Avoiding Overfitting Information Gain of a real valued input Building Decision Trees with real Valued Inputs Andrew’s homebrewed hack: Binary Categorical Splits Example Decision Trees Outline 75

 Software packages and database add-ons to do this are known as OLAP tools  They usually include point and click navigation to view slices and aggregates of contingency tables  They usually include nice histogram visualization 76

 Why would people want to look at contingency tables? 77

 With 16 attributes, how many 1-d contingency tables are there?  How many 2-d contingency tables?  How many 3-d tables?  With 100 attributes how many 3-d tables are there? 78

 With 16 attributes, how many 1-d contingency tables are there? 16  How many 2-d contingency tables? 16-choose-2 = 16! / [2! * (16 – 2)!] = (16 * 15) / 2 = 120  How many 3-d tables? 560  With 100 attributes how many 3-d tables are there? 161,700 79

 Looking at one contingency table: can be as much fun as reading an interesting book  Looking at ten tables: as much fun as watching CNN  Looking at 100 tables: as much fun as watching an infomercial  Looking at 100,000 tables: as much fun as a three-week November vacation in Duluth with a dying weasel 80

 Data Mining is all about automating the process of searching for patterns in the data Which patterns are interesting? Which might be mere illusions? And how can they be exploited? 82

 Data Mining is all about automating the process of searching for patterns in the data Which patterns are interesting? Which might be mere illusions? And how can they be exploited? 83 That’s what we’ll look at right now. And the answer (info gains) will turn out to be the engine that drives decision tree learning…(but you already know that) That’s what we’ll look at right now. And the answer (info gains) will turn out to be the engine that drives decision tree learning…(but you already know that)

 We will use information theory  A very large topic, originally used for compressing signals  But more recently used for data mining… 84

 Given something (e.g., wealth) you are trying to predict, it is easy to ask the computer to find which attribute has highest information gain for it 86

 A Decision Tree is a tree-structured plan of a set of attributes to test in order to predict the output  To decide which attribute should be tested first, simply find the one with the highest information gain  Then recurse… 88

89 From the UCI (University of California at Irvine) repository (thanks to Ross Quinlan) 40 Records

90 Suppose we want to predict MPG

92 Take the Original Dataset.. And partition it according to the value of the attribute we split on Records in which cylinders = 4 Records in which cylinders = 5 Records in which cylinders = 6 Records in which cylinders = 8

93 Records in which cylinders = 4 Records in which cylinders = 5 Records in which cylinders = 6 Records in which cylinders = 8 Build tree from These records.. Build tree from These records.. Build tree from These records.. Build tree from These records..

94 Recursively build a tree from the seven records in which there are four cylinders and the maker was based in Asia (Similar recursion in the other cases)

The final tree 95

Base Case One Don’t split a node if all matching records have the same output value 96

Base Case Two Don’t split a node if none of the attributes can create multiple non- empty children 97

Base Case Two: No attributes can distinguish 98

 Base Case One: If all records in current data subset have the same output then don’t recurse  Base Case Two: If all records have exactly the same set of input attributes then don’t recurse 99

 Base Case One: If all records in current data subset have the same output then don’t recurse  Base Case Two: If all records have exactly the same set of input attributes then don’t recurse 100 Proposed Base Case 3: If all attributes have zero information gain then don’t recurse Is this a good idea?

101 y = a XOR b The information gains: The resulting decision tree:

102 y = a XOR b The resulting decision tree:

BuildTree(DataSet,Output)  If all output values are the same in DataSet, return a leaf node that says “predict this unique output”  If all input values are the same, return a leaf node that says “predict the majority output”  Else find attribute X with highest Info Gain  Suppose X has n X distinct values (i.e., X has arity n X ) ◦ Create and return a non-leaf node with n X children ◦ The i’th child should be built by calling BuildTree(DS i,Output) where DS i built consists of all those records in DataSet for which X = ith distinct value of X 103

 For each record, follow the decision tree to see what it would predict For what number of records does the decision tree’s prediction disagree with the true value in the database?  This quantity is called the training set error. The smaller the better. 105

MPG Training error 106

 It is not usually in order to predict the training data’s output on data we have already seen 109

 It is more commonly in order to predict the output value for future data we have not yet seen 110

 Suppose we are forward thinking  We hide some data away when we learn the decision tree  But once learned, we see how well the tree predicts that data  This is a good simulation of what happens when we try to predict future data  And it is called Test Set Error 112

MPG Test set error 113

The test set error is much worse than the training set error… …why? 114

 We’ll create a training dataset 116 abcdey 000000 000010 000100 000111 001001 :::::: 111111 Five inputs, all bits, are generated in all 32 possible combinations Output y = copy of e, except a random 25% of the records have y set to the opposite of e 32 records

 Suppose someone generates a test set according to the same method  The test set is identical, except that some of the y’s will be different  Some y’s that were corrupted in the training set will be uncorrupted in the testing set  Some y’s that were uncorrupted in the training set will be corrupted in the test set 117

 Suppose we build a full tree (we always split until “base case 2”, i.e., don’t split a node if none of the attributes can create multiple non-empty children) 118 Root e=0 a=0a=1 e=1 a=0a=1 25% of these leaf node labels will be corrupted

All the leaf nodes contain exactly one record and so…  We would have a training set error of zero 119

120 1/4 of the tree nodes are corrupted 3/4 are fine 1/4 of the test set records are corrupted 1/16 of the test set will be correctly predicted for the wrong reasons 3/16 of the test set will be wrongly predicted because the test record is corrupted 3/4 are fine3/16 of the test predictions will be wrong because the tree node is corrupted 9/16 of the test predictions will be fine In total, we expect to be wrong on 3/8 of the test set predictions

 This explains the discrepancy between training and test set error  But more importantly… …it indicates there’s something we should do about it if we want to predict well on future data 121

 Let’s not look at the irrelevant bits 122 abcdey 000000 000010 000100 000111 001001 :::::: 111111 These bits are hidden Output y = copy of e, except a random 25% of the records have y set to the opposite of e 32 records What decision tree would we learn now?

123 e=0 e=1 Root These nodes will be unexpandable

124 e=0 e=1 Root These nodes will be unexpandable In about 12 of the 16 records in this node the output will be 0 So this will almost certainly predict 0 In about 12 of the 16 records in this node the output will be 1 So this will almost certainly predict 1

125 e=0 e=1 Root almost certainly none of the tree nodes are corrupted almost certainly all are fine 1/4 of the test set records are corrupted n/a1/4 of the test set will be wrongly predicted because the test record is corrupted 3/4 are finen/a3/4 of the test predictions will be fine In total, we expect to be wrong on only 1/4 of the test set predictions

 Definition: If your machine learning algorithm fits noise (i.e., pays attention to parts of the data that are irrelevant) it is overfitting  Fact (theoretical and empirical): If your machine learning algorithm is overfitting then it may perform less well on test set data 126

 Usually we do not know in advance which are the irrelevant variables  …and it may depend on the context For example, if y = a AND b, then b is an irrelevant variable only in the portion of the tree in which a=0 But we can use simple statistics to warn us that we might be overfitting 128

Consider this split 129

 Suppose that mpg was completely uncorrelated with maker  What is the chance we’d have seen data of at least this apparent level of association anyway? 130

 Suppose that mpg was completely uncorrelated with maker  What is the chance we’d have seen data of at least this apparent level of association anyway? By using a particular kind of chi-squared test, the answer is 13.5% (i.e., the probability that the attribute is really irrelevant can be calculated with the help of standard chi-squared tables) 131

 Build the full decision tree as before  But when you can grow it no more, start to prune: ◦ Beginning at the bottom of the tree, delete splits in which p chance > MaxPchance ◦ Continue working you way up until there are no more prunable nodes MaxPchance is a magic parameter you must specify to the decision tree, indicating your willingness to risk fitting noise 132

Original MPG Test set error 133

 With MaxPchance = 0.1, you will see the following MPG decision tree: 134 Note the improved test set accuracy compared with the unpruned tree

 Good news: The decision tree can automatically adjust its pruning decisions according to the amount of apparent noise and data  Bad news: The user must come up with a good value of MaxPchance (note: Andrew Moore usually uses 0.05, which is his favorite value for any magic parameter)  Good news: But with extra work, the best MaxPchance value can be estimated automatically by a technique called cross-validation 135

 Set aside some fraction of the known data and use it to test the prediction performance of a hypothesis induced from the remaining data  K-fold cross-validation means that you run k experiments, each time setting aside a different 1/k of the data to test on, and average the results 136

137 For nondeterministic functions (e.g., the true inputs are not fully observed), there is an inevitable tradeoff between the complexity of the hypothesis and the degree of fit to the data

 Ensemble learning methods select a whole collection, or ensemble, of hypotheses from the hypothesis space and combine their predictions  For example, we might generate a hundred different decision trees from the same training set, and have them vote on the best classification for a new example 138

 Suppose we assume that each hypothesis h i in the ensemble has an error of p; that is, the probability that a randomly chosen example is misclassified by h i is p  Suppose we also assume that the errors made by each hypothesis are independent  Then if p is small, the probability of a large number of misclassifications occurring is very small  (The independence assumption above is unrealistic, but reduced correlation of errors among hypotheses still helps) 139

 In a weighted training set, each example has an associated weight w J > 0; the higher the weight of an example, the higher the importance attached to it during the learning of a hypothesis  Boosting starts with w J = 1 for all the examples (i.e., a normal training set)  From this set, it generates the first hypothesis, h 1  This hypothesis will classify some of the training examples correctly and some incorrectly 141

 We want the next hypothesis to do better on the misclassified examples, so we increase their weights while decreasing the weights of the correctly classified examples  From this new weighted training set, we generate hypothesis h 2  The process continues in this way until we have generated M hypotheses, where M is an input to the boosting algorithm  The final ensemble hypothesis is a weighted- majority combination of all the M hypotheses, each weighted according to how well it performed on the training set 142

 There are many variants of the basic boosting idea with different ways of adjusting the weights and combining the hypotheses  One specific algorithm, called AdaBoost, is given in Russell and Norvig  AdaBoost has an important property: if the input learning algorithm L is a weak learning algorithm (that is, L always returns a hypothesis with weighted error on the training set that is slightly better than random guessing, i.e., 50% for Boolean classification) then AdaBoost will return a hypothesis that classifies the training data perfectly for large enough M 144

 Thus, the algorithm boosts the accuracy of the original learning algorithm on the training data  This result holds no matter how inexpressive the original hypothesis space and no matter how complex the function being learned 145

 What can we say about the “correctness” of our learning procedure?  Is there the possibility of exactly learning a concept?  The answer is “yes”, but the technique is so restrictive, that it’s unusable in practice  A stroll down memory lane… 146

In all representation languages, there is a partial order according to the generality of each sentence : 147  c 1 Red(c1)  c1 c2 [Red(c1)  Red(c2)]  c 1 c 2 [Red(c 1 )  Black(c 2 )]  c 1 c 2 c 3 [Red(c 1 )  Red(c 2 )  Black(c 3 ) A small rule space

The “candidate - elimination algorithm” moves S up and moves G down until they are equal and contain a single concept 148 “Boundary Sets” can be used to represent a subspace of the rule space: G S more general more specific

Positive examples of a concept move S up (generalizing S); Negative examples of a concept move G down (specializing G). 149

1. Make G be the null description ( most general ) ; make S be all the most specific concepts in the space. 2. Accept a “training example” : A. If positive, i. remove from G all concepts that don’t cover the new example ii. generalize the elements in S as little as possible so that they cover the new example. B. If negative, i. remove from S all concepts that cover this counter - example. ii. specialize the elements in G as little as possible so that they will not cover this new negative example. 3. Repeat step 2 until G = S and is a singleton set. This is the concept to be learned. 150 The algorithm:

Ex : Consider objects that have 2 features, size (S or L) and shape (C or R or T). The initial version space is : 151 (x y) (S y)(L y)(x R)(x C)(x T) (S R)(L R)(S C)(L C)(S T)(L T) G = { (x y) } S = { (S R), (L R), (S C), (L C), (S T), (L T) }

First training instance is positive : (S C) So, G = { (x y) }, S = { (S C) } and version space is now : 152 (x y) (S y) (L y)(x R) (x C) (x T) (S R)(L R) (S C) (L C)(S T)(L T) We’ve changed the S-set

Second training instance is negative : (L T) So, G = { (x C), (S y) },S = { (S C) } and version space is now : 153 We’ve changed the G-set (x y) (S y) (L y)(x R) (x C) (x T) (S R)(L R) (S C) (L C)(S T)(L T)

The third example is positive : (L C) First, (S y) is eliminated from the G - set (since it doesn’t cover the example). Then the S - set is generalized. 154 G = { (x C) } S = { (x C) } So the concept (x C) is the answer (“any circle”). (x y) (S y) (L y)(x R) (x C) (x T) (S R)(L R) (S C) (L C)(S T)(L T)

 Doesn’t tolerate news in the training set  Doesn’t learn disjunctive concepts  What is needed is a more general theory of learning that will approach the issue probabilistically, not deterministically  Enter: PAC Learning 155

 Any hypothesis that is seriously wrong will almost certainly be “found out” with high probability after a small number of examples, because it will make an incorrect prediction  Thus, any hypothesis that is consistent with a sufficiently large set of training examples is unlikely to be seriously wrong: that is, it must be probably approximately correct 156

 The key assumption, called the stationarity assumption, is that the training and test sets are drawn randomly and independently from the same population of examples with the same probability distribution  Without the stationarity assumption, the theory can make no claims at all about the future, because there would be no necessary connection between future and past 157

 Let X be the set of all possible examples  Let D be the distribution from which examples are drawn  Let H be the set of possible hypotheses  Let N be the number of examples in the training set 158

 Assume that the true function f is a member of H  Define the error of a hypothesis h with respect to the true function f given a distribution D over the examples, as the probability that h is different from f on an example error(h) = P( h(x) ≠ f(x) | x drawn from D) 159

 A hypothesis h is called approximately correct if error(h) <  ( , as usual, is a small constant)  We’ll show that after seeing N examples, with high probability, all consistent hypotheses will be approximately correct --- lying within the  -ball around the true function f 160

 What is the probability that hypothesis h b in H bad is consistent with the first N examples?  We have error(h b ) >   The probability that h b agrees with a given example is at most 1 –   The bound for N examples is: P(h b agrees with N examples) ≤ (1 –  ) N 162

 The probability that H bad contains at least one consistent hypothesis is bounded by the sum of the individual probabilities: P(H bad contains a consistent hypothesis) ≤ |H bad |(1 –  ) N ≤ |H|(1 –  ) N  We want to reduce this probability below some small number  : |H|(1 –  ) N ≤  163

 Given that 1 –  ≤ e – , we can achieve this if we allow the algorithm to see N ≥ 1/  ( ln 1/  + ln |H| ) examples  If a learning algorithm returns a hypothesis that is consistent with this many examples, then with probability at least 1 - , it has error at most  164

 The number of required examples, as a function of  and , is called the sample complexity of the hypothesis space  One of the key issues, then, is the size of the hypothesis space  To make learning effective, we sometimes can restrict the space of functions the algorithm can consider (see Russell and Norvig on “learning decision lists”) 165

Andrew W. Moore Associate Professor School of Computer Science Carnegie Mellon University www.cs.cmu.edu/~awm awm@cs.cmu.edu 412-268-7599 Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in your own lecture, please include this message, or the following link to the source repository of Andrew’s tutorials: http://www.cs.cmu.edu/~awm/tutorials. Comments and corrections gratefully received. http://www.cs.cmu.edu/~awm/tutorials

 Imagine we’re doing classification with categorical inputs  All inputs and outputs are binary  Data is noiseless  There’s a machine f(x,h) which has H possible settings (a.k.a. hypotheses), called h 1, h 2.. h H 167

 f(x,h) consists of all logical sentences about X1, X2.. Xm that contain only logical ands  Example hypotheses:  X1  X3  X19  X3  X18  X7  X1  X2  X2  x4 …  Xm  Question: if there are 3 attributes, what is the complete set of hypotheses in f? 168

 f(x,h) consists of all logical sentences about X1, X2.. Xm that contain only logical ands.  Example hypotheses:  X1  X3  X19  X3  X18  X7  X1  X2  X2  x4 …  Xm  Question: if there are 3 attributes, what is the complete set of hypotheses in f? (H = 8) 169 TrueX2X3X2 ^ X3 X1X1 ^ X2X1 ^ X3X1 ^ X2 ^ X3

 f(x,h) consists of all logical sentences about X1, X2.. Xm that contain only logical ands  Example hypotheses:  X1  X3  X19  X3  X18  X7  X1  X2  X2  x4 …  Xm  Question: if there are m attributes, how many hypotheses in f? 170

 f(x,h) consists of all logical sentences about X1, X2.. Xm that contain only logical ands  Example hypotheses:  X1  X3  X19  X3  X18  X7  X1  X2  X2  x4 …  Xm  Question: if there are m attributes, how many hypotheses in f? (H = 2 m ) 171

 f(x,h) consists of all logical sentences about X1, X2.. Xm or their negations that contain only logical ands.  Example hypotheses:  X1 ^ ~X3 ^ X19  X3 ^ ~X18  ~X7  X1 ^ X2 ^ ~X3 ^ … ^ Xm  Question: if there are 2 attributes, what is the complete set of hypotheses in f? 172

 f(x,h) consists of all logical sentences about X1, X2.. Xm or their negations that contain only logical ands.  Example hypotheses:  X1 ^ ~X3 ^ X19  X3 ^ ~X18  ~X7  X1 ^ X2 ^ ~X3 ^ … ^ Xm  Question: if there are 2 attributes, what is the complete set of hypotheses in f? (H = 9) 173 True X2 True~X2 X1True X1  X2 X1  ~X2 ~X1True ~X1  X2 ~X1  ~X2

 f(x,h) consists of all logical sentences about X1, X2.. Xm or their negations that contain only logical ands.  Example hypotheses:  X1 ^ ~X3 ^ X19  X3 ^ ~X18  ~X7  X1 ^ X2 ^ ~X3 ^ … ^ Xm  Question: if there are m attributes, what is the size of the complete set of hypotheses in f? 174 True X2 True~X2 X1True X1  X2 X1  ~X2 ~X1True ~X1  X2 ~X1  ~X2

 f(x,h) consists of all logical sentences about X1, X2.. Xm or their negations that contain only logical ands.  Example hypotheses:  X1 ^ ~X3 ^ X19  X3 ^ ~X18  ~X7  X1 ^ X2 ^ ~X3 ^ … ^ Xm  Question: if there are m attributes, what is the size of the complete set of hypotheses in f? (H = 3 m ) 175 True X2 True~X2 X1True X1  X2 X1  ~X2 ~X1True ~X1  X2 ~X1  ~X2

 f(x,h) consists of all truth tables mapping combinations of input attributes to true and false  Example hypothesis:  Question: if there are m attributes, what is the size of the complete set of hypotheses in f? 176 X1X2X3X4Y 00000 00011 00101 00110 01001 01010 01100 01111 10000 10010 10100 10111 11000 11010 11100 11110

 f(x,h) consists of all truth tables mapping combinations of input attributes to true and false  Example hypothesis:  Question: if there are m attributes, what is the size of the complete set of hypotheses in f? 177 X1X2X3X4Y 00000 00011 00101 00110 01001 01010 01100 01111 10000 10010 10100 10111 11000 11010 11100 11110

 We specify f, the machine  Nature choose hidden random hypothesis h*  Nature randomly generates R datapoints ◦ How is a datapoint generated? 1.Vector of inputs x k = (x k1,x k2, x km ) is drawn from a fixed unknown distrib: D 2.The corresponding output y k =f(x k, h*)  We learn an approximation of h* by choosing some h est for which the training set error is 0 178

 We specify f, the machine  Nature choose hidden random hypothesis h*  Nature randomly generates R datapoints ◦ How is a datapoint generated? 1.Vector of inputs x k = (x k1,x k2, x km ) is drawn from a fixed unknown distrib: D 2.The corresponding output y k =f(x k, h*)  We learn an approximation of h* by choosing some h est for which the training set error is 0  For each hypothesis h,  Say h is Correctly Classified (CCd) if h has zero training set error  Define TESTERR(h ) = Fraction of test points that h will classify correctly = P( h classifies a random test point correctly)  Say h is BAD if TESTERR(h) >  179

 We specify f, the machine  Nature choose hidden random hypothesis h*  Nature randomly generates R datapoints ◦ How is a datapoint generated? 1.Vector of inputs x k = (x k1,x k2, x km ) is drawn from a fixed unknown distrib: D 2.The corresponding output y k =f(x k, h*)  We learn an approximation of h* by choosing some h est for which the training set error is 0  For each hypothesis h,  Say h is Correctly Classified (CCd) if h has zero training set error  Define TESTERR(h ) = Fraction of test points that i will classify correctly = P( h classifies a random test point correctly)  Say h is BAD if TESTERR(h) >  180

 We specify f, the machine  Nature choose hidden random hypothesis h*  Nature randomly generates R datapoints ◦ How is a datapoint generated? 1.Vector of inputs x k = (x k1,x k2, x km ) is drawn from a fixed unknown distrib: D 2.The corresponding output y k =f(x k, h*)  We learn an approximation of h* by choosing some h est for which the training set error is 0  For each hypothesis h,  Say h is Correctly Classified (CCd) if h has zero training set error  Define TESTERR(h ) = Fraction of test points that i will classify correctly = P( h classifies a random test point correctly)  Say h is BAD if TESTERR(h) >  181

 Chose R such that with probability less than  we’ll select a bad h est (i.e., an h est which makes mistakes more than fraction  of the time)  Probably Approximately Correct  As we just saw, this can be achieved by choosing R such that  i.e., R such that 182

183 MachineExample Hypothesis HR required to PAC- learn And-positive- literals X3 ^ X7 ^ X82m2m And-literalsX3 ^ ~X73m3m Lookup Table And-lits or And-lits (X1 ^ X5) v (X2 ^ ~X7 ^ X8) X1X2X3X4Y 00000 00011 00101 00110 01001 01010 01100 01111 10000 10010 10100 10111 11000 11010 11100 11110

 Assume m attributes  H k = Number of decision trees of depth k  H 0 =2  H k+1 = (#choices of root attribute) * (# possible left subtrees) * (# possible right subtrees) = m * H k * H k  Write L k = log 2 H k  L 0 = 1  L k+1 = log 2 m + 2L k  So L k = (2 k -1)(1+log 2 m) +1  So to PAC-learn, need 184

The Hebrew University of Jerusalem School of Engineering and Computer Science Instructor: Jeff Rosenschein (Chapter 18, “Artificial Intelligence: A Modern.

Similar presentations

Presentation on theme: "The Hebrew University of Jerusalem School of Engineering and Computer Science Instructor: Jeff Rosenschein (Chapter 18, “Artificial Intelligence: A Modern."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The Hebrew University of Jerusalem School of Engineering and Computer Science Instructor: Jeff Rosenschein (Chapter 18, “Artificial Intelligence: A Modern.

Similar presentations

Presentation on theme: "The Hebrew University of Jerusalem School of Engineering and Computer Science Instructor: Jeff Rosenschein (Chapter 18, “Artificial Intelligence: A Modern."— Presentation transcript:

Similar presentations

About project

Feedback