Presentation is loading. Please wait.

Presentation is loading. Please wait.

18 LEARNING FROM OBSERVATIONS

Similar presentations


Presentation on theme: "18 LEARNING FROM OBSERVATIONS"— Presentation transcript:

1 18 LEARNING FROM OBSERVATIONS
A learning agent can be divided into four conceptual components: The most important distinction is between the learning element, which is responsible for making improvements, and the performance element, which is responsible for selecting external actions. The critic is designed to tell the learning element how well the agent is doing. The problem generator is responsible for suggesting actions that will lead to new and informative experiences. IV LEARNING All the "intelligence" in an agent has been built in by the agent's designer. Whenever the designer has incomplete knowledge of the environment that the agent will live in, learning is the only way that the agent can acquire what it needs to know. Learning thus provides autonomy. The four chapters in this part cover the field of machine learning - the subfield of AI concerned with programs that learn from experience. Chapter 18 introduces the basic design for learning agents, and addresses the general problem of learning from examples. 18 LEARNING FROM OBSERVATIONS A learning agent can be divided into four conceptual components: The most important distinction is between the learning element, which is responsible for making improvements, and the performance element, which is responsible for selecting external actions. The critic is designed to tell the learning element how well the agent is doing. The problem generator is responsible for suggesting actions that will lead to new and informative experiences. Russell Stuart, Norvig Peter, Artificial Intelligence: A Modern Approach, 1995

2 Figure 18.1 A general model of learning agents.
Russell Stuart, Norvig Peter, Artificial Intelligence: A Modern Approach, 1995

3 The design of the learning element is affected by four major issues:
Which components of the performance element are to be improved. What representation is used for those components. What feedback is available. What prior information is available. Machine learning researchers have come up with a large variety of learning elements. To understand them, it will help to see how their design is affected by the context in which they will operate. The design of the learning element is affected by four major issues: -Which components of the performance element are to be improved. -What representation is used for those components. -What feedback is available. -What prior information is available. Russell Stuart, Norvig Peter, Artificial Intelligence: A Modern Approach, 1995

4 Representation of the components
Any of these components can be represented using any of the representation schemes in this book. We have seen several examples: deterministic descriptions such as linear weighted polynomials for utility functions in game-playing programs and propositional and first-order logical sentences for all of the components in a logical agent; and probabilistic descriptions such as belief networks for the inferential components of a decision- theoretic agent. Effective learning algorithms have been devised for all of these. The details of the learning algorithm will be different for each representation, but the main idea remains the same. Russell Stuart, Norvig Peter, Artificial Intelligence: A Modern Approach, 1995

5 Available feedback supervised learning reinforcement learning
unsupervised learning For some components, such as the component for predicting the outcome of an action, the available feedback generally tells the agent what the correct outcome is. That is, the agent predicts that a certain action (braking) will have a certain outcome (stopping in 10 feet), and the environment immediately provides a percept that describes the actual correct outcome (stopping in 15 feet). Any situation in which both the inputs and outputs of a component can be perceived is called supervised learning. (Often, the outputs are provided by a friendly teacher.) On the other hand, in learning the condition- action component, the agent receives some evaluation of its action (such as a hefty bill for rear‑ending the car in front) but is not told the correct action (to brake more gently and much earlier). This is called reinforcement learning; the hefty bill is called a reinforcement. 1 The subject is covered in Chapter 20 .2 Learning when there is no hint at all about the correct outputs is called unsupervised learning. An unsupervised learner can always learn relationships among its percepts using supervised learning methods‑that is, it can learn to predict its future percepts given its previous percepts. It cannot learn what to do unless it already has a utility function. Russell Stuart, Norvig Peter, Artificial Intelligence: A Modern Approach, 1995

6 Bringing it all together
Each of the seven components of the performance element can be described mathematically as a function: information about the way the world evolves can be described as a function from a world state (the current state) to a world state (the next state or states); a goal can be described as a function from a state to a Boolean value (0 or 1) indicating whether the state satisfies the goal. The key point is that all learning can be seen as learning the representation of a function. We can choose which component of the performance element to improve and how it is to be represented. Russell Stuart, Norvig Peter, Artificial Intelligence: A Modern Approach, 1995

7 18.2 INDUCTIVE LEARNING In supervised learning, the learning element is given the correct (or approximately correct) value of the function for particular inputs, and changes its representation of the function to try to match the information provided by the feedback. An example is a pair (x,f(x)), where x is the input and f(x) is the output of the function applied to x. The task of pure inductive inference (or induction): given a collection of examples off, return a function h that approximates f. The function h is called a hypothesis. In supervised learning, the learning element is given the correct (or approximately correct) value of the function for particular inputs, and changes its representation of the function to try to match the information provided by the feedback. - More formally, we say an example is a pair (x,f(x)), where x is the input and f(x) is the output of the function applied to x. - The task of pure inductive inference (or induction) is this: given a collection of examples off, return a function h that approximates f. The function h is called a hypothesis. Russell Stuart, Norvig Peter, Artificial Intelligence: A Modern Approach, 1995

8 Figure 18.2 In (a) we have some example (input,output) pairs.
In (b), (c), and (d) we have three hypotheses for functions from which these examples could be drawn. Russell Stuart, Norvig Peter, Artificial Intelligence: A Modern Approach, 1995

9 18.3 LEARNING DECISION TREES
Decision tree induction is one of the simplest and yet most successful forms of learning algorithm. It serves as a good introduction to the area of inductive learning, and is easy to implement. We first describe the performance element, and then show how to learn it. Along the way, we will introduce many of the ideas and terms that appear in all areas of inductive learning. Russell Stuart, Norvig Peter, Artificial Intelligence: A Modern Approach, 1995

10 Decision trees as performance elements
A decision tree takes as input an object or situation described by a set of properties, and outputs a yes/no "decision." Decision trees therefore represent Boolean functions. Functions with a larger range of outputs can also be represented, but for simplicity we will usually stick to the Boolean case. Each internal node in the tree corresponds to a test of the value of one of the properties, and the branches from the node are labelled with the possible values of the test. Each leaf node in the tree specifies the Boolean value to be returned if that leaf is reached. the goal predicate (goal concept) As an example, consider the problem of whether to wait for a table at a restaurant. The aim here is to learn a definition for the goal predicate (goal concept) WillWait, where the definition is expressed as a decision tree. In setting this up as a learning problem, we first have to decide what properties or attributes are available to describe examples in the domain. Russell Stuart, Norvig Peter, Artificial Intelligence: A Modern Approach, 1995

11 Expressiveness of decision trees
If decision trees correspond to sets of implication sentences, a natural question is whether they can represent any set. Decision trees are fully expressive within the class of propositional languages, that is, any Boolean function can be written as a decision tree. For some kinds of functions, this is a real problem. parity function, majority function, If decision trees correspond to sets of implication sentences, a natural question is whether they can represent any set. The answer is no, because decision trees are implicitly limited to talking about a single object. That is, the decision tree language is essentially propositional, with each attribute test being a proposition. We cannot use decision trees to represent tests that refer to two or more different objects, for example,  r2 Nearby(r2,r)  Price(r2,p2)  Price(r2,p2)  Cheaper(p2,p) (is there a cheaper restaurant nearby). Obviously, we could add another Boolean attribute with the name CheaperRestaurantNearby, but it is intractable to add all such attributes. Decision trees are fully expressive within the class of propositional languages, that is, any Boolean function can be written as a decision tree. This can be done trivially by having each row, in the truth table for the function correspond to a path in the tree. This would not necessarily be a good way to represent the function, because the truth table is exponentially large in the number of attributes. Clearly, decision trees can represent many functions with much smaller trees. Russell Stuart, Norvig Peter, Artificial Intelligence: A Modern Approach, 1995

12 Figure 18.4 A decision tree for deciding whether to wait for a table.
Russell Stuart, Norvig Peter, Artificial Intelligence: A Modern Approach, 1995

13 Inducing decision trees from examples
An example is described by the values of the attributes and the value of the goal predicate. We call the value of the goal predicate the classification of the example. If the goal predicate is true for some example, we call it a positive example; otherwise we call it a negative example. A set of examples X1,…, X12 for the restaurant domain is shown in Figure The positive examples are ones where the goal WillWait is true (X1, X3,....) and negative examples are ones where it is false (X2, X5, ...). The complete set of examples is called the training set. An example is described by the values of the attributes and the value of the goal predicate. We call the value of the goal predicate the classification of the example. If the goal predicate is true for some example, we call it a positive example; otherwise we call it a negative example. A set of examples X1,…, X12 for the restaurant domain is shown in Figure The positive examples are ones where the goal WillWait is true (X1, X3,....) and negative examples are ones where it is false (X2, X5, ...). The complete set of examples is called the training set. Russell Stuart, Norvig Peter, Artificial Intelligence: A Modern Approach, 1995

14 Figure 18.5 Examples for the restaurant domain.
A set of examples X1,…, X12 for the restaurant domain is shown in Figure The positive examples are ones where the goal WillWait is true (X1, X3,....) and negative examples are ones where it is false (X2, X5, ...). Figure Examples for the restaurant domain. Russell Stuart, Norvig Peter, Artificial Intelligence: A Modern Approach, 1995

15 Figure 18.6 Splitting the examples by testing on attributes.
In (a) Patrons is a good attribute to test first; in (b) Type is a poor one; and in (c) Hungry is a fairly good second test, given that Patrons is the first test. Figure Splitting the examples by testing on attributes. In (a), we see that Patrons is a good attribute to test first; in (b), we see that Type is a poor one; and in (c), we see that Hungry is a fairly good second test, given that Patrons is the first test. Russell Stuart, Norvig Peter, Artificial Intelligence: A Modern Approach, 1995

16 Figure 18.8 The decision tree induced from the 12-example training set.
Russell Stuart, Norvig Peter, Artificial Intelligence: A Modern Approach, 1995

17 Assessing the performance of the learning algorithm
A learning algorithm is good if it produces hypotheses that do a good job of predicting the classifications of unseen examples. 1. how prediction quality can be estimated in advance. 2. a methodology for assessing prediction quality after the fact. Obviously, a prediction is good if it turns out to be true, so we can assess the quality of a hypothesis by checking its predictions against the correct classification once we know it. We do this on a set of examples known as the test set. A learning algorithm is good if it produces hypotheses that do a good job of predicting the classifications of unseen examples. In Section 18.6, we will see how prediction quality can be estimated in advance. For now, we will look at a methodology for assessing prediction quality after the fact. Obviously, a prediction is good if it turns out to be true, so we can assess the quality of a hypothesis by checking its predictions against the correct classification once we know it. We do this on a set of examples known as the test set. If we train on all our available examples, then we will have to go out and get some more to test on, so often it is more convenient to adopt the following methodology: Collect a large set of examples. Divide it into two disjoint sets: the training set and the test set. Use the learning algorithm with the training set as examples to generate a hypothesis H. Measure the percentage of examples in the test set that are correctly classified by H. Repeat steps 1 to 4 for different sizes of training sets and different randomly selected training sets of each size. Russell Stuart, Norvig Peter, Artificial Intelligence: A Modern Approach, 1995

18 1. Collect a large set of examples.
If we train on all our available examples, then we will have to go out and get some more to test on, so often it is more convenient to adopt the following methodology: 1. Collect a large set of examples. 2. Divide it into two disjoint sets: the training set and the test set. 3. Use the learning algorithm with the training set as examples to generate a hypothesis H. 4. Measure the percentage of examples in the test set that are correctly classified by H. 5. Repeat steps 1 to 4 for different sizes of training sets and different randomly selected training sets of each size. Russell Stuart, Norvig Peter, Artificial Intelligence: A Modern Approach, 1995

19 18.4 USING INFORMATION THEORY
Information theory uses this same intuition, but instead of measuring the value of information in dollars, it measures information content in bits. One bit of information is enough to answer a yes/no question about which one has no idea, such as the flip of a fair coin. The information gain from the attribute test is defined as the difference between the original information requirement and the new requirement: Gain(A) = I (p/(p+n), n/(p+n)) - Remainder(A) Remainder(A) = (pi+ni)/(p+n) I (pi/(pi+ni), ni/(pi+ni) and the heuristic used in the CHOOSE-ATTRIBUTE function is just to choose the attribute with the largest gain. Russell Stuart, Norvig Peter, Artificial Intelligence: A Modern Approach, 1995

20 18.5 LEARNING GENERAL LOGICAL DESCRIPTIONS
Inductive learning can be viewed as a process of searching for a good hypothesis in a large space - the hypothesis space - defined by the representation language chosen for the task. Each hypothesis proposes such an expression, which we call a candidate definition of the goal predicate. the extension of the predicate An example can be a false negative for the hypothesis, if the hypothesis says it should be negative but in fact it is positive. An example can be a false positive for the hypothesis, if the hypothesis says it should be positive but in fact it is negative. The hypothesis and the example are therefore logically inconsistent. inductive learning can be viewed as a process of searching for a good hypothesis in a large space - the hypothesis space - defined by the representation language chosen for the task. Each hypothesis proposes such an expression, which we call a candidate definition of the goal predicate. the extension of the predicate An example can be a false negative for the hypothesis, if the hypothesis says it should be negative but in fact it is positive. An example can be a false positive for the hypothesis, if the hypothesis says it should be positive but in fact it is negative. The hypothesis and the example are therefore logically inconsistent. Current-best-hypothesis search generalization specialization Least-commitment search Russell Stuart, Norvig Peter, Artificial Intelligence: A Modern Approach, 1995

21 Current-best-hypothesis search
(generalization, specialization) Figure (a) A consistent hypothesis. (b) A false negative. (c) The hypothesis is generalized. (d) A false positive. (e) The hypothesis is specialized. Russell Stuart, Norvig Peter, Artificial Intelligence: A Modern Approach, 1995

22 Least-commitment search
The set of hypotheses remaining is called the version space, and the learning algorithm is called the version space learning algorithm (also the candidate elimination algorithm). it is incremental It is also a least-commitment algorithm because it makes no arbitrary choices Russell Stuart, Norvig Peter, Artificial Intelligence: A Modern Approach, 1995

23 Figure 18.13 The version space contains all hypotheses consistent with the examples.
Russell Stuart, Norvig Peter, Artificial Intelligence: A Modern Approach, 1995

24 Figure 18. 14 The extensions of the members of G and S
Figure The extensions of the members of G and S. No known examples lie in between. Russell Stuart, Norvig Peter, Artificial Intelligence: A Modern Approach, 1995

25 18.6 WHY LEARNING WORKS: COMPUTATIONAL LEARNING THEORY
Learning means behaving better as a result of experience. computational learning theory - a field at the intersection of AI and theoretical computer science. error(h) = P(h(x) f(x)x drawn from D) Russell Stuart, Norvig Peter, Artificial Intelligence: A Modern Approach, 1995

26 error(h) = P(h(x) f(x)x drawn from D)
Figure Schematic diagram of hypothesis space, showing the "-ball" around the true function f. Russell Stuart, Norvig Peter, Artificial Intelligence: A Modern Approach, 1995

27 Learning decision lists
 x WillWait(x)  Patrons(x, Some)  (Patrons(x, Full)  Fri/Sat(x)) A decision list is a logical expression of a restricted form. It consists of a series of tests, each of which is a conjunction of literals. If a test succeeds when applied to an example description, the decision list specifies the value to be returned. If the test fails, processing continues with the next test in the list. Figure A decision list for the restaurant problem. Russell Stuart, Norvig Peter, Artificial Intelligence: A Modern Approach, 1995

28 Figure Graph showing the predictive performance of the DECISION-LIST-LEARNING algorithm on the restaurant data, as a function of the number of examples seen. The curve for DECISION-TREE-LEARNING is shown for comparison Russell Stuart, Norvig Peter, Artificial Intelligence: A Modern Approach, 1995

29 18.7 SUMMARY All learning can be seen as learning a function, and in this chapter we concentrate on induction: learning a function from example input/output pairs. The main points were as follows: Learning in intelligent agents is essential for dealing with unknown environments (i.e., compensating for the designer's lack of omniscience about the agent's environment). Learning is also essential for building agents with a reasonable amount of effort (i.e., compensating for the designer's laziness, or lack of time). Learning agents can be divided conceptually into a performance element, which is responsible for selecting actions, and a learning element, which is responsible for modifying the performance element. Russell Stuart, Norvig Peter, Artificial Intelligence: A Modern Approach, 1995

30 Learning takes many forms, depending on the nature of the performance element, the available feedback, and the available knowledge. Learning any particular component of the performance element can be cast as a problem of learning an accurate representation of a function. Learning a function from examples of its inputs and outputs is called inductive learning. The difficulty of learning depends on the chosen representation. Functions can be represented by logical sentences, polynomials, belief networks, neural networks, and others. Decision trees are an efficient method for learning deterministic Boolean functions. Russell Stuart, Norvig Peter, Artificial Intelligence: A Modern Approach, 1995

31 Ockham's razor suggests choosing the simplest hypothesis that matches the observed examples. The information gain heuristic allows us to find a simple decision tree. The performance of inductive learning algorithms is measured by their learning curve, which shows the prediction accuracy as a function of the number of observed examples. We presented two general approaches for learning logical theories. The current-best-hypothesis approach maintains and adjusts a single hypothesis, whereas the version space approach maintains a representation of all consistent hypotheses. Both are vulnerable to noise in the training set. Computational learning theory analyses the sample complexity and computational complexity of inductive learning. There is a trade-off between the expressiveness of the hypothesis language and the ease of learning. Russell Stuart, Norvig Peter, Artificial Intelligence: A Modern Approach, 1995


Download ppt "18 LEARNING FROM OBSERVATIONS"

Similar presentations


Ads by Google