Download presentation
Presentation is loading. Please wait.
Published byRoberta Parker Modified over 9 years ago
1
Reinforcement Learning: Generalization and Function Brendan and Yifang Feb 10, 2015
2
Problem Our estimates of value functions are represented as a table with one entry for each state or each state-action pair. The approaches we discussed so far are limited to tasks with small numbers of states and actions. If the state or action space include continuous variables or complex sensations, the problem would be more severe.
3
Solution We approximate the value function as a parameterized functional form with parameter, instead of a table. Our goal is to minimize the mean-squared errors. The distribution P is usually the distribution from which the states in the training examples are drawn. We assume the P as a uniform distribution.
4
Approach: Gradient-Descent We assume the function is a function of and, and is a parameter of Gradient descent method tells us if is differentiable in a neighborhood of S, then decreases fastest if one goes from S in the direction of negative gradient at the point S. In this problem, our goal is to achieve We use gradient descent to solve it. The gradient is
5
Because our goal is local minimum, we update, with its negative gradient, as follows: Then, we use a constant 0.5α to tune the update in each step, so we have If is a linear function, i.e.,, then we have
6
Monte-Carlo estimate But is actually unavailable, so we use some approximation of it. So instead, we use the following update equation. If is an unbiased estimate, for each t, then is guaranteed to converge to a local optimum. Because the true value of a state is the expected value of the return following it, the Monte Carlo target is an unbiased estimate. If we use TD estimate, will converge to the local optimum?
7
N-step TD TD(0) is one-step TD. So how to leverage between ? Figure 7.1 Diagrams from one step TD to Monte Carlo estimation
8
We chose An intuitive idea We define Figure 7.3 Diagram of TD(λ)
9
in Gradient Descent Monte Carlo is a special case of TD(λ) TD(λ) based θ update Monte Carlo θ update However, TD(λ) is not an unbiased estimate, so this method does not converge to a local optimum. So what? TD(λ) has been proved to converge in the linear case. But it does not converge to the optimum solution, but to a nearly parameter vector
10
Backward view of TD(λ) Eligibility trace evaluates the recency of a state been visited.
11
Backward view TD, cont’d In the forward view, the TD error for state-value prediction is But in the backward view, the TD error is proportional to all the recent visits. Figure 7.7 Online TD(λ) tabular policy evaluation
12
TD(0) or TD(λ)? Figure 6.1 Using TD(0) in policy evaluation Figure 7.7 Using TD(λ) in policy evaluation
13
Backward view of gradient descent Back view of gradient descent is Where is a column vector of eligibility traces, one for each component of. It is an extension to the original definition.
14
Linear Methods Corresponding to each state, there is a column vector of features, and the state-value function is given by Why do we prefer linear methods? In linear case, any method guaranteed to converge to or near a local optimum is automatically guaranteed to converge to or near the global optimum.
15
Linear Method Coding Goal: use linear method to solve un-linear problem. Solution: code the un-linear problem into a linear problem. Transforming the non-linear solution in state space to a linear solution in the feature space. Coarse coding Tile Coding Using Radial Basic function in coding
16
Coarse Coding A feature correspond to a circle in the state space. If the state is inside a circle, then the corresponding feature has the value 1, and is said to be present, otherwise the feature is 0, and is said to be absent. This kind of feature is called binary feature. An example: : the state space is 1-dimensional, and the dimensionality of the feature spaces is the number of the intervals. Figure 8.2 One-dimensional coarse coding
17
Coarse Coding An example: the state space is 2-dimensional, and the dimensionality of the feature spaces is the number of the circles. Figure 8.4 Two dimensional coarse coding
18
Tile Coding Different from coarse coding, where circles could be overlapping, tile coding partitions the state space into exhaustive tiles. And tiles are not overlapping. Figure 8.6 Different tiling schemes
19
Radial Basis Functions Rather than each feature being either 0 or 1, the feature can be anything in the interval of [0, 1], reflecting various degree to which the feature is present. Figure 8.7 One-dimensional radial function
20
Control with Linear method Evaluate the action-value function to generate the policy Figure 8.8 Policy control: Linear gradient descent Sarsa(λ)
21
Pros and Cons of Bootstrap Theoretically, non-bootstrapping methods achieve a lower asymptotic error than bootstrapping methods. In practice, bootstrapping methods usually outperforms than non-bootstrapping methods. Figure 8.10 The mountain car task Figure 8.15 Performance of different solutions
22
Pros and Cons of Bootstrap Another example: a small Markov process for generating random walks. Figure 6.5 A random walk Figure 8.15 Performance of different solutions
23
Question 1: I am not sure what the theta-vector is. Is there a more basic concrete example of a task need to be solved, a description of the states and where the theta-vector comes into play? What is the theta-vector in mountain-car task. (From Henry) Figure 8.10 The mountain car task
24
Question 2: one of the assumptions required is that value function is a smooth differentiable function. The authors do not address if this is realistic or even if the assumption holds in the majority of cases? (From Henry) Answer: (I think) smooth differential requirement is necessary for any gradient descent method to make sure the descent at a point could reflect the trend of the function. If there is jump at the point, the linear descent method requires a denser grid. http://en.wikipedia.org/wiki/Monte_Carlo_method
25
Question 3: How does the hashing form of tiling make sense? (From Henry) Answer: Hashing allows to produce noncontiguous tiles. High resolution is needed in only a small fraction of the state space. It frees us from the curse of dimensionality. Figure Hashing scheme
26
Question 4: Can you give an example to explain how the gradient-descent of Monte Carlo state-value prediction will converge? (From Tavish) Answer: Use this equation to update. Eligibility traces are not leveraged any more. Other parts are the same with the following algorithm
27
Question 5: In section 8.1, the book say “the novelty in this chapter is that the approximate value function at time t, is represented not a table but as a parameterized function from with parameter vector theta”. Could you give some example of theta vector? (From Yuankai)
28
Question 6: the book says “there are generally far more states than there are components to theta-vector”. Why is this the case? (From Yuankai) Answer: The state-value function might be continuous function. But the number of features is limited. With this limited number of features, we can approximate the original function. Denser grid can approximate much fluctuating function, at the expense of more resources.
29
Question 7: It looks like the theta vector can fit in the other models, Can you illustrate the roles of this theta vector in these different approaches? (From Sicong) Answer: The universal approximation theorem of artificial neural network is a good example. It is as follows: is a non-constant, bounded, and monotonically-increasing continuous function. is m-dimensional unit hypercube. Any function f in the continuous function space define on could be approximated
30
Question 8: In theory, non-bootstrapping is better. But in practice. Bootstrapping methods are better. And we do not know why. Is it correct? (From Brad) Answer: According to textbook, “the available result indicate that non-bootstrapping methods are better than bootstrapping method at reducing MSE from the true value function. But reducing MSE is not necessarily the most import goal” Question 9: If Vt is a biased estimate, why the method does not converge to a local optimum? (From Brad) Answer: Method is to approximate Vt. If Vt is a biased estimate, then the method converges to biased value.
31
Question 10: The book mentions backup many times. What is it? (From Jiyun) Answer: Update the value of Value of St using Value St+1, as standing the state St+1, at time t+1, and trying to update the value of St. The process is updating the value of a preceding state using the value of a successive state, i.e. a backup strategy.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.