Download presentation
Presentation is loading. Please wait.
Published byWinifred Anissa Boyd Modified over 9 years ago
1
Flexible and fast convergent learning agent Miguel A. Soto Santibanez Michael M. Marefat Department of Electrical and Computer Engineering University of Arizona, Tucson, AZ santiban@ece.arizona.edu marefat@ece.arizona.edu
2
Background and Motivation “A computer program is said to LEARN from experience E with respect to some class of task T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E” A robot driving learning problem: Task T: driving on public four-lane highways using vision sensors Performance measure P: average distance traveled before an error (as judged by human overseer) Training experiences E: a sequence of images and steering commands recorded while observing a human driver
3
1) Artificial Neural Networks Robust to errors in the training data Dependency on the availability of good and extensive training examples 2) Instance-Based Learning Able to model complex policies by making use of less complex local approximations Dependency on the availability of good and extensive training examples 3) Reinforcement Learning Independent of the availability of good and extensive training examples Convergence to the optimal policy can be extremely slow Background and Motivation II
4
Background and Motivation III Motivation: Is it possible to get the best of both worlds? Is it possible for a Learning Agent to be flexible and fast convergent at the same time?
5
The Problem Formalization: Given: a) a set of actions A = {a1, a2, a3,...}, b) a set of situations S = {s1, s2, s3,...}, c) and a function TR(a, s) tr, where tr is the total reward associated with applying action a while at state s, The LA needs to construct a set of rules P = {rule(s1, a1), rule(s2, a2),...} such that rule(s, a) P, a = amax where TR(amax, s)=max(TR(a1,s), TR(a2,s),...) Also: 1) Increase flexibility 2) Increase speed of convergence
6
The Solution The Q learning Algorithm: 1: rule(s, a) P, TR(a, s) 0 2:find out what is the current situation si 3:do forever: 4:select an action ai A and execute it 5:find out what is the immediate reward r 6:find out what is the current situation si’ 7:TR(ai, si) r + aFactor max(TR(a, si’)) a 8:si si’
7
The Solution II Advantages: 1) The LA does not depend on the availability of good and extensive training examples Reason: a) This method learns from experimentation instead of given training examples Shortcomings: 1) Convergence to the optimal policy can be very slow Reasons: a) The Q learning Algorithm propagates “good findings” very slowly. b) Speed of convergence tied to number of situations that need to be handled. 2) May not be able to use this method on high dimensionality problems Reason: a) The memory requirements grow exponentially as we add more dimensions to the problem.
8
The Solution III Speed of convergence tied to the number of situations: situations ==> P rules that need to be found P rules that need to be found ==> experiments are needed experiments are needed ==> convergence speed
9
The Solution IV Slow propagation of “good findings”:
10
The Solution V First Sub-problem: Slow propagation of “good findings” Solution: Develop a method that propagates “good findings” beyond the previous state
11
The Solution VI Solution to First Sub-problem: a) Use a buffer, which we call “short term memory”, to keep track of the last n situations b) After each learning experience apply the following algorithm:
12
The Solution VII The Second and Third Sub-problems: a) Memory requirements grow exponentially as we add more dimensions to the problem b) Speed of convergence tied to number of situations that need to be handled. Solution: 1) We just keep a few examples of the policy (also called prototypes) 2) We generate the policy on situation not described explicitly by these prototypes by “generalizing” from “nearby” prototypes
13
The Solution VIII Kanerva Coding And Tile Coding Moving Prototypes
14
The Solution IX
15
The Solution X
16
The Solution XI
17
The Solution XII A sound tree: a) all the “areas” are mutually exclusive b) their merging is exhaustive c) the merging of any two sibling “areas” is equal to their parent’s “area”. children parent children parent
18
The Solution XIII Impossible Merge
19
The Solution XIV “Smallest predecessor”
20
The Solution XV
21
The Solution XVI Possible ways of breaking the existing nodes: Node being inserted
22
The Solution XVII List 1 List 1.1 List 1.2 and
23
The Solution XVIII
24
The Solution XIX
25
The Solution XX
26
Results The performance of the algorithm “Propagation of Good Findings” is especially good when the world is large: The algorithm “Propagation of Good Findings” is more efficient when the size of its “Short Term Memory” is large:
27
Results II The algorithm “Propagation of Good Findings” is more efficient when the value of the parameter “discount factor” is large: Results do not depend on sequence of random numbers:
28
Conclusions Q Learning Algorithm LA becomes more flexible Propagating concept Convergence is accelerated Moving Prototypes concept LA becomes more flexible Moving Prototypes concept Convergence is accelerated
29
Conclusions II What is left to do: Obtain results on the advantages of using regression trees and linear approximation over other similar methods (just as we have already done with the method “Propagation of Good Findings”). Apply the proposed model to solving example applications such as a self-optimizing middle-men between a high level planner and the actuators in a robot. Develop more precisely the limits on the use of this model.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.