1 Reinforcement Learning: Approximate Planning Csaba Szepesvári University of Alberta Kioloa, MLSS’08 Slides:

Slides:



Advertisements
Similar presentations
Constraint Satisfaction Problems
Advertisements

Advanced Piloting Cruise Plot.
Reinforcement Learning
Feichter_DPG-SYKL03_Bild-01. Feichter_DPG-SYKL03_Bild-02.
1 Vorlesung Informatik 2 Algorithmen und Datenstrukturen (Parallel Algorithms) Robin Pomplun.
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
Chapter 1 The Study of Body Function Image PowerPoint
Reinforcement Learning: Learning from Interaction
Milan Vojnović Microsoft Research Cambridge Collaborators: E. Perron and D. Vasudevan 1 Consensus – with Limited Processing and Signalling.
Cognitive Radio Communications and Networks: Principles and Practice By A. M. Wyglinski, M. Nekovee, Y. T. Hou (Elsevier, December 2009) 1 Chapter 12 Cross-Layer.
1 Copyright © 2013 Elsevier Inc. All rights reserved. Appendix 01.
Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA.
Chapter 1 Image Slides Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
UNITED NATIONS Shipment Details Report – January 2006.
Summary of Convergence Tests for Series and Solved Problems
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.
Year 6 mental test 10 second questions
Reductions Complexity ©D.Moshkovitz.
Chapter 7 Sampling and Sampling Distributions
1 Outline relationship among topics secrets LP with upper bounds by Simplex method basic feasible solution (BFS) by Simplex method for bounded variables.
Photo Slideshow Instructions (delete before presenting or this page will show when slideshow loops) 1.Set PowerPoint to work in Outline. View/Normal click.
1 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A AA A A A.
Solve Multi-step Equations
Break Time Remaining 10:00.
The basics for simulations
Announcements Homework 6 is due on Thursday (Oct 18)
EE, NCKU Tien-Hao Chang (Darby Chang)
1 Column Generation. 2 Outline trim loss problem different formulations column generation the trim loss problem master problem and subproblem in column.
Randomized Algorithms Randomized Algorithms CS648 1.
5-1 Chapter 5 Theory & Problems of Probability & Statistics Murray R. Spiegel Sampling Theory.
1 Undirected Breadth First Search F A BCG DE H 2 F A BCG DE H Queue: A get Undiscovered Fringe Finished Active 0 distance from A visit(A)
Lirong Xia Reinforcement Learning (1) Tue, March 18, 2014.
Green Eggs and Ham.
VOORBLAD.
15. Oktober Oktober Oktober 2012.
1 Breadth First Search s s Undiscovered Discovered Finished Queue: s Top of queue 2 1 Shortest path from s.
Factor P 16 8(8-5ab) 4(d² + 4) 3rs(2r – s) 15cd(1 + 2cd) 8(4a² + 3b²)
Basel-ICU-Journal Challenge18/20/ Basel-ICU-Journal Challenge8/20/2014.
1..
Optimization 1/33 Radford, A D and Gero J S (1988). Design by Optimization in Architecture, Building, and Construction, Van Nostrand Reinhold, New York.
© 2012 National Heart Foundation of Australia. Slide 2.
Copyright © 2013, 2009, 2006 Pearson Education, Inc. 1 Section 5.4 Polynomials in Several Variables Copyright © 2013, 2009, 2006 Pearson Education, Inc.
Understanding Generalist Practice, 5e, Kirst-Ashman/Hull
1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt Synthetic.
Module 17: Two-Sample t-tests, with equal variances for the two populations This module describes one of the most utilized statistical tests, the.
25 seconds left…...
: 3 00.
5 minutes.
Januar MDMDFSSMDMDFSSS
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
Online Sampling for Markov Decision Processes Bob Givan Joint work w/ E. K. P. Chong, H. Chang, G. Wu Electrical and Computer Engineering Purdue University.
Local Search Jim Little UBC CS 322 – CSP October 3, 2014 Textbook §4.8
Clock will move after 1 minute
PSSA Preparation.
Lecture 20 Continuous Problems Linear Operators and Their Adjoints.
Mani Srivastava UCLA - EE Department Room: 6731-H Boelter Hall Tel: WWW: Copyright 2003.
Select a time to count down from the clock above
Murach’s OS/390 and z/OS JCLChapter 16, Slide 1 © 2002, Mike Murach & Associates, Inc.
Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.
all-pairs shortest paths in undirected graphs
Tuning bandit algorithms in stochastic environments The 18th International Conference on Algorithmic Learning Theory October 3, 2007, Sendai International.
Discretization Pieter Abbeel UC Berkeley EECS
1 Fitted/batch/model-based RL: A (sketchy, biased) overview(?) Csaba Szepesvári University of Alberta TexPoint fonts used in EMF. Read the TexPoint manual.
1 Monte-Carlo Planning: Policy Improvement Alan Fern.
Tuning bandit algorithms in stochastic environments
Presentation transcript:

1 Reinforcement Learning: Approximate Planning Csaba Szepesvári University of Alberta Kioloa, MLSS’08 Slides: TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A A A A A AA

2 Planning Problem  The MDP.. is given (p,r can be queried).. can be sampled from  at any state  Trajectories “Simulation Optimization”  Goal: Find an optimal policy  Constraints: Computational efficiency  Polynomial complexity  O(1) ´ real-time decisions Sample efficiency ~ computational efficiency

3 Methods for planning  Exact solutions (DP)  Approximate solutions Rollouts ( ´ search)  Sparse lookahead trees, UCT Approximate value functions  RDM, FVI, LP Policy search  Policy gradient (Likelihood Ratio Method), Pegasus [Ng & Jordan ’00] Hybrid  Actor-critic

4 Bellman’s Curse of Dimensionality  The state space in many problems is.. Continuous High-dimensional  “Curse of Dimensionality” (Bellman, 57) Running time of algorithms scales exponentially with the dimension of the state space.  Transition probabilities Kernel: P(dy|x,a) Density: p(y|x,a)  !!  e.g. p(y|x,a) ~ exp( -||y-f(x,a)|| 2 /(2 ¾ 2 ))

5 A Lower Bound  Theorem (Chow, Tsitsiklis ’89) Markovian Decision Problems d dimensional state space Bounded transition probabilities, rewards Lipschitz-continuous transition probabilities and rewards  Any algorithm computing an ² - approximation of the optimal value function needs ( ² - d ) values of p and r.  What’s next then??

6 Monte-Carlo Search Methods Problem:  Can generate trajectories from an initial state  Find a good action at the initial state

7 Sparse lookahead trees  [Kearns et al., ’02]: Sparse lookahead trees  Effective horizon: H( ² ) = K r /( ² (1- ° ))  Size of the tree: S = c |A| H ( ² ) (unavoidable)  Good news: S is independent of d! ..but is exponential in H( ² )  Still attractive: Generic, easy to implement  Would you use it?

8 Idea..  Need to propagate values from good branches as early as possible  Why sample suboptimal actions at all?  Breadth-first  Depth-first!  Bandit algorithms  Upper Confidence Bounds  UCT [KoSze ’06]

9 UCB [Auer et al. ’02]  Bandit with a finite number of actions (a) – called arms here  Q t (a): Estimated payoff of action a  T t (a): Number of pulls of arm a  Action choice by UCB:  Theorem: The expected loss is bounded by O(log n)  Optimal rate

10 UCT Algorithm [KoSze ’06]  To decide which way to go play a bandit in each node of the tree  Extend tree one by one  Similar ideas: [Peret and Garcia, ’04] [Chang et al., ’05]

11 Results: Sailing  ‘Sailing’: Stochastic shortest path  State-space size = 24*problem-size  Extension to two-player, full information games  Major advances in go!

12 Results: 9x9 Go  Mogo A: Y. Wang, S. Gelly, R. Munos, O. Teytaud, and P-A. Coquelin, D. Silver K simulations/move Around since 2006 aug.  CrazyStone A: Rémi Coulom Switched to UCT in 2006  Steenvreter A: Erik van der Werf Introduced in 2007  Computer Olympiad (2007 December) 19x19 1.MoGo 2.CrazyStone 3.GnuGo 9x9 1.Steenvreter 2.Mogo 3.CrazyStone  Guo Jan (5 dan), 9x9 board  Mogo black: 75% win  Mogo white: 33% win CGOS: 1800 ELO  2600 ELO

13 Random Discretization Method Problem:  Continuous state-space  Given p,r, find a good policy!  Be efficient!

14 Value Iteration in Continuous Spaces  Value Iteration: V k+1 (x) = max a 2 A {r(x,a)+ ° s X p(y|x,a) V k (y) dy}  How to compute the integral?  How to represent value functions?

15 Discretization

16 Discretization

17 Can this work?  No!  The result of [Chow and Tsitsiklis, 1989] says that methods like this can not scale well with the dimensionality

18 Random Discretization [Rust ’97]

19 Weighted Importance Sampling  How to compute s p(y|x,a) V(y) dy?

20 The Strength of Monte-Carlo  Goal: Compute I(f) = s f(x) p(x) dx  Draw X 1,…,X N ~ p(.)  Compute I N (f) = 1/N  i f(X i )  Theorem: E[ I N (f) ] = I(f) Var[ I N (f) ] = Var[f(X 1 )]/N Rate of convergence is independent of the dimensionality of x!

21 The Random Discretization Method

22 Guarantees  State space: [0,1] d  Action space: finite  p(y|x,a), r(x,a) Lipschitz continuous, bounded  Theorem [Rust ’97]:  No curse of dimensionality!  Why??  Can we have a result for planning??

23 Planning [Sze ’01]  Replace max a with argmax a in procedure RDM-estimate:  Reduce the effect of unlucky samples by using a fresh set:

24 Results for Planning  p(y|x,a): Lipschitz continuous (L p ) and bounded (K p )  r(x,a) : bounded (K r )  H( ² ) = K r /( ² (1- ° ))  Theorem [Sze ’01]: If N=poly(d,log(|A|),H( ² ),K p,log(L p ),log(1/ ± )), then with probability 1- ±, the policy implemented by plan0 is ² -optimal. with probability 1, the policy implemented by plan1 is ² -optimal.  Improvements: Dependence on log(L p ) not L p ; log(|A|) not |A|, no dependence on L r !

25 A multiple-choice test..  Why is not there a curse of dimensionality for RDM? A.Randomization is the cure to everything B.Class of MDPs is too small C.Expected error is small, variance is huge D.The result does not hold for control E.The hidden constants blow up anyway F.Something else

26 Why no curse of dimensionality??  RDM uses a computational model different than that of Chow and Tsitsiklis! One is allowed to use p,r at the time of answering “V*(x) = ?, ¼ *(x) = ?”  Why does this help? V ¼ = r ¼ + ° P ¼  t ° t P ¼ t r ¼ = r ¼ + ° P ¼ V ¼ Also explains why smoothness of the reward function is not required

27 Possible Improvements  Reduce distribution mismatch Once a good policy is computed, follow it to generate new points  How to do weighted importance sampling then??  Fit distribution & generate samples from the fitted distribution(?) Repeat Z times  Decide adaptively when to stop adding new points

28 Planning with a Generative Model Problem:  Can generate transitions from anywhere  Find a good policy!  Be efficient!

29 Sampling based fitted value iteration  Generative model Cannot query p(y|x,a) Can generate Y~p(.|x,a)  Can we generalize RDM?  Option A: Build model  Option B: Use function approximation to propagate values  [Samuel, 1959], [Bellman and Dreyfus, 1959], [Reetz,1977], [Keane and Wolpin, 1994],..

30 Single-sample version [SzeMu ’05]

31 Multi-sample version [SzeMu ’05]

32 Assumptions  C( ¹ ) = ||dP(.|x,a)/d ¹ || 1 <+ 1 ¹ uniform: dP/d ¹ = p(.|x,a); density kernel This was used by the previous results Rules out deterministic systems and systems with jumps

33 Loss bound [SzeMu ’05]

34 The Bellman error of function sets  Bound is in temrs of the “distance of the functions sets F and T F : d(T F, F ) = inf f 2 F sup V 2 F ||TV-f|| p, ¹  “Bellman error on F ”  F should be large to make d(T F, F ) small  If MDP is “smooth”, TV is smooth for any bounded(!) V  Smooth functions can be well- approximated  Assume MDP is smooth

35 Metric Entropy  The bound depends on the metric entropy, E = E ( F ). Metric entropy: ‘capacity measure’, similar to VC-dimension  Metric entropy increases with F !  Previously we concluded that F should be big  ??? Smoothness RKHS

36 RKHS Bounds  Linear models (~RKHS): F = { w T Á : ||w|| 1 · A }  [Zhang, ’02]: E ( F )=O(log N)  This is independent of dim( Á )!  Corollary: Sample complexity of FVI is polynomial for “sparse” MDPs Cf. [Chow and Tsitsiklis ’89]  Extension to control? Yes

37 Improvements  Model selection How to choose F ? Choose as large an F as needed!  Regularization  Model-selection  Aggregation ..  Place base-points better Follow policies No need to fit densities to them!

38 References  [Ng & Jordan ’00] A.Y. Ng and M. Jordan: PEGASUS: A policy search method for large MDPs and POMDPs, UAI  [R. Bellman ’57] R. Bellman: Dynamic Programming. Princeton Univ. Press,  [Chow & Tsitsiklis ’89] C.S. Chow and J.N. Tsitsiklis: The complexity of dynamic programming, Journal of Complexity, 5:466—488,  [Kearns et al. ’02] M.J. Kearns, Y. Mansour, A.Y. Ng: A sparse sampling algorithm for near-optimal planning in large Markov decision processes. Machine Learning 49: 193— 208,  [KoSze ’06] L. Kocsis and Cs. Szepesvári: Bandit based Monte-Carlo planning. ECML,  [Auer et al. ’02] P. Auer, N. Cesa-Bianchi and P. Fischer: Finite time analysis of the multiarmed bandit problem, Machine Learning, 47:235—256,  [Peret and Garcia ’04] L. Peret & F. Garcia: On-line search for solving Markov decision processes via heuristic sampling. ECAI,  [Chang et al. ’05] H.S. Chang, M. Fu, J. Hu, and S.I. Marcus: An adaptive sampling algorithm for solving Markov decision processes. Operations Research, 53:126—139,  [Rust ’97] J. Rust, 1997, Using randomization to break the curse of dimensionality, Econometrica, 65:487—516,  [Sze ’01] Cs. Szepesvári: Efficient approximate planning in continuous space Markovian decision problems, AI Communications, 13: ,  [SzeMu ’05] Cs. Szepesvári and R. Munos: Finite time bounds for sampling based fitted value iteration, ICML,  [Zhang ’02] T. Zhang: Covering number bounds of certain regularized linear function classes. Journal of Machine Learning Research, 2:527–550, 2002.