Download presentation
Presentation is loading. Please wait.
1
1 Fitted/batch/model-based RL: A (sketchy, biased) overview(?) Csaba Szepesvári University of Alberta TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A A A AA
2
2 Contents What, why? Constraints How? Model-based learning Model learning Planning Model-free learning Averagers Fitted RL
3
3 Motto “Nothing is more practical than a good theory” [Lewin] “He who loves practice without theory is like the sailor who boards ship without a rudder and compass and never knows where he may cast.” [Leonardo da Vinci]
4
4 What? Why? What is batch RL? Input: Samples (algorithm cannot influence samples) Output: A good policy Why? Common problem Sample efficiency -- data is expensive Building block Why not? Too much work (for nothing?) – “Don’t worry, be lazy!” Old samples are irrelevant Missed opportunities (evaluate a policy!?)
5
5 Constraints Large (infinite) state/action space Limits on Computation Memory use
6
6 How? Model learning + planning Model free Policy search DP Policy iteration Value iteration
7
7 Model-based learning
8
8 Model learning
9
9 Model-based methods Model-learning: How? Model: What happens if..? Features vs. observations vs. states System identification? Satinder! Carlos! Eric! … Planning: How? Sample + learning! (batch RL?..but you can influence the samples) What else? (Discretize? Nay..) Pro: Model is good for multiple things Contra: Problem is doubled: need of high fidelity models, good planning Problem 1: Should planning take into account the uncertainties in the model? (“robustification”) Problem 2: How to learn relevant, compact models? For example: How to reject irrelevant features and keep the relevant ones? Need: Tight integration of planning and learning!
10
10 Planning
11
11 Bad news.. Theorem (Chow, Tsitsiklis ’89) Markovian Decision Problems d dimensional state space Bounded transition probabilities, rewards Lipschitz-continuous transition probabilities and rewards Any algorithm computing an ² - approximation of the optimal value function needs ( ² - d ) values of p and r. What’s next then?? Open: Policy approximation?
12
12 The joy of laziness Don’t worry, be lazy: “If something is too hard to do, then it's not worth doing” Luckiness factor: “If you really want something in this life, you have to work for it - Now quiet, they're about to announce the lottery numbers!”
13
13 Sparse lookahead trees [Kearns et al., ’02] Idea: Computing a good action ´ planning build a lookahead tree Size of the tree: S = c |A| H ( ² ) (unavoidable), where H( ² ) = K r /( ² (1- ° )) Good news: S is independent of d! Bad news: S is exponential in H( ² ) Still attractive: Generic, easy to implement Problem: Not really practical
14
14 Idea.. Be more lazy Need to propagate values from good leaves as early as possible Why sample suboptimal actions at all? Breadth-first Depth-first! Bandit algorithms Upper Confidence Bounds UCT [KoSze ’06] Remi Similar ideas: [Peret and Garcia, ’04] [Chang et al., ’05]
15
15 Results: Sailing ‘Sailing’: Stochastic shortest path State-space size = 24*problem-size Extension to two-player, full information games Good results in go! ( Remi, David!) Open: Why (when) does UCT work so well? Conjecture: When being (very) optimistic does not abuse search How to improve UCT?
16
16 Random Discretization Method [Rust’97] Method: Random base points Value function computed at these points (weighted importance sampling) Compute values at other points at run-time (“half-lazy method”) Why Monte-Carlo? Avoid grids! Result: State space: [0,1] d Action space: finite p(y|x,a), r(x,a) Lipschitz continuous, bounded Theorem [Rust ’97]: Theorem [Sze’01]: Poly samples are enough to come up with ² -optimal actions (poly dependence on H). Smoothness of the value function is not required Open: Can we improve the result by changing the distribution of samples? Idea: Presample + Follow the obtained policy Open: Can we get poly dependence on both d and H without representing a value function? (e.g. lookahead trees)
17
17 Pegasus [Ng & Jordan ’00] Idea: Policy search + method of common random numbers (“scenarios”) Results: Condition: Deterministic simulative model Thm: Finite action space, finite complexity policy class polynomial sample complexity Thm: Infinite action spaces, Lipschitz continuity of trans.probs + rewards polynomial sample complexity Thm: Finitely computable models + policies polynomial sample complexity Pro: Nice results Contra: Global search? What policy space? Problem 1: How to avoid global search? Problem 2: When can we find a good policy efficiently? How? Problem 3: How to choose the policy class?
18
18 Other planning methods Your favorite RL method! +Planning is easier than learning: You can reset the state! Dyna-style planning with prioritized sweeping Rich Conservative policy iteration Problem: Policy search, guaranteed improvement in every iteration [K&L’00]: Bound for finite MDPs, policy class ´ all policies [K’03]: Arbitrary policies, reduction-style result Policy search by DP [Bagnell, Kakade, Ng & Schneider ’03] Similar to [K’03], finite horizon problems Fitted value iteration..
19
19 Model-free: Policy Search ???? Open: How to do it?? (I am serious) Open: How to evaluate a policy/policy gradient given some samples? (partial result: In the limit, under some conditions, policies can be evaluated [AnSzeMu’08])
20
20 Model-free: Dynamic Programming Policy Iteration How to evaluate policies? Do good value functions give rise to good policies? Value Iteration Use action-value functions How to represent value functions? How to do the updates?
21
21 Value-function based methods Questions: What representation to use? How are errors propagated? Averagers [Gordon ’95] ~ kernel methods V t+1 = ¦ F T V t L 1 theory Can we have an L 2 (L p ) theory? Counterexamples [Boyan&Moore ’95, Baird’95, BeTsi’96] L 2 error propagation [Munos ’03 ’05]
22
22 Fitted methods Idea: Use regression/classification with value/policy iteration Notable examples: Fitted Q-iteration Use trees ( averagers; Damien!) Use neural nets ( L 2, Martin!) Policy iteration LSTD [Bradtke&Barto ’96, Boyan ‘99] BRM [AnSzeMu’06,’08] LSPI: Use action-value functions + iterate [Lagoudakis & Parr ’01, ’03] RL as classification [La & Pa ’03]
23
23 Results for fitted algorithms Results for LSPI/BRM-PI, FQI: Finite action-, continuous state-space Smoothness conditions on MDP Representative training set Function class (F) large (Bellman error of F is small), but controlled complexity Polynomial rates (similar to supervised learning) FQI, continuous action-spaces Similar conditions + restricted policy class Polynomial rates, but bad scaling with the dimension of the action space [AnSzeMu ’06-’08] Open: How to choose the function space in an adaptive way? ~ model selection in supervised learning Supervised learning does not work without model selection? Why would RL work? NO, IT DOES NOT. Idea: Regularize! Problem: How to evaluate policies?
24
24 Regularization
25
25 Final thoughts Batch RL: Flourishing area Many open questions More should! come soon! Some good results in practice Take computation cost seriously? Connect to on-line RL?
26
26 Batch RL Let’s switch to that policy – after all the paper says that learning converges at an optimal rate!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.