Download presentation
Presentation is loading. Please wait.
Published byQuintin Hounsell Modified over 9 years ago
1
Monte-Carlo Methods Learning methods averaging complete episodic returns Slides based on [Sutton & Barto: Reinforcement Learning: An Introduction, 1998]
2
Differences with DP/TD Differences with DP methods: – Real RL: Complete transition model not necessary They sample experience; can be used for direct learning – They do not bootstrap No evaluation of successor states Differences with TD methods – Well, they do not bootstrap – they average episodic returns 2Slides prepared by Georgios Chalkiadakis
3
Overview and Advantages Learn from experience – sample episodes – Sample sequences of states, actions, rewards – Either on-line, or from simulated (model-based) interactions with environment. But no complete model required. Advantages – Provably learn optimal policy without model – Can be used with sample /easy-to-produce models – Can focus on interesting state regions easily – More robust wrt Markov property violations 3Slides prepared by Georgios Chalkiadakis
4
Policy Evaluation Slides prepared by Georgios Chalkiadakis4
5
Action-value functions required Without a model, we need Q-value estimates MC methods now average returns following visits to state-action pairs All such pairs “need” to be visited! …sufficient exploration required – Randomize episode starts (“exploring-starts”) – …or behave using a stochastic (e.g. ε-greedy) policy – …thus “Monte-Carlo” 5Slides prepared by Georgios Chalkiadakis
6
Monte-Carlo Control (to generate optimal policy) For now, assume “exploring starts” Does “policy iteration” work? – Yes! 6Slides prepared by Georgios Chalkiadakis Where evaluation of each policy is over multiple episodes And improvement make policy greedy wrt current Q-value function
7
Monte-Carlo Control (to generate optimal policy) Why? Slides prepared by Georgios Chalkiadakis7 is greedy wrt Then, policy-improvement theorem applies because, for all s : Thus is uniformly better than
8
A Monte-Carlo control algorithm Slides prepared by Georgios Chalkiadakis8
9
ε-greedy Exploration If not “greedy”, select with 9Slides prepared by Georgios Chalkiadakis Otherwise: What about ε-greedy policies?
10
Yes, policy iteration works See the details in book ε-soft on-policy algorithm: 10
11
…and you can have off-policy learning as well… Why? Slides prepared by Georgios Chalkiadakis11
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.