Presentation is loading. Please wait.

Presentation is loading. Please wait.

Monte-Carlo Methods Learning methods averaging complete episodic returns Slides based on [Sutton & Barto: Reinforcement Learning: An Introduction, 1998]

Similar presentations


Presentation on theme: "Monte-Carlo Methods Learning methods averaging complete episodic returns Slides based on [Sutton & Barto: Reinforcement Learning: An Introduction, 1998]"— Presentation transcript:

1 Monte-Carlo Methods Learning methods averaging complete episodic returns Slides based on [Sutton & Barto: Reinforcement Learning: An Introduction, 1998]

2 Differences with DP/TD Differences with DP methods: – Real RL: Complete transition model not necessary They sample experience; can be used for direct learning – They do not bootstrap No evaluation of successor states Differences with TD methods – Well, they do not bootstrap – they average episodic returns 2Slides prepared by Georgios Chalkiadakis

3 Overview and Advantages Learn from experience – sample episodes – Sample sequences of states, actions, rewards – Either on-line, or from simulated (model-based) interactions with environment. But no complete model required. Advantages – Provably learn optimal policy without model – Can be used with sample /easy-to-produce models – Can focus on interesting state regions easily – More robust wrt Markov property violations 3Slides prepared by Georgios Chalkiadakis

4 Policy Evaluation Slides prepared by Georgios Chalkiadakis4

5 Action-value functions required Without a model, we need Q-value estimates MC methods now average returns following visits to state-action pairs All such pairs “need” to be visited! …sufficient exploration required – Randomize episode starts (“exploring-starts”) – …or behave using a stochastic (e.g. ε-greedy) policy – …thus “Monte-Carlo” 5Slides prepared by Georgios Chalkiadakis

6 Monte-Carlo Control (to generate optimal policy) For now, assume “exploring starts” Does “policy iteration” work? – Yes! 6Slides prepared by Georgios Chalkiadakis Where evaluation of each policy is over multiple episodes And improvement  make policy greedy wrt current Q-value function

7 Monte-Carlo Control (to generate optimal policy) Why? Slides prepared by Georgios Chalkiadakis7 is greedy wrt Then, policy-improvement theorem applies because, for all s : Thus is uniformly better than

8 A Monte-Carlo control algorithm Slides prepared by Georgios Chalkiadakis8

9 ε-greedy Exploration If not “greedy”, select with 9Slides prepared by Georgios Chalkiadakis Otherwise: What about ε-greedy policies?

10 Yes, policy iteration works See the details in book ε-soft on-policy algorithm: 10

11 …and you can have off-policy learning as well… Why? Slides prepared by Georgios Chalkiadakis11


Download ppt "Monte-Carlo Methods Learning methods averaging complete episodic returns Slides based on [Sutton & Barto: Reinforcement Learning: An Introduction, 1998]"

Similar presentations


Ads by Google