Monte-Carlo Methods Learning methods averaging complete episodic returns Slides based on [Sutton & Barto: Reinforcement Learning: An Introduction, 1998]

Differences with DP/TD Differences with DP methods: – Real RL: Complete transition model not necessary They sample experience; can be used for direct learning – They do not bootstrap No evaluation of successor states Differences with TD methods – Well, they do not bootstrap – they average episodic returns 2Slides prepared by Georgios Chalkiadakis

Overview and Advantages Learn from experience – sample episodes – Sample sequences of states, actions, rewards – Either on-line, or from simulated (model-based) interactions with environment. But no complete model required. Advantages – Provably learn optimal policy without model – Can be used with sample /easy-to-produce models – Can focus on interesting state regions easily – More robust wrt Markov property violations 3Slides prepared by Georgios Chalkiadakis

Policy Evaluation Slides prepared by Georgios Chalkiadakis4

Action-value functions required Without a model, we need Q-value estimates MC methods now average returns following visits to state-action pairs All such pairs “need” to be visited! …sufficient exploration required – Randomize episode starts (“exploring-starts”) – …or behave using a stochastic (e.g. ε-greedy) policy – …thus “Monte-Carlo” 5Slides prepared by Georgios Chalkiadakis

Monte-Carlo Control (to generate optimal policy) For now, assume “exploring starts” Does “policy iteration” work? – Yes! 6Slides prepared by Georgios Chalkiadakis Where evaluation of each policy is over multiple episodes And improvement  make policy greedy wrt current Q-value function

Monte-Carlo Control (to generate optimal policy) Why? Slides prepared by Georgios Chalkiadakis7 is greedy wrt Then, policy-improvement theorem applies because, for all s : Thus is uniformly better than

A Monte-Carlo control algorithm Slides prepared by Georgios Chalkiadakis8

ε-greedy Exploration If not “greedy”, select with 9Slides prepared by Georgios Chalkiadakis Otherwise: What about ε-greedy policies?

Yes, policy iteration works See the details in book ε-soft on-policy algorithm: 10

…and you can have off-policy learning as well… Why? Slides prepared by Georgios Chalkiadakis11

Monte-Carlo Methods Learning methods averaging complete episodic returns Slides based on [Sutton & Barto: Reinforcement Learning: An Introduction, 1998]

Similar presentations

Presentation on theme: "Monte-Carlo Methods Learning methods averaging complete episodic returns Slides based on [Sutton & Barto: Reinforcement Learning: An Introduction, 1998]"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Monte-Carlo Methods Learning methods averaging complete episodic returns Slides based on [Sutton & Barto: Reinforcement Learning: An Introduction, 1998]

Similar presentations

Presentation on theme: "Monte-Carlo Methods Learning methods averaging complete episodic returns Slides based on [Sutton & Barto: Reinforcement Learning: An Introduction, 1998]"— Presentation transcript:

Similar presentations

About project

Feedback