Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.

Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011

Outline Introduction to Reinforcement Learning Overview of the field Model-based BRL Model-free RL

References ICML-07 Tutorial –P. Poupart, M. Ghavamzadeh, Y. Engel Reinforcement Learning: An Introduction –Richard S. Sutton and Andrew G. Barto

Machine Learning Unsupervised Learning Reinforcement Learning Supervised Learning

Definitions StateActionReward Policy £££££ Reward function

Markov Decision Process x0x0 a0a0 x1x1 Policy Transition Probability r0r0 a1a1 r1r1 Reward function

Value Function

Optimal Policy Assume one optimal action per state Unknown Value Iteration

Reinforcement Learning RL Problem: Solve MDP when reward/transition models are unknown Basic Idea: Use samples obtained from agent’s interaction with environment

Model-Based vs Model-Free RL Model-Based: Learn a model of the reward/transition dynamics and derive value function/policy Model-Free: Directly learn value function/policy

RL Solutions

Value Function Algorithms –Define a form for the value function –Sample state-action-reward sequence –Update value function –Extract optimal policy SARSA, Q-learning

RL Solutions Actor-Critic –Define a policy structure (actor) –Define a value function (critic) –Sample state-action-reward –Update both actor & critic

RL Solutions Policy Search Algorithm –Define a form for the policy –Sample state-action-reward sequence –Update policy PEGASUS –(Policy Evaluation-of-Goodness And Search Using Scenarios)

Online - Offline Offline –Use a simulator –Policy fixed for each ‘episode’ –Updates made at the end of episode Online –Directly interact with environment –Learning happens step-by-step

Model-Free Solutions 1.Prediction: Estimate V(x) or Q(x,a) 2.Control: Extract policy On-Policy Off-Policy

Monte-Carlo Predictions Value Reward -13 Leave car parkGet out of cityMotorway Enter Cambridge State -90-83-55-11 -15-61-11 Updated -100-87 -72 -11

Temporal Difference Predictions Value Reward -13 Leave car parkGet out of cityMotorway Enter Cambridge State -90-83-55-11 -15-61-11 Updated -96-70 -72 -11

Advantages of TD Don’t need a model of reward/transitions Online, fully incremental Proved to converge given conditions on step-size “Usually” faster than MC methods

From TD to TD(λ) State Reward Terminal state

SARSA & Q-learning TD-Learning SARSA Q-Learning On-Policy Estimate value function for current policy Off-Policy Estimate value function for optimal policy

GP Temporal Difference x x x x x x x x x x x 1 2

x x x x x x x x x x x 12

Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.

Similar presentations

Presentation on theme: "Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.

Similar presentations

Presentation on theme: "Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011."— Presentation transcript:

Similar presentations

About project

Feedback