Download presentation
Presentation is loading. Please wait.
Published byEsmond Parsons Modified over 9 years ago
1
Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011
2
Outline Introduction to Reinforcement Learning Overview of the field Model-based BRL Model-free RL
3
References ICML-07 Tutorial –P. Poupart, M. Ghavamzadeh, Y. Engel Reinforcement Learning: An Introduction –Richard S. Sutton and Andrew G. Barto
4
Machine Learning Unsupervised Learning Reinforcement Learning Supervised Learning
5
Definitions StateActionReward Policy £££££ Reward function
6
Markov Decision Process x0x0 a0a0 x1x1 Policy Transition Probability r0r0 a1a1 r1r1 Reward function
7
Value Function
8
Optimal Policy Assume one optimal action per state Unknown Value Iteration
9
Reinforcement Learning RL Problem: Solve MDP when reward/transition models are unknown Basic Idea: Use samples obtained from agent’s interaction with environment
10
Model-Based vs Model-Free RL Model-Based: Learn a model of the reward/transition dynamics and derive value function/policy Model-Free: Directly learn value function/policy
11
RL Solutions
12
Value Function Algorithms –Define a form for the value function –Sample state-action-reward sequence –Update value function –Extract optimal policy SARSA, Q-learning
13
RL Solutions Actor-Critic –Define a policy structure (actor) –Define a value function (critic) –Sample state-action-reward –Update both actor & critic
14
RL Solutions Policy Search Algorithm –Define a form for the policy –Sample state-action-reward sequence –Update policy PEGASUS –(Policy Evaluation-of-Goodness And Search Using Scenarios)
15
Online - Offline Offline –Use a simulator –Policy fixed for each ‘episode’ –Updates made at the end of episode Online –Directly interact with environment –Learning happens step-by-step
16
Model-Free Solutions 1.Prediction: Estimate V(x) or Q(x,a) 2.Control: Extract policy On-Policy Off-Policy
17
Monte-Carlo Predictions Value Reward -13 Leave car parkGet out of cityMotorway Enter Cambridge State -90-83-55-11 -15-61-11 Updated -100-87 -72 -11
18
Temporal Difference Predictions Value Reward -13 Leave car parkGet out of cityMotorway Enter Cambridge State -90-83-55-11 -15-61-11 Updated -96-70 -72 -11
19
Advantages of TD Don’t need a model of reward/transitions Online, fully incremental Proved to converge given conditions on step-size “Usually” faster than MC methods
20
From TD to TD(λ) State Reward Terminal state
21
From TD to TD(λ) State Reward Terminal state
22
SARSA & Q-learning TD-Learning SARSA Q-Learning On-Policy Estimate value function for current policy Off-Policy Estimate value function for optimal policy
23
GP Temporal Difference x x x x x x x x x x x 1 2
24
x x x x x x x x x x x 12
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.