Download presentation
Presentation is loading. Please wait.
2
Apprentissage par Renforcement Reinforcement Learning Kenji Doya doya@atr.co.jp ATR Human Information Science Laboratories CREST, Japan Science and Technology Corporation
3
Outline Introduction to Reinforcement Learning (RL) Markov decision process (MDP) Current topics RL in Continuous Space and Time Model-free and model-based approaches Learning to Stand Up Discrete plans and continuous control Modular Decomposition Multiple model-based RL (MMRL)
4
Learning to Walk (Doya & Nakano, 1985) Action: cycle of 4 postures Reward: speed sensor output Multiple solutions: creeping, jumping,…
5
Markov Decision Process (MDP) Environment dynamics P(s’|s,a) reward P(r|s,a) Agent policy P(a|s) Goal: maximize cumulative future rewards E[ r(t+1) + r(t+2) + …] 0≤ ≤1: discount factor agentenvironment reward r action a state s
6
Value Function and TD error State value function V(s) = E[ r(t+1) + r(t+2) + …| s(t)=s, P(a|s)] 0≤ ≤1: discount factor Consistency condition (t) = r(t) + V(s(t)) - V(s(t-1)) = 0 new estimate - old estimate Dual role of temporal difference (TD) error (t) Reward prediction: (t) 0 in average Action selection: (t)>0 better than average
7
Example: Navigation Reward field Value function =0.9 =0.5
8
Actor-Critic Architecture Critic: future reward prediction update value V(s(t-1)) (t) Actor: action reinforcement increase P(a(t-1)|s(t-1)) if (t) > 0 critic: V(s) actor: P(a|s) TD error reward r environment action a state s
9
Q Learning Action value function Q(s,a) = E[ r(t+1) + r(t+2) + …| s(t)=s, a(t)=a, P(a|s)] = E[ r(t+1) + V(s(t+1))| s(t)=s, a(t)=a] Action selection a(t) = argmax a Q(s(t),a) with prob. 1- Update Q(s(t),a(t)) := r(t+1) + max a [ Q(s(t+1),a)] Q(s(t),a(t)) := r(t+1) + Q(s(t+1),a(t+1))
10
Dynamic Programming and RL Dynamic Programming given models P(s’|s,a) and P(r|s,a) off-line solution of Bellman equation V*(s) = max a [ r rP(r|s,a) + s’ V(s’)P(s’|s,a)] Reinforcement Learning on-line learning with TD error (t) = r(t) + V(s(t) - V(s(t-1)) V(s(t-1)) = (t) Q(s(t-1),a(t-1)) = (t)
11
Model-free and Model-based RL Model-free: e.g., learn action values Q(s,a) := r(s,a) + Q(s’,a’) a = argmax a Q(s,a) Model-based: forward model P(s’|s,a) action selection: a = argmax a E[ R(s,a) + s’ V(s’)P(s’|s,a)] simulation: learn V(s) and/or Q(s,a) off-line dynamic programming: solve Bellman eq. V(s) = max a E[ R(s,a) + s’ V(s’)P(s’|s,a)]
12
Current Topics Convergence proofs with function approximators Learning with hidden states: POMDP estimate belief states reactive, stochastic policy parameterized finite-state policies Hierarchical architectures learn to select fixed sub-modules train sub-modules both
13
Partially Observable Markov Decision Process (POMDP) Update the belief state observation P(o|s): not identity belief state b=(P(s 1 ), P(s 2 ),…): real valued P(s k |o) P(o|s k ) i P(s k |s i,a) P(s i )
14
Tiger Problem (Kaelbing et al., 1998) state: a tiger is in {left,right} action: {left, right, listen} observation with 15% error policy treefinite state policy
15
Outline Introduction to Reinforcement Learning (RL) Markov decision process (MDP) Current topics RL in Continuous Space and Time Model-free and model-based approaches Learning to Stand Up Discrete plans and continuous control Modular Decomposition Multiple model-based RL (MMRL)
16
Why Continuous? Analog control problems discretization poor control performance how to discretize? Better theoretical properties differential algorithms use of local linear models
17
Continuous TD learning Dynamics Value function TD error Discount factor Gradient Policy
18
On-line Learning of State Value state x=(angle, angular vel.) V(x)
19
Example: Cart-pole Swing up Reward: height of the tip Punish: crash to wall
20
Fast Learning by Internal Models Pole balancing (Stefan Schaal, USC) Forward model of pole dynamics Inverse model of arm dynamics
21
Internal Models for Planning Devil sticking (Chris Atkeson, CMU)
22
Outline Introduction to Reinforcement Learning (RL) Markov decision process (MDP) Current topics RL in Continuous Space and Time Model-free and model-based approaches Learning to Stand Up Discrete plans and continuous control Modular Decomposition Multiple model-based RL (MMRL)
23
Need for Hierarchical Architecture Performance of control Many high-precision sensors and actuator Prohibitively long time for learning Speed of learning Search in low-dimensional, low-resolution space
24
Learning to Stand up (Morimoto & Doya, 1998) Reward: height of the head Punishment: tumble State: pitch and joint angles, their derivatives Simulation many thousands of trials to learn
25
Hierarchical Architecture Upper level discrete state/time kinematics action: subgoals reward: total task Lower level continuous state/time dynamics action: motor torque reward: achieving subgoals Q(S,A) V(s) a=g(s) sequence of subgoals
26
Learning in Simulation early learning after ~700 trials Upper level subgoals Lower level control
27
Learning with Real Hardware (Morimoto & Doya, 2001) after simulation after ~100 physical trials Adaptation by lower control modules
28
Outline Introduction to Reinforcement Learning (RL) Markov decision process (MDP) Current topics RL in Continuous Space and Time Model-free and model-based approaches Learning to Stand Up Discrete plans and continuous control Modular Decomposition Multiple model-based RL (MMRL)
29
Modularity in Motor Learning Fast De-adaptation and Re-adaptation switching rather than re-learning Combination of Learned Modules serial/parallel/sigmoidal mixture
30
‘Soft’ Switching of Adaptive Modules ‘Hard’ switching based on prediction errors (Narendra et al., 1995) Can result in sub-optimal task decomposition with initially poor prediction models. ‘Soft’ switching by ‘softmax’ of prediction errors (Wolpert and Kawato, 1998) Can use ‘annealing’ for optimal decomposition. (Pawelzik et al., 1996)
31
Responsibility by Competition predict state change responsibility weight output/learning
32
Multiple Linear Quadratic Controllers Linear dynamic models Quadratic reward models Value functions Action outputs
33
Swing-up control of a pendulum Red: module 1 Green: module 2
34
Non-linearity and Non-stationarity Specialization by predictability in space and time
35
Swing-up control of an ‘Acrobot’ Reward: height of the center of mass Linearized around four fixed points
36
Swing-up motions R=0.001R=0.002
37
Module switching trajectories x(t) R=0.001R=0.002 responsibility i : symbol-like representation 1-2-1-2-1-3-4-1-3-4-3-41-2-1-2-1-2-1-3-4-1-3-4
38
Stand Up by Multiple Modules Seven locally linear models
39
Segmentation of Observed Trajectory Predicted motor output Predicted state change Predicted responsibility
40
Imitation of Acrobot Swing-up 1 (0)= /12 1 (0)= /6 1 (0)= /12 (imitation)
41
Outline Introduction to Reinforcement Learning (RL) Markov decision process (MDP) Current topics RL in Continuous Space and Time Model-free and model-based approaches Learning to Stand Up Discrete plans and continuous control Modular Decomposition Multiple model-based RL (MMRL)
42
Future Directions Autonomous learning agents Tuning of meta-parameters Design of rewards Selection of necessary/sufficient state coding Neural mechanisms of RL Dopamine neurons: encoding TD error Basal ganglia: value-based action selection Cerebellum: internal models Cerebral cortex: modular decomposition
43
What is Reward for a robot? Should be grounded by Self preservation: self recharging Self reproduction: copying control program Cyber Rodent
44
The Cyber Rodent Project Learning mechanisms under realistic constraints of self-preservation and self-reproduction acquisition of task-oriented internal representation metalearning algorithms constraints of finite time and energy mechanisms for collaborative behaviors roles of communication abstract/emotional, concrete/symbolic gene exchange rules for evolution
45
Input/Output Sensory CCD camera range sensor IR proximity x8 acceleration/gylo microphone x2 Motor two wheels jaw R/G/B LED speaker
46
Computation/Communication CPU: Hitachi SH-4 CPU FPGA image processor IO modules Communication IR port wireless LAN Software learning/evolution dynamic simulation
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.