Apprentissage par Renforcement Reinforcement Learning Kenji Doya ATR Human Information Science Laboratories CREST, Japan Science and Technology.

Apprentissage par Renforcement Reinforcement Learning Kenji Doya doya@atr.co.jp ATR Human Information Science Laboratories CREST, Japan Science and Technology Corporation

Outline Introduction to Reinforcement Learning (RL) Markov decision process (MDP) Current topics RL in Continuous Space and Time Model-free and model-based approaches Learning to Stand Up Discrete plans and continuous control Modular Decomposition Multiple model-based RL (MMRL)

Learning to Walk (Doya & Nakano, 1985) Action: cycle of 4 postures Reward: speed sensor output Multiple solutions: creeping, jumping,…

Markov Decision Process (MDP) Environment dynamics P(s’|s,a) reward P(r|s,a) Agent policy P(a|s) Goal: maximize cumulative future rewards E[ r(t+1) +  r(t+2) + …] 0≤  ≤1: discount factor agentenvironment reward r action a state s

Value Function and TD error State value function V(s) = E[ r(t+1) +  r(t+2) + …| s(t)=s, P(a|s)] 0≤  ≤1: discount factor Consistency condition  (t) = r(t) +  V(s(t)) - V(s(t-1)) = 0 new estimate - old estimate Dual role of temporal difference (TD) error  (t) Reward prediction:  (t)  0 in average Action selection:  (t)>0 better than average

Example: Navigation Reward field Value function  =0.9  =0.5

Actor-Critic Architecture Critic: future reward prediction update value  V(s(t-1))   (t) Actor: action reinforcement increase P(a(t-1)|s(t-1)) if  (t) > 0 critic: V(s) actor: P(a|s) TD error  reward r environment action a state s

Q Learning Action value function Q(s,a) = E[ r(t+1) +  r(t+2) + …| s(t)=s, a(t)=a, P(a|s)] = E[ r(t+1) +  V(s(t+1))| s(t)=s, a(t)=a] Action selection a(t) = argmax a Q(s(t),a) with prob. 1-  Update Q(s(t),a(t)) := r(t+1) +  max a [ Q(s(t+1),a)] Q(s(t),a(t)) := r(t+1) +  Q(s(t+1),a(t+1))

Dynamic Programming and RL Dynamic Programming given models P(s’|s,a) and P(r|s,a) off-line solution of Bellman equation V*(s) = max a [  r rP(r|s,a) +  s’ V(s’)P(s’|s,a)] Reinforcement Learning on-line learning with TD error  (t) = r(t) +  V(s(t) - V(s(t-1))  V(s(t-1)) =   (t)  Q(s(t-1),a(t-1)) =   (t)

Model-free and Model-based RL Model-free: e.g., learn action values Q(s,a) := r(s,a) +  Q(s’,a’) a = argmax a Q(s,a) Model-based: forward model P(s’|s,a) action selection: a = argmax a E[ R(s,a) +   s’ V(s’)P(s’|s,a)] simulation: learn V(s) and/or Q(s,a) off-line dynamic programming: solve Bellman eq. V(s) = max a E[ R(s,a) +   s’ V(s’)P(s’|s,a)]

Current Topics Convergence proofs with function approximators Learning with hidden states: POMDP estimate belief states reactive, stochastic policy parameterized finite-state policies Hierarchical architectures learn to select fixed sub-modules train sub-modules both

Partially Observable Markov Decision Process (POMDP) Update the belief state observation P(o|s): not identity belief state b=(P(s 1 ), P(s 2 ),…): real valued P(s k |o)  P(o|s k )  i P(s k |s i,a) P(s i )

Tiger Problem (Kaelbing et al., 1998) state: a tiger is in {left,right} action: {left, right, listen} observation with 15% error policy treefinite state policy

Why Continuous? Analog control problems discretization  poor control performance how to discretize? Better theoretical properties differential algorithms use of local linear models

Continuous TD learning Dynamics Value function TD error Discount factor Gradient Policy

On-line Learning of State Value state x=(angle, angular vel.) V(x)

Example: Cart-pole Swing up Reward: height of the tip Punish: crash to wall

Fast Learning by Internal Models Pole balancing (Stefan Schaal, USC) Forward model of pole dynamics Inverse model of arm dynamics

Internal Models for Planning Devil sticking (Chris Atkeson, CMU)

Need for Hierarchical Architecture Performance of control Many high-precision sensors and actuator Prohibitively long time for learning Speed of learning Search in low-dimensional, low-resolution space

Learning to Stand up (Morimoto & Doya, 1998) Reward: height of the head Punishment: tumble State: pitch and joint angles, their derivatives Simulation  many thousands of trials to learn

Hierarchical Architecture Upper level discrete state/time kinematics action: subgoals reward: total task Lower level continuous state/time dynamics action: motor torque reward: achieving subgoals Q(S,A) V(s) a=g(s) sequence of subgoals

Learning in Simulation early learning after ~700 trials Upper level subgoals Lower level control

Learning with Real Hardware (Morimoto & Doya, 2001) after simulation after ~100 physical trials Adaptation by lower control modules

Modularity in Motor Learning Fast De-adaptation and Re-adaptation switching rather than re-learning Combination of Learned Modules serial/parallel/sigmoidal mixture

‘Soft’ Switching of Adaptive Modules ‘Hard’ switching based on prediction errors (Narendra et al., 1995) Can result in sub-optimal task decomposition with initially poor prediction models. ‘Soft’ switching by ‘softmax’ of prediction errors (Wolpert and Kawato, 1998) Can use ‘annealing’ for optimal decomposition. (Pawelzik et al., 1996)

Responsibility by Competition predict state change responsibility weight output/learning

Multiple Linear Quadratic Controllers Linear dynamic models Quadratic reward models Value functions Action outputs

Swing-up control of a pendulum Red: module 1 Green: module 2

Non-linearity and Non-stationarity Specialization by predictability in space and time

Swing-up control of an ‘Acrobot’ Reward: height of the center of mass Linearized around four fixed points

Swing-up motions R=0.001R=0.002

Module switching trajectories x(t) R=0.001R=0.002 responsibility i : symbol-like representation 1-2-1-2-1-3-4-1-3-4-3-41-2-1-2-1-2-1-3-4-1-3-4

Stand Up by Multiple Modules Seven locally linear models

Segmentation of Observed Trajectory Predicted motor output Predicted state change Predicted responsibility

Imitation of Acrobot Swing-up  1 (0)=  /12  1 (0)=  /6  1 (0)=  /12 (imitation)

Future Directions Autonomous learning agents Tuning of meta-parameters Design of rewards Selection of necessary/sufficient state coding Neural mechanisms of RL Dopamine neurons: encoding TD error Basal ganglia: value-based action selection Cerebellum: internal models Cerebral cortex: modular decomposition

What is Reward for a robot? Should be grounded by Self preservation: self recharging Self reproduction: copying control program Cyber Rodent

The Cyber Rodent Project Learning mechanisms under realistic constraints of self-preservation and self-reproduction acquisition of task-oriented internal representation metalearning algorithms constraints of finite time and energy mechanisms for collaborative behaviors roles of communication abstract/emotional, concrete/symbolic gene exchange rules for evolution

Input/Output Sensory CCD camera range sensor IR proximity x8 acceleration/gylo microphone x2 Motor two wheels jaw R/G/B LED speaker

Computation/Communication CPU: Hitachi SH-4 CPU FPGA image processor IO modules Communication IR port wireless LAN Software learning/evolution dynamic simulation

Apprentissage par Renforcement Reinforcement Learning Kenji Doya ATR Human Information Science Laboratories CREST, Japan Science and Technology.

Similar presentations

Presentation on theme: "Apprentissage par Renforcement Reinforcement Learning Kenji Doya ATR Human Information Science Laboratories CREST, Japan Science and Technology."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Apprentissage par Renforcement Reinforcement Learning Kenji Doya ATR Human Information Science Laboratories CREST, Japan Science and Technology.

Similar presentations

Presentation on theme: "Apprentissage par Renforcement Reinforcement Learning Kenji Doya ATR Human Information Science Laboratories CREST, Japan Science and Technology."— Presentation transcript:

Similar presentations

About project

Feedback