Presentation is loading. Please wait.

Presentation is loading. Please wait.

Apprentissage par Renforcement Reinforcement Learning Kenji Doya ATR Human Information Science Laboratories CREST, Japan Science and Technology.

Similar presentations


Presentation on theme: "Apprentissage par Renforcement Reinforcement Learning Kenji Doya ATR Human Information Science Laboratories CREST, Japan Science and Technology."— Presentation transcript:

1

2 Apprentissage par Renforcement Reinforcement Learning Kenji Doya doya@atr.co.jp ATR Human Information Science Laboratories CREST, Japan Science and Technology Corporation

3 Outline Introduction to Reinforcement Learning (RL) Markov decision process (MDP) Current topics RL in Continuous Space and Time Model-free and model-based approaches Learning to Stand Up Discrete plans and continuous control Modular Decomposition Multiple model-based RL (MMRL)

4 Learning to Walk (Doya & Nakano, 1985) Action: cycle of 4 postures Reward: speed sensor output Multiple solutions: creeping, jumping,…

5 Markov Decision Process (MDP) Environment dynamics P(s’|s,a) reward P(r|s,a) Agent policy P(a|s) Goal: maximize cumulative future rewards E[ r(t+1) +  r(t+2) + …] 0≤  ≤1: discount factor agentenvironment reward r action a state s

6 Value Function and TD error State value function V(s) = E[ r(t+1) +  r(t+2) + …| s(t)=s, P(a|s)] 0≤  ≤1: discount factor Consistency condition  (t) = r(t) +  V(s(t)) - V(s(t-1)) = 0 new estimate - old estimate Dual role of temporal difference (TD) error  (t) Reward prediction:  (t)  0 in average Action selection:  (t)>0 better than average

7 Example: Navigation Reward field Value function  =0.9  =0.5

8 Actor-Critic Architecture Critic: future reward prediction update value  V(s(t-1))   (t) Actor: action reinforcement increase P(a(t-1)|s(t-1)) if  (t) > 0 critic: V(s) actor: P(a|s) TD error  reward r environment action a state s

9 Q Learning Action value function Q(s,a) = E[ r(t+1) +  r(t+2) + …| s(t)=s, a(t)=a, P(a|s)] = E[ r(t+1) +  V(s(t+1))| s(t)=s, a(t)=a] Action selection a(t) = argmax a Q(s(t),a) with prob. 1-  Update Q(s(t),a(t)) := r(t+1) +  max a [ Q(s(t+1),a)] Q(s(t),a(t)) := r(t+1) +  Q(s(t+1),a(t+1))

10 Dynamic Programming and RL Dynamic Programming given models P(s’|s,a) and P(r|s,a) off-line solution of Bellman equation V*(s) = max a [  r rP(r|s,a) +  s’ V(s’)P(s’|s,a)] Reinforcement Learning on-line learning with TD error  (t) = r(t) +  V(s(t) - V(s(t-1))  V(s(t-1)) =   (t)  Q(s(t-1),a(t-1)) =   (t)

11 Model-free and Model-based RL Model-free: e.g., learn action values Q(s,a) := r(s,a) +  Q(s’,a’) a = argmax a Q(s,a) Model-based: forward model P(s’|s,a) action selection: a = argmax a E[ R(s,a) +   s’ V(s’)P(s’|s,a)] simulation: learn V(s) and/or Q(s,a) off-line dynamic programming: solve Bellman eq. V(s) = max a E[ R(s,a) +   s’ V(s’)P(s’|s,a)]

12 Current Topics Convergence proofs with function approximators Learning with hidden states: POMDP estimate belief states reactive, stochastic policy parameterized finite-state policies Hierarchical architectures learn to select fixed sub-modules train sub-modules both

13 Partially Observable Markov Decision Process (POMDP) Update the belief state observation P(o|s): not identity belief state b=(P(s 1 ), P(s 2 ),…): real valued P(s k |o)  P(o|s k )  i P(s k |s i,a) P(s i )

14 Tiger Problem (Kaelbing et al., 1998) state: a tiger is in {left,right} action: {left, right, listen} observation with 15% error policy treefinite state policy

15 Outline Introduction to Reinforcement Learning (RL) Markov decision process (MDP) Current topics RL in Continuous Space and Time Model-free and model-based approaches Learning to Stand Up Discrete plans and continuous control Modular Decomposition Multiple model-based RL (MMRL)

16 Why Continuous? Analog control problems discretization  poor control performance how to discretize? Better theoretical properties differential algorithms use of local linear models

17 Continuous TD learning Dynamics Value function TD error Discount factor Gradient Policy

18 On-line Learning of State Value state x=(angle, angular vel.) V(x)

19 Example: Cart-pole Swing up Reward: height of the tip Punish: crash to wall

20 Fast Learning by Internal Models Pole balancing (Stefan Schaal, USC) Forward model of pole dynamics Inverse model of arm dynamics

21 Internal Models for Planning Devil sticking (Chris Atkeson, CMU)

22 Outline Introduction to Reinforcement Learning (RL) Markov decision process (MDP) Current topics RL in Continuous Space and Time Model-free and model-based approaches Learning to Stand Up Discrete plans and continuous control Modular Decomposition Multiple model-based RL (MMRL)

23 Need for Hierarchical Architecture Performance of control Many high-precision sensors and actuator Prohibitively long time for learning Speed of learning Search in low-dimensional, low-resolution space

24 Learning to Stand up (Morimoto & Doya, 1998) Reward: height of the head Punishment: tumble State: pitch and joint angles, their derivatives Simulation  many thousands of trials to learn

25 Hierarchical Architecture Upper level discrete state/time kinematics action: subgoals reward: total task Lower level continuous state/time dynamics action: motor torque reward: achieving subgoals Q(S,A) V(s) a=g(s) sequence of subgoals

26 Learning in Simulation early learning after ~700 trials Upper level subgoals Lower level control

27 Learning with Real Hardware (Morimoto & Doya, 2001) after simulation after ~100 physical trials Adaptation by lower control modules

28 Outline Introduction to Reinforcement Learning (RL) Markov decision process (MDP) Current topics RL in Continuous Space and Time Model-free and model-based approaches Learning to Stand Up Discrete plans and continuous control Modular Decomposition Multiple model-based RL (MMRL)

29 Modularity in Motor Learning Fast De-adaptation and Re-adaptation switching rather than re-learning Combination of Learned Modules serial/parallel/sigmoidal mixture

30 ‘Soft’ Switching of Adaptive Modules ‘Hard’ switching based on prediction errors (Narendra et al., 1995) Can result in sub-optimal task decomposition with initially poor prediction models. ‘Soft’ switching by ‘softmax’ of prediction errors (Wolpert and Kawato, 1998) Can use ‘annealing’ for optimal decomposition. (Pawelzik et al., 1996)

31 Responsibility by Competition predict state change responsibility weight output/learning

32 Multiple Linear Quadratic Controllers Linear dynamic models Quadratic reward models Value functions Action outputs

33 Swing-up control of a pendulum Red: module 1 Green: module 2

34 Non-linearity and Non-stationarity Specialization by predictability in space and time

35 Swing-up control of an ‘Acrobot’ Reward: height of the center of mass Linearized around four fixed points

36 Swing-up motions R=0.001R=0.002

37 Module switching trajectories x(t) R=0.001R=0.002 responsibility i : symbol-like representation 1-2-1-2-1-3-4-1-3-4-3-41-2-1-2-1-2-1-3-4-1-3-4

38 Stand Up by Multiple Modules Seven locally linear models

39 Segmentation of Observed Trajectory Predicted motor output Predicted state change Predicted responsibility

40 Imitation of Acrobot Swing-up  1 (0)=  /12  1 (0)=  /6  1 (0)=  /12 (imitation)

41 Outline Introduction to Reinforcement Learning (RL) Markov decision process (MDP) Current topics RL in Continuous Space and Time Model-free and model-based approaches Learning to Stand Up Discrete plans and continuous control Modular Decomposition Multiple model-based RL (MMRL)

42 Future Directions Autonomous learning agents Tuning of meta-parameters Design of rewards Selection of necessary/sufficient state coding Neural mechanisms of RL Dopamine neurons: encoding TD error Basal ganglia: value-based action selection Cerebellum: internal models Cerebral cortex: modular decomposition

43 What is Reward for a robot? Should be grounded by Self preservation: self recharging Self reproduction: copying control program Cyber Rodent

44 The Cyber Rodent Project Learning mechanisms under realistic constraints of self-preservation and self-reproduction acquisition of task-oriented internal representation metalearning algorithms constraints of finite time and energy mechanisms for collaborative behaviors roles of communication abstract/emotional, concrete/symbolic gene exchange rules for evolution

45 Input/Output Sensory CCD camera range sensor IR proximity x8 acceleration/gylo microphone x2 Motor two wheels jaw R/G/B LED speaker

46 Computation/Communication CPU: Hitachi SH-4 CPU FPGA image processor IO modules Communication IR port wireless LAN Software learning/evolution dynamic simulation


Download ppt "Apprentissage par Renforcement Reinforcement Learning Kenji Doya ATR Human Information Science Laboratories CREST, Japan Science and Technology."

Similar presentations


Ads by Google