Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computational Investigations of the Regulative Role of Pleasure in Adaptive Behavior Action-Selection Biased by Pleasure-Regulated Simulated Interaction.

Similar presentations


Presentation on theme: "Computational Investigations of the Regulative Role of Pleasure in Adaptive Behavior Action-Selection Biased by Pleasure-Regulated Simulated Interaction."— Presentation transcript:

1 Computational Investigations of the Regulative Role of Pleasure in Adaptive Behavior Action-Selection Biased by Pleasure-Regulated Simulated Interaction Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.

2 Overview Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands. Cognitive influence Reactive behavior Distributed-state RL model Interaction-selection Action-selection action reinforcement interaction ENVIRONMENT Perception simulated interaction simulated reinforcement percept predicted interactions Emotion process pleasure stimulus Distributed-state RL model

3 Learning The agent's memory structure is modelled with a directed graph. Constructs a distributed state-prediction of next states. Learns through continuous interaction. –The memory is adapted while the agent interacts with its environment. –Agent selects an action a, executes a and combines with resulting perception p into a situation s 1 =. Memory adds s 1 if s 1 does not yet exists. –Do another action, resulting in s 2, add s 2, and –connect s 1 to s 2 by creating an interactron node I 1. –Recursively apply this process, use interactrons to predict situations. –(!Encoding situations in this way is too symbolic for real world applications!) a s1s1 c s1s1 s2s2 I1I1 b s1s1 s2s2 d s1s1 s2s2 I2I2 I1I1 I3I3 s3s3 e s1s1 s2s2 I2I2 I1I1 I3I3 Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.

4 Learning: example Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands. Agent environment, grid world consisting of: –lava (red) r=-1, can walk on lava but is discouraged to do so, –food (yellow) r=1, –agent (black), –path (white) r=0.

5 Learning: reinforcement Learning to predict values follows standard RL techniques (Sutton and Barto, 1998). –Except that learning is per interactron (node) and that there are two prediction values per node. Every interactron (representing a sequence of occurred situations) maintains –a direct reinforcement, the interactron’s own predicted value changed by the reinforcement r from the environment, –a lazy-propagated indirect reinforcement value, that estimates the future reinforcement,, based on the predicted next interactions, and –a combined value μ= +. Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.

6 Learning: reinforcement example Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands. Lazy reinforcement propagation

7 Action-Selection Cognitive influence Reactive behavior Distributed-state RL model Interaction-selection Action-selection action reinforcement interaction ENVIRONMENT Perception simulated interaction simulated reinforcement percept predicted interactions Emotion process pleasure stimulus

8 Action-Selection Integrate distributed predictions into action values. –Action values are a result of the parallel inhibition and excitation of actions in the agent’s set of actions, A. Calculated using formula: –with l t (a h ) = the resulting level of activation of an action a h  A at time t, –y i an active interactron, and –x i j predicts action a h  Sum over the weighted values of all predicted interactions into actions. Action-selection is based on these action values. –If any l t (a h )>threshold  a select such that l t (a select )=max(l t (a 1 ),…,l t (a |A| )). –If all l t (a h )<threshold  a select stochastically from l(a 1 ),…,l(a |A| ). Other selection mechanisms possible, i.e. Boltzmann. (!With a static threshold our selection suffers from lack of exploration!). Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.

9 Action-Selection: example Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.

10 Internal Simulation Cognitive influence Reactive behavior Distributed-state RL model Interaction-selection Action-selection action reinforcement interaction ENVIRONMENT Perception simulated interaction simulated reinforcement percept predicted interactions Emotion process pleasure stimulus

11 Thinking as Internal Simulation of Behavior Internal simulation of behavior –Covertly execute and evaluate potential interaction using sensory-motor substrates (Hesslow, 2002; Damasio; Cotterill, 2001), but see also –“interaction potentialities” (Bickhard), and –“state anticipation” (Butz, Sigaud, Gérard, 2003). –Existing mechanisms are basis for simulation  Evolutionary continuity!

12 Simulation: action-selection bias At every step, instead of action-selection, select a subset of predicted interactions from reinforcement learning model  feed back to RL model. 1.Interaction-selection: select a subset of predicted interactions. 2.Simulate-and-bias-predicted-benefit: feed back to model as if a real interaction. (note that the memory advances to time t+1, so we have to 3. reset-memory-state to time=t to be able to select an appropriate action.) Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands. Cognitive influence Reactive behavior Hierarchical-state RL model Interaction-selection Action-selection action reinforcement interaction ENVIRONMENT Perception simulated interaction simulated reinforcement percept predicted interactions Emotion process pleasure stimulus 4. Action-selection: select the next action using the action-selection mechanism explained earlier based on the now biased action values.

13 Simulation: example Action list before simulation (!hypothetical example!): –{up=0.2, down=-0.5, right=-1, left=-1} Action-selection would have selected “up”, –at least using our naive action-selection mechanism. –With Boltzmann  high probability for “up”. Simulate all interactions. –Propagate back the predicted values by simulating interaction with environment. –Effect is a “value look-ahead” of 1 step. Action list after simulation: –{up=0.1, down=0.5, right=-1, left=-1} Action-selection selects “down”. In this example simulating all predicted interactions helps. Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands. Roadblock r=-.5

14 But: Simulating Everything is not Always Best Even apart from fact that simulating everything costs mental effort. Earlier experiments (Broekens, 2005) showed that –simulation has benefit, especially when many interactions are simulated. This is not surprising (better heuristic). However, –in some cases less simulation resulted in better learning.  Dynamic relation between environment and simulation “strategy” (i.e. simulation threshold: percentage of all predicted interactions to be simulated).  Emotion as metalearning to adapt amount of internal simulation? (Doya, 2002) –Pleasure is an indication of the current performance of the agent (Clore and Gasper, 2000). Also, –high pleasure  top down thinking, and low pleasure  bottom up thinking (Fiedler and Bless, 2000). Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.

15 Pleasure Modulates Simulation Cognitive influence Reactive behavior Distributed-state RL model Interaction-selection Action-selection action reinforcement interaction ENVIRONMENT Perception simulated interaction simulated reinforcement percept predicted interactions Emotion process pleasure stimulus

16 Pleasure Modulates Simulation Many theories of emotion. We use core-affect (or activation-valence) theory of emotion as basis. –Two fundamental factors, pleasure and arousal (Russell, 2003). –Pleasure relates to emotional valence, and –arousal relates to action-readiness, or activity. In this study we model pleasure as simulation threshold. –We use pleasure to dynamically adapt the amount of interactions that are simulated. It is thus used as a dynamic simulation threshold. –We study the indirect effect of emotion as a metalearning parameter affecting information processing that on its turn influences action-selection. Many models study emotion as direct influence on action-selection (or motivation(-al states)) (Avila-Garcia and Cãnamero, 2004; Cãnamero, 1997; Velasquez, 1998), or as information (e.g. Botelho and Coelho). Example of exception: Belavkin (2004), relation between emotion, entropy and information processing. Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.

17 Pleasure Modulates Simulation Pleasure: indication of current performance relative to what the agent is used to. –Tried to capture this by the normalized difference between the short term average reinforcement signal and the long term average reinforcement signal: Cognitive influence Reactive behavior Hierarchical-state RL model Interaction-selection Action-selection action reinforcement interaction ENVIRONMENT Perception simulated interaction simulated reinforcement percept predicted interactions Emotion process pleasure, e p stimulus Continuous pleasure feedback: –High pleasure, going well? Continue strategy, goal directed thinking. > e p, high threshold, simulate predicted best interactions, –Low pleasure? Look broader, pay more attention to all predicted interactions. < e p, low threshold, simulate many interactions.

18 Experiments To measure adaptive effect of pleasure-modulated simulation: force agent to adapt to new task. –First the agent has 128 trials to learn task 1, then –switch environment to new task, 128 trials to learn task 2. –Repeat for many different parameter settings (e.g. the window of the long and short term average reinforcement signals, the learning rate, etc…) Pleasure predictions: –Pleasure increases to value near 1 (agent gets better at task) –then slowly converges down to.5. (agent gets used to task) –At switch: pleasure drops, (new task, drop in performance) –then increases to value near 1, and converges down to.5 (agent gets used to new task) Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.

19 Results Performance of pleasure-modulated simulation is comparable with simulating ALL / Best 50% predicted interactions (static simulation threshold), but, using only 30% / 70% of the mental resources. Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.

20 Results Some settings even have a significantly better performance at lower mental cost. Predicted pleasure curve was confirmed Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.

21 Conclusions Simple pleasure feedback can be used to determine the broadness of internal simulation, when simulation is used as action-selection bias, performance is comparable and mental effort decreases. –Since we introduce few new mechanism for simulation  relevant to the understanding of the evolutionary plausibility of the simulation hypothesis, as increased adaptation at lower cost is an evolutionary advantageous feature. Our results provide clues of a relation between the simulation hypothesis and emotion theory. Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.

22 Action-selection discussion, and questions. Use emotion to: –vary action-selection distribution (Doya, 2002), and/or –vary interaction-selection distribution (e.g. temperature of Boltzmann, threshold of our AS mechanism). Interplay between: covert interaction (simulation) and overt interaction (action-selection). –Simulate the best interaction, but chose an action stochastically, see also (Gadanho, 2003):  Gives extra “drive” to certain actions. –The inverse? Seems rational too:  Simulate bad actions for “mental (covert) exploration”, choose best actions for “overt exploitation”.  Early experiments do not (yet) show clear benefit. How to integrate internal simulation input into action–selection? Questions? Joost Broekens, Fons J Verbeek, LIACS, Leiden University, The Netherlands.


Download ppt "Computational Investigations of the Regulative Role of Pleasure in Adaptive Behavior Action-Selection Biased by Pleasure-Regulated Simulated Interaction."

Similar presentations


Ads by Google