1 Decision making
2 How does the brain learn the values?
3 The computational problem The goal is to maximize the sum of rewards
4 The computational problem The value of the state S 1 depends on the policy If the animal chooses ‘right’ at S 1,
5 How to find the optimal policy in a complicated world?
6 If values of the different states are known then this task is easy
7 How to find the optimal policy in a complicated world? If values of the different states are known then this task is easy How can the values of the different states be learned?
8 V(S t ) = the value of the state at time t r t = the (average) reward delivered at time t V(S t+1 ) = the value of the state at time t+1
9 where is the TD error. The TD (temporal difference) learning algorithm
10
11 Dopamine
12 Dopamine is good Dopamine is released by rewarding experiences, e.g., sex, food Cocaine, nicotine and amphetamine directly or indirectly lead to an increase of dopamine release Neutral stimuli that are associated with rewarding experiences result in a release of dopamine Drugs that reduce dopamine activity reduce motivation, cause anhedonia (inability to experience pleasure) Long-term use may result in dyskinesia (diminished voluntary movements and the presence of involuntary movements)
13 No dopamine is bad
14 Bradykinesia – slowness in voluntary movement such as standing up, walking, and sitting down. This may lead to difficulty initiating walking, but when more severe can cause “freezing episodes” once walking has begun. Tremors – often occur in the hands, fingers, forearms, foot, mouth, or chin. Typically, tremors take place when the limbs are at rest as opposed to when there is movement. Rigidity – otherwise known as stiff muscles, often produce muscle pain that is increased during movement. Poor balance – happens because of the loss of reflexes that help posture. This causes unsteady balance, which oftentimes leads to falls. No dopamine is bad (Parkinson’s disease)
15 Schultz, Dayan and Montague, Science, 1997
16 CS Reward Before trial 1: In trial 1: no reward in states 1-7 reward of size 1 in states 8
17 CS Reward Before trial 2: In trial 2, for states 1-6 For state 7,
18 CS Reward Before trial 2: For state 8,
19 CS Reward Before trial 3: In trial 2, for states 1-5 For state 6,
20 CS Reward For state 7, Before trial 3: For state 8,
21 CS Reward After many trials Except for the CS whose time is unknown
22
23 Schultz, 1998
24 Bayer and Glimcher, 1998 “We found that these neurons encoded the difference between the current reward and a weighted average of previous rewards, a reward prediction error, but only for outcomes that were better than expected”.
25 Bayer and Glimcher, 1998