Download presentation
Presentation is loading. Please wait.
Published byBryan Elmer Reynolds Modified over 9 years ago
1
1 At the dendrite the incoming signals arrive (incoming currents) Molekules Synapses Neurons Local Nets Areas Systems CNS At the soma current are finally integrated. At the axon hillock action potential are generated if the potential crosses the membrane threshold The axon transmits (transports) the action potential to distant sites At the synapses are the outgoing signals transmitted onto the dendrites of the target neurons Structure of a Neuron:
2
2 Chemical synapse: Learning = Change of Synaptic Strength Neurotransmitter Receptors
3
3 Overview over different methods
4
4 Different Types/Classes of Learning Unsupervised Learning (non-evaluative feedback) Trial and Error Learning. No Error Signal. No influence from a Teacher, Correlation evaluation only. Reinforcement Learning (evaluative feedback) (Classic. & Instrumental) Conditioning, Reward-based Lng. “Good-Bad” Error Signals. Teacher defines what is good and what is bad. Supervised Learning (evaluative error-signal feedback) Teaching, Coaching, Imitation Learning, Lng. from examples and more. Rigorous Error Signals. Direct influence from a teacher/teaching signal.
5
5 Basic Hebb-Rule: = u i v << 1 didi dt For Learning: One input, one output. An unsupervised learning rule: A supervised learning rule (Delta Rule): No input, No output, one Error Function Derivative, where the error function compares input- with output- examples. A reinforcement learning rule (TD-learning): One input, one output, one reward.
6
6 map Self-organizing maps: unsupervised learning Neighborhood relationships are usually preserved (+) Absolute structure depends on initial condition and cannot be predicted (-) input
7
7 Basic Hebb-Rule: = u i v << 1 didi dt For Learning: One input, one output An unsupervised learning rule: A supervised learning rule (Delta Rule): No input, No output, one Error Function Derivative, where the error function compares input- with output- examples. A reinforcement learning rule (TD-learning): One input, one output, one reward
8
8 I. Pawlow Classical Conditioning
9
9 Basic Hebb-Rule: = u i v << 1 didi dt For Learning: One input, one output An unsupervised learning rule: A supervised learning rule (Delta Rule): No input, No output, one Error Function Derivative, where the error function compares input- with output- examples. A reinforcement learning rule (TD-learning): One input, one output, one reward
10
10 Supervised Learning: Example OCR
11
11 The influence of the type of learning on speed and autonomy of the learner Correlation based learning: No teacher Reinforcement learning, indirect influence Reinforcement learning, direct influence Supervised Learning, Teacher Programming Learning Speed Autonomy
12
12 Hebbian learning A B A B t When an axon of cell A excites cell B and repeatedly or persistently takes part in firing it, some growth processes or metabolic change takes place in one or both cells so that A‘s efficiency... is increased. Donald Hebb (1949)
13
13 Overview over different methods You are here !
14
14 Hebbian Learning …Basic Hebb-Rule: …correlates inputs with outputs by the… = v u 1 << 1 dd dt v u1u1 Vector Notation Cell Activity: v = w. u This is a dot product, where w is a weight vector and u the input vector. Strictly we need to assume that weight changes are slow, otherwise this turns into a differential eq.
15
15 = v u 1 << 1 dd dt Single Input = v u << 1 dwdw dt Many Inputs As v is a single output, it is scalar. Averaging Inputs = << 1 dwdw dt We can just average over all input patterns and approximate the weight change by this. Remember, this assumes that weight changes are slow. If we replace v with w. u we can write: = Q. w where Q = is the input correlation matrix dwdw dt Note: Hebb yields an instable (always growing) weight vector!
16
16 Synaptic plasticity evoked artificially Examples of Long term potentiation (LTP) and long term depression (LTD). LTP First demonstrated by Bliss and Lomo in 1973. Since then induced in many different ways, usually in slice. LTD, robustly shown by Dudek and Bear in 1992, in Hippocampal slice.
17
17
18
18
19
19
20
20 LTP will lead to new synaptic contacts
21
21 Conventional LTP = Hebbian Learning Symmetrical Weight-change curve Pre t Pre Post t Post Synaptic change % Pre t Pre Post t Post The temporal order of input and output does not play any role
22
22
23
23 Spike timing dependent plasticity - STDP Markram et. al. 1997
24
24 Pre follows Post: Long-term Depression Pre t Pre Post t Post Synaptic change % Spike Timing Dependent Plasticity: Temporal Hebbian Learning Weight-change curve (Bi&Poo, 2001) Pre t Pre Post t Post Pre precedes Post: Long-term Potentiation Acausal Causal (possibly)
25
25 = v u 1 << 1 dd dt Single Input = v u << 1 dwdw dt Many Inputs As v is a single output, it is scalar. Averaging Inputs = << 1 dwdw dt We can just average over all input patterns and approximate the weight change by this. Remember, this assumes that weight changes are slow. If we replace v with w. u we can write: = Q. w where Q = is the input correlation matrix dwdw dt Note: Hebb yields an instable (always growing) weight vector! Back to the Math. We had:
26
26 = (v - ) u << 1 dwdw dt Covariance Rule(s) Normally firing rates are only positive and plain Hebb would yield only LTP. Hence we introduce a threshold to also get LTD Output threshold = v (u - << 1 dwdw dt Input vector threshold Many times one sets the threshold as the average activity of some reference time period (training period) = or = together with v = w. u we get: = C. w, where C is the covariance matrix of the input dwdw dt http://en.wikipedia.org/wiki/Covariance_matrix C = )(u- )> = - = )u>
27
27 The covariance rule can produce LTP without (!) post-synaptic output. This is biologically unrealistic and the BCM rule (Bienenstock, Cooper, Munro) takes care of this. BCM- Rule = vu (v - ) << 1 dwdw dt As such this rule is again unstable, but BCM introduces a sliding threshold = (v 2 - ) < 1 dd dt Note the rate of threshold change should be faster than then weight changes ( ), but slower than the presentation of the individual input patterns. This way the weight growth will be over-dampened relative to the (weight – induced) activity increase.
28
28 Evidence for weight normalization: Reduced weight increase as soon as weights are already big (Bi and Poo, 1998, J. Neurosci.) Problem: Hebbian Learning can lead to unlimited weight growth. Solution: Weight normalization a) subtractive (subtract the mean change of all weights from each individual weight). b) multiplicative (mult. each weight by a gradually decreasing factor).
29
29 Examples of Applications Kohonen (1984). Speech recognition - a map of phonemes in the Finish language Goodhill (1993) proposed a model for the development of retinotopy and ocular dominance, based on Kohonen Maps (SOM) Angeliol et al (1988) – travelling salesman problem (an optimization problem) Kohonen (1990) – learning vector quantization (pattern classification problem) Ritter & Kohonen (1989) – semantic maps OD ORI
30
30 Differential Hebbian Learning of Sequences Learning to act in response to sequences of sensor events
31
31 Overview over different methods You are here !
32
32 I. Pawlow History of the Concept of Temporally Asymmetrical Learning: Classical Conditioning
33
33
34
34 I. Pawlow History of the Concept of Temporally Asymmetrical Learning: Classical Conditioning Correlating two stimuli which are shifted with respect to each other in time. Pavlov’s Dog: “Bell comes earlier than Food” This requires to remember the stimuli in the system. Eligibility Trace: A synapse remains “eligible” for modification for some time after it was active (Hull 1938, then a still abstract concept).
35
35 0 = 1 11 Unconditioned Stimulus (Food) Conditioned Stimulus (Bell) Response X + Stimulus Trace E The first stimulus needs to be “remembered” in the system Classical Conditioning: Eligibility Traces
36
36 I. Pawlow History of the Concept of Temporally Asymmetrical Learning: Classical Conditioning Eligibility Traces Note: There are vastly different time-scales for (Pavlov’s) hehavioural experiments: Typically up to 4 seconds as compared to STDP at neurons: Typically 40-60 milliseconds (max.)
37
37 Defining the Trace In general there are many ways to do this, but usually one chooses a trace that looks biologically realistic and allows for some analytical calculations, too. EPSP-like functions: -function: Double exp.: This one is most easy to handle analytically and, thus, often used. Dampened Sine wave: Shows an oscillation. k kk
38
38 Overview over different methods Mathematical formulation of learning rules is similar but time-scales are much different.
39
39 Early: “Bell” Late: “Food” x Differential Hebb Learning Rule XiXi X0X0 Simpler Notation x = Input u = Traced Input V V’(t) uiui u0u0
40
40 Convolution used to define the traced input, Correlation used to calculate weight growth. uw
41
41 Produces asymmetric weight change curve (if the filters h produce unimodal „humps“) Derivative of the Output Filtered Input Output T Differential Hebbian Learning
42
42 Conventional LTP Symmetrical Weight-change curve Pre t Pre Post t Post Synaptic change % Pre t Pre Post t Post The temporal order of input and output does not play any role
43
43 Produces asymmetric weight change curve (if the filters h produce unimodal „humps“) Derivative of the Output Filtered Input Output T Differential Hebbian Learning
44
44 Weight-change curve (Bi&Poo, 2001) T=t Post - t Pre ms Pre follows Post: Long-term Depression Pre t Pre Post t Post Synaptic change % Pre t Pre Post t Post Pre precedes Post: Long-term Potentiation Spike-timing-dependent plasticity (STDP): Some vague shape similarity
45
45 Overview over different methods You are here !
46
46 Plastic Synapse NMDA/AMPA Postsynaptic: Source of Depolarization The biophysical equivalent of Hebb’s postulate Presynaptic Signal (Glu) Pre-Post Correlation, but why is this needed?
47
47 Plasticity is mainly mediated by so called N-methyl-D-Aspartate (NMDA) channels. These channels respond to Glutamate as their transmitter and they are voltage depended:
48
48 Biophysical Model: Structure x NMDA synapse v Hence NMDA-synapses (channels) do require a (hebbian) correlation between pre and post-synaptic activity! Source of depolarization: 1) Any other drive (AMPA or NMDA) 2) Back-propagating spike
49
49 Local Events at the Synapse Local Current sources “under” the synapse: Synaptic current I synaptic Global I BP Influence of a Back-propagating spike Currents from all parts of the dendritic tree I Dendritic u1u1 x1x1 v
51
51 Pre-syn. Spike BP- or D-Spike * V*h g NMDA On „Eligibility Traces“ Membrane potential: Weight Synaptic input Depolarization source
52
52 Dendritic compartment Plastic synapse with NMDA channels Source of Ca 2+ influx and coincidence detector Plastic Synapse NMDA/AMPA g BP spike Source of Depolarization Dendritic spike Source of depolarization: 1. Back-propagating spike 2. Local dendritic spike Model structure
53
53 Plasticity Rule (Differential Hebb) NMDA synapse - Plastic synapse NMDA/AMPA g Source of depolarization Instantenous weight change: Presynaptic influence Glutamate effect on NMDA channels Postsynaptic influence
54
54 Normalized NMDA conductance: NMDA channels are instrumental for LTP and LTD induction (Malenka and Nicoll, 1999; Dudek and Bear,1992) Pre-synaptic influence NMDA synapse - Plastic synapse NMDA/AMPA g Source of depolarization
55
55 Dendritic spikes Back- propagating spikes (Larkum et al., 2001 Golding et al, 2002 Häusser and Mel, 2003) (Stuart et al., 1997) Depolarizing potentials in the dendritic tree
56
56 NMDA synapse - Plastic synapse NMDA/AMPA g Source of depolarization Postsyn. Influence For F we use a low-pass filtered („slow“) version of a back-propagating or a dendritic spike.
57
57 BP and D-Spikes
58
58 Back-propagating spike Weight change curve T NMDAr activation Back-propagating spike T=t Post – t Pre Weight Change Curves Source of Depolarization: Back-Propagating Spikes
59
59 Plastic Synapse NMDA/AMPA Postsynaptic: Source of Depolarization The biophysical equivalent of Hebb’s PRE-POST CORRELATION postulate: THINGS TO REMEMBER Presynaptic Signal (Glu) Possible sources are: BP-Spike Dendritic Spike Local Depolarization Slow-Acting NMDA Signal as presynatic influence
60
60 One word about Supervised Learning
61
61 Overview over different methods – Supervised Learning You are here ! And many more
62
62 Supervised learning methods are mostly non-neuronal and will therefore not be discussed here.
63
63 So Far: Open Loop Learning All slides so far !
64
64 CLOSED LOOP LEARNING Learning to Act (to produce appropriate behavior) Instrumental (Operant) Conditioning All slides to come now !
65
65 This is an open-loop system Sensor 2 conditioned Input BellFood Salivation Pavlov, 1927 Temporal Sequence This is an Open Loop System !
66
66 Adaptable Neuron Env. Closed loop Sensing Behaving
67
67 Instrumental/Operant Conditioning
68
68 Overview over different methods – Closed Loop Learning
69
69 Behaviorism “All we need to know in order to describe and explain behavior is this: actions followed by good outcomes are likely to recur, and actions followed by bad outcomes are less likely to recur.” (Skinner, 1953) Skinner had invented the type of experiments called operant conditioning. B.F. Skinner (1904-1990)
70
70 Operant behavior: occurs without an observable external stimulus. Operates on the organism’s environment. The behavior is instrumental in securing a stimulus more representative of everyday learning. Skinner Box
71
71 OPERANT CONDITIONING TECHNIQUES POSITIVE REINFORCEMENT = increasing a behavior by administering a reward NEGATIVE REINFORCEMENT = increasing a behavior by removing an aversive stimulus when a behavior occurs PUNISHMENT = decreasing a behavior by administering an aversive stimulus following a behavior OR by removing a positive stimulus EXTINCTION = decreasing a behavior by not rewarding it
72
72 Overview over different methods You are here !
73
73 How to assure behavioral & learning convergence ?? This is achieved by starting with a stable reflex-like action and learning to supercede it by an anticipatory action. Remove before being hit !
74
74 Reflex Only (Compare to an electronic closed loop controller!) This structure assures initial (behavioral) stability (“homeostasis”) Think of a Thermostat !
75
75 Robot Application x Early: “Vision” Late: “Bump”
76
76 Robot Application Initially built-in behavior: Retraction reaction whenever an obstacle is touched. Learning Goal: Correlate the vision signals with the touch signals and navigate without collisions.
77
77 Robot Example
78
78 What has happened during learning to the system ? The primary reflex re-action has effectively been eliminated and replaced by an anticipatory action
79
Reinforcement Learning (RL) Learning from rewards (and punishments) Learning to assess the value of states. Learning goal directed behavior. RL has been developed rather independently from two different fields: 1)Dynamic Programming and Machine Learning (Bellman Equation). 2)Psychology (Classical Conditioning) and later Neuroscience (Dopamine System in the brain)
80
I. Pawlow Back to Classical Conditioning U(C)S = Unconditioned Stimulus U(C)R = Unconditioned Response CS = Conditioned Stimulus CR = Conditioned Response
81
Less “classical” but also Conditioning ! (Example from a car advertisement) Learning the association CS → U(C)R Porsche → Good Feeling
82
Overview over different methods – Reinforcement Learning You are here !
83
Overview over different methods – Reinforcement Learning And later also here !
84
US = r,R = “Reward” CS = s,u = Stimulus = “State 1 ” CR = v,V = (Strength of the) Expected Reward = “Value” UR = --- (not required in mathematical formalisms of RL) Weight = = weight used for calculating the value; e.g. v= u Action = a = “Action” Policy = = “Policy” 1 Note: The notion of a “state” really only makes sense as soon as there is more than one state. Notation
85
A note on “Value” and “Reward Expectation” If you are at a certain state then you would value this state according to how much reward you can expect when moving on from this state to the end-point of your trial. Hence: Value = Expected Reward ! More accurately: Value = Expected cumulative future discounted reward. (for this, see later!)
86
1)Rescorla-Wagner Rule: Allows for explaining several types of conditioning experiments. 2)TD-rule (TD-algorithm) allows measuring the value of states and allows accumulating rewards. Thereby it generalizes the Resc.-Wagner rule. 3)TD-algorithm can be extended to allow measuring the value of actions and thereby control behavior either by ways of a)Q or SARSA learning or with b)Actor-Critic Architectures Types of Rules
87
Overview over different methods – Reinforcement Learning You are here !
88
Rescorla-Wagner Rule Pavlovian: Extinction: Partial: TrainResult u→ru→r u → r u → ● Pre-Train u→ru→ru→●u→● u → v=max u → v=0 u → v<max We define:v = u, with u=1 or u=0, binary and → + u with = r - v This learning rule minimizes the avg. squared error between actual reward r and the prediction v, hence min We realize that is the prediction error. The associability between stimulus u and reward r is represented by the learning rate .
89
Extinction reward expected reward prediction error Pawlovian
90
Extinction Partial Stimulus u is paired with r=1 in 100% of the discrete “epochs” for Pawlovian and in 50% of the cases for Partial.
91
Partial (50% reward)
92
Rescorla-Wagner Rule, Vector Form for Multiple Stimuli We define:v = w. u, and w → w + u with = r – v Where we minimize Blocking: TrainResult u 1 +u 2 → r Pre-Train u 1 → v=max, u 2 → v=0u1→ru1→r For Blocking: The association formed during pre-training leads to =0. As 2 starts with zero the expected reward v= 1 u 1 + 2 u 2 remains at r. This keeps =0 and the new association with u 2 cannot be learned.
93
Rescorla-Wagner Rule, Vector Form for Multiple Stimuli Inhibitory: TrainResultPre-Train u 1 +u 2 → ●, u 1 → ru 1 → v=max, u 2 → v<0 Inhibitory Conditioning: Presentation of one stimulus together with the reward and alternating presenting a pair of stimuli where the reward is missing. In this case the second stimulus actually predicts the ABSENCE of the reward (negative v). Trials in which the first stimulus is presented together with the reward lead to 1 >0. In trials where both stimuli are present the net prediction will be v= 1 u 1 + 2 u 2 = 0. As u 1,2 =1 (or zero) and 1 >0, we get 2 <0 and, consequentially, v(u 2 )<0.
94
Rescorla-Wagner Rule, Vector Form for Multiple Stimuli Overshadow: TrainResultPre-Train u 1 +u 2 → ru 1 → v<max, u 2 → v<max Overshadowing: Presenting always two stimuli together with the reward will lead to a “sharing” of the reward prediction between them. We get v= 1 u 1 + 2 u 2 = r. Using different learning rates will lead to differently strong growth of 1,2 and represents the often observed different saliency of the two stimuli.
95
Rescorla-Wagner Rule, Vector Form for Multiple Stimuli Secondary: TrainResultPre-Train u1→ru1→ru2→u1u2→u1 u 2 → v=max Secondary Conditioning reflect the “replacement” of one stimulus by a new one for the prediction of a reward. As we have seen the Rescorla-Wagner Rule is very simple but still able to represent many of the basic findings of diverse conditioning experiments. Secondary conditioning, however, CANNOT be captured.
96
Predicting Future Reward Animals can predict to some degree such sequences and form the correct associations. For this we need algorithms that keep track of time. Here we do this by ways of states that are subsequently visited and evaluated. The Rescorla-Wagner Rule cannot deal with the sequentiallity of stimuli (required to deal with Secondary Conditioning). As a consequence it treats this case similar to Inhibitory Conditioning lead to negative 2.
97
Prediction and Control The goal of RL is two-fold: 1)To predict the value of states (exploring the state space following a policy) – Prediction Problem. 2)Change the policy towards finding the optimal policy – Control Problem. State, Action, Reward, Value, Policy Terminology (again):
98
Markov Decision Problems (MDPs) states actions rewards If the future of the system depends always only on the current state and action then the system is said to be “Markovian”.
99
What does an RL-agent do ? An RL-agent explores the state space trying to accumulate as much reward as possible. It follows a behavioral policy performing actions (which usually will lead the agent from one state to the next). For the Prediction Problem: It updates the value of each given state by assessing how much future (!) reward can be obtained when moving onwards from this state (State Space). It does not change the policy, rather it evaluates it. (Policy Evaluation).
100
For the Control Problem: It updates the value of each given action at a given state and of by assessing how much future reward can be obtained when performing this action at that state (State- Action Space, which is larger than the State Space ). and all following actions at the following state moving onwards. Guess: Will we have to evaluate ALL states and actions onwards?
101
Exploration – Exploitation Dilemma: The agent wants to get as much cumulative reward (also often called return) as possible. For this it should always perform the most rewarding action “exploiting” its (learned) knowledge of the state space. This way it might however miss an action which leads (a bit further on) to a much more rewarding path. Hence the agent must also “explore” into unknown parts of the state space. The agent must, thus, balance its policy to include exploitation and exploration. What does an RL-agent do ? Policies 1)Greedy Policy: The agent always exploits and selects the most rewarding action. This is sub-optimal as the agent never finds better new paths.
102
Policies -Greedy Policy: With a small probability the agent will choose a non-optimal action. *All non-optimal actions are chosen with equal probability.* This can take very long as it is not known how big should be. One can also “anneal” the system by gradually lowering to become more and more greedy. 3)Softmax Policy: -greedy can be problematic because of (*). Softmax ranks the actions according to their values and chooses roughly following the ranking using for example: where Q a is value of the currently to be evaluated action a and T is a temperature parameter. For large T all actions have approx. equal probability to get selected.
103
Overview over different methods – Reinforcement Learning You are here !
104
Back to the question: To get the value of a given state, will we have to evaluate ALL states and actions onwards? There is no unique answer to this! Different methods exist which assign the value of a state by using differently many (weighted) values of subsequent states. We will discuss a few but concentrate on the most commonly used TD-algorithm(s). Temporal Difference (TD) Learning Towards TD-learning – Pictorial View In the following slides we will treat “Policy evaluation”: We define some given policy and want to evaluate the state space. We are at the moment still not interested in evaluating actions or in improving policies.
105
Formalising RL: Policy Evaluation with goal to find the optimal value function of the state space We consider a sequence s t, r t+1, s t+1, r t+2,..., r T, s T. Note, rewards occur downstream (in the future) from a visited state. Thus, r t+1 is the next future reward which can be reached starting from state s t. The complete return R t to be expected in the future from state s t is, thus, given by: where ≤1 is a discount factor. This accounts for the fact that rewards in the far future should be valued less. Reinforcement learning assumes that the value of a state V(s) is directly equivalent to the expected return E at this state, where denotes the (here unspecified) action policy to be followed. Thus, the value of state s t can be iteratively updated with:
106
We use as a step-size parameter, which is not of great importance here, though, and can be held constant. Note, if V(s t ) correctly predicts the expected complete return R t, the update will be zero and we have found the final value. This method is called constant- Monte Carlo update. It requires to wait until a sequence has reached its terminal state (see some slides before!) before the update can commence. For long sequences this may be problematic. Thus, one should try to use an incremental procedure instead. We define a different update rule with: The elegant trick is to assume that, if the process converges, the value of the next state V(s t+1 ) should be an accurate estimate of the expected return downstream to this state (i.e., downstream to s t+1 ). Thus, we would hope that the following holds: Indeed, proofs exist that under certain boundary conditions this procedure, known as TD(0), converges to the optimal value function for all states. This is why it is called TD (temp. diff.) Learning
107
Reinforcement Learning – Relations to Brain Function I You are here !
108
Trace u1u1 How to implement TD in a Neuronal Way Now we have: We had defined: (first lecture!)
109
How to implement TD in a Neuronal Way v(t+1)-v(t) Note: v(t+1)- v(t) is acausal (future!). Make it “causal” by using delays. Serial-Compound representations X 1,…X n for defining an eligibility trace.
110
Reinforcement Learning – Relations to Brain Function II You are here !
111
TD-learning & Brain Function DA-responses in the basal ganglia pars compacta of the substantia nigra and the medially adjoining ventral tegmental area (VTA). This neuron is supposed to represent the -error of TD-learning, which has moved forward as expected. Omission of reward leads to inhibition as also predicted by the TD-rule.
112
TD-learning & Brain Function This neuron is supposed to represent the reward expectation signal v. It has extended forward (almost) to the CS (here called Tr) as expected from the TD-rule. Such neurons are found in the striatum, orbitofrontal cortex and amygdala. This is even better visible from the population response of 68 striatal neurons
113
Reinforcement Learning – The Control Problem So far we have concentrated on evaluating and unchanging policy. Now comes the question of how to actually improve a policy trying to find the optimal policy. We will discuss: 1)Actor-Critic Architectures But not: 2)SARSA Learning 3)Q-Learning Abbreviation for policy:
114
Reinforcement Learning – Control Problem I You are here !
115
Control Loops A basic feedback–loop controller (Reflex) as in the slide before.
116
Control Loops An Actor-Critic Architecture: The Critic produces evaluative, reinforcement feedback for the Actor by observing the consequences of its actions. The Critic takes the form of a TD-error which gives an indication if things have gone better or worse than expected with the preceding action. Thus, this TD-error can be used to evaluate the preceding action: If the error is positive the tendency to select this action should be strengthened or else, lessened.
117
Example of an Actor-Critic Procedure Action selection here follows the Gibb’s Softmax method: where p(s,a) are the values of the modifiable (by the Critic!) policy parameters of the actor, indicating the tendency to select action a when being in state s. We can now modify p for a given state action pair at time t with: where t is the -error of the TD-Critic.
118
Reinforcement Learning – Control I & Brain Function III You are here !
119
Actor-Critics and the Basal Ganglia VP=ventral pallidum, SNr=substantia nigra pars reticulata, SNc=substantia nigra pars compacta, GPi=globus pallidus pars interna, GPe=globus pallidus pars externa, VTA=ventral tegmental area, RRA=retrorubral area, STN=subthalamic nucleus. The basal ganglia are a brain structure involved in motor control. It has been suggested that they learn by ways of an Actor-Critic mechanism.
120
So called striosomal modules of the Striatum S fulfill the functions of the adaptive Critic. The prediction-error ( ) characteristics of the DA-neurons of the Critic are generated by: 1) Equating the reward r with excitatory input from the lateral hypothalamus. 2) Equating the term v(t) with indirect excitation at the DA-neurons which is initiated from striatal striosomes and channelled through the subthalamic nucleus onto the DA neurons. 3) Equating the term v(t−1) with direct, long-lasting inhibition from striatal striosomes onto the DA-neurons. There are many problems with this simplistic view though: timing, mismatch to anatomy, etc. Cortex=C, striatum=S, STN=subthalamic Nucleus, DA=dopamine system, r=reward. Actor-Critics and the Basal Ganglia: The Critic C DA v(t-1) v(t) LH
121
121 Literature (all of this is very mathematical!) General Theoretical Neuroscience: „Theoretical Neuroscience“, P.Dayan and L. Abbott, MIT Press (there used to be a version of this on the internet) „Spiking Neuron Models“, W. Gerstner & W.M. Kistler, Cambridge University Press. (there is a version on the internet) Neural Coding Issues: „Spikes“ F. Rieke, D. Warland, R. de Ruyter v. Steveninck, W. Bialek, MIT Press Artificial Neural Networks: „Konnektionismus“, G. Dorffner, B.G. Teubner Verlg. Stuttgart „Fundamentals of Artificial Neural Networks“, M.H. Hassoun, MIT Press Hodgkin Huxley Model: See above „Spiking Neuron Models“, W. Gerstner & W.M. Kistler, Cambridge University Press. Learning and Plasticity: See above „Spiking Neuron Models“, W. Gerstner & W.M. Kistler, Cambridge University Press. Calculating with Neurons: Has been compiled from many different sources. Maps: Has been compiled from many different sources.
122
122
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.