Dynamics of Reward Bias Effects in Perceptual Decision Making Jay McClelland & Juan Gao Building on: Newsome and Rorie Holmes and Feng Usher and McClelland
Our Questions Can we trace the effect of reward bias on decision making over time? Can we determine what would be the optimal policy, and what constraints there are on this policy? Can we determine how well participants do at achieving optimality? Can we uncover the processing mechanisms that lead to the observed patterns of behavior?
Overview Experiment Results Optimality analysis Abstract dynamical model Mechanistic dynamical model
Human Experiment Examining Reward Bias Effect at Different Time Points after Target Onset Stimuli are rectangles shifted 1,3, or 5 pixels L or R of fixation Reward cue occurs 750 msec before stimulus. Small arrow head visible for 250 msec. Only biased reward conditions (2 vs 1 and 1 vs 2) are considered. Response signal occurs at these times after stimulus onset: 0 75 150 225 300 450 600 900 1200 2000 Participant receives reward (one or two points) if response occurs within 250 msec of response signal and is correct. Participants were run for 15-25 sessions to provide stable data. Data shown are from later sets of sessions in which the biasing effect of reward appeared to be fairly stable.
A participant with very little reward bias Top panel shows probability of response giving larger reward as a function of actual response time for combinations of: Stimulus shift (1 3 5) pixels Reward-stimulus compatibility Lower panel shows data transformed to z scores, and corresponds to the theoretical construct: mean(x1(t)-x2(t))+bias(t) sd(x1(t)-x2(t)) where x1 represents the state of the accumulator associated with greater reward, x2 the same for lesser reward, and S is thought to choose larger reward if x1(t)-x2(t)+bias(t) > 0.
Participants Showing Reward Bias
Abstract optimality analysis
Assumptions At a given time, two distributions, means +mu, -mu, same STD sigma. Choice x >?< X_c For three difficulty levels, same STD sigma, means mu_i (i=1,2,3), same X_c.
Only one diff level Three diff levels Subject’s sensitivity, a definition in theory of signal detectability When response signal delay varies For each subject, fit with function
Subject Sensitivity
Real “bias” Optimal “bias”
Dynamical analysis Based on one dimensional leaky integrator model. Initial condition: x = 0 Chose left if x > 0 when the response signal is detected; otherwise choose right. Accuracy approximates exponential approach to asymptote because of leakage. How is the reward implemented? A time-varying offset that optimizes reward? Offset in initial conditions? An additional term in the input to the decision variable? A fixed offset in the value of the decision variable?
1. Time-varying term that optimizes rewards (No free parameter for reward bias) 0.5 1 1.5 2 2.5 0.2 0.4 0.6 0.8 Time (s) P of choice toward larger reward RSC 1, diff 5 RSC 0, diff 5 RSC 1, diff 3 RSC 0, diff 3 RSC 1, diff 1 RSC 0, diff 1 Notes: Equivalent to a time-varying criterion = -b(t). There is a dip at Prediction and test: higher C level earlier dip. For multiple C levels, no analytical expressions.
2. Offset in initial conditions Notes: Effect of the bias decays away for lambda<0. Single C level , a dip at Prediction and test: higher C level earlier dip
3. Reward as a term in the input Reward signal comes -t seconds relative to stimulus. For t<0: input = b; noise sd = s For t>0, input = b+aC; noise continues as before. Notes: Effect of the bias persists. But bias is sub-optimal initially, and there is no dip. They forgot the 2 here. Thoeritically, the dip should happen at 1/lambda* log ( (ac-bk)/(ack^2-bk^2) ), where k=exp(lambda*tau). The t calculated is negative. 18
4. Reward as a constant offset in the decision variable Note: Equivalent to setting criterion at –m0 Effect persists for lambda<0. Single C level , a dip at Prediction and test: higher C level earlier dip
5. Reward as a term in the input, creating variability at stimulus onset Reward signal comes -t seconds relative to stimulus. For t<0: input = b, noise sd = sb Eor t>0, input = b+aC; noise sd = sb+s. Notes: Effect of the bias persists. If sb = 0, no dip. Prediction and test: given small sb, longer reward period later and shallower dip. They forgot the 2 here. Thoeritically, the dip should happen at 1/lambda* log ( (ac-bk)/(ack^2-bk^2) ), where k=exp(lambda*tau). The t calculated is negative. 20
Leaky Competing Integrator Model Inputs for: reward stimulus response signal High threshold for