Studies on Goal-Directed Feature Learning Cornelius Weber, FIAS presented at: “Machine Learning Approaches to Representational Learning and Recognition in Vision” Workshop at the Frankfurt Institute for Advanced Studies (FIAS), November 27-28, 2008
for taking action, we need only the relevant features x y z
models’ background & overview: - unsupervised feature learning models are enslaved by bottom-up input - reward-modulated activity leads to input selection Nakahara, Neur Comp 14, (2002) - reward-modulated STDP Izhikevich, Cereb Cortex 17, (2007), Florian, Neur Comp 19/6, (2007); Farries & Fairhall, Neurophysiol 98, (2007);... - RL models learn partitioning of input space e.g. McCallum, PhD Thesis, Rochester, NY, USA (1996) - reward-modulated Hebb Triesch, Neur Comp 19, (2007), Roelfsema & Ooyen, Neur Comp 17, (2005); Franz & Triesch, ICDL (2007) (model 3 presented here, extending to delayed reward) - feature-pruning models learn all features but forget the irrelevant ones (models 1 & 2 presented here)
sensory input reward action purely sensory data, in which one feature type is linked to reward the action is not controlled by the network
model 1: obtaining the relevant features 1) build a feature detecting model 2) learn associations between features 3) register the average features’ reward 4) spread value along associative connections 5) check whether actions in-/decrease value 6) remove features where action doesn’t matter irrelevant relevant
Földiák, Biol Cybern 64, (1990) → homogeneous activity distr. features thresholds lateral weights (decorrelation) selected features associative weights action effect Weber & Triesch, Proc ICANN, (2008); Witkowski, Adap Behav, 15(1), (2007); Toussaint, Proc NIPS, (2003); Weber, Proc ICANN, (2001) → relevant features indentified
sensory inputreward motor-sensory data (again, one feature type is linked to reward) the network selects the action (to get reward) irrelevant subspace relevant subspace
model 2: removing the irrelevant inputs 1) initialize feature detecting model (but continue learning) 2) perform actor-critic RL, taking the features’ outputs as state representation - works despite irrelevant features - challenge: relevant features will occur at different frequencies - nevertheless, features may remain stable 3) observe the critic: puts negative value on irrelevant features after long training 4) modulate (multiply) learning by critic’s value frequency value
Lücke & Bouecke, Proc ICANN, 31-7 (2005) features critic value action weights → relevant subspace discovered
model 3: learning only the relevant inputs 1) top level: reinforcement learning model (SARSA) 2) lower level: feature learning model (SOM / K-means) 3) modulate learning by δ, in both layers RL weights feature weights input action
model 3: SARSA with SOM-like activation and update
relevant subspace RL action weights subspace coverage feature weights
RL action weights feature weights input reward 2 actions (not shown) data learning ‘long bars’ data
RL action weights feature weights input data: bars controlled by actions ‘up’, ‘down’, ‘left’, ‘right’ learning the ‘short bars’ data reward action
short bars in 12x12 average # of steps to goal: 11
cortex striatum GPi (output of basal ganglia) biological interpretation - no direct feedback from striatum to cortex - convergent mapping → little receptive field overlap, consistent with subspace discovery feature/subspace detection action selection
Discussion - models 1 and 2 learn all features and identify the relevent features - either requires homogeneous feature distribution (model 1) - or can do only subspace- (no real feature) detection (model 2) - model 3 is very simple: SARSA on SOM with δ-feedback - learns only the relevant subspace or features in the first place - link between unsupervised- and reinforcement learning Sponsors Bernstein Focus Neurotechnology EU project “IM-CLeVeR” call FP7-ICT Frankfurt Institute for Advanced Studies FIAS
early learninglate learning Jog et al, Science, 286, (1999) relevant features change during learning units in the basal ganglia are active at the junction during early task acquisition but not at a later stage T - maze decision task (rat)
evidence for reward/action modulated learning in the visual system Shuler & Bear, "Reward timing in the primary visual cortex", Science, 311, (2006) Schoups et al. "Practising orientation identification improves orientation coding in V1 neurons" Nature, 412, (2001)