NIPS Workshops 12/10/05 Does RL Occur Naturally? C. R. Gallistel Rutgers Center for Cognitive Science.

NIPS Workshops 12/10/05 Does RL Occur Naturally? C. R. Gallistel Rutgers Center for Cognitive Science

Turing’s Vision (‘47-’48) “It would be quite possible to have the machine try out behaviors and accept or reject them…” “What we want is a machine that can learn from experience. The possibility of letting the machine alter its own instructions provides the mechanism for this…” “It might possible to carry through the organizing [of a learning machine] with only two interfering inputs, one for reward (R) or pleasure and the other for pain or punishment (P). It is intended that pain stimuli occur when the machine’s behavior is wrong, pleasure stimuli when it is particularly right.”

A Different Vision Policy (what to do given a state of the world) is pre-specified and immutable Learning consists in determining the state of the world; it’s all model estimation Appropriate sampling behavior is itself prespecified

The Deep Reasons Wolpert & Macready’s “No Free Lunch” theorems Chomsky’s “Poverty of the Stimulus” argument Bottom line: reinforcement learning takes too long Because there is not enough information in the R & P signals Because learning in the absence of a highly structured hypothesis space is a practical impossibility (we don’t live long enough)

Learning by Integrating Ant knows where it is This knowledge is acquired (learned) It is acquired by path integration --Harkness & Maroudas, 1985

Building a Map Ant remembers where the food was (records its coordinates) Bees & ants make a map by the GPS principle (record location coordinates--& views) They do not discover by trial and error that this is a good thing to do As in the GPS, the computational machinery to determine a course from an arbitrary location to an arbitrary location is built in No RL learning here

Ranging Behavior When leaving a new food source or a new nest (hive), bees & wasps fly backwards in an ever increasing zigzag Determining visual feature distances by parallax Innately specified sampling (model building) behavior Wehner, 1981

Also in the Locust Locust scanning Sobel, 1990 Moved target, so as to make  independent of D Reproduced function relating take off velocity to D

Learning by Parameter Estimation Animal’s (including insects) use sun as compass reference To do this, must learn solar ephemeris: sun’s compass bearing as a function of the time of day--where it is when Solar ephemeris varies with latitude and season

Learning from the Dance Returning forager does a dance to tell other foragers the location (range & bearing) of source Compass bearing, , specified by specifying current solar bearing,  Range specified by number of waggles Hopeless as an RL problem?  = compass bearing of sun  = compass bearing of source  =solar bearing of source

Ephemeris Framework

Deceived Dancing Dyer, 1987

Poverty of Stimulus Dyer & Dickinson, 1994 Incubator raised bees allowed to forage to station due west of hive but only in late afternoon when sun declining in west On heavy overcast day, moved to new field line with different compass orientation and allowed to forage in morning (with feeder “west” of hive location) Experimenter observes dance of returning foragers to estimate where they believe the sun to be

Bees Believe Earth is Round

Implications Form of solar ephemeris equation is built into the nervous system Only its parameters are estimated from observation Solves poverty of the stimulus problem: the information about universal properties of the ephemeris in the priors Neural net without this prior information could not generalize as bees do

Language Learning Same story? Innate universal grammar specifies structure common to all language Distinctions between languages are due to differences in parameters (e.g., head final versus head first) Learning a language reduces to learning the (binary?) parameter values Mark Baker (2001) The Atom’s of Language

Natural Learning Curves Gallistel et al (PNAS 2004) Analyzed individual(!) learning curves from standard paradigms and in pigeons, rats, rabbits and mice  Pavlovian (autoshaping in pigeon, rat & mouse)  Eyeblink in rabbit  + Maze in rat  Water maze in mouse Regardless of paradigm, the typical curve cannot be distinguished from a step function Latency and size of step varies between subjects Averaging across these steps produces a gradual learning curve: it’s gradualness is an averaging artifact

Matching Subjects foraging back and forth between locations where food becomes available unpredictably (on random rate schedules with unlimited holds) Subjects match the ratio of the time they invest in the locations (expected stay duration, T 1 /T 2 ) to the ratio of the incomes they have derived from them (I 1 /I 2 ) Matching equates returns: R i = I i /T i ; I 1 /T 1 = I 2 /T 2 iff T 1 /T 2 = I 1 /I 2

RL Models Most assume hill-climbing discovery of the policy that equates returns Policy is one dimensional (ratio of expected stay durations) Try-out given policy (stay ratio) Determine direction of inequality Adjust investment ratio accordingly

But (Gallistel et al 2001) Adjustment of investment ratio after a step change in the relative rates of reward is quick and step-like

Bayesian Ideal Detector Analysis

Second Example

 Incomes, Not  Returns Evidence of a change in behavior appears as soon as there is evidence of a change in incomes And (often) before there is evidence of a change in returns

Evidence of Absence of Evidence Upper panel: Odds that subject’s stay durations had changed as a function of session time Lower panel: Odds that subject’s returns had changed. There was no evidence--in the returns!

Implications Matching is an innate policy Depends only on estimates of incomes Anti-aliasing sampling behavior to detect periodic structure in reward provision built into policy Estimates of incomes to be expected based on small samples taken only when a change in income detected Here, too, learning is model updating, not policy value updating Subjects perversely ignore returns (policy values)

Conclusions Most (all?) natural learning looks like model estimation Efficient model estimation is made possible by  Informative priors (a highly structured problem-specific hypothesis space)  Innately specified efficient sampling routines

NIPS Workshops 12/10/05 Does RL Occur Naturally? C. R. Gallistel Rutgers Center for Cognitive Science.

Similar presentations

Presentation on theme: "NIPS Workshops 12/10/05 Does RL Occur Naturally? C. R. Gallistel Rutgers Center for Cognitive Science."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

NIPS Workshops 12/10/05 Does RL Occur Naturally? C. R. Gallistel Rutgers Center for Cognitive Science.

Similar presentations

Presentation on theme: "NIPS Workshops 12/10/05 Does RL Occur Naturally? C. R. Gallistel Rutgers Center for Cognitive Science."— Presentation transcript:

Similar presentations

About project

Feedback