Connectionist Time and Dynamic Systems Time in One Architecture? Modeling Word Learning at Two Timescales Jessica S. Horst Bob McMurray Larissa K. Samuelson Dept. of Psychology University of Iowa
Two Time Scales in Neural Networks Connectionist and dynamical systems accounts: stress change over time complement each other in timescale Dynamic Systems: online processes Connectionist Networks: long-term learning Many domains of development require both timescales: Example: language development requires sensitivity to brief and sequential nature of the input slower developmental processes.
Two Time Scales in Language Acquisition Word learning often attributed to fast mapping - quick link between a novel name and a novel object (e.g., Carey, 1978). But, recent empirical data suggests that fast mapping and word learning may represent two distinct time scales (Horst & Samuelson, April, 2005). - Fast Mapping: quick process emerging in the moment. - Word Learning: gradual process over the course of development We capture both timescales in a recurrent network….
Activation feed from input layers to decision layers. Decision units compete via inhibition. Activation feeds back to input layers. Cycle continues until system settles.c Initial State (Before Learning) Auditory Inputs Visual Inputs Decision Units (Hidden) Layer The Architecture (McMurray & Spivey, 2000) Unsupervised Hebbian learning occurs on every cycle.
Online decision dynamics reflect auditory and visual competitors.
The Model End State Post Learning Intermediate State During Learning 15 Auditory & 15 Visual units 90 Decision units Names presented singly with a variable number of objects Name-Decision & Object-Decision associations strengthened via learning After 4000 training trials network forms localist representations Learns name-object links and to ignore visual competitors
Auditory Input Decision Units Connection Strength
Fast: Moment by Moment Online information integration and constraint satisfaction (e.g., McClelland & Elman, 1986, Dell, 1981) Reaches a pattern of stable activation through input based on auditory and visual inputs and stored knowledge (weights) Model makes correct name-object links based on the latest input Slow: Over the Long-Term Unsupervised Hebbian Learning Associates words with visual targets Learns to ignore visual competitors Two Time Scales
The two time scales are not independent Long-term learning depends critically on the dynamics of the fast time scales Competition between decision units ensures pseudo- localist representations—critical for Hebbian learning (e.g. Rumelhart & Zipser, 1986) Learning occurs on each cycle -Influences processing cycle-by-cycle & trial-by-trial Accumulated learning across trials leads to learning on long-term time scale (i.e., word learning) Dependent Time Scales
Empirical Results
24-month-old children Saw 2 familiar & 1 novel objects Asked to get familiar and novel objects (e.g., “get the cow!” or “get the yok!”) Fast Time Scale Cow (familiar) Block (familiar) Yok (novel) Children were excellent at fast mapping (finding the referent of novel and familiar words in the moment). ***
Slow Time Scale After a 5-minute delay, children were asked to pick a newly fast- mapped name (e.g., “get the yok!”) Yok (target) Fode (named foil) unnamed foil (prev. seen) Children unable to retain mappings after a 5-minute delay ***
Initial findings replicated with simpler tasks: effect of number of names or trials? Children’s difficulty in retaining newly fast-mapped names is not related to the number of names or trials Replication Fast MappingRetention 9/12 **4/9 n.s. Fast MappingRetention 7/12 *4/7 n.s. * Binomial, p <.05, ** Binomial, p <.01 Replication #1 (N = 12) Replication #2 (N = 12) 1 Novel Name 8 Familiar Names 7 Preference Trials 1 Novel Name 2 Familiar Names
Simulations
20 networks initialized with random weights 15 word lexicon (names & objects): 5 familiar words 5 novel words 5 held out Trained on 5 familiar items for 5000 epochs Items presented in random order Run in the Fast Mapping Experiment: 10 fast mapping trials (5 familiar, 5 novel) 5 retention trials Learning was not turned off during experiment.
How The Model Behaves Fast Time Scale: Model succeeded on both types of fast-mapping trials Model behavior patterned with empirical results
Slow Time Scale: The model fails to “retain” the newly learned words after a “delay” Chance
How The Model “Thinks” Analyses of weight matrices revealed that relatively little learning occurred during the test phase. End Familiar Words Familiar Words Novel Words Control Words After Learning After Test Squared Deviations Change (RMS) in portions of weight matrix Familiar WordsNovel WordsControl Words After Test Squared Deviations Temporal dynamics of processing
Prior to Experiment After Experiment Connection Strength
Two time scales captured in a single architecture: –Fast, online: fast mapping –Slow, long-term: word learning The model replicated the empirical findings: –Excellent word learning and fast mapping –Poor “retention” Has sufficient knowledge to select the referent at a given moment in time, given auditory and visual input and stored knowledge (weights). But not enough to subsequently “know” the word. Conclusions
In-the-moment learning: –Subtly biases behavior –Combined with activation dynamics, yields correct response. –Does not provide robust, context-independent word knowledge (in the short term) Continued training on fast-mapped words (i.e., 5000 epochs) makes them familiar words. Accumulation of this learning provides robust context- independent word knowledge over development. Conclusions
Take-Home Messages 1) A fast-mapped word is not a known word… …but a known word is known, because it has been fast-mapped many, many times. 2) Understanding development requires models that integrate both short-term dynamic processes and long-term learning.
Carey, S. (1978). The child as word learner. In M. Halle, J. Bresnan & A. Miller (Eds.), Linguistic Theory and Psychological Reality (pp ). Cambridge, MA: MIT Press. Dell, Gary S. (1986). A spreading-activation theory of retrieval in sentence production. Psychological Review, 93(3) Horst, J.S. & Samuelson, L.K. (2005, April). Slow Down: Understanding the Time Course Behind Fast Mapping. Poster session presented at the 2005 Biennial Meeting of the Society for Research in Child Development, Atlanta, GA. McClelland, J. & Elman, J. (1986). The TRACE Model of Speech Perception, Cognitive Psychology, 18(1), McMurray, B., & Spivey, M. (2000). The Categorical Perception of Consonants: The Interaction of Learning and Processing, The Proceedings of the Chicago Linguistics Society, 34(2), Rumelhart, D. & Zipser, D. (1986). Feature Discovery By Competitive Learning. In Rumelhart, D., & McClelland, J. (Eds) Parallel Distributed Processing: Explorations in the Microstructure of Cognition, 1, Cambridge, MA: MIT Press. References Acknowledgements The authors would like to thank Joseph Toscano for programming assistance and support. This work was supported by NICHD Grant R01-HD to LKS.