Emergent Functions of Simple Systems J. L. McClelland Stanford University.

Slides:

Advertisements

Similar presentations

Thomas Trappenberg Autonomous Robotics: Supervised and unsupervised learning.

Advertisements

ARCHITECTURES FOR ARTIFICIAL INTELLIGENCE SYSTEMS

WHAT IS THE NATURE OF SCIENCE?

A Tutorial on Learning with Bayesian Networks

Deep Learning Bing-Chen Tsai 1/21.

Chapter 4 Key Concepts.

Stochastic Neural Networks Deep Learning and Neural Nets Spring 2015.

CS590M 2008 Fall: Paper Presentation

CS 678 –Boltzmann Machines1 Boltzmann Machine Relaxation net with visible and hidden units Learning algorithm Avoids local minima (and speeds up learning)

PDP: Motivation, basic approach. Cognitive psychology or “How the Mind Works”

Culture and psychological knowledge: A Recap

The dynamics of iterated learning Tom Griffiths UC Berkeley with Mike Kalish, Steve Lewandowsky, Simon Kirby, and Mike Dowman.

Emergence in Cognitive Science: Semantic Cognition Jay McClelland Stanford University.

Data Mining Techniques Outline

Learning with Bayesian Networks David Heckerman Presented by Colin Rickert.

Stochastic Neural Networks, Optimal Perceptual Interpretation, and the Stochastic Interactive Activation Model PDP Class January 15, 2010.

Network Goodness and its Relation to Probability PDP Class Winter, 2010 January 13, 2010.

Stochastic Interactive Activation and Interactive Activation in the Brain PDP Class January 20, 2010.

CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.

Neural Networks. Background - Neural Networks can be : Biological - Biological models Artificial - Artificial models - Desire to produce artificial systems.

Second language acquisition

Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.

Theory of Decision Time Dynamics, with Applications to Memory.

General Knowledge Dr. Claudia J. Stanny EXP 4507 Memory & Cognition Spring 2009.

Development and Disintegration of Conceptual Knowledge: A Parallel-Distributed Processing Approach Jay McClelland Department of Psychology and Center for.

Perfection and bounded rationality in the study of cognition Henry Brighton.

Dynamics of learning: A case study Jay McClelland Stanford University.

Bayesian Learning By Porchelvi Vijayakumar. Cognitive Science Current Problem: How do children learn and how do they get it right?

Object-Oriented Software Engineering Practical Software Development using UML and Java Chapter 7: Focusing on Users and Their Tasks.

Bayesian and Connectionist Approaches to Learning Tom Griffiths, Jay McClelland Alison Gopnik, Mark Seidenberg.

The Interactive Activation Model. Ubiquity of the Constraint Satisfaction Problem In sentence processing –I saw the grand canyon flying to New York –I.

Putting Research to Work in K-8 Science Classrooms Ready, Set, SCIENCE.

Slides to accompany Weathington, Cunningham & Pittenger (2010), Chapter 3: The Foundations of Research 1.

Shane T. Mueller, Ph.D. Indiana University Klein Associates/ARA Rich Shiffrin Indiana University and Memory, Attention & Perception Lab REM-II: A model.

Emergence of Semantic Knowledge from Experience Jay McClelland Stanford University.

WHAT IS THE NATURE OF SCIENCE?. SCIENTIFIC WORLD VIEW 1.The Universe Is Understandable. 2.The Universe Is a Vast Single System In Which the Basic Rules.

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

University of Windsor School of Computer Science Topics in Artificial Intelligence Fall 2008 Sept 11, 2008.

Similarity and Attribution Contrasting Approaches To Semantic Knowledge Representation and Inference Jay McClelland Stanford University.

The Essence of PDP: Local Processing, Global Outcomes PDP Class January 16, 2013.

BCS547 Neural Decoding.

Principled Probabilistic Inference and Interactive Activation Psych209 January 25, 2013.

The Emergent Structure of Semantic Knowledge

Chapter 6 - Standardized Measurement and Assessment

Convolutional Restricted Boltzmann Machines for Feature Learning Mohammad Norouzi Advisor: Dr. Greg Mori Simon Fraser University 27 Nov

Theories and Methods in Social Psychology David Rude, MA, CPC Instructor 1.

Chapter 6 Neural Network.

How Psychologists Do Research Chapter 2. How Psychologists Do Research What makes psychological research scientific? Research Methods Descriptive studies.

Emergent Semantics: Meaning and Metaphor Jay McClelland Department of Psychology and Center for Mind, Brain, and Computation Stanford University.

Chapter 9 Knowledge. Some Questions to Consider Why is it difficult to decide if a particular object belongs to a particular category, such as “chair,”

WHAT IS THE NATURE OF SCIENCE?

Bayesian inference in neural networks

Some Slides from 2007 NIPS tutorial by Prof. Geoffrey Hinton

What is cognitive psychology?

Learning Deep Generative Models by Ruslan Salakhutdinov

Learning in Neural Networks

James L. McClelland SS 100, May 31, 2011

Data Mining Lecture 11.

Bayesian inference in neural networks

Dynamical Models of Decision Making Optimality, human performance, and principles of neural information processing Jay McClelland Department of Psychology.

Emergence of Semantics from Experience

Emergent Functions of Simple Systems

Revealing priors on category structures through iterated learning

Dynamical Models of Decision Making Optimality, human performance, and principles of neural information processing Jay McClelland Department of Psychology.

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Learning linguistic structure with simple recurrent neural networks

Toward a Great Class Project: Discussion of Stoianov & Zorzi’s Numerosity Model Psych 209 – 2019 Feb 14, 2019.

Chapter 3 Interlanguage.

CSC 578 Neural Networks and Deep Learning

Presentation transcript:

Emergent Functions of Simple Systems J. L. McClelland Stanford University

Topics Emergent probabilistic optimization in neural networks Emergent probabilistic optimization in neural networks Relationship between competence/rational approaches and mechanistic (including connectionist) approaches Relationship between competence/rational approaches and mechanistic (including connectionist) approaches Some models that bring connectionist and probabilistic approaches into proximal contact Some models that bring connectionist and probabilistic approaches into proximal contact

Connectionist Units Calculate Posteriors based on Priors and Evidence Given Given A unit representing hypothesis h i, with binary inputs indexed by j representing the state of information about various elements of evidence e, where for all j p(e j ) is conditionally independent given h i A unit representing hypothesis h i, with binary inputs indexed by j representing the state of information about various elements of evidence e, where for all j p(e j ) is conditionally independent given h i A bias on the unit equal to log(prior i /(1-prior i )) A bias on the unit equal to log(prior i /(1-prior i )) Weights to the unit from each input equal to log(p(e j |h i )/(1-log(p(e j |not h i )) Weights to the unit from each input equal to log(p(e j |h i )/(1-log(p(e j |not h i )) If If the output of the unit is computed from the logistic function a = 1/[1+exp( bias i +  j a j w ij )] Then Then a = p(h i |e) a = p(h i |e)

Further Points A collection of connectionist units representing mutually exclusive alternative hypotheses can assign the posterior probability to each in a similar way, using the softmax activation function (where net i = bias i +  j a j w ij ) a i = exp(  net i )/  i’ exp(  net i’ ) A collection of connectionist units representing mutually exclusive alternative hypotheses can assign the posterior probability to each in a similar way, using the softmax activation function (where net i = bias i +  j a j w ij ) a i = exp(  net i )/  i’ exp(  net i’ ) If  = 1, this consitutes probability matching. If  = 1, this consitutes probability matching. As  increases, more and more of the activation goes to the most likely alternative(s). As  increases, more and more of the activation goes to the most likely alternative(s). Selecting the h i largest a i corresponds to choosing the alternative with the largest posterior probability. Selecting the h i largest a i corresponds to choosing the alternative with the largest posterior probability.

Emergent Outcomes from Local Computations (Hopfield, ’82, Hinton & Sejnowski, ’83) If wij = wji and if you update units in a network one at a time, setting ai = 1 if neti >0, ai = 0 otherwise The net will settle to a state s which is a local maximum in a measure Rumelhart et al (1986) called G If wij = wji and if you update units in a network one at a time, setting ai = 1 if neti >0, ai = 0 otherwise The net will settle to a state s which is a local maximum in a measure Rumelhart et al (1986) called G G(s) = Si<j wij aiaj + Si ai(biasi + exti) G(s) = Si<j wij aiaj + Si ai(biasi + exti) If each unit sets its activation to 1 with probability logistic(gneti) then p(s) = exp(gG(s))/Ss’(exp(gG(s’)) If each unit sets its activation to 1 with probability logistic(gneti) then p(s) = exp(gG(s))/Ss’(exp(gG(s’)) This allows probability matching (g = 1) or maximization (g->infinity), and that can be achieved via simulated annealing (gradual increase in g) This allows probability matching (g = 1) or maximization (g->infinity), and that can be achieved via simulated annealing (gradual increase in g)

A Tweaked Connectionist Model (McClelland & Rumelhart, 1981) that is Also a Graphical Model Each pool of units in the IA model is equivalent to a Dirichlet variable (c.f. Dean, 2005). Each pool of units in the IA model is equivalent to a Dirichlet variable (c.f. Dean, 2005). This is enforced if we use softmax and set one of the a i in each pool to 1 with probability: p j = e net j /  j’ e net j’ This is enforced if we use softmax and set one of the a i in each pool to 1 with probability: p j = e net j /  j’ e net j’ Weight arrays linking the variables are equivalent of the ‘edges’ encoding conditional relationships between states of these different variables. Weight arrays linking the variables are equivalent of the ‘edges’ encoding conditional relationships between states of these different variables. Biases at word level encode prior p(w). Biases at word level encode prior p(w). Weights are bi-directional, but encode generative constraints (p(l|w), p(f|l)). Weights are bi-directional, but encode generative constraints (p(l|w), p(f|l)). At equilibrium with g = 1, network’s probability of being in state s equals p(s|I) At equilibrium with g = 1, network’s probability of being in state s equals p(s|I)

But that’s not the true PDP approach to Perception/Cognition/etc… We want to learn how to represent the world and constraints among its constituents from experience, using (to the fullest extent possible) a domain-general approach. We want to learn how to represent the world and constraints among its constituents from experience, using (to the fullest extent possible) a domain-general approach. In this context, the prototypical connectionist learning rules correspond to probability maximization or matching In this context, the prototypical connectionist learning rules correspond to probability maximization or matching Back Propagation Algorithm: Back Propagation Algorithm: Treats output units (or n-way pools) as conditionally independent given Input Treats output units (or n-way pools) as conditionally independent given Input Maximizes p(oi|I) Maximizes p(oi|I) I o

Overcoming the independence assumption The Boltzmann Machine algorithm learns to match probabilities of entire output states given current Input. The Boltzmann Machine algorithm learns to match probabilities of entire output states given current Input. That is, it minimizes That is, it minimizes - Integral(o) p(o|I) log(p(o|I)/q(o|I)) do Here:p(o|I) is sampled from the environment q(o|I) is the network’s estimate of p(o|I) obtained by Gibbs sampling Here:p(o|I) is sampled from the environment q(o|I) is the network’s estimate of p(o|I) obtained by Gibbs sampling The algorithm is beautifully simple and local: The algorithm is beautifully simple and local: Dwij = e (ai+aj+ - ai-aj-) This is slow and generalizes poorly in completely unconstrained Boltzmann machines. This is slow and generalizes poorly in completely unconstrained Boltzmann machines.

But things have gotten much better recently… Hinton’s deep belief networks are fully distributed learned connectionist models that use a restricted form of the Boltzmann machine (no intra-layer connections) and learn state-of- the-art models very fast (e.g. handwritten digit recognition). Generic constraints (sparsity, locality) turn out to allow such networks to learn very efficiently and generalize very well in demanding task contexts (c.f. Olshausen, Lewicki, le Cun, Bengio, Ng, and others). Hinton, Osindero, and Teh (2006). A fast learning algorithm for deep belief networks. Neural Computation, 18,

Topics Emergent probabilistic optimization in neural networks Emergent probabilistic optimization in neural networks Relationship between competence/rational approaches and mechanistic (including connectionist) approaches Relationship between competence/rational approaches and mechanistic (including connectionist) approaches Some models that bring connectionist and probabilistic approaches into proximal contact Some models that bring connectionist and probabilistic approaches into proximal contact

Relationship between rational approaches and mechanistic approaches Characterizing what’s optimal is always a great thing to do Characterizing what’s optimal is always a great thing to do Optimization is of course relative to a set of constraints Optimization is of course relative to a set of constraints Time Time Memory Memory Processing speed Processing speed The lesson of Voltaire’s Candide. The lesson of Voltaire’s Candide. The question of whether people do behave optimally in any particular situation is an empirical question The question of whether people do behave optimally in any particular situation is an empirical question The question of why and how people can/do behave rationally in some situations and not so rationally in others is a matter of theory. The question of why and how people can/do behave rationally in some situations and not so rationally in others is a matter of theory.

Two perspectives People are rational. People are rational. They seek to derive explicit internal models of the structure of the world. They seek to derive explicit internal models of the structure of the world. Optimal structure type Optimal structure type Optimal structure within each type Optimal structure within each type Resource limits and implementational constraints are unknown, and should be ignored in determining what is rational. Resource limits and implementational constraints are unknown, and should be ignored in determining what is rational. But inference is hard, and prior domain-specific constraints are therefore essential. But inference is hard, and prior domain-specific constraints are therefore essential. People emerged through an optimization process, so they are likely to approximate rationality within limits. Implicit internal models characterize natural/intuitive intelligence; human cultures seek explicit models of the structure of the world; science and scientists engage in this search. Culture/School teaches us to think explicitly, we do so under some circumstances. Most connectionist models do not directly address this kind of thinking. Human behavior won’t be understood without considering the constraints it operates under. Figuring out what is optimal sans constraints is always a good thing. Such an effort should not presuppose individual human intent to derive and explicit model of the structure of the world. Inference is hard, and explicit models help, but domain- general mechanisms (which may be partially pre-structured where evolution has had a long time to work its magic) shaped by generic constraints deserve the fullest possible exploration. In some cases such models may closely approximate what might be the optimal explicit model. But that model might only be an approximation and the domain-specific constraints might not be necessary.

It is important to figure out when we rely on explicit vs. implicit cognition Box appears… Box appears… Then one or two objects appear Then one or two objects appear Then a dot may or may not appear Then a dot may or may not appear RT condition: Respond as fast as possible when dot appears RT condition: Respond as fast as possible when dot appears Prediction condition: Predict whether a dot will appear, get feedback after prediction. Prediction condition: Predict whether a dot will appear, get feedback after prediction. Outcomes follow ‘Causal Powers’ model with 10% noise. Outcomes follow ‘Causal Powers’ model with 10% noise. Half of participants are instructed in Causal Powers model, half not. Half of participants are instructed in Causal Powers model, half not. All events listed to the right occur several times, interleaved. All events listed to the right occur several times, interleaved. All participants learn explicit relations. All participants learn explicit relations. Only Instructed Prediction subjects show Blocking and Screening. Only Instructed Prediction subjects show Blocking and Screening. AB+,A+ CD+,C- EF+ GH-,G- fillers

Topics Emergent probabilistic optimization in neural networks Emergent probabilistic optimization in neural networks Relationship between competence/rational approaches and mechanistic (including connectionist) approaches Relationship between competence/rational approaches and mechanistic (including connectionist) approaches Some models that bring connectionist and probabilistic approaches into proximal contact Some models that bring connectionist and probabilistic approaches into proximal contact

Some models that bring connectionist and probabilistic approaches into proximal contact Graphical IA model Graphical IA model Leaky Competing Accumulator Model (LCAM, Usher and McClelland, 2001, and the large family of related decision making models). Leaky Competing Accumulator Model (LCAM, Usher and McClelland, 2001, and the large family of related decision making models). Models of Unsupervised Category Learning: Models of Unsupervised Category Learning: Competitive Learning, OME, TOME Competitive Learning, OME, TOME Subjective Likelihood Model of Recognition Memory (SLiM, McClelland and Chappell, 1998; c.f. REM, Steyvers and Shiffrin, 1997). Subjective Likelihood Model of Recognition Memory (SLiM, McClelland and Chappell, 1998; c.f. REM, Steyvers and Shiffrin, 1997).

Some Phenomena in Cognitive Science – Are they all Emergents? Categories, prototypes, rules Categories, prototypes, rules Lexical entries Lexical entries Grammatical and semantic structures Grammatical and semantic structures Cognitive modules for words and faces Cognitive modules for words and faces Attention, working memory Attention, working memory Choices and decisions Choices and decisions Memories for specific episodes or events Memories for specific episodes or events Deep dyslexia Category-specific deficits Deficits in the hierarchical organization of behavior Appearance/disappearance of behaviors in development Object permanence Stage transitions Sensitive periods Language structure and language change

Some Phenomena in Cognitive Science – Are they all Emergents? Categories, prototypes, rules Categories, prototypes, rules Lexical entries Lexical entries Grammatical and semantic structures Grammatical and semantic structures Cognitive modules for words and faces Cognitive modules for words and faces Attention, working memory Attention, working memory Choices and decisions Choices and decisions Memories for specific episodes or event Memories for specific episodes or event Deep dyslexia Category-specific deficits Deficits in the hierarchical organization of behavior Appearance/disappearance of behaviors in development Object permanence Stage transitions Sensitive periods Language structure and language change

Example: PDP models of reading can… Read regular words, exception words, and nonwords without rules or lexical entries. Read regular words, exception words, and nonwords without rules or lexical entries. Match data showing graded sensitivity to consistency and frequency in response choices and reaction times. Match data showing graded sensitivity to consistency and frequency in response choices and reaction times. Account for detailed aspects of deficits including Account for detailed aspects of deficits including Graded effects of damage Graded effects of damage Co-occurrence of semantic and visual errors in deep dyslexia Co-occurrence of semantic and visual errors in deep dyslexia Regularization errors in surface dyslexia Regularization errors in surface dyslexia Correlation of semantic impairment and surface dyslexia Correlation of semantic impairment and surface dyslexia Patterns of individual differences in these correlations Patterns of individual differences in these correlations Sem: APRICOT “peach” Vis: FLASK “flash” Reg: CAFE “caif”

Basis of Visual Errors in Deep Dyslexia

Some Phenomena in Cognitive Science – Are they all Emergents? Categories, prototypes, rules Categories, prototypes, rules Lexical entries Lexical entries Grammatical and semantic structures Grammatical and semantic structures Cognitive modules for words and faces Cognitive modules for words and faces Attention, working memory Attention, working memory Choices and decisions Choices and decisions Memories for specific episodes or event Memories for specific episodes or event Deep dyslexia Category-specific deficits Deficits in the hierarchical organization of behavior Appearance/disappearance of behaviors in development Object permanence Stage transitions Sensitive periods Language structure and language change

Object Permanence and The A not B Error (Thelen et al, BBS, 2001; Munakata et al, Psych Rev, 1997; Munakata, Devel Sci, 1998) Do young children lack ‘The Principle of Object Permanence’? Do young children lack ‘The Principle of Object Permanence’? Or have they not yet acquired the ability to sustain a tendency to respond to an object that is no longer visible? Or have they not yet acquired the ability to sustain a tendency to respond to an object that is no longer visible? What underlies the striking A-not-B error? What underlies the striking A-not-B error? Failure of knowledge or competing response tendencies? Failure of knowledge or competing response tendencies? Basic object permanence behaviors and the A-not-B error are both highly sensitive to task details – ages at which these effects can occur are easily manipulated. Basic object permanence behaviors and the A-not-B error are both highly sensitive to task details – ages at which these effects can occur are easily manipulated. In emergentist accounts, these effects emerge from gradually- developing abilities that must be strong enough to withstand delays and other impediments and to compete with other forces favoring alternative response tendencies. In emergentist accounts, these effects emerge from gradually- developing abilities that must be strong enough to withstand delays and other impediments and to compete with other forces favoring alternative response tendencies.

Why Does Emergence Matter? Because it explains phenomena in terms of their substrate without reducing them to it. Because it explains phenomena in terms of their substrate without reducing them to it. Because it explains how phenomena arise without the need for a blueprint or plan. Because it explains how phenomena arise without the need for a blueprint or plan. Because an emergent account allows us to see more clearly how the phenomenon is more graded, approximate, and context sensitive than would otherwise be apparent. Because an emergent account allows us to see more clearly how the phenomenon is more graded, approximate, and context sensitive than would otherwise be apparent. Because the phenomenon is contingent on the details of what it emerges from, explaining when it does and does not occur. Because the phenomenon is contingent on the details of what it emerges from, explaining when it does and does not occur. Because the explanation may not require the postulation of something that itself remains to be explained. Because the explanation may not require the postulation of something that itself remains to be explained. Gravity Gravity Preformation Preformation Universal Grammar Universal Grammar

How well do we understand emergence? Only to a very limited extent – Only to a very limited extent – More work is clearly necessary! More work is clearly necessary!

What can be done to increase our understanding? Increase awareness of emergent phenomena in other domains of science and foster an understanding of their mechanistic basis Increase awareness of emergent phenomena in other domains of science and foster an understanding of their mechanistic basis Increase acceptance of and reliance on computational models as vehicles for explaining observed cognitive, developmental and linguistic phenomena Increase acceptance of and reliance on computational models as vehicles for explaining observed cognitive, developmental and linguistic phenomena Work harder on making the explanations for the emergent properties of models more clear Work harder on making the explanations for the emergent properties of models more clear Increase emphasis on understanding underlying mechanisms and processes Increase emphasis on understanding underlying mechanisms and processes

Credits and Bibliography Braitenberg. Vehicles Braitenberg. Vehicles Rumelhart et al. Parallel-Distributed Processing. Rumelhart et al. Parallel-Distributed Processing. Elman, Bates, Johnson, Karmiloff-Smith, Parisi and Plunkett. Rethinking Innateness Elman, Bates, Johnson, Karmiloff-Smith, Parisi and Plunkett. Rethinking Innateness Thelen and Smith. A Dynamic Systems Approach to the Development of Cognition and Action Thelen and Smith. A Dynamic Systems Approach to the Development of Cognition and Action MacWhinney. The Emergence of Language. MacWhinney. The Emergence of Language.