A Bayesian view of language evolution by iterated learning Tom Griffiths Brown University Mike Kalish University of Louisiana.

A Bayesian view of language evolution by iterated learning Tom Griffiths Brown University Mike Kalish University of Louisiana

Linguistic universals Human languages possess universal properties –e.g. compositionality (Comrie, 1981; Greenberg, 1963; Hawkins, 1988) Two questions: –why do linguistic universals exist? –why are particular properties universal?

Possible answers Traditional answer: –linguistic universals reflect innate constraints specific to a system for acquiring language (e.g., Chomsky, 1965) Alternative answer: –linguistic universals emerge as the result of the fact that language is learned anew by each generation (using general-purpose learning mechanisms) (e.g., Briscoe, 1998; Kirby, 2001)

The iterated learning model (Kirby, 2001) Each learner sees data, forms a hypothesis, produces the data given to the next learner c.f. the playground game “telephone”

The “information bottleneck” (Kirby, 2001) “survival of the most compressible” size indicates compressibility

Analyzing iterated learning What are the consequences of iterated learning? Simulations Analytic results Complex algorithms Simple algorithms Komarova, Niyogi, & Nowak (2002) Brighton (2002) Kirby (2001) Smith, Kirby, & Brighton (2003) ?

Bayesian inference Reverend Thomas Bayes Rational procedure for updating beliefs Foundation of many learning algorithms (e.g., Mackay, 2003) Widely used for language learning (e.g., Charniak, 1993)

Bayes’ theorem Posterior probability LikelihoodPrior probability Sum over space of hypotheses h: hypothesis d: data

Iterated Bayesian learning p(h|d)p(h|d) p(d|h)p(d|h) p(h|d)p(h|d) p(d|h)p(d|h) Learners are Bayesian agents d0d0 h1h1 d1d1 h2h2 p(h|d)p(h|d) p(d|h)p(d|h) p(h|d)p(h|d) d2d2 h3h3 p(d|h)p(d|h) p(h|d)p(h|d)

Variable x n independent of history given x n-1 Converges to a stationary distribution under easily checked conditions for ergodicity xx x xx x x x transition matrix T T ij = p(x n = i | x n-1 = j) Markov chains

Stationary distributions Stationary distribution: In matrix form Using tools from linear algebra: –  is the first eigenvector of the transition matrix T –second eigenvalue of T sets rate of convergence

Analyzing iterated learning d0d0 h1h1 d1d1 h2h2 p(h|d)p(h|d) p(d|h)p(d|h) p(h|d)p(h|d) d2d2 h3h3 p(d|h)p(d|h) p(h|d)p(h|d)  d p(d|h)p(h|d) h1h1 h2h2 h3h3 A Markov chain on hypotheses d0d0 d1d1  h p(h|d) p(d|h) d2d2 A Markov chain on data p(h|d) p(d|h) h 1,d 1 h 2,d 2 h 3,d 3 A Markov chain on hypothesis-data pairs

Stationary distributions Markov chain on h converges to the prior, p(h) Markov chain on d converges to the “prior predictive distribution” Markov chain on (h,d) is a Gibbs sampler for

Implications The probability that the nth learner entertains the hypothesis h approaches p(h) as n   Convergence to the prior occurs regardless of: –the amount or structure of the data transmitted –the properties of the hypotheses themselves The consequences of iterated learning are determined entirely by the biases of the learners

A simple language model eventsutterances language “actions” “agents” “nouns” “verbs” 0 1 01 0 1 01 compositional

A simple language model Data: m event-utterance pairs Hypotheses: languages, with error  0 1 01 0 1 01 0 1 01 0 1 01 holistic p(h)p(h) compositional

Analysis technique 1.Compute transition matrix on languages 2.Sample Markov chains 3.Compare language frequencies with prior

Convergence to priors  = 0.50,  = 0.05, m = 3  = 0.01,  = 0.05, m = 3 ChainPrior Iteration Compositionality emerges only when favored by the prior

The information bottleneck  = 0.50,  = 0.05, m = 1  = 0.50,  = 0.05, m = 3 ChainPrior Iteration  = 0.50,  = 0.05, m = 10 No effect of bottleneck

The information bottleneck Bottleneck affects relative stability of languages favored by prior

Explaining linguistic universals Two questions: –why do linguistic universals exist? –why are particular properties universal? Our analysis gives different answers: –existence explained through iterated learning –universal properties depend on the prior Focuses inquiry on the priors of the learners –languages reflect the biases of human learners

Extensions and future directions Results extend to: –unbounded populations –continuous time population dynamics Iterated learning applies to other knowledge –religious concepts, social norms, legends… Provides a method for evaluating priors –experiments in iterated learning with humans

Iterated function learning Each learner sees a set of (x,y) pairs Makes predictions of y for new x values Predictions are data for the next learner datahypotheses

Function learning in the lab Stimulus Response Slider Feedback Examine iterated learning with different initial data

1 2 3 4 5 6 7 8 9 Iteration Initial data (Kalish, 2004)

Markov chain Monte Carlo A strategy for sampling from complex probability distributions Key idea: construct a Markov chain which converges to a particular distribution –e.g. Metropolis algorithm –e.g. Gibbs sampling

Gibbs sampling For variables x = x 1, x 2, …, x n Draw x i (t+1) from P(x i |x -i ) x -i = x 1 (t+1), x 2 (t+1),…, x i-1 (t+1), x i+1 (t), …, x n (t) Converges to P(x 1, x 2, …, x n ) (a.k.a. the heat bath algorithm in statistical physics) (Geman & Geman, 1984)

Gibbs sampling (MacKay, 2003)

An example: Gaussians If we assume… –data, d, is a single real number, x –hypotheses, h, are means of a Gaussian,  –prior, p(  ), is Gaussian(  0,  0 2 ) …then p(x n+1 |x n ) is Gaussian(  n,  x 2 +  n 2 )

An example: Gaussians If we assume… –data, d, is a single real number, x –hypotheses, h, are means of a Gaussian,  –prior, p(  ), is Gaussian(  0,  0 2 ) …then p(x n+1 |x n ) is Gaussian(  n,  x 2 +  n 2 ) p(x n |x 0 ) is Gaussian(  0 +c n x 0, (  x 2 +  0 2 )(1 - c 2n )) i.e. geometric convergence to prior

 0 = 0,  0 2 = 1, x 0 = 20 Iterated learning results in rapid convergence to prior

An example: Linear regression Assume –data, d, are pairs of real numbers (x, y) –hypotheses, h, are functions An example: linear regression –hypotheses have slope  and pass through origin –p(  ) is Gaussian(  0,  0 2 ) } x = 1  y

}  y  0 = 1,  0 2 = 0.1, y 0 = -1

A Bayesian view of language evolution by iterated learning Tom Griffiths Brown University Mike Kalish University of Louisiana.

Similar presentations

Presentation on theme: "A Bayesian view of language evolution by iterated learning Tom Griffiths Brown University Mike Kalish University of Louisiana."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Bayesian view of language evolution by iterated learning Tom Griffiths Brown University Mike Kalish University of Louisiana.

Similar presentations

Presentation on theme: "A Bayesian view of language evolution by iterated learning Tom Griffiths Brown University Mike Kalish University of Louisiana."— Presentation transcript:

Similar presentations

About project

Feedback