Download presentation
Presentation is loading. Please wait.
1
A Bayesian view of language evolution by iterated learning Tom Griffiths Brown University Mike Kalish University of Louisiana
2
Linguistic universals Human languages possess universal properties –e.g. compositionality (Comrie, 1981; Greenberg, 1963; Hawkins, 1988) Two questions: –why do linguistic universals exist? –why are particular properties universal?
3
Possible answers Traditional answer: –linguistic universals reflect innate constraints specific to a system for acquiring language (e.g., Chomsky, 1965) Alternative answer: –linguistic universals emerge as the result of the fact that language is learned anew by each generation (using general-purpose learning mechanisms) (e.g., Briscoe, 1998; Kirby, 2001)
4
The iterated learning model (Kirby, 2001) Each learner sees data, forms a hypothesis, produces the data given to the next learner c.f. the playground game “telephone”
5
The “information bottleneck” (Kirby, 2001) “survival of the most compressible” size indicates compressibility
6
Analyzing iterated learning What are the consequences of iterated learning? Simulations Analytic results Complex algorithms Simple algorithms Komarova, Niyogi, & Nowak (2002) Brighton (2002) Kirby (2001) Smith, Kirby, & Brighton (2003) ?
7
Bayesian inference Reverend Thomas Bayes Rational procedure for updating beliefs Foundation of many learning algorithms (e.g., Mackay, 2003) Widely used for language learning (e.g., Charniak, 1993)
8
Bayes’ theorem Posterior probability LikelihoodPrior probability Sum over space of hypotheses h: hypothesis d: data
9
Iterated Bayesian learning p(h|d)p(h|d) p(d|h)p(d|h) p(h|d)p(h|d) p(d|h)p(d|h) Learners are Bayesian agents d0d0 h1h1 d1d1 h2h2 p(h|d)p(h|d) p(d|h)p(d|h) p(h|d)p(h|d) d2d2 h3h3 p(d|h)p(d|h) p(h|d)p(h|d)
10
Variable x n independent of history given x n-1 Converges to a stationary distribution under easily checked conditions for ergodicity xx x xx x x x transition matrix T T ij = p(x n = i | x n-1 = j) Markov chains
11
Stationary distributions Stationary distribution: In matrix form Using tools from linear algebra: – is the first eigenvector of the transition matrix T –second eigenvalue of T sets rate of convergence
12
Analyzing iterated learning d0d0 h1h1 d1d1 h2h2 p(h|d)p(h|d) p(d|h)p(d|h) p(h|d)p(h|d) d2d2 h3h3 p(d|h)p(d|h) p(h|d)p(h|d) d p(d|h)p(h|d) h1h1 h2h2 h3h3 A Markov chain on hypotheses d0d0 d1d1 h p(h|d) p(d|h) d2d2 A Markov chain on data p(h|d) p(d|h) h 1,d 1 h 2,d 2 h 3,d 3 A Markov chain on hypothesis-data pairs
13
Stationary distributions Markov chain on h converges to the prior, p(h) Markov chain on d converges to the “prior predictive distribution” Markov chain on (h,d) is a Gibbs sampler for
14
Implications The probability that the nth learner entertains the hypothesis h approaches p(h) as n Convergence to the prior occurs regardless of: –the amount or structure of the data transmitted –the properties of the hypotheses themselves The consequences of iterated learning are determined entirely by the biases of the learners
15
A simple language model eventsutterances language “actions” “agents” “nouns” “verbs” 0 1 01 0 1 01 compositional
16
A simple language model Data: m event-utterance pairs Hypotheses: languages, with error 0 1 01 0 1 01 0 1 01 0 1 01 holistic p(h)p(h) compositional
17
Analysis technique 1.Compute transition matrix on languages 2.Sample Markov chains 3.Compare language frequencies with prior
18
Convergence to priors = 0.50, = 0.05, m = 3 = 0.01, = 0.05, m = 3 ChainPrior Iteration Compositionality emerges only when favored by the prior
19
The information bottleneck = 0.50, = 0.05, m = 1 = 0.50, = 0.05, m = 3 ChainPrior Iteration = 0.50, = 0.05, m = 10 No effect of bottleneck
20
The information bottleneck Bottleneck affects relative stability of languages favored by prior
21
Explaining linguistic universals Two questions: –why do linguistic universals exist? –why are particular properties universal? Our analysis gives different answers: –existence explained through iterated learning –universal properties depend on the prior Focuses inquiry on the priors of the learners –languages reflect the biases of human learners
22
Extensions and future directions Results extend to: –unbounded populations –continuous time population dynamics Iterated learning applies to other knowledge –religious concepts, social norms, legends… Provides a method for evaluating priors –experiments in iterated learning with humans
24
Iterated function learning Each learner sees a set of (x,y) pairs Makes predictions of y for new x values Predictions are data for the next learner datahypotheses
25
Function learning in the lab Stimulus Response Slider Feedback Examine iterated learning with different initial data
26
1 2 3 4 5 6 7 8 9 Iteration Initial data (Kalish, 2004)
28
Markov chain Monte Carlo A strategy for sampling from complex probability distributions Key idea: construct a Markov chain which converges to a particular distribution –e.g. Metropolis algorithm –e.g. Gibbs sampling
29
Gibbs sampling For variables x = x 1, x 2, …, x n Draw x i (t+1) from P(x i |x -i ) x -i = x 1 (t+1), x 2 (t+1),…, x i-1 (t+1), x i+1 (t), …, x n (t) Converges to P(x 1, x 2, …, x n ) (a.k.a. the heat bath algorithm in statistical physics) (Geman & Geman, 1984)
30
Gibbs sampling (MacKay, 2003)
32
An example: Gaussians If we assume… –data, d, is a single real number, x –hypotheses, h, are means of a Gaussian, –prior, p( ), is Gaussian( 0, 0 2 ) …then p(x n+1 |x n ) is Gaussian( n, x 2 + n 2 )
33
An example: Gaussians If we assume… –data, d, is a single real number, x –hypotheses, h, are means of a Gaussian, –prior, p( ), is Gaussian( 0, 0 2 ) …then p(x n+1 |x n ) is Gaussian( n, x 2 + n 2 ) p(x n |x 0 ) is Gaussian( 0 +c n x 0, ( x 2 + 0 2 )(1 - c 2n )) i.e. geometric convergence to prior
34
0 = 0, 0 2 = 1, x 0 = 20 Iterated learning results in rapid convergence to prior
35
An example: Linear regression Assume –data, d, are pairs of real numbers (x, y) –hypotheses, h, are functions An example: linear regression –hypotheses have slope and pass through origin –p( ) is Gaussian( 0, 0 2 ) } x = 1 y
36
} y 0 = 1, 0 2 = 0.1, y 0 = -1
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.