Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Bayesian view of language evolution by iterated learning Tom Griffiths Brown University Mike Kalish University of Louisiana.

Similar presentations


Presentation on theme: "A Bayesian view of language evolution by iterated learning Tom Griffiths Brown University Mike Kalish University of Louisiana."— Presentation transcript:

1 A Bayesian view of language evolution by iterated learning Tom Griffiths Brown University Mike Kalish University of Louisiana

2 Linguistic universals Human languages possess universal properties –e.g. compositionality (Comrie, 1981; Greenberg, 1963; Hawkins, 1988) Two questions: –why do linguistic universals exist? –why are particular properties universal?

3 Possible answers Traditional answer: –linguistic universals reflect innate constraints specific to a system for acquiring language (e.g., Chomsky, 1965) Alternative answer: –linguistic universals emerge as the result of the fact that language is learned anew by each generation (using general-purpose learning mechanisms) (e.g., Briscoe, 1998; Kirby, 2001)

4 The iterated learning model (Kirby, 2001) Each learner sees data, forms a hypothesis, produces the data given to the next learner c.f. the playground game “telephone”

5 The “information bottleneck” (Kirby, 2001) “survival of the most compressible” size indicates compressibility

6 Analyzing iterated learning What are the consequences of iterated learning? Simulations Analytic results Complex algorithms Simple algorithms Komarova, Niyogi, & Nowak (2002) Brighton (2002) Kirby (2001) Smith, Kirby, & Brighton (2003) ?

7 Bayesian inference Reverend Thomas Bayes Rational procedure for updating beliefs Foundation of many learning algorithms (e.g., Mackay, 2003) Widely used for language learning (e.g., Charniak, 1993)

8 Bayes’ theorem Posterior probability LikelihoodPrior probability Sum over space of hypotheses h: hypothesis d: data

9 Iterated Bayesian learning p(h|d)p(h|d) p(d|h)p(d|h) p(h|d)p(h|d) p(d|h)p(d|h) Learners are Bayesian agents d0d0 h1h1 d1d1 h2h2 p(h|d)p(h|d) p(d|h)p(d|h) p(h|d)p(h|d) d2d2 h3h3 p(d|h)p(d|h) p(h|d)p(h|d)

10 Variable x n independent of history given x n-1 Converges to a stationary distribution under easily checked conditions for ergodicity xx x xx x x x transition matrix T T ij = p(x n = i | x n-1 = j) Markov chains

11 Stationary distributions Stationary distribution: In matrix form Using tools from linear algebra: –  is the first eigenvector of the transition matrix T –second eigenvalue of T sets rate of convergence

12 Analyzing iterated learning d0d0 h1h1 d1d1 h2h2 p(h|d)p(h|d) p(d|h)p(d|h) p(h|d)p(h|d) d2d2 h3h3 p(d|h)p(d|h) p(h|d)p(h|d)  d p(d|h)p(h|d) h1h1 h2h2 h3h3 A Markov chain on hypotheses d0d0 d1d1  h p(h|d) p(d|h) d2d2 A Markov chain on data p(h|d) p(d|h) h 1,d 1 h 2,d 2 h 3,d 3 A Markov chain on hypothesis-data pairs

13 Stationary distributions Markov chain on h converges to the prior, p(h) Markov chain on d converges to the “prior predictive distribution” Markov chain on (h,d) is a Gibbs sampler for

14 Implications The probability that the nth learner entertains the hypothesis h approaches p(h) as n   Convergence to the prior occurs regardless of: –the amount or structure of the data transmitted –the properties of the hypotheses themselves The consequences of iterated learning are determined entirely by the biases of the learners

15 A simple language model eventsutterances language “actions” “agents” “nouns” “verbs” 0 1 01 0 1 01 compositional

16 A simple language model Data: m event-utterance pairs Hypotheses: languages, with error  0 1 01 0 1 01 0 1 01 0 1 01 holistic p(h)p(h) compositional

17 Analysis technique 1.Compute transition matrix on languages 2.Sample Markov chains 3.Compare language frequencies with prior

18 Convergence to priors  = 0.50,  = 0.05, m = 3  = 0.01,  = 0.05, m = 3 ChainPrior Iteration Compositionality emerges only when favored by the prior

19 The information bottleneck  = 0.50,  = 0.05, m = 1  = 0.50,  = 0.05, m = 3 ChainPrior Iteration  = 0.50,  = 0.05, m = 10 No effect of bottleneck

20 The information bottleneck Bottleneck affects relative stability of languages favored by prior

21 Explaining linguistic universals Two questions: –why do linguistic universals exist? –why are particular properties universal? Our analysis gives different answers: –existence explained through iterated learning –universal properties depend on the prior Focuses inquiry on the priors of the learners –languages reflect the biases of human learners

22 Extensions and future directions Results extend to: –unbounded populations –continuous time population dynamics Iterated learning applies to other knowledge –religious concepts, social norms, legends… Provides a method for evaluating priors –experiments in iterated learning with humans

23

24 Iterated function learning Each learner sees a set of (x,y) pairs Makes predictions of y for new x values Predictions are data for the next learner datahypotheses

25 Function learning in the lab Stimulus Response Slider Feedback Examine iterated learning with different initial data

26 1 2 3 4 5 6 7 8 9 Iteration Initial data (Kalish, 2004)

27

28 Markov chain Monte Carlo A strategy for sampling from complex probability distributions Key idea: construct a Markov chain which converges to a particular distribution –e.g. Metropolis algorithm –e.g. Gibbs sampling

29 Gibbs sampling For variables x = x 1, x 2, …, x n Draw x i (t+1) from P(x i |x -i ) x -i = x 1 (t+1), x 2 (t+1),…, x i-1 (t+1), x i+1 (t), …, x n (t) Converges to P(x 1, x 2, …, x n ) (a.k.a. the heat bath algorithm in statistical physics) (Geman & Geman, 1984)

30 Gibbs sampling (MacKay, 2003)

31

32 An example: Gaussians If we assume… –data, d, is a single real number, x –hypotheses, h, are means of a Gaussian,  –prior, p(  ), is Gaussian(  0,  0 2 ) …then p(x n+1 |x n ) is Gaussian(  n,  x 2 +  n 2 )

33 An example: Gaussians If we assume… –data, d, is a single real number, x –hypotheses, h, are means of a Gaussian,  –prior, p(  ), is Gaussian(  0,  0 2 ) …then p(x n+1 |x n ) is Gaussian(  n,  x 2 +  n 2 ) p(x n |x 0 ) is Gaussian(  0 +c n x 0, (  x 2 +  0 2 )(1 - c 2n )) i.e. geometric convergence to prior

34  0 = 0,  0 2 = 1, x 0 = 20 Iterated learning results in rapid convergence to prior

35 An example: Linear regression Assume –data, d, are pairs of real numbers (x, y) –hypotheses, h, are functions An example: linear regression –hypotheses have slope  and pass through origin –p(  ) is Gaussian(  0,  0 2 ) } x = 1  y

36 }  y  0 = 1,  0 2 = 0.1, y 0 = -1

37


Download ppt "A Bayesian view of language evolution by iterated learning Tom Griffiths Brown University Mike Kalish University of Louisiana."

Similar presentations


Ads by Google