Download presentation
Presentation is loading. Please wait.
1
Statistical NLP Spring 2011
Lecture 28: Diachronics Dan Klein – UC Berkeley TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAA
2
Evolution: Main Phenomena
Mutations of sequences Time Speciation
3
Tree of Languages Challenge: identify the phylogeny
Much work in biology, e.g. work by Warnow, Felsenstein, Steele… Also in linguistics, e.g. Warnow et al., Gray and Atkinson…
4
Statistical Inference Tasks
Inputs Outputs Modern Text focus Ancestral Word Forms fuego feu FR IT PT ES fuego oeuf Cognate Groups / Translations huevo feu Phylogeny Grammatical Inference les faits sont très clairs
5
les faits sont très clairs
Outline focus Ancestral Word Forms fuego feu fuego oeuf Cognate Groups / Translations huevo feu Grammatical Inference les faits sont très clairs
6
Language Evolution: Sound Change
Eng. camera from Latin, “camera obscura” camera /kamera/ Latin Deletion: /e/ Change: /k/ .. /tʃ/ .. /ʃ/ Insertion: /b/ chambre /ʃambʁ/ French Eng. chamber from Old Fr. before the initial /t/ dropped
7
Diachronic Evidence Yahoo! Answers [2009] Appendix Probi [ca 300]
tonight not tonite tonitru non tonotru
8
Synchronic (Comparative) Evidence
9
Simple Model: Single Characters
10
A Model for Words Forms focus /fokus/ /fogo/ fuoco /fwɔko/
[BGK 07] focus /fokus/ LA /fogo/ IB fuoco /fwɔko/ fuego /fweɣo/ fogo /fogo/ IT ES PT
11
Contextual Changes … # f o /fokus/ # f w ɔ /fwɔko/
12
Changes are Systematic
/fokus/ /fweɣo/ /fogo/ /fwɔko/ /fokus/ /fweɣo/ /fogo/ /fwɔko/ /kentrum/ /sentro/ /tʃɛntro/
13
Experimental Setup Data sets Small: Romance Large: Oceanic FR IT PT ES
French, Italian, Portuguese, Spanish 2344 words Complete cognate sets Target: (Vulgar) Latin Large: Oceanic 661 languages 140K words Incomplete cognate sets Target: Proto-Oceanic [Blust, 1993] FR IT PT ES
14
Data: Romance
15
Learning: Objective /fokus/ /fweɣo/ /fogo/ /fwɔko/
16
Learning: EM M-Step E-Step
Find parameters which fit (expected) sound change counts Easy: gradient ascent on theta E-Step Find (expected) change counts given parameters Hard: variables are string-valued /fokus/ /fweɣo/ /fogo/ /fwɔko/ /fokus/ /fweɣo/ /fogo/ /fwɔko/
17
Computing Expectations
[Holmes 01, BGK 07] Computing Expectations Standard approach, e.g. [Holmes 2001]: Gibbs sampling each sequence call it ‘grass’
18
A Gibbs Sampler ‘grass’
19
A Gibbs Sampler ‘grass’
20
A Gibbs Sampler ‘grass’
21
Getting Stuck ? How to jump to a state where the liquids /r/ and /l/ have a common ancestor?
22
Getting Stuck the sampler needs to wander in a valley of lower pr before it can gets to the other peak
23
Solution: Vertical Slices
[BGK 08] Solution: Vertical Slices Single Sequence Resampling Ancestry Resampling
24
Details: Defining “Slices”
The sampling domains (kernels) are indexed by contiguous subsequences (anchors) of the observed leaf sequences anchor Correct construction section(G) is non-trivial but very efficient
25
Results: Alignment Efficiency
Depth of the phylogenetic tree Sum of Pairs Setup: Fixed wall time Synthetic data, same parameters Results: Alignment Efficiency Is ancestry resampling faster than basic Gibbs? Hypothesis: Larger gains for deeper trees now that I have shown that we get better alignment, the next question is whether we also get them more quickly our hypo here is that... and this is because the random walk that gibbs needs to go thru gets longer and longer the setup here is that we fix a wall time budget, and create synthetic dataset along deeper and deeper phylogenetic tree for each one, we compute sum of pair
26
Results: Romance
27
Learned Rules / Mutations
28
Learned Rules / Mutations
29
Comparison to Other Methods
Evaluation metric: edit distance from a reconstruction made by a linguist (lower is better) Us Oakes Comparison to system from [Oakes, 2000] Uses exact inference and deterministic rules Reconstruction of Proto-Malayo-Javanic cf [Nothefer, 1975] The way we evaluate performance is by measuring the averaged levenhstein edit distance b/w the proposed reconstruction and those proposed by a linguist Let me start with some results comparing our model to other methods System called Prague, designed by Robert Oakes As you can see in this graph, our system outperforms this previous approach
30
Data: Oceanic Proto-Oceanic
31
Data: Oceanic
32
Result: More Languages Help
Distance from [Blust, 1993] Reconstructions Mean edit distance Number of modern languages used
33
Results: Large Scale -here is the tree we have used, a tree based on the full austronesian language dataset
34
Visualization: Learned universals
all of those links are actually very reasonable sound changes *The model did not have features encoding natural classes
35
Regularity and Functional Load
In a language, some pairs of sounds are more contrastive than others (higher functional load) Example: English “p”/“b” versus “t”/”th” “p”/“b”: pot/dot, pin/din, dress/press, pew/dew, ... “t”/”th”: thin/tin for example, one open question in linguistics is whether sounds that are less contrastive in a language would be more vulnerable to sound changes here is what I mean by constrastive: consider for example .. functional load quantifies this concept, and would give a low functional load to the t/th pair, and a high funct. load to the other pair
36
Functional Load: Timeline
Functional Load Hypothesis (FLH): sounds changes are less frequent when they merge phonemes with high functional load [Martinet, 55] Previous research within linguistics: “FLH does not seem to be supported by the data” [King, 67] Caveat: only four languages were used in King’s study [Hocket 67; Surandran et al., 06] Our work: we reexamined the question with two orders of magnitude more data [BGK, under review] there is a very interesting open question in linguistics related to FL: the conjecture, made by Martinet in 55, is that sound changes are less freq when they merge phonemes with high FL
37
Regularity and Functional Load
Data: only 4 languages from the Austronesian data Each dot is a sound change identified by the system here is the kind of plot we are interested in: the x axis is the functional load as computed by king, and the y axis is the merger posterior pr for example if it’s close to 1, it’s a systematic sound change, lower values represent context-dep changes each point in this plot is a sound change Merger posterior probability Functional load as computed by [King, 67]
38
Regularity and Functional Load
Data: all 637 languages from the Austronesian data computing FL from this tree gave us a much clearer view on the relationship b/w FL and sound change, in particular, the data seems to support the functional load hypothesis, that is sounds changes are more frequent when they merge phonemes with low functional load Merger posterior probability Functional load as computed by [King, 67]
39
les faits sont très clairs
Outline focus Ancestral Word Forms fuego feu fuego oeuf Cognate Groups / Translations huevo feu Grammatical Inference les faits sont très clairs
40
Cognate Groups ‘fire’ /fwɔko/ /vɛrbo/ /tʃɛntro/ /sentro/ /berβo/
/fweɣo/ /vɛrbo/ /fogo/ /sɛntro/
41
Model: Cognate Survival
+ - IT ES PT IB LA omnis - ogni IT ES PT IB LA
42
Results: Grouping Accuracy
Fraction of Words Correctly Grouped Method [Hall and Klein, in submission]
43
Semantics: Matching Meanings
EN day tag EN Occurs with: “name” “label” “along” Occurs with: “night” “sun” “week” DE tag Occurs with: “nacht” “sonne” “woche”
44
les faits sont très clairs
Outline focus Ancestral Word Forms fuego feu fuego oeuf Cognate Groups / Translations huevo feu Grammatical Inference les faits sont très clairs
45
Grammar Induction Task: Given sentences, infer grammar
(and parse tree structures) congress narrowly passed the amended bill les faits sont très clairs
46
Shared Prior congress narrowly passed the amended bill
les faits sont très clairs
47
Results: Phylogenetic Prior
Dutch Danish Swedish Spanish Portuguese Slovene Chinese English WG NG RM G IE GL Avg rel gain: 29%
48
Conclusion Phylogeny-structured models can:
Accurately reconstruct ancestral words Give evidence to open linguistic debates Detect translations from form and context Improve language learning algorithms Lots of questions still open: Can we get better phylogenies using these high- res models? What do these models have to say about the very earliest languages? Proto-world?
49
Thank you! nlp.cs.berkeley.edu
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.