Statistical NLP Spring 2011 Lecture 28: Diachronics Dan Klein – UC Berkeley TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAA
Evolution: Main Phenomena Mutations of sequences Time Speciation
Tree of Languages Challenge: identify the phylogeny Much work in biology, e.g. work by Warnow, Felsenstein, Steele… Also in linguistics, e.g. Warnow et al., Gray and Atkinson… http://andromeda.rutgers.edu/~jlynch/language.html
Statistical Inference Tasks Inputs Outputs Modern Text focus Ancestral Word Forms fuego feu FR IT PT ES fuego oeuf Cognate Groups / Translations huevo feu Phylogeny Grammatical Inference les faits sont très clairs
les faits sont très clairs Outline focus Ancestral Word Forms fuego feu fuego oeuf Cognate Groups / Translations huevo feu Grammatical Inference les faits sont très clairs
Language Evolution: Sound Change Eng. camera from Latin, “camera obscura” camera /kamera/ Latin Deletion: /e/ Change: /k/ .. /tʃ/ .. /ʃ/ Insertion: /b/ chambre /ʃambʁ/ French Eng. chamber from Old Fr. before the initial /t/ dropped
Diachronic Evidence Yahoo! Answers [2009] Appendix Probi [ca 300] tonight not tonite tonitru non tonotru
Synchronic (Comparative) Evidence
Simple Model: Single Characters
A Model for Words Forms focus /fokus/ /fogo/ fuoco /fwɔko/ [BGK 07] focus /fokus/ LA /fogo/ IB fuoco /fwɔko/ fuego /fweɣo/ fogo /fogo/ IT ES PT
Contextual Changes … # f o /fokus/ # f w ɔ /fwɔko/
Changes are Systematic /fokus/ /fweɣo/ /fogo/ /fwɔko/ /fokus/ /fweɣo/ /fogo/ /fwɔko/ /kentrum/ /sentro/ /tʃɛntro/
Experimental Setup Data sets Small: Romance Large: Oceanic FR IT PT ES French, Italian, Portuguese, Spanish 2344 words Complete cognate sets Target: (Vulgar) Latin Large: Oceanic 661 languages 140K words Incomplete cognate sets Target: Proto-Oceanic [Blust, 1993] FR IT PT ES
Data: Romance
Learning: Objective /fokus/ /fweɣo/ /fogo/ /fwɔko/
Learning: EM M-Step E-Step Find parameters which fit (expected) sound change counts Easy: gradient ascent on theta E-Step Find (expected) change counts given parameters Hard: variables are string-valued /fokus/ /fweɣo/ /fogo/ /fwɔko/ /fokus/ /fweɣo/ /fogo/ /fwɔko/
Computing Expectations [Holmes 01, BGK 07] Computing Expectations Standard approach, e.g. [Holmes 2001]: Gibbs sampling each sequence call it ‘grass’
A Gibbs Sampler ‘grass’
A Gibbs Sampler ‘grass’
A Gibbs Sampler ‘grass’
Getting Stuck ? How to jump to a state where the liquids /r/ and /l/ have a common ancestor?
Getting Stuck the sampler needs to wander in a valley of lower pr before it can gets to the other peak
Solution: Vertical Slices [BGK 08] Solution: Vertical Slices Single Sequence Resampling Ancestry Resampling
Details: Defining “Slices” The sampling domains (kernels) are indexed by contiguous subsequences (anchors) of the observed leaf sequences anchor Correct construction section(G) is non-trivial but very efficient
Results: Alignment Efficiency Depth of the phylogenetic tree Sum of Pairs Setup: Fixed wall time Synthetic data, same parameters Results: Alignment Efficiency Is ancestry resampling faster than basic Gibbs? Hypothesis: Larger gains for deeper trees now that I have shown that we get better alignment, the next question is whether we also get them more quickly our hypo here is that... and this is because the random walk that gibbs needs to go thru gets longer and longer the setup here is that we fix a wall time budget, and create synthetic dataset along deeper and deeper phylogenetic tree for each one, we compute sum of pair
Results: Romance
Learned Rules / Mutations
Learned Rules / Mutations
Comparison to Other Methods Evaluation metric: edit distance from a reconstruction made by a linguist (lower is better) Us Oakes Comparison to system from [Oakes, 2000] Uses exact inference and deterministic rules Reconstruction of Proto-Malayo-Javanic cf [Nothefer, 1975] The way we evaluate performance is by measuring the averaged levenhstein edit distance b/w the proposed reconstruction and those proposed by a linguist Let me start with some results comparing our model to other methods System called Prague, designed by Robert Oakes As you can see in this graph, our system outperforms this previous approach
Data: Oceanic Proto-Oceanic
Data: Oceanic http://language.psy.auckland.ac.nz/austronesian/research.php
Result: More Languages Help Distance from [Blust, 1993] Reconstructions Mean edit distance Number of modern languages used
Results: Large Scale -here is the tree we have used, a tree based on the full austronesian language dataset
Visualization: Learned universals all of those links are actually very reasonable sound changes *The model did not have features encoding natural classes
Regularity and Functional Load In a language, some pairs of sounds are more contrastive than others (higher functional load) Example: English “p”/“b” versus “t”/”th” “p”/“b”: pot/dot, pin/din, dress/press, pew/dew, ... “t”/”th”: thin/tin for example, one open question in linguistics is whether sounds that are less contrastive in a language would be more vulnerable to sound changes here is what I mean by constrastive: consider for example .. functional load quantifies this concept, and would give a low functional load to the t/th pair, and a high funct. load to the other pair
Functional Load: Timeline Functional Load Hypothesis (FLH): sounds changes are less frequent when they merge phonemes with high functional load [Martinet, 55] Previous research within linguistics: “FLH does not seem to be supported by the data” [King, 67] Caveat: only four languages were used in King’s study [Hocket 67; Surandran et al., 06] Our work: we reexamined the question with two orders of magnitude more data [BGK, under review] there is a very interesting open question in linguistics related to FL: the conjecture, made by Martinet in 55, is that sound changes are less freq when they merge phonemes with high FL
Regularity and Functional Load Data: only 4 languages from the Austronesian data Each dot is a sound change identified by the system here is the kind of plot we are interested in: the x axis is the functional load as computed by king, and the y axis is the merger posterior pr for example if it’s close to 1, it’s a systematic sound change, lower values represent context-dep changes each point in this plot is a sound change Merger posterior probability Functional load as computed by [King, 67]
Regularity and Functional Load Data: all 637 languages from the Austronesian data computing FL from this tree gave us a much clearer view on the relationship b/w FL and sound change, in particular, the data seems to support the functional load hypothesis, that is sounds changes are more frequent when they merge phonemes with low functional load Merger posterior probability Functional load as computed by [King, 67]
les faits sont très clairs Outline focus Ancestral Word Forms fuego feu fuego oeuf Cognate Groups / Translations huevo feu Grammatical Inference les faits sont très clairs
Cognate Groups ‘fire’ /fwɔko/ /vɛrbo/ /tʃɛntro/ /sentro/ /berβo/ /fweɣo/ /vɛrbo/ /fogo/ /sɛntro/
Model: Cognate Survival + - IT ES PT IB LA omnis - ogni IT ES PT IB LA
Results: Grouping Accuracy Fraction of Words Correctly Grouped Method [Hall and Klein, in submission]
Semantics: Matching Meanings EN day tag EN Occurs with: “name” “label” “along” Occurs with: “night” “sun” “week” DE tag Occurs with: “nacht” “sonne” “woche”
les faits sont très clairs Outline focus Ancestral Word Forms fuego feu fuego oeuf Cognate Groups / Translations huevo feu Grammatical Inference les faits sont très clairs
Grammar Induction Task: Given sentences, infer grammar (and parse tree structures) congress narrowly passed the amended bill les faits sont très clairs
Shared Prior congress narrowly passed the amended bill les faits sont très clairs
Results: Phylogenetic Prior Dutch Danish Swedish Spanish Portuguese Slovene Chinese English WG NG RM G IE GL Avg rel gain: 29%
Conclusion Phylogeny-structured models can: Accurately reconstruct ancestral words Give evidence to open linguistic debates Detect translations from form and context Improve language learning algorithms Lots of questions still open: Can we get better phylogenies using these high- res models? What do these models have to say about the very earliest languages? Proto-world?
Thank you! nlp.cs.berkeley.edu