Statistical NLP Spring 2011

Statistical NLP Spring 2011
Lecture 28: Diachronics Dan Klein – UC Berkeley TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAA

Evolution: Main Phenomena
Mutations of sequences Time Speciation

Tree of Languages Challenge: identify the phylogeny
Much work in biology, e.g. work by Warnow, Felsenstein, Steele… Also in linguistics, e.g. Warnow et al., Gray and Atkinson…

Statistical Inference Tasks
Inputs Outputs Modern Text focus Ancestral Word Forms fuego feu FR IT PT ES fuego oeuf Cognate Groups / Translations huevo feu Phylogeny Grammatical Inference les faits sont très clairs

les faits sont très clairs
Outline focus Ancestral Word Forms fuego feu fuego oeuf Cognate Groups / Translations huevo feu Grammatical Inference les faits sont très clairs

Language Evolution: Sound Change
Eng. camera from Latin, “camera obscura” camera /kamera/ Latin Deletion: /e/ Change: /k/ .. /tʃ/ .. /ʃ/ Insertion: /b/ chambre /ʃambʁ/ French Eng. chamber from Old Fr. before the initial /t/ dropped

Diachronic Evidence Yahoo! Answers [2009] Appendix Probi [ca 300]
tonight not tonite tonitru non tonotru

Synchronic (Comparative) Evidence

Simple Model: Single Characters

A Model for Words Forms focus /fokus/ /fogo/ fuoco /fwɔko/
[BGK 07] focus /fokus/ LA /fogo/ IB fuoco /fwɔko/ fuego /fweɣo/ fogo /fogo/ IT ES PT

Contextual Changes … # f o /fokus/ # f w ɔ /fwɔko/

Changes are Systematic
/fokus/ /fweɣo/ /fogo/ /fwɔko/ /fokus/ /fweɣo/ /fogo/ /fwɔko/ /kentrum/ /sentro/ /tʃɛntro/

Experimental Setup Data sets Small: Romance Large: Oceanic FR IT PT ES
French, Italian, Portuguese, Spanish 2344 words Complete cognate sets Target: (Vulgar) Latin Large: Oceanic 661 languages 140K words Incomplete cognate sets Target: Proto-Oceanic [Blust, 1993] FR IT PT ES

Data: Romance

Learning: Objective /fokus/ /fweɣo/ /fogo/ /fwɔko/

Learning: EM M-Step E-Step
Find parameters which fit (expected) sound change counts Easy: gradient ascent on theta E-Step Find (expected) change counts given parameters Hard: variables are string-valued /fokus/ /fweɣo/ /fogo/ /fwɔko/ /fokus/ /fweɣo/ /fogo/ /fwɔko/

Computing Expectations
[Holmes 01, BGK 07] Computing Expectations Standard approach, e.g. [Holmes 2001]: Gibbs sampling each sequence call it ‘grass’

A Gibbs Sampler ‘grass’

Getting Stuck ? How to jump to a state where the liquids /r/ and /l/ have a common ancestor?

Getting Stuck the sampler needs to wander in a valley of lower pr before it can gets to the other peak

Solution: Vertical Slices
[BGK 08] Solution: Vertical Slices Single Sequence Resampling Ancestry Resampling

Details: Defining “Slices”
The sampling domains (kernels) are indexed by contiguous subsequences (anchors) of the observed leaf sequences anchor Correct construction section(G) is non-trivial but very efficient

Results: Alignment Efficiency
Depth of the phylogenetic tree Sum of Pairs Setup: Fixed wall time Synthetic data, same parameters Results: Alignment Efficiency Is ancestry resampling faster than basic Gibbs? Hypothesis: Larger gains for deeper trees now that I have shown that we get better alignment, the next question is whether we also get them more quickly our hypo here is that... and this is because the random walk that gibbs needs to go thru gets longer and longer the setup here is that we fix a wall time budget, and create synthetic dataset along deeper and deeper phylogenetic tree for each one, we compute sum of pair

Results: Romance

Learned Rules / Mutations

Comparison to Other Methods
Evaluation metric: edit distance from a reconstruction made by a linguist (lower is better) Us Oakes Comparison to system from [Oakes, 2000] Uses exact inference and deterministic rules Reconstruction of Proto-Malayo-Javanic cf [Nothefer, 1975] The way we evaluate performance is by measuring the averaged levenhstein edit distance b/w the proposed reconstruction and those proposed by a linguist Let me start with some results comparing our model to other methods System called Prague, designed by Robert Oakes As you can see in this graph, our system outperforms this previous approach

Data: Oceanic Proto-Oceanic

Data: Oceanic

Result: More Languages Help
Distance from [Blust, 1993] Reconstructions Mean edit distance Number of modern languages used

Results: Large Scale -here is the tree we have used, a tree based on the full austronesian language dataset

Visualization: Learned universals
all of those links are actually very reasonable sound changes *The model did not have features encoding natural classes

Regularity and Functional Load
In a language, some pairs of sounds are more contrastive than others (higher functional load) Example: English “p”/“b” versus “t”/”th” “p”/“b”: pot/dot, pin/din, dress/press, pew/dew, ... “t”/”th”: thin/tin for example, one open question in linguistics is whether sounds that are less contrastive in a language would be more vulnerable to sound changes here is what I mean by constrastive: consider for example .. functional load quantifies this concept, and would give a low functional load to the t/th pair, and a high funct. load to the other pair

Functional Load: Timeline
Functional Load Hypothesis (FLH): sounds changes are less frequent when they merge phonemes with high functional load [Martinet, 55] Previous research within linguistics: “FLH does not seem to be supported by the data” [King, 67] Caveat: only four languages were used in King’s study [Hocket 67; Surandran et al., 06] Our work: we reexamined the question with two orders of magnitude more data [BGK, under review] there is a very interesting open question in linguistics related to FL: the conjecture, made by Martinet in 55, is that sound changes are less freq when they merge phonemes with high FL

Data: only 4 languages from the Austronesian data Each dot is a sound change identified by the system here is the kind of plot we are interested in: the x axis is the functional load as computed by king, and the y axis is the merger posterior pr for example if it’s close to 1, it’s a systematic sound change, lower values represent context-dep changes each point in this plot is a sound change Merger posterior probability Functional load as computed by [King, 67]

Data: all 637 languages from the Austronesian data computing FL from this tree gave us a much clearer view on the relationship b/w FL and sound change, in particular, the data seems to support the functional load hypothesis, that is sounds changes are more frequent when they merge phonemes with low functional load Merger posterior probability Functional load as computed by [King, 67]

Cognate Groups ‘fire’ /fwɔko/ /vɛrbo/ /tʃɛntro/ /sentro/ /berβo/
/fweɣo/ /vɛrbo/ /fogo/ /sɛntro/

Model: Cognate Survival
+ - IT ES PT IB LA omnis - ogni IT ES PT IB LA

Results: Grouping Accuracy
Fraction of Words Correctly Grouped Method [Hall and Klein, in submission]

Semantics: Matching Meanings
EN day tag EN Occurs with: “name” “label” “along” Occurs with: “night” “sun” “week” DE tag Occurs with: “nacht” “sonne” “woche”

Grammar Induction Task: Given sentences, infer grammar
(and parse tree structures) congress narrowly passed the amended bill les faits sont très clairs

Shared Prior congress narrowly passed the amended bill
les faits sont très clairs

Results: Phylogenetic Prior
Dutch Danish Swedish Spanish Portuguese Slovene Chinese English WG NG RM G IE GL Avg rel gain: 29%

Conclusion Phylogeny-structured models can:
Accurately reconstruct ancestral words Give evidence to open linguistic debates Detect translations from form and context Improve language learning algorithms Lots of questions still open: Can we get better phylogenies using these high- res models? What do these models have to say about the very earliest languages? Proto-world?

Thank you! nlp.cs.berkeley.edu

Statistical NLP Spring 2011

Similar presentations

Presentation on theme: "Statistical NLP Spring 2011"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Statistical NLP Spring 2011

Similar presentations

Presentation on theme: "Statistical NLP Spring 2011"— Presentation transcript:

Similar presentations

About project

Feedback