Presentation is loading. Please wait.

Presentation is loading. Please wait.

Statistical NLP Spring 2011

Similar presentations


Presentation on theme: "Statistical NLP Spring 2011"— Presentation transcript:

1 Statistical NLP Spring 2011
Lecture 28: Diachronics Dan Klein – UC Berkeley TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAA

2 Evolution: Main Phenomena
Mutations of sequences Time Speciation

3 Tree of Languages Challenge: identify the phylogeny
Much work in biology, e.g. work by Warnow, Felsenstein, Steele… Also in linguistics, e.g. Warnow et al., Gray and Atkinson…

4 Statistical Inference Tasks
Inputs Outputs Modern Text focus Ancestral Word Forms fuego feu FR IT PT ES fuego oeuf  Cognate Groups / Translations huevo feu Phylogeny Grammatical Inference les faits sont très clairs

5 les faits sont très clairs
Outline focus Ancestral Word Forms fuego feu fuego oeuf  Cognate Groups / Translations huevo feu Grammatical Inference les faits sont très clairs

6 Language Evolution: Sound Change
Eng. camera from Latin, “camera obscura” camera /kamera/ Latin Deletion: /e/ Change: /k/ .. /tʃ/ .. /ʃ/ Insertion: /b/ chambre /ʃambʁ/ French Eng. chamber from Old Fr. before the initial /t/ dropped

7 Diachronic Evidence Yahoo! Answers [2009] Appendix Probi [ca 300]
tonight not tonite tonitru non tonotru

8 Synchronic (Comparative) Evidence

9 Simple Model: Single Characters

10 A Model for Words Forms focus /fokus/ /fogo/ fuoco /fwɔko/
[BGK 07] focus /fokus/ LA /fogo/ IB fuoco /fwɔko/ fuego /fweɣo/ fogo /fogo/ IT ES PT

11 Contextual Changes # f o /fokus/ # f w ɔ /fwɔko/

12 Changes are Systematic
/fokus/ /fweɣo/ /fogo/ /fwɔko/ /fokus/ /fweɣo/ /fogo/ /fwɔko/ /kentrum/ /sentro/ /tʃɛntro/

13 Experimental Setup Data sets Small: Romance Large: Oceanic FR IT PT ES
French, Italian, Portuguese, Spanish 2344 words Complete cognate sets Target: (Vulgar) Latin Large: Oceanic 661 languages 140K words Incomplete cognate sets Target: Proto-Oceanic [Blust, 1993] FR IT PT ES

14 Data: Romance

15 Learning: Objective /fokus/ /fweɣo/ /fogo/ /fwɔko/

16 Learning: EM M-Step E-Step
Find parameters which fit (expected) sound change counts Easy: gradient ascent on theta E-Step Find (expected) change counts given parameters Hard: variables are string-valued /fokus/ /fweɣo/ /fogo/ /fwɔko/ /fokus/ /fweɣo/ /fogo/ /fwɔko/

17 Computing Expectations
[Holmes 01, BGK 07] Computing Expectations Standard approach, e.g. [Holmes 2001]: Gibbs sampling each sequence call it ‘grass’

18 A Gibbs Sampler ‘grass’

19 A Gibbs Sampler ‘grass’

20 A Gibbs Sampler ‘grass’

21 Getting Stuck ? How to jump to a state where the liquids /r/ and /l/ have a common ancestor?

22 Getting Stuck the sampler needs to wander in a valley of lower pr before it can gets to the other peak

23 Solution: Vertical Slices
[BGK 08] Solution: Vertical Slices Single Sequence Resampling Ancestry Resampling

24 Details: Defining “Slices”
The sampling domains (kernels) are indexed by contiguous subsequences (anchors) of the observed leaf sequences anchor Correct construction section(G) is non-trivial but very efficient

25 Results: Alignment Efficiency
Depth of the phylogenetic tree Sum of Pairs Setup: Fixed wall time Synthetic data, same parameters Results: Alignment Efficiency Is ancestry resampling faster than basic Gibbs? Hypothesis: Larger gains for deeper trees now that I have shown that we get better alignment, the next question is whether we also get them more quickly our hypo here is that... and this is because the random walk that gibbs needs to go thru gets longer and longer the setup here is that we fix a wall time budget, and create synthetic dataset along deeper and deeper phylogenetic tree for each one, we compute sum of pair

26 Results: Romance

27 Learned Rules / Mutations

28 Learned Rules / Mutations

29 Comparison to Other Methods
Evaluation metric: edit distance from a reconstruction made by a linguist (lower is better) Us Oakes Comparison to system from [Oakes, 2000] Uses exact inference and deterministic rules Reconstruction of Proto-Malayo-Javanic cf [Nothefer, 1975] The way we evaluate performance is by measuring the averaged levenhstein edit distance b/w the proposed reconstruction and those proposed by a linguist Let me start with some results comparing our model to other methods System called Prague, designed by Robert Oakes As you can see in this graph, our system outperforms this previous approach

30 Data: Oceanic Proto-Oceanic

31 Data: Oceanic

32 Result: More Languages Help
Distance from [Blust, 1993] Reconstructions Mean edit distance Number of modern languages used

33 Results: Large Scale -here is the tree we have used, a tree based on the full austronesian language dataset

34 Visualization: Learned universals
all of those links are actually very reasonable sound changes *The model did not have features encoding natural classes

35 Regularity and Functional Load
In a language, some pairs of sounds are more contrastive than others (higher functional load) Example: English “p”/“b” versus “t”/”th” “p”/“b”: pot/dot, pin/din, dress/press, pew/dew, ... “t”/”th”: thin/tin for example, one open question in linguistics is whether sounds that are less contrastive in a language would be more vulnerable to sound changes here is what I mean by constrastive: consider for example .. functional load quantifies this concept, and would give a low functional load to the t/th pair, and a high funct. load to the other pair

36 Functional Load: Timeline
Functional Load Hypothesis (FLH): sounds changes are less frequent when they merge phonemes with high functional load [Martinet, 55] Previous research within linguistics: “FLH does not seem to be supported by the data” [King, 67] Caveat: only four languages were used in King’s study [Hocket 67; Surandran et al., 06] Our work: we reexamined the question with two orders of magnitude more data [BGK, under review] there is a very interesting open question in linguistics related to FL: the conjecture, made by Martinet in 55, is that sound changes are less freq when they merge phonemes with high FL

37 Regularity and Functional Load
Data: only 4 languages from the Austronesian data Each dot is a sound change identified by the system here is the kind of plot we are interested in: the x axis is the functional load as computed by king, and the y axis is the merger posterior pr for example if it’s close to 1, it’s a systematic sound change, lower values represent context-dep changes each point in this plot is a sound change Merger posterior probability Functional load as computed by [King, 67]

38 Regularity and Functional Load
Data: all 637 languages from the Austronesian data computing FL from this tree gave us a much clearer view on the relationship b/w FL and sound change, in particular, the data seems to support the functional load hypothesis, that is sounds changes are more frequent when they merge phonemes with low functional load Merger posterior probability Functional load as computed by [King, 67]

39 les faits sont très clairs
Outline focus Ancestral Word Forms fuego feu fuego oeuf  Cognate Groups / Translations huevo feu Grammatical Inference les faits sont très clairs

40 Cognate Groups ‘fire’ /fwɔko/ /vɛrbo/ /tʃɛntro/ /sentro/ /berβo/
/fweɣo/ /vɛrbo/ /fogo/ /sɛntro/

41 Model: Cognate Survival
+ - IT ES PT IB LA omnis - ogni IT ES PT IB LA

42 Results: Grouping Accuracy
Fraction of Words Correctly Grouped Method [Hall and Klein, in submission]

43 Semantics: Matching Meanings
EN day tag EN Occurs with: “name” “label” “along” Occurs with: “night” “sun” “week” DE tag Occurs with: “nacht” “sonne” “woche”

44 les faits sont très clairs
Outline focus Ancestral Word Forms fuego feu fuego oeuf  Cognate Groups / Translations huevo feu Grammatical Inference les faits sont très clairs

45 Grammar Induction Task: Given sentences, infer grammar
(and parse tree structures) congress narrowly passed the amended bill les faits sont très clairs

46 Shared Prior congress narrowly passed the amended bill
les faits sont très clairs

47 Results: Phylogenetic Prior
Dutch Danish Swedish Spanish Portuguese Slovene Chinese English WG NG RM G IE GL Avg rel gain: 29%

48 Conclusion Phylogeny-structured models can:
Accurately reconstruct ancestral words Give evidence to open linguistic debates Detect translations from form and context Improve language learning algorithms Lots of questions still open: Can we get better phylogenies using these high- res models? What do these models have to say about the very earliest languages? Proto-world?

49 Thank you! nlp.cs.berkeley.edu


Download ppt "Statistical NLP Spring 2011"

Similar presentations


Ads by Google