Download presentation
Presentation is loading. Please wait.
Published byHilda Hampton Modified over 9 years ago
1
Statistical NLP Spring 2010 Lecture 25: Diachronics Dan Klein – UC Berkeley
2
Evolution: Main Phenomena Mutations of sequences Time Speciation Time
3
Tree of Languages Challenge: identify the phylogeny Much work in biology, e.g. work by Warnow, Felsenstein, Steele… Also in linguistics, e.g. Warnow et al., Gray and Atkinson… http://andromeda.rutgers.edu/~jlynch/language.html
4
Statistical Inference Tasks Modern Text Ancestral Word Forms Cognate Groups / Translations Grammatical Inference InputsOutputs Phylogeny FR IT PT ES fuegofeu focus fuego feuhuevo oeuf les faits sont très clairs
5
Outline Ancestral Word Forms Cognate Groups / Translations Grammatical Inference fuegofeu focus fuego feuhuevo oeuf les faits sont très clairs
6
Language Evolution: Sound Change camera / kamera / Latin chambre / ʃambʁ / French Deletion: / e / Change: / k /.. / tʃ /.. / ʃ / Insertion: / b / Eng. camera from Latin, “camera obscura” Eng. chamber from Old Fr. before the initial / t / dropped
7
Diachronic Evidence tonitru non tonotrutonight not tonite Yahoo! Answers [2009]Appendix Probi [ca 300]
8
Synchronic (Comparative) Evidence
9
Simple Model: Single Characters CGCCCCGG CG C G G
10
A Model for Words Forms focus /fokus/ fuego /fweɣo/ /fogo/ fogo /fogo/ fuoco /fwɔko/ IT ES PT IB LA [BGK 07]
11
Contextual Changes /fokus/ /fwɔko/ f# o f# w ɔ …
12
Changes are Systematic /fokus/ /fweɣo/ /fogo/ /fwɔko/ /fokus/ /fweɣo/ /fogo/ /fwɔko/ /kentrum/ /sentro/ /tʃɛntro/
13
Experimental Setup Data sets Small: Romance French, Italian, Portuguese, Spanish 2344 words Complete cognate sets Target: (Vulgar) Latin Large: Oceanic 661 languages 140K words Incomplete cognate sets Target: Proto-Oceanic [Blust, 1993] FR IT PT ES
14
Data: Romance
15
Learning: Objective /fokus/ /fweɣo/ /fogo/ /fwɔko/
16
Learning: EM /fokus/ /fweɣo/ /fogo/ /fwɔko/ /fokus/ /fweɣo/ /fogo/ /fwɔko/ M-Step Find parameters which fit (expected) sound change counts Easy: gradient ascent on theta E-Step Find (expected) change counts given parameters Hard: variables are string-valued
17
Computing Expectations ‘grass’ Standard approach, e.g. [Holmes 2001]: Gibbs sampling each sequence [Holmes 01, BGK 07]
18
A Gibbs Sampler ‘grass’
19
A Gibbs Sampler ‘grass’
20
A Gibbs Sampler ‘grass’
21
Getting Stuck How to jump to a state where the liquids /r/ and /l/ have a common ancestor? ?
22
Getting Stuck
23
Solution: Vertical Slices Single Sequence Resampling Ancestry Resampling [BGK 08]
24
Details: Defining “Slices” The sampling domains (kernels) are indexed by contiguous subsequences ( anchors ) of the observed leaf sequences Correct construction section(G) is non-trivial but very efficient anchor
25
Results: Alignment Efficiency Is ancestry resampling faster than basic Gibbs? Hypothesis: Larger gains for deeper trees Depth of the phylogenetic tree Sum of Pairs Setup: Fixed wall time Synthetic data, same parameters
26
Results: Romance
27
Learned Rules / Mutations
29
Comparison to Other Methods Evaluation metric: edit distance from a reconstruction made by a linguist (lower is better) UsOakes Comparison to system from [Oakes, 2000] Uses exact inference and deterministic rules Reconstruction of Proto-Malayo-Javanic cf [Nothefer, 1975]
30
Data: Oceanic Proto-Oceanic
31
Data: Oceanic http://language.psy.auckland.ac.nz/austronesian/research.php
32
Results: Large Phylogenies Centroid: a novel heuristic based on an approximation to the minimum Bayes risk Reconstruction of Proto-Oceanic [Blust, 1993] Both algorithms use 64 modern languages Us Centroid
33
Result: More Languages Help Number of modern languages used Mean edit distance Distance from [Blust, 1993] Reconstructions
34
Results: Large Scale
35
Visualization: Learned universals *The model did not have features encoding natural classes
36
Regularity and Functional Load In a language, some pairs of sounds are more contrastive than others (higher functional load) Example: English “p”/“b” versus “t”/”th” “p”/“b”: pot/dot, pin/din, dress/press, pew/dew,... “t”/”th”: thin/tin
37
Functional Load: Timeline Functional Load Hypothesis (FLH): sounds changes are less frequent when they merge phonemes with high functional load [Martinet, 55] Previous research within linguistics : “FLH does not seem to be supported by the data” [King, 67] Caveat : only four languages were used in King’s study [Hocket 67; Surandran et al., 06] Our work: we reexamined the question with two orders of magnitude more data [BGK, under review ]
38
Regularity and Functional Load Functional load as computed by [King, 67] Data: only 4 languages from the Austronesian data Merger posterior probability Each dot is a sound change identified by the system
39
Regularity and Functional Load Data: all 637 languages from the Austronesian data Functional load as computed by [King, 67] Merger posterior probability
40
Outline Ancestral Word Forms Cognate Groups / Translations Grammatical Inference fuegofeu focus fuego feuhuevo oeuf les faits sont très clairs
41
Cognate Groups /fweɣo/ /fogo/ /fwɔko/ /berβo/ /vɛrbo/ /tʃɛntro/ /sentro/ /sɛntro/ ‘fire’
42
Model: Cognate Survival omnis - - - ogni IT ES PT IB LA + - - - + IT ES PT IB LA
43
Results: Grouping Accuracy Fraction of Words Correctly Grouped Method [Hall and Klein, in submission]
44
Semantics: Matching Meanings day Occurs with: “night” “sun” “week” tag DE tag EN Occurs with: “nacht” “sonne” “woche” Occurs with: “name” “label” “along”
45
Outline Ancestral Word Forms Cognate Groups / Translations Grammatical Inference fuegofeu focus fuego feuhuevo oeuf les faits sont très clairs
46
Grammar Induction congress narrowly passed the amended bill les faits sont très clairs Task: Given sentences, infer grammar (and parse tree structures)
47
Shared Prior les faits sont très clairs congress narrowly passed the amended bill
48
Results: Phylogenetic Prior DutchDanishSwedishSpanishPortugueseSloveneChineseEnglish WGNG RM G IE GL Avg rel gain: 29%
49
Conclusion Phylogeny-structured models can: Accurately reconstruct ancestral words Give evidence to open linguistic debates Detect translations from form and context Improve language learning algorithms Lots of questions still open: Can we get better phylogenies using these high- res models? What do these models have to say about the very earliest languages? Proto-world?
50
Thank you! nlp.cs.berkeley.edu
51
Machine Translation Approach Source Text Target Text nous acceptons votre opinion. we accept your view.
52
Translations from Monotexts Source Text Target Text Translation without parallel text? Need (lots of) sentences [Fung 95, Koehn and Knight 02, Haghighi and Klein 08]
53
Task: Lexicon Matching Source Text Target Text Matching m state world name Source Words s nation estado política Target Words t mundo nombre [Haghighi and Klein 08]
54
Data Representation state Source Text What are we generating? Orthographic Features 1.0 #st tat te# Context Features 20.0 5.0 10.0 world politics society
55
Data Representation state Orthographic Features 1.0 #st tat te# 5.0 20.0 10.0 Context Features world politics society Source Text estado Orthographic Features 1.0 #es sta do# 10.0 17.0 6.0 Context Features mundo politica sociedad Target Text What are we generating?
56
Generative Model (CCA) estado state Source Space Target Space Canonical Space
57
Generative Model (Matching) Source Words s Target Words t Matching m state world name nation estado nombre politica mundo
58
E-Step: Find best matching M-Step: Solve a CCA problem Inference: Hard EM
59
Experimental Setup Data: 2K most frequent nouns, texts from Wikipedia Seed: 100 translation pairs Evaluation: Precision and Recall against lexicon obtained from Wiktionary Report p 0.33, precision at recall 0.33
60
Lexicon Quality (EN-ES) Precision Recall
61
Seed Lexicon Source Automatic Seed Edit distance seed 92 4k EN-ES Wikipedia Articles Precision
62
Analysis
63
Interesting MatchesInteresting Mistakes
64
Language Variation
65
Analysis Orthography Features Context Features
66
Language Variation
67
Markov Mutation Model /fokus/ /fwɔko/
68
Local Mutation along Tree f# o f# w ɔ. …
70
Model: Many Words /fokus/ /fweɣo/ /fogo/ /fwɔko/
71
Results
72
Results: More Languages
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.