Statistical NLP Spring 2010 Lecture 25: Diachronics Dan Klein – UC Berkeley.

Statistical NLP Spring 2010 Lecture 25: Diachronics Dan Klein – UC Berkeley

Evolution: Main Phenomena Mutations of sequences Time Speciation Time

Tree of Languages  Challenge: identify the phylogeny  Much work in biology, e.g. work by Warnow, Felsenstein, Steele…  Also in linguistics, e.g. Warnow et al., Gray and Atkinson… http://andromeda.rutgers.edu/~jlynch/language.html

Statistical Inference Tasks Modern Text Ancestral Word Forms Cognate Groups / Translations Grammatical Inference InputsOutputs Phylogeny FR IT PT ES fuegofeu focus fuego feuhuevo oeuf les faits sont très clairs

Outline Ancestral Word Forms Cognate Groups / Translations Grammatical Inference fuegofeu focus fuego feuhuevo oeuf les faits sont très clairs

Language Evolution: Sound Change camera / kamera / Latin chambre / ʃambʁ / French Deletion: / e / Change: / k /.. / tʃ /.. / ʃ / Insertion: / b / Eng. camera from Latin, “camera obscura” Eng. chamber from Old Fr. before the initial / t / dropped

Diachronic Evidence tonitru non tonotrutonight not tonite Yahoo! Answers [2009]Appendix Probi [ca 300]

Synchronic (Comparative) Evidence

Simple Model: Single Characters CGCCCCGG CG C G G

A Model for Words Forms focus /fokus/ fuego /fweɣo/ /fogo/ fogo /fogo/ fuoco /fwɔko/ IT ES PT IB LA [BGK 07]

Contextual Changes /fokus/ /fwɔko/ f# o f# w ɔ …

Changes are Systematic /fokus/ /fweɣo/ /fogo/ /fwɔko/ /fokus/ /fweɣo/ /fogo/ /fwɔko/ /kentrum/ /sentro/ /tʃɛntro/

Experimental Setup  Data sets  Small: Romance  French, Italian, Portuguese, Spanish  2344 words  Complete cognate sets  Target: (Vulgar) Latin  Large: Oceanic  661 languages  140K words  Incomplete cognate sets  Target: Proto-Oceanic [Blust, 1993] FR IT PT ES

Data: Romance

Learning: Objective /fokus/ /fweɣo/ /fogo/ /fwɔko/

Learning: EM /fokus/ /fweɣo/ /fogo/ /fwɔko/ /fokus/ /fweɣo/ /fogo/ /fwɔko/  M-Step  Find parameters which fit (expected) sound change counts  Easy: gradient ascent on theta  E-Step  Find (expected) change counts given parameters  Hard: variables are string-valued

Computing Expectations ‘grass’ Standard approach, e.g. [Holmes 2001]: Gibbs sampling each sequence [Holmes 01, BGK 07]

A Gibbs Sampler ‘grass’

Getting Stuck How to jump to a state where the liquids /r/ and /l/ have a common ancestor? ?

Getting Stuck

Solution: Vertical Slices Single Sequence Resampling Ancestry Resampling [BGK 08]

Details: Defining “Slices” The sampling domains (kernels) are indexed by contiguous subsequences ( anchors ) of the observed leaf sequences Correct construction section(G) is non-trivial but very efficient anchor

Results: Alignment Efficiency Is ancestry resampling faster than basic Gibbs? Hypothesis: Larger gains for deeper trees Depth of the phylogenetic tree Sum of Pairs Setup: Fixed wall time Synthetic data, same parameters

Results: Romance

Learned Rules / Mutations

Comparison to Other Methods  Evaluation metric: edit distance from a reconstruction made by a linguist (lower is better) UsOakes  Comparison to system from [Oakes, 2000]  Uses exact inference and deterministic rules  Reconstruction of Proto-Malayo-Javanic cf [Nothefer, 1975]

Data: Oceanic Proto-Oceanic

Data: Oceanic http://language.psy.auckland.ac.nz/austronesian/research.php

Results: Large Phylogenies  Centroid: a novel heuristic based on an approximation to the minimum Bayes risk  Reconstruction of Proto-Oceanic [Blust, 1993]  Both algorithms use 64 modern languages Us Centroid

Result: More Languages Help Number of modern languages used Mean edit distance Distance from [Blust, 1993] Reconstructions

Results: Large Scale

Visualization: Learned universals *The model did not have features encoding natural classes

Regularity and Functional Load In a language, some pairs of sounds are more contrastive than others (higher functional load) Example: English “p”/“b” versus “t”/”th” “p”/“b”: pot/dot, pin/din, dress/press, pew/dew,... “t”/”th”: thin/tin

Functional Load: Timeline Functional Load Hypothesis (FLH): sounds changes are less frequent when they merge phonemes with high functional load [Martinet, 55] Previous research within linguistics : “FLH does not seem to be supported by the data” [King, 67] Caveat : only four languages were used in King’s study [Hocket 67; Surandran et al., 06] Our work: we reexamined the question with two orders of magnitude more data [BGK, under review ]

Regularity and Functional Load Functional load as computed by [King, 67] Data: only 4 languages from the Austronesian data Merger posterior probability Each dot is a sound change identified by the system

Regularity and Functional Load Data: all 637 languages from the Austronesian data Functional load as computed by [King, 67] Merger posterior probability

Cognate Groups /fweɣo/ /fogo/ /fwɔko/ /berβo/ /vɛrbo/ /tʃɛntro/ /sentro/ /sɛntro/ ‘fire’

Model: Cognate Survival omnis - - - ogni IT ES PT IB LA + - - - + IT ES PT IB LA

Results: Grouping Accuracy Fraction of Words Correctly Grouped Method [Hall and Klein, in submission]

Semantics: Matching Meanings day Occurs with: “night” “sun” “week” tag DE tag EN Occurs with: “nacht” “sonne” “woche” Occurs with: “name” “label” “along”

Grammar Induction congress narrowly passed the amended bill les faits sont très clairs Task: Given sentences, infer grammar (and parse tree structures)

Shared Prior les faits sont très clairs congress narrowly passed the amended bill

Results: Phylogenetic Prior DutchDanishSwedishSpanishPortugueseSloveneChineseEnglish WGNG RM G IE GL Avg rel gain: 29%

Conclusion  Phylogeny-structured models can:  Accurately reconstruct ancestral words  Give evidence to open linguistic debates  Detect translations from form and context  Improve language learning algorithms  Lots of questions still open:  Can we get better phylogenies using these high- res models?  What do these models have to say about the very earliest languages? Proto-world?

Thank you! nlp.cs.berkeley.edu

Machine Translation Approach Source Text Target Text nous acceptons votre opinion. we accept your view.

Translations from Monotexts Source Text Target Text  Translation without parallel text?  Need (lots of) sentences [Fung 95, Koehn and Knight 02, Haghighi and Klein 08]

Task: Lexicon Matching Source Text Target Text Matching m state world name Source Words s nation estado política Target Words t mundo nombre [Haghighi and Klein 08]

Data Representation state Source Text What are we generating? Orthographic Features 1.0 #st tat te# Context Features 20.0 5.0 10.0 world politics society

Data Representation state Orthographic Features 1.0 #st tat te# 5.0 20.0 10.0 Context Features world politics society Source Text estado Orthographic Features 1.0 #es sta do# 10.0 17.0 6.0 Context Features mundo politica sociedad Target Text What are we generating?

Generative Model (CCA) estado state Source Space Target Space Canonical Space

Generative Model (Matching) Source Words s Target Words t Matching m state world name nation estado nombre politica mundo

E-Step: Find best matching M-Step: Solve a CCA problem Inference: Hard EM

Experimental Setup  Data: 2K most frequent nouns, texts from Wikipedia  Seed: 100 translation pairs  Evaluation: Precision and Recall against lexicon obtained from Wiktionary  Report p 0.33, precision at recall 0.33

Lexicon Quality (EN-ES) Precision Recall

Seed Lexicon Source  Automatic Seed  Edit distance seed 92 4k EN-ES Wikipedia Articles Precision

Analysis

Interesting MatchesInteresting Mistakes

Language Variation

Analysis Orthography Features Context Features

Language Variation

Markov Mutation Model /fokus/ /fwɔko/

Local Mutation along Tree f# o f# w ɔ. …

Model: Many Words /fokus/ /fweɣo/ /fogo/ /fwɔko/

Results

Results: More Languages

Statistical NLP Spring 2010 Lecture 25: Diachronics Dan Klein – UC Berkeley.

Similar presentations

Presentation on theme: "Statistical NLP Spring 2010 Lecture 25: Diachronics Dan Klein – UC Berkeley."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Statistical NLP Spring 2010 Lecture 25: Diachronics Dan Klein – UC Berkeley.

Similar presentations

Presentation on theme: "Statistical NLP Spring 2010 Lecture 25: Diachronics Dan Klein – UC Berkeley."— Presentation transcript:

Similar presentations

About project

Feedback