Presentation is loading. Please wait.

Presentation is loading. Please wait.

Statistical NLP Spring 2010 Lecture 25: Diachronics Dan Klein – UC Berkeley.

Similar presentations


Presentation on theme: "Statistical NLP Spring 2010 Lecture 25: Diachronics Dan Klein – UC Berkeley."— Presentation transcript:

1 Statistical NLP Spring 2010 Lecture 25: Diachronics Dan Klein – UC Berkeley

2 Evolution: Main Phenomena Mutations of sequences Time Speciation Time

3 Tree of Languages  Challenge: identify the phylogeny  Much work in biology, e.g. work by Warnow, Felsenstein, Steele…  Also in linguistics, e.g. Warnow et al., Gray and Atkinson… http://andromeda.rutgers.edu/~jlynch/language.html

4 Statistical Inference Tasks Modern Text Ancestral Word Forms Cognate Groups / Translations Grammatical Inference InputsOutputs Phylogeny FR IT PT ES fuegofeu focus fuego feuhuevo oeuf les faits sont très clairs

5 Outline Ancestral Word Forms Cognate Groups / Translations Grammatical Inference fuegofeu focus fuego feuhuevo oeuf les faits sont très clairs

6 Language Evolution: Sound Change camera / kamera / Latin chambre / ʃambʁ / French Deletion: / e / Change: / k /.. / tʃ /.. / ʃ / Insertion: / b / Eng. camera from Latin, “camera obscura” Eng. chamber from Old Fr. before the initial / t / dropped

7 Diachronic Evidence tonitru non tonotrutonight not tonite Yahoo! Answers [2009]Appendix Probi [ca 300]

8 Synchronic (Comparative) Evidence

9 Simple Model: Single Characters CGCCCCGG CG C G G

10 A Model for Words Forms focus /fokus/ fuego /fweɣo/ /fogo/ fogo /fogo/ fuoco /fwɔko/ IT ES PT IB LA [BGK 07]

11 Contextual Changes /fokus/ /fwɔko/ f# o f# w ɔ …

12 Changes are Systematic /fokus/ /fweɣo/ /fogo/ /fwɔko/ /fokus/ /fweɣo/ /fogo/ /fwɔko/ /kentrum/ /sentro/ /tʃɛntro/

13 Experimental Setup  Data sets  Small: Romance  French, Italian, Portuguese, Spanish  2344 words  Complete cognate sets  Target: (Vulgar) Latin  Large: Oceanic  661 languages  140K words  Incomplete cognate sets  Target: Proto-Oceanic [Blust, 1993] FR IT PT ES

14 Data: Romance

15 Learning: Objective /fokus/ /fweɣo/ /fogo/ /fwɔko/

16 Learning: EM /fokus/ /fweɣo/ /fogo/ /fwɔko/ /fokus/ /fweɣo/ /fogo/ /fwɔko/  M-Step  Find parameters which fit (expected) sound change counts  Easy: gradient ascent on theta  E-Step  Find (expected) change counts given parameters  Hard: variables are string-valued

17 Computing Expectations ‘grass’ Standard approach, e.g. [Holmes 2001]: Gibbs sampling each sequence [Holmes 01, BGK 07]

18 A Gibbs Sampler ‘grass’

19 A Gibbs Sampler ‘grass’

20 A Gibbs Sampler ‘grass’

21 Getting Stuck How to jump to a state where the liquids /r/ and /l/ have a common ancestor? ?

22 Getting Stuck

23 Solution: Vertical Slices Single Sequence Resampling Ancestry Resampling [BGK 08]

24 Details: Defining “Slices” The sampling domains (kernels) are indexed by contiguous subsequences ( anchors ) of the observed leaf sequences Correct construction section(G) is non-trivial but very efficient anchor

25 Results: Alignment Efficiency Is ancestry resampling faster than basic Gibbs? Hypothesis: Larger gains for deeper trees Depth of the phylogenetic tree Sum of Pairs Setup: Fixed wall time Synthetic data, same parameters

26 Results: Romance

27 Learned Rules / Mutations

28

29 Comparison to Other Methods  Evaluation metric: edit distance from a reconstruction made by a linguist (lower is better) UsOakes  Comparison to system from [Oakes, 2000]  Uses exact inference and deterministic rules  Reconstruction of Proto-Malayo-Javanic cf [Nothefer, 1975]

30 Data: Oceanic Proto-Oceanic

31 Data: Oceanic http://language.psy.auckland.ac.nz/austronesian/research.php

32 Results: Large Phylogenies  Centroid: a novel heuristic based on an approximation to the minimum Bayes risk  Reconstruction of Proto-Oceanic [Blust, 1993]  Both algorithms use 64 modern languages Us Centroid

33 Result: More Languages Help Number of modern languages used Mean edit distance Distance from [Blust, 1993] Reconstructions

34 Results: Large Scale

35 Visualization: Learned universals *The model did not have features encoding natural classes

36 Regularity and Functional Load In a language, some pairs of sounds are more contrastive than others (higher functional load) Example: English “p”/“b” versus “t”/”th” “p”/“b”: pot/dot, pin/din, dress/press, pew/dew,... “t”/”th”: thin/tin

37 Functional Load: Timeline Functional Load Hypothesis (FLH): sounds changes are less frequent when they merge phonemes with high functional load [Martinet, 55] Previous research within linguistics : “FLH does not seem to be supported by the data” [King, 67] Caveat : only four languages were used in King’s study [Hocket 67; Surandran et al., 06] Our work: we reexamined the question with two orders of magnitude more data [BGK, under review ]

38 Regularity and Functional Load Functional load as computed by [King, 67] Data: only 4 languages from the Austronesian data Merger posterior probability Each dot is a sound change identified by the system

39 Regularity and Functional Load Data: all 637 languages from the Austronesian data Functional load as computed by [King, 67] Merger posterior probability

40 Outline Ancestral Word Forms Cognate Groups / Translations Grammatical Inference fuegofeu focus fuego feuhuevo oeuf les faits sont très clairs

41 Cognate Groups /fweɣo/ /fogo/ /fwɔko/ /berβo/ /vɛrbo/ /tʃɛntro/ /sentro/ /sɛntro/ ‘fire’

42 Model: Cognate Survival omnis - - - ogni IT ES PT IB LA + - - - + IT ES PT IB LA

43 Results: Grouping Accuracy Fraction of Words Correctly Grouped Method [Hall and Klein, in submission]

44 Semantics: Matching Meanings day Occurs with: “night” “sun” “week” tag DE tag EN Occurs with: “nacht” “sonne” “woche” Occurs with: “name” “label” “along”

45 Outline Ancestral Word Forms Cognate Groups / Translations Grammatical Inference fuegofeu focus fuego feuhuevo oeuf les faits sont très clairs

46 Grammar Induction congress narrowly passed the amended bill les faits sont très clairs Task: Given sentences, infer grammar (and parse tree structures)

47 Shared Prior les faits sont très clairs congress narrowly passed the amended bill

48 Results: Phylogenetic Prior DutchDanishSwedishSpanishPortugueseSloveneChineseEnglish WGNG RM G IE GL Avg rel gain: 29%

49 Conclusion  Phylogeny-structured models can:  Accurately reconstruct ancestral words  Give evidence to open linguistic debates  Detect translations from form and context  Improve language learning algorithms  Lots of questions still open:  Can we get better phylogenies using these high- res models?  What do these models have to say about the very earliest languages? Proto-world?

50 Thank you! nlp.cs.berkeley.edu

51 Machine Translation Approach Source Text Target Text nous acceptons votre opinion. we accept your view.

52 Translations from Monotexts Source Text Target Text  Translation without parallel text?  Need (lots of) sentences [Fung 95, Koehn and Knight 02, Haghighi and Klein 08]

53 Task: Lexicon Matching Source Text Target Text Matching m state world name Source Words s nation estado política Target Words t mundo nombre [Haghighi and Klein 08]

54 Data Representation state Source Text What are we generating? Orthographic Features 1.0 #st tat te# Context Features 20.0 5.0 10.0 world politics society

55 Data Representation state Orthographic Features 1.0 #st tat te# 5.0 20.0 10.0 Context Features world politics society Source Text estado Orthographic Features 1.0 #es sta do# 10.0 17.0 6.0 Context Features mundo politica sociedad Target Text What are we generating?

56 Generative Model (CCA) estado state Source Space Target Space Canonical Space

57 Generative Model (Matching) Source Words s Target Words t Matching m state world name nation estado nombre politica mundo

58 E-Step: Find best matching M-Step: Solve a CCA problem Inference: Hard EM

59 Experimental Setup  Data: 2K most frequent nouns, texts from Wikipedia  Seed: 100 translation pairs  Evaluation: Precision and Recall against lexicon obtained from Wiktionary  Report p 0.33, precision at recall 0.33

60 Lexicon Quality (EN-ES) Precision Recall

61 Seed Lexicon Source  Automatic Seed  Edit distance seed 92 4k EN-ES Wikipedia Articles Precision

62 Analysis

63 Interesting MatchesInteresting Mistakes

64 Language Variation

65 Analysis Orthography Features Context Features

66 Language Variation

67 Markov Mutation Model /fokus/ /fwɔko/

68 Local Mutation along Tree f# o f# w ɔ. …

69

70 Model: Many Words /fokus/ /fweɣo/ /fogo/ /fwɔko/

71 Results

72 Results: More Languages


Download ppt "Statistical NLP Spring 2010 Lecture 25: Diachronics Dan Klein – UC Berkeley."

Similar presentations


Ads by Google