Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 ASRU, Dec. 2015 Graphical Models Over String-Valued Random Variables Jason Eisner Ryan Cotterell Nanyun (Violet) Peng Nick Andrews Markus Dreyer Michael.

Similar presentations


Presentation on theme: "1 ASRU, Dec. 2015 Graphical Models Over String-Valued Random Variables Jason Eisner Ryan Cotterell Nanyun (Violet) Peng Nick Andrews Markus Dreyer Michael."— Presentation transcript:

1 1 ASRU, Dec. 2015 Graphical Models Over String-Valued Random Variables Jason Eisner Ryan Cotterell Nanyun (Violet) Peng Nick Andrews Markus Dreyer Michael Paul 1 with

2 2 ASRU, Dec. 2015 Probabilistic Inference of Strings Jason Eisner Ryan Cotterell Nanyun (Violet) Peng Nick Andrews Markus Dreyer Michael Paul 2 with Pronunciation Dictionaries

3 3

4 lexicon (word types) semantics sentences discourse context resources entailment correlation inflection cognates transliteration abbreviation neologism language evolution translation alignment editing quotation speech misspellings,typos formatting entanglement annotation N tokens To recover variables, model and exploit their correlations

5 5

6 Bayesian View of the World observed data hidden data probability distribution 6

7 Bayesian NLP Some good NLP questions:  Underlying facts or themes that help explain this document collection?  An underlying parse tree that helps explain this sentence? An underlying meaning that helps explain that parse tree?  An underlying grammar that helps explain why these sentences are structured as they are?  An underlying grammar or evolutionary history that helps explain why these words are spelled as they are? 7

8 Today’s Challenge Too many words in a language! 8

9 Natural Language is Built from Words 9

10 Can store info about each word in a table IndexSpellingMeaningPronunciationSyntax 123ca[si.ei]NNP (abbrev) 124can [k ɛɪ n] NN 125can [kæn], [k ɛ n], … MD 126cane [ke ɪ n] NN (mass) 127cane [ke ɪ n] NN 128canes [ke ɪ nz] NNS (other columns would include translations, topics, counts, embeddings, …) 10

11 Problem: Too Many Words! Google analyzed 1 trillion words of English text Found > 13M distinct words with count ≥ 200 The problem isn’t storing such a big table … it’s acquiring the info for each row separately  Need lots of evidence, or help from human speakers Hard to get for every word of the language Especially hard for complex or “low-resource” languages  Omit rare words? Maybe, but many sentences contain them (Zipf’s Law) 11

12 Technically speaking, # words =  Really the set of (possible) words is ∑* Names Neologisms Typos Productive processes:  friend  friendless  friendlessness  friendlessnessless  …  hand+bag  handbag (sometimes can iterate) 12

13 Technically speaking, # words =  Really the set of (possible) words is ∑* Names Neologisms Typos Productive processes:  friend  friendless  friendlessness  friendlessnessless  …  hand+bag  handbag (sometimes can iterate) 13 Turkish word: uygarlaştiramadiklarimizdanmişsinizcasina = uygar+laş+tir+ama+dik+lar+imiz+dan+miş+siniz+casina (behaving) as if you are among those whom we could not cause to become civilized

14 14 A typical Polish verb (“to contain”) ImperfectivePerfective infinitivezawieraćzawrzeć present past future conditional imperative present active participleszawierający, -a, -e; -y, -e present passive participleszawierany, -a, -e; -, -e past passive participleszawarty, -a, -e; -, -te adverbial participlezawierając zawieramzaweiramy zawieraszzawieracie zawierazawierają zawierałem/zawierałamzawieraliśmy/zawierałyśmy zawierałeś/zawierałaśzawieraliście/zawierałyście zawierał/zawierała/zawierałozawierali/zawierały zawarłem/zawarłamzawarliśmy/zawarłyśmy zawarłeś/zawarłaśzawarliście/zawarłyście zawarł/zawarła/zawarłozawarli/zawarły będę zawierał/zawierałabędziemy zawierali/zawierały będziesz zawierał/zawierałabędziecie zawierali/zawierały będzie zawierał/zawierała/zawierałobędą zawierali/zawierały zawręzawrzemy zawreszzawrzecie zawrzezawrą zawierałbym/zawierałabymzawieralibyśmy/zawierałybyśmy zawierałbyś/zawierałabyśzawieralibyście/zawierałybyście zawierałby/zawierałaby/zawierałobyzawieraliby/zawierałyby zawarłbym/zawarłabymzawarlibyśmy/zawarłybyśmy zawarłbyś/zawarłabyśzawarlibyście/zawarłybyście zawarłby/zawarłaby/zawarłobyzawarliby/zawarłyby zawierajmy zawierajzawierajcie niech zawieraniech zawierają zawrzyjmy zawrzyjzawrzjcie niech zawrzeniech zawrą 100 inflected forms per verb Sort of predictable from one another! (verbs are more or less regular)

15

16 Solution: Don’t model every cell separately Noble gases Positive ions 16

17 Can store info about each word in a table IndexSpellingMeaningPronunciationSyntax 123ca[si.ei]NNP (abbrev) 124can [k ɛɪ n] NN 125can [kæn], [k ɛ n], … MD 126cane [ke ɪ n] NN (mass) 127cane [ke ɪ n] NN 128canes [ke ɪ nz] NNS 17 (other columns would include translations, topics, counts, embeddings, …)

18 What’s in the table? NLP strings are diverse … Use  Orthographic (spelling)  Phonological (pronunciation)  Latent (intermediate steps not observed directly) Size  Morphemes (meaningful subword units)  Words  Multi-word phrases, including “named entities”  URLs 18

19 Language  English, French, Russian, Hebrew, Chinese, …  Related languages (Romance langs, Arabic dialects, …)  Dead languages (common ancestors) – unobserved?  Transliterations into different writing systems Medium  Misspellings  Typos  Wordplay  Social media 19 What’s in the table? NLP strings are diverse …

20 Some relationships within the table spelling  pronunciation word  noisy word (e.g., with a typo) word  related word in another language (loanwords, language evolution, cognates) singular  plural (for example) (root, binyan)  word underlying form  surface form 20

21 Chains of relations can be useful Misspelling or pun = spelling  pronunciation  spelling Cognate = word  historical parent  historical child 21

22 Reconstructing the (multilingual) lexicon IndexSpellingMeaningPronunciationSyntax 123ca[si.ei]NNP (abbrev) 124can [k ɛɪ n] NN 125can [kæn], [k ɛ n], … MD 126cane [ke ɪ n] NN (mass) 127cane [ke ɪ n] NN 128canes [ke ɪ nz] NNS (other columns would include translations, topics, distributional info such as counts, …) Ultimate goal: Probabilistically reconstruct all missing entries of this infinite multilingual table, given some entries and some text. Needed: Exploit the relationships (arrows). May have to discover those relationships. Approach: Linguistics + generative modeling + statistical inference. Modeling ingredients: Finite-state machines, graphical models, CRP. Inference ingredients: MCMC, BP/EP, DD. 22

23 Today’s Focus: Phonology (but methods also apply to other relationships among strings) 23

24 What is Phonology? [kæt] Phonology: Orthography: cat Phonology explains regular sound patterns 24

25 What is Phonology? [kæt] Phonetics: Phonology: Orthography: cat Phonology explains regular sound patterns Not phonetics, which deals with acoustics 25

26 Q: What do phonologists do? A: They find patterns among the pronunciations of words. 26

27 A Phonological Exercise Tenses Verbs [tɔk] [tɔks] [tɔkt] 1P Pres. Sg. 3P Pres. Sg.Past Tense Past Part. [tɔkt] [θeɪŋk] [θeɪŋks] [θeɪŋkt] [hæk] [hæks] [hækt] TALK THANK HACK CRACK SLAP [kɹæks] [kɹækt] [slæp] [slæpt] 27

28 Matrix Completion: Collaborative Filtering Movies Users -37 29 19 29 -36 67 77 22 -24 61 74 12 -79 -41 -52 -39 28

29 Matrix Completion: Collaborative Filtering 29 19 Movies Users 29 67 77 22 61 74 12 -79 -41 -39 -6 -3 2 [ 4 1 -5] [ 7 -2 0] [ 6 -2 3] [-9 1 4] [ 3 8 -5] [ [ 9 -2 1 [ [ 9 -7 2 [ [ 4 3 -2 [ [ -37 -36 -24 -52 29

30 Matrix Completion: Collaborative Filtering Prediction! 59 -80 6 46 -37 29 19 29 -36 67 77 22 -24 61 74 12 -79 -41 -52 -39 -6 -3 2 [ [ 9 -2 1 [ [ 9 -7 2 [ [ [ [ [ 4 1 -5] [ 7 -2 0] [ 6 -2 3] [-9 1 4] [ 3 8 -5] Movies Users 4 3 -2 [ 30

31 Matrix Completion: Collaborative Filtering [1,-4,3] [-5,2,1] -10 -11 Dot Product Gaussian Noise 31

32 Matrix Completion: Collaborative Filtering Prediction! 59 -80 6 46 -37 29 19 29 -36 67 77 22 -24 61 74 12 -79 -41 -52 -39 -6 -3 2 [ [ 9 -2 1 [ [ 9 -7 2 [ [ [ [ [ 4 1 -5] [ 7 -2 0] [ 6 -2 3] [-9 1 4] [ 3 8 -5] Movies Users 4 3 -2 [ 32

33 A Phonological Exercise [tɔk] [tɔks] [tɔkt] TALK THANK HACK 1P Pres. Sg. 3P Pres. Sg.Past Tense Past Part. Tenses Verbs [tɔkt] [θeɪŋk] [θeɪŋks] [θeɪŋkt] [hæk] [hæks] [hækt] CRACK SLAP [kɹæks] [kɹækt] [slæp] [slæpt] 33

34 A Phonological Exercise [tɔk] [tɔks] [tɔkt] TALK THANK HACK 1P Pres. Sg. 3P Pres. Sg.Past Tense Past Part. Suffixes Stems [tɔkt] [θeɪŋk] [θeɪŋks] [θeɪŋkt] [hæk] [hæks] [hækt] CRACK SLAP [kɹæks] [kɹækt] [slæp] [slæpt] /Ø//Ø/ /s//s/ /t//t//t//t/ /tɔk/ /θeɪŋk/ /hæk/ /slæp/ /kɹæk/ 34

35 A Phonological Exercise [tɔk] [tɔks] [tɔkt] TALK THANK HACK 1P Pres. Sg. 3P Pres. Sg.Past Tense Past Part. [tɔkt] [θeɪŋk] [θeɪŋks] [θeɪŋkt] [hæk] [hæks] [hækt] CRACK SLAP [kɹæks] [kɹækt] [slæp] [slæpt] /Ø//Ø/ /s//s/ /t//t//t//t/ /tɔk/ /θeɪŋk/ /hæk/ /slæp/ /kɹæk/ Suffixes Stems 35

36 A Phonological Exercise [tɔk] [tɔks] [tɔkt] TALK HACK 1P Pres. Sg. 3P Pres. Sg.Past Tense Past Part. [tɔkt] [θeɪŋk] [θeɪŋks] [θeɪŋkt] [hæk] [hæks] [hækt] CRACK SLAP [kɹæk] [kɹæks] [kɹækt] [slæp] [slæps] [slæpt] /Ø//Ø/ /s//s/ /t//t//t//t/ /tɔk/ /θeɪŋk/ /hæk/ /slæp/ /kɹæk/ Prediction! THANK Suffixes Stems 36

37 Why “talks” sounds like that tɔk s s tɔks Concatenate “talks” 37

38 A Phonological Exercise [tɔk] [tɔks] [tɔkt] TALK HACK 1P Pres. Sg. 3P Pres. Sg.Past Tense Past Part. Suffixes Stems [tɔkt] [θeɪŋk] [θeɪŋks] [θeɪŋkt] [hæk] [hæks] [hækt] CRACK SLAP CODE BAT [kɹæks] [kɹækt] [slæp] [slæpt] [koʊdz] [koʊdɪd] [bæt] [bætɪd] /Ø//Ø/ /s//s/ /t//t//t//t/ /tɔk/ /θeɪŋk/ /hæk/ /bæt/ /koʊd/ /slæp/ /kɹæk/ THANK 38

39 A Phonological Exercise [tɔk] [tɔks] [tɔkt] TALK HACK 1P Pres. Sg. 3P Pres. Sg.Past Tense Past Part. Suffixes Stems [tɔkt] [θeɪŋk] [θeɪŋks] [θeɪŋkt] [hæk] [hæks] [hækt] CRACK SLAP CODE BAT [kɹæks] [kɹækt] [slæp] [slæpt] [koʊdz] [koʊdɪd] [bæt] [bætɪd] /Ø//Ø/ /s//s/ /t//t//t//t/ /tɔk/ /θeɪŋk/ /hæk/ /bæt/ /koʊd/ /slæp/ /kɹæk/ z instead of s ɪd instead of t THANK 39

40 A Phonological Exercise [tɔk] [tɔks] [tɔkt] TALK THANK HACK 1P Pres. Sg. 3P Pres. Sg.Past Tense Past Part. Suffixes Stems [tɔkt] [θeɪŋk] [θeɪŋks] [θeɪŋkt] [hæk] [hæks] [hækt] CRACK SLAP CODE BAT EAT [kɹæks] [kɹækt] [slæp] [slæpt] [koʊdz] [koʊdɪd] [bæt] [bætɪd] [it] [eɪt] [itən] /Ø//Ø/ /s//s/ /t//t//t//t/ /tɔk/ /θeɪŋk/ /hæk/ /it/ /bæt/ /koʊd/ /slæp/ /kɹæk/ eɪt instead of i t ɪd 40

41 Why “codes” sounds like that koʊds koʊd#s koʊdz Concatenate Phonology (stochastic) “codes” Modeling word forms using latent underlying morphs and phonology. Cotterell et. al. TACL 2015 41

42 Why “resignation” sounds like that rizaignation rizaign#ation rεzɪgneɪʃn “resignation” Concatenate Phonology (stochastic) 42

43 dæmn e ɪʃ ən srizaign r,εz ɪ gn’e ɪʃ n riz’ajnz d,æmn’e ɪʃ n d’æmz rizaign#e ɪʃ ən rizaign#s dæmn#s dæmn#e ɪʃ ən Fragment of Our Graph for English 1) Morphemes 2) Underlying words 3) Surface words Concatenation Phonology “resignation”“resigns” “damnation” “damns” 3 rd -person singular suffix: very common! 43

44 Handling Multimorphemic Words gəliːpt gəliːbt tgə 44 liːb “geliebt” (German: loved) Matrix completion: each word built from one stem (row) + one suffix (column). WRONG Graphical model: a word can be built from any # of morphemes (parents). RIGHT

45 Limited to concatenation? No, could extend to templatic morphology … 45

46 A (Simple) Model of Phonology 46

47 [1,-4,3] [-5,2,1] -10 -11 Dot Product Gaussian Noise 47

48 rizaigns rizaign#s rizainz “resigns” Concatenate Phonology (stochastic) SθSθ 48

49 Upper Left ContextUpper Right Context Lower Left Context Phonology as an Edit Process r r i i z z a a i i g g n n s s 49

50 Upper Left Context Lower Left Context Upper Right Context Phonology as an Edit Process r r i i z z a a i i g g n n s s r r COPY 50

51 Upper Left Context Lower Left Context Upper Right Context Phonology as an Edit Process r r i i z z a a i i g g n n s s r r i i COPY 51

52 Upper Left Context Lower Left Context Upper Right Context Phonology as an Edit Process r r i i z z a a i i g g n n s s r r i i COPY z z 52

53 Upper Left Context Lower Left Context i i i i Upper Right Context Phonology as an Edit Process r r z z a a i i g g n n s s r r z z COPY a a 53

54 i i i i Lower Left Context Upper Left ContextUpper Right Context Phonology as an Edit Process r r z z a a i i g g n n s s r r z z a a i i COPY 54

55 i i i i Lower Left Context Upper Left Context Upper Right Context Phonology as an Edit Process r r z z a a i i g g n n s s r r z z a a i i DEL ɛ ɛ 55

56 i i i i Lower Left Context Upper Left Context Upper Right Context Phonology as an Edit Process r r z z a a i i g g n n s s r r z z a a i i ɛ ɛ COPY n n 56

57 i i i i Lower Left Context Upper Left Context Upper Right Context Phonology as an Edit Process r r z z a a i i g g n n s s r r z z a a i i ɛ ɛ n n SUB z z 57

58 i i i i Lower Left Context Upper Left Context Upper Right Context Phonology as an Edit Process r r z z a a i i g g n n s s r r z z a a i i ɛ ɛ n n SUB z z 58

59 i i i i Lower Left Context Upper Left Context Upper Right Context Phonology as an Edit Process r r z z a a i i g g n n s s r r z z a a i i ɛ ɛ COPY n n 59

60 Lower Left Context Upper Left ContextUpper Right Context i i i i Phonology as an Edit Process r r z z a a i i g g n n s s r r z z a a i i DEL ɛ ɛ Weights Feature Function ActionProb DEL.75 COPY.01 SUB(A).05 SUB(B).03... INS(A).02 INS(B).01... 60

61 Lower Left Context Upper Left ContextUpper Right Context i i i i Phonology as an Edit Process r r z z a a i i g g n n s s r r z z a a i i DEL ɛ ɛ Feature Function Weights Features 61

62 Lower Left Context Upper Left ContextUpper Right Context i i i i Phonology as an Edit Process r r z z a a i i g g n n s s r r z z a a i i DEL ɛ ɛ Feature Function Weights Features Surface Form 62

63 i i i i Lower Left Context Upper Left ContextUpper Right Context Phonology as an Edit Process r r z z a a i i g g n n s s r r z z a a i i DEL ɛ ɛ Feature Function Weights Features Surface Form Transduction 63

64 Lower Left Context Upper Left ContextUpper Right Context i i i i Phonology as an Edit Process r r z z a a i i g g n n s s r r z z a a i i DEL ɛ ɛ Feature Function Weights Features Surface Form Transduction Upper String 64

65 Phonological Attributes Binary Attributes (+ and -) 65

66 i i i i Lower Left Context Upper Left ContextUpper Right Context Phonology as an Edit Process r r z z a a i i g g n n s s r r z z a a i i DEL ɛ ɛ 66

67 Lower Left Context Upper Left ContextUpper Right Context i i i i Phonology as an Edit Process r r z z a a i i g g n n s s r r z z a a i i DEL ɛ ɛ Faithfulness Features EDIT(g, ɛ ) EDIT(+cons, ɛ ) EDIT(+voiced, ɛ ) 67

68 i i i i Lower Left Context Upper Left ContextUpper Right Context Phonology as an Edit Process r r z z a a i i g g n n s s r r z z a a i i COPY Markedness Features BIGRAM(a, i) BIGRAM(-high, -low) BIGRAM(+back, -back) 68

69 i i i i Lower Left Context Upper Left ContextUpper Right Context Phonology as an Edit Process r r z z a a i i g g n n s s r r z z a a i i COPY Markedness Features BIGRAM(a, i) BIGRAM(-high, -low) BIGRAM(+back, -back) Inspired by Optimality Theory: A popular Constraint-Based Phonology Formalism 69

70 Inference for Phonology 70

71 Bayesian View of the World observed data hidden data probability distribution 71

72 r,εz ɪ gn’e ɪʃ nd,æmn’e ɪʃ n d’æmz 72

73 dæmn e ɪʃ ən srizaign r,εz ɪ gn’e ɪʃ n riz’ajnz d,æmn’e ɪʃ n d’æmz rizaign#e ɪʃ ən rizaign#s dæmn#s dæmn#e ɪʃ ən 73

74 Bayesian View of the World observed data hidden data probability distribution the rest of the words! all of the morphs the parameter vectors θ, φ some of the words 74

75 Statistical Inference some surface forms of the language 1. The underlying forms giving rise to those surface forms 2. The surface forms for all other words, as predicted by the underlying forms probability distribution (we also learn the edit parameters that best explain the visible part of the iceberg) 75

76 Why this matters Phonological grammars are usually hand- engineered by phonologists. Linguistics goal: Create an automated phonologist? Cognitive science goal: Model how babies learn phonology? Engineering goal: Analyze and generate words we haven’t heard before? 76

77 The Generative Story (defines which iceberg shapes are likely) 1. Sample the parameters φ and θ from priors. These parameters describe the grammar of a new language: what tends to happen in the language. 2. Now choose the lexicon of morphs and words: – For each abstract morpheme a  A, sample the morph M(a) ~ M φ – For each abstract word a=a 1,a 2 ···, sample its surface pronunciation S(a) from S θ (· | u), where u=M(a 1 )#M(a 2 ) ··· 3. This lexicon can now be used to communicate. A word’s pronunciation is now just looked up, not sampled; so it is the same each time it is used. 77 rizaign rizaign#s riz’ajnz

78 Why Probability? A language’s morphology and phonology are fixed, but probability models the learner’s uncertainty about what they are. Advantages: – Quantification of irregularity (“singed” vs. “sang”) – Soft models admit efficient learning and inference Our use is orthogonal to the way phonologists currently use probability to explain gradient phenomena 78

79 Basic Methods for Inference and Learning 79

80 Train the Parameters using EM (Dempster et al. 1977) E-Step (“inference”): – Infer the hidden strings (posterior distribution) r,εz ɪ gn’e ɪʃ nd,æmn’e ɪʃ n d’æmz

81 Train the Parameters using EM (Dempster et al. 1977) E-Step (“inference”): – Infer the hidden strings (posterior distribution) dæmn e ɪʃ ən srizaign r,εz ɪ gn’e ɪʃ n riz’ajnz d,æmn’e ɪʃ n d’æmz rizaign#e ɪʃ ən rizaign#s dæmn#s dæmn#e ɪʃ ən

82 Train the Parameters using EM (Dempster et al. 1977) E-Step (“inference”): – Infer the hidden strings (posterior distribution) M-Step (“learning”): – Improve the continuous model parameters θ, φ (gradient descent: the E-step provides supervision) Repeat till convergence i i i i r r z z a a i i g g n n s s r r z z a a i i DEL 82 riz’ajnz rizaign#s

83 Directed Graphical Model (defines the probability of a candidate solution) 1) Morphemes 2) Underlying words 3) Surface words Concatenation Phonology 83 Inference step: Find high-probability reconstructions of the hidden variables. High-probability if each string is likely given its parents. dæmn e ɪʃ ən srizaign r,εz ɪ gn’e ɪʃ n riz’ajnz d,æmn’e ɪʃ n d’æmz rizaign#e ɪʃ ən rizaign#s dæmn#s dæmn#e ɪʃ ən

84 1) Morphemes 2) Underlying words 3) Surface words Concatenation Phonology 84 Equivalent Factor Graph (defines the probability of a candidate solution) dæmn e ɪʃ ən srizaign r,εz ɪ gn’e ɪʃ n riz’ajnz d,æmn’e ɪʃ n d’æmz rizaign#e ɪʃ ən rizaign#s dæmn#s dæmn#e ɪʃ ən

85 85

86 86

87 87

88 88

89 89

90 90

91 91

92 92

93 93

94 Directed Graphical Model 94 1) Morphemes 2) Underlying words 3) Surface words Concatenation Phonology dæmn e ɪʃ ən srizaign r,εz ɪ gn’e ɪʃ n riz’ajnz d,æmn’e ɪʃ n d’æmz rizaign#e ɪʃ ən rizaign#s dæmn#s dæmn#e ɪʃ ən

95 Equivalent Factor Graph Each ellipse is a random variable Each square is a “factor” – a function that jointly scores the values of its few neighboring variables 95 1) Morphemes 2) Underlying words 3) Surface words Concatenation Phonology dæmn e ɪʃ ən srizaign r,εz ɪ gn’e ɪʃ n riz’ajnz d,æmn’e ɪʃ n d’æmz rizaign#e ɪʃ ən rizaign#s dæmn#s dæmn#e ɪʃ ən

96 ? ? riz’ajnz ? r,εz ɪ gn’e ɪʃ n ? ? riz’ajnd ? ? Dumb Inference by Hill-Climbing 1) Morpheme URs 2) Word URs 3) Word SRs 96

97 foo ? riz’ajnz ? r,εz ɪ gn’e ɪʃ n s ? riz’ajnd da bar Dumb Inference by Hill-Climbing 1) Morpheme URs 2) Word URs 3) Word SRs 97

98 Dumb Inference by Hill-Climbing foo bar#s riz’ajnz bar#foo r,εz ɪ gn’e ɪʃ n s bar#da riz’ajnd da bar 1) Morpheme URs 2) Word URs 3) Word SRs 98

99 Dumb Inference by Hill-Climbing 8e-3 0.01 0.050.02 foo bar#s riz’ajnz bar#foo r,εz ɪ gn’e ɪʃ n s bar#da riz’ajnd da bar 1) Morpheme URs 2) Word URs 3) Word SRs 99

100 Dumb Inference by Hill-Climbing 8e-3 0.01 0.050.02 foo bar#s riz’ajnz bar#foo r,εz ɪ gn’e ɪʃ n s bar#da riz’ajnd da bar 1) Morpheme URs 2) Word URs 3) Word SRs 6e-1200 2e-1300 7e-1100 100

101 Dumb Inference by Hill-Climbing 8e-3 0.01 0.050.02 foo bar#s riz’ajnz bar#foo r,εz ɪ gn’e ɪʃ n s bar#da riz’ajnd da bar 1) Morpheme URs 2) Word URs 3) Word SRs 6e-1200 2e-1300 7e-1100  101

102 Dumb Inference by Hill-Climbing foo far#s riz’ajnz far#foo r,εz ɪ gn’e ɪʃ n s far#da riz’ajnd da far 1) Morpheme URs 2) Word URs 3) Word SRs ? 102

103 Dumb Inference by Hill-Climbing foo size#s riz’ajnz size#foo r,εz ɪ gn’e ɪʃ n s size#da riz’ajnd da size 1) Morpheme URs 2) Word URs 3) Word SRs ? 103

104 Dumb Inference by Hill-Climbing foo …#s riz’ajnz …#foo r,εz ɪ gn’e ɪʃ n s …#da riz’ajnd da … 1) Morpheme URs 2) Word URs 3) Word SRs ? 104

105 Dumb Inference by Hill-Climbing foo rizajn#s riz’ajnz rizajn#foo r,εz ɪ gn’e ɪʃ n s rizajn#da riz’ajnd da rizajn 1) Morpheme URs 2) Word URs 3) Word SRs 105

106 Dumb Inference by Hill-Climbing foo rizajn#s riz’ajnz rizajn#foo r,εz ɪ gn’e ɪʃ n s rizajn#da riz’ajnd da rizajn 1) Morpheme URs 2) Word URs 3) Word SRs 0.012e-50.008 106

107 Dumb Inference by Hill-Climbing e ɪʃ n rizajn#s riz’ajnz rizajn#e ɪʃ n r,εz ɪ gn’e ɪʃ n s rizajn#d riz’ajnd d rizajn 1) Morpheme URs 2) Word URs 3) Word SRs 0.010.0010.015 107

108 Dumb Inference by Hill-Climbing e ɪʃ n rizajgn#s riz’ajnz rizajgn#e ɪʃ n r,εz ɪ gn’e ɪʃ n s rizajgn#d riz’ajnd d rizajgn 1) Morpheme URs 2) Word URs 3) Word SRs 0.008 0.013 108

109 Dumb Inference by Hill-Climbing 109  Can we make this any smarter?  This naïve method would be very slow. And it could wander around forever, get stuck in local maxima, etc.  Alas, the problem of finding the best values in a factor graph is undecidable! (Can’t even solve by brute force because strings have unbounded length.)  Exact methods that might not terminate (but do in practice)  Approximate methods – which try to recover not just the best values, but the posterior distribution of values  All our methods are based on finite-state automata

110 A Generative Model of Phonology A Directed Graphical Model of the lexicon rˌɛzɪgnˈeɪʃən dˈæmz rizˈajnz 110 (Approximate) Inference MCMC – Bouchard-Côté (2007) Belief Propagation – Dreyer and Eisner (2009) Expectation Propagation – Cotterell and Eisner (2015) Dual Decomposition – Peng, Cotterell, & Eisner (2015)

111 About to sell our mathematical soul? 111 Insight Efficiency Exactness

112 112 General algos Give up lovely dynamic programming? Big Models Specialized algos

113 113 Give up lovely dynamic programming? General algos Insight Specialized algos Not quite! – Yes, general algos … which call specialized algos as subroutines Within a framework such as belief propagation, we may run – parsers (Smith & Eisner 2008) – finite-state machines (Dreyer & Eisner 2009) A step of belief propagation takes time O(k n ) in general – To update one message from a factor that coordinates n variables that have k possible values each – If that’s slow, we can sometimes exploit special structure in the factor! large n: parser uses dynamic programming to coordinate many vars infinite k: FSMs use dynamic programming to coordinate strings

114 rˌɛzɪgnˈeɪʃən dˈæmz rizˈajnz 114 Distribution Over Surface Form: UR Prob dæme ɪʃ ən.80 dæmne ɪʃ ən.10 dæmine ɪʃ ən..001 dæmiine ɪʃ ən.0001 … … chomsky.000001 … r in g u e ε s e h a Encoded as Weighted Finite- State Automaton

115 115

116 Experimental Design 116

117 Experimental Datasets 7 languages from different families – Maori – Tangale – Indonesian – Catalan – English – Dutch – German 117 200 to 800 67 71 54 43 homework exercises: can we generalize correctly from small data? CELEX can we scale up to larger datasets? can we handle naturally occurring datasets that have more irregularity? # of observed words per experiment

118 118 Evaluation Setup r,εz ɪ gn’e ɪʃ n d’æmz riz’ajnz

119 119 Evaluation Setup dæmn e ɪʃ ən srizaign r,εz ɪ gn’e ɪʃ n riz’ajnz d,æmn’e ɪʃ n d’æmz rizaign#e ɪʃ ən rizaign#s dæmn#s dæmn#e ɪʃ ən did we guess this pronunciation right?

120 Distribution Over Surface Form: UR Prob dæme ɪʃ ən.80 dæmne ɪʃ ən.10 dæmine ɪʃ ən..001 dæmiine ɪʃ ən.0001 … … chomsky.000001 … Exploring the Evaluation Metrics 120 1-best error rate – Is the 1-best correct? *

121 Distribution Over Surface Form: UR Prob dæme ɪʃ ən.80 dæmne ɪʃ ən.10 dæmine ɪʃ ən..001 dæmiine ɪʃ ən.0001 … … chomsky.000001 … Exploring the Evaluation Metrics 121 1-best error rate – Is the 1-best correct? Cross Entropy – What is the probability of the correct answer?

122 Distribution Over Surface Form: UR Prob dæme ɪʃ ən.80 dæmne ɪʃ ən.10 dæmine ɪʃ ən..001 dæmiine ɪʃ ən.0001 … … chomsky.000001 … Exploring the Evaluation Metrics 122 1-best error rate – Is the 1-best correct? Cross Entropy – What is the probability of the correct answer? Expected Edit Distance – How close am I on average?

123 Distribution Over Surface Form: UR Prob dæme ɪʃ ən.80 dæmne ɪʃ ən.10 dæmine ɪʃ ən..001 dæmiine ɪʃ ən.0001 … … chomsky.000001 … Exploring the Evaluation Metrics 123 1-best error rate – Is the 1-best correct? Cross Entropy – What is the probability of the correct answer? Expected Edit Distance – How close am I on average? Average over many training-test splits

124 Evaluation Metrics: (Lower is Always Better) – 1-best error rate (did we get it right?) – cross-entropy (what probability did we give the right answer?) – expected edit-distance (how far away on average are we?) – Average each metric over many training-test splits Comparisons: – Lower Bound: Phonology as noisy concatenation – Upper Bound: Oracle URs from linguists 124

125 Evaluation Philosophy We’re evaluating a language learner, on languages we didn’t examine when designing the learner. We directly evaluate how well our learner predicts held-out words that the learner didn’t see. No direct evaluation of intermediate steps: – Did we get the “right” underlying forms? – Did we learn a “simple” or “natural” phonology? – It’s hard to judge the answers. Anyway, we only want the answers to be “yes” because we suspect that this will give us a more predictive theory. So let’s just see if the theory is predictive. Proof is in the pudding! Caveat: Linguists and child language learners also have access to other kinds of data that we’re not considering yet. 125

126 Results (using Loopy Belief Propagation for inference) 126

127 German Results 127 Error Bars with bootstrap resampling

128 CELEX Results 128

129 Phonological Exercise Results 129

130 Gold UR Recovery 130

131 Formalizing Our Setup Many scoring functions on strings (e.g., our phonology model) can be represented using FSMs 131

132 1) Morphemes 2) Underlying words 3) Surface words Concatenation Phonology 132 What class of functions will we allow for the factors? (black squares) dæmn e ɪʃ ən srizaign r,εz ɪ gn’e ɪʃ n riz’ajnz d,æmn’e ɪʃ n d’æmz rizaign#e ɪʃ ən rizaign#s dæmn#s dæmn#e ɪʃ ən

133 Real-Valued Functions on Strings We’ll need to define some nonnegative functions. f(x) = score of a string f(x,y) = score of a pair of strings Can represent deterministic processes: f(x,y)  {1,0}  Is y the observable result of deleting latent material from x? Can represent probability distributions, e.g.,  f(x) = p(x) under some “generating process”  f(x,y) = p(x,y) under some “joint generating process”  f(x,y) = p(y | x) under some “transducing process” 133

134 Restrict to Finite-State Functions One string input a c  Boolean output Two string inputs (on 2 tapes) a:x c:z :y:y a:x/.5 c:z/.7  :y/.5.3 Real output a/.5 c/.7  /.5.3 Path weight = product of arc weights Score of input = total weight of accepting paths 134

135 Example: Stochastic Edit Distance p(y|x) a:   :a b:   :b a:b b:a a:a b:b O(k) deletion arcs O(k) insertion arcs O(k 2 ) substitution arcs O(k) identity arcs Likely edits = high-probability arcs 135

136 Computing p(y|x) c l a r a c a ? Given (x,y), construct a graph of all accepting paths in the original FSM. These are different explanations for how x could have been edited to yield y (x-to-y alignments). Use dyn. prog. to find highest-prob path, or total prob of all paths. 0 1 2 3 4 5 0 1 2 3 4 c:  l:  a:  r:  a:   :c c:c  :c l:c  :c a:c  :c r:c  :c a:c  :c c:  l:  a:  r:  a:   :a c:a  :a l:a  :a a:a  :a r:a  :a a:a  :a c:  l:  a:  r:  a:   :c c:c  :c l:c  :c a:c  :c r:c  :c a:c  :c c:  l:  a:  r:  a:   :a c:a  :a l:a  :a a:a  :a r:a  :a a:a  :a c:  l:  a:  r:  a:  0 1 2 3 4 5 0123401234 position in upper string position in lower string 136

137 Why restrict to finite-state functions? Can always compute f(x,y) efficiently.  Construct the graph of accepting paths by FSM composition.  Sum over them via dynamic prog., or by solving a linear system. Finite-state functions are closed under useful operations:  Marginalization: h(x) = ∑ y f(x,y)  Pointwise product: h(x) = f(x) ∙ g(x)  Join: h(x,y,z) = f(x,y) ∙ g(y,z) 137

138 Define a function family Use finite-state machines (FSMs). The arc weights are parameterized. We tune the parameters to get weights that predict our training data well. The FSM topology defines a function family. In practice, generalizations of stochastic edit distance.  So, we are learning the edit probabilities.  With more states, these can depend on left and right context. 138

139 Probabilistic FSTs 139

140 Probabilistic FSTs 140

141 Finite-State Graphical Model Over String-Valued Random Variables ● Joint distribution over many strings ● Variables ● Range over Σ*  infinite set of all strings ● Relations among variables ● Usually specified by (multi-tape) FSTs 141 A probabilistic approach to language change ( Bouchard-Côté et. al. NIPS 2008 ) Graphical models over multiple strings. (Dreyer and Eisner. EMNLP 2009 ) Large-scale cognate recovery (Hall and Klein. EMNLP 2011 )

142 Useful 3-tape FSMs f(x,y,z) = p(z | x, y)  typically z is functionally dependent on x,y, so this represents a deterministic process Concatenation: f( dog, s, dogs) = 1 Binyan: f(ktb, a _ _ u _, aktub) = 1 Remark: WLOG, we can do everything with 2- tape FSMs. Similar to binary constraints in CSP. 142

143 Computational Hardness Ordinary graphical model inference is sometimes easy and sometimes hard, depending on graph topology But over strings, it can be hard even with simple topologies and simple finite-state factors  143

144 Simple model family can be NP-hard Multi-sequence alignment problem  Generalize edit distance to k strings of length O(n)  Dynamic programming would seek best path in a hypercube of size O(n^k) Similar to Steiner string problem (“consensus”) c:  l:  a:  r:  a:   :c c:c  :c l:c  :c a:c  :c r:c  :c a:c  :c c:  l:  a:  r:  a:   :a c:a  :a l:a  :a a:a  :a r:a  :a a:a  :a c:  l:  a:  r:  a:   :c c:c  :c l:c  :c a:c  :c r:c  :c a:c  :c c:  l:  a:  r:  a:   :a c:a  :a l:a  :a a:a  :a r:a  :a a:a  :a c:  l:  a:  r:  a:  0 1 2 3 4 5 0123401234 position in lower string 144

145 Post’s Correspondence Problem (1946)  Given a 2-tape FSM f(x,y) of a certain form, is there a string z for which f(z,z) > 0 ?  No Turing Machine can decide this in general  So no Turing machine can determine in general whether this simple factor graph has any positive-weight solutions: b:bb:b b:bb:b a:a: Simple model family can be undecidable (!) z f xy = f 145 z = bbaabbbaa bba+ab+bba+a  bb  +aa+bb  +baa f = :a:a :a:a a:ba:b a:aa:a b:a :: bba+ab+bba bb  +aa+bb  bba+ab bb  +aa bba bb 

146 Inference by Belief Propagation 146

147 147

148 148

149 149

150 150

151 151

152 152

153 153

154 Loopy belief propagation (BP) The beautiful observation (Dreyer & Eisner 2009):  Each message is a 1-tape FSM that scores all strings.  Messages are iteratively updated based on other messages, according to the rules of BP.  The BP rules require only operations under which FSMs are closed! Achilles’ heel:  The FSMs may grow large as the algorithm iterates. So the algorithm may be slow, or not converge at all. 154

155 Belief Propagation (BP) in a Nutshell X1X1 X2X2 X3X3 X4X4 X6X6 X5X5

156 dˌæmnˈeɪʃən rizˈajnz rˌɛzɪgnˈeɪʃən dæmnz rizajgnz rizajgneɪʃən dæmneɪʃən eɪʃən z z rizajgn dæmn dˈæmz 156

157 Belief Propagation (BP) in a Nutshell dˌæmnˈeɪʃən rizˈajnz rˌɛzɪgnˈeɪʃən dæmnz rizajgnz rizajgneɪʃən dæmneɪʃən eɪʃən z z rizajgn dæmn dˈæmz Factor to Variable Messages 157

158 Belief Propagation (BP) in a Nutshell dˌæmnˈeɪʃən rizˈajnz rˌɛzɪgnˈeɪʃən dæmnz rizajgnz rizajgneɪʃən dæmneɪʃən eɪʃən z z rizajgn dæmn dˈæmz Variable to Factor Messages 158

159 Belief Propagation (BP) in a Nutshell dˌæmnˈeɪʃən rizˈajnz rˌɛzɪgnˈeɪʃən dæmnz rizajgnz rizajgneɪʃən dæmneɪʃən eɪʃən z z rizajgn dæmn dˈæmz Encoded as Finite- State Machines r in g u e ε s e h a r in g u e ε e e s e h a r in g u e ε e e s e h a r in g u e ε e e s e h a r in g u e ε s e h a r in g u e ε s e h a 159

160 Belief Propagation (BP) in a Nutshell dˌæmnˈeɪʃən rizˈajnz rˌɛzɪgnˈeɪʃən dæmnz rizajgnz rizajgneɪʃən dæmneɪʃən eɪʃən z z rizajgn dæmn dˈæmz 160

161 Belief Propagation (BP) in a Nutshell dˌæmnˈeɪʃən rizˈajnz rˌɛzɪgnˈeɪʃən dæmnz rizajgnz rizajgneɪʃən dæmneɪʃən eɪʃən z z rizajgn dæmn dˈæmz r in g u e ε e e s e h a r in g u e ε e e s e h a 161

162 Belief Propagation (BP) in a Nutshell dˌæmnˈeɪʃən rizˈajnz rˌɛzɪgnˈeɪʃən dæmnz rizajgnz rizajgneɪʃən dæmneɪʃən eɪʃən z z rizajgn dæmn dˈæmz r i n g u e ε e e s e h a r i n g u e ε e e s e h a r i n g u e ε e e s e h a Point-wise product (finite-state intersection) yields marginal belief 162

163 Belief Propagation (BP) in a Nutshell dˌæmnˈeɪʃən rizˈajnz rˌɛzɪgnˈeɪʃən dæmnz rizajgnz rizajgneɪʃən dæmneɪʃən eɪʃən z z rizajgn dæmn dˈæmnz Distribution Over Underlying Forms: UR Prob rizajgnz.95 rezajnz.02 rezigz.02 rezgz.0001 … chomsky.000001 … r i n g u e ε e e s e h a r i n g u e ε e e s e h a r i n g u e ε e e s e h a 163

164 Computing Marginal Beliefs X1X1 X2X2 X3X3 X4X4 X7X7 X5X5

165 X1X1 X2X2 X3X3 X4X4 X7X7 X5X5

166 Belief Propagation (BP) in a Nutshell X1X1 X2X2 X3X3 X4X4 X6X6 X5X5 r in g u e ε e e s e h a r in g u e ε s e h a r in g u e ε e e s e h a r in g u e ε s e h a

167 Computing Marginal Beliefs X1X1 X2X2 X3X3 X4X4 X7X7 X5X5 r in g u e ε s e h a r in g u e ε s e h a r in g u e ε s e h a r in g u e ε s e h a

168 X1X1 X2X2 X3X3 X4X4 X7X7 X5X5 C C r in g u e ε s e h a r in g u e ε s e h a r in g u e ε s e h a r in g u e ε s e h a r in g u e ε s e h a r in g u e ε s e h a r in g u e ε s e h a r in g u e ε s e h a r in g u e ε s e h a r in g u e ε s e h a Computation of belief results in large state space

169 Computing Marginal Beliefs X1X1 X2X2 X3X3 X4X4 X7X7 X5X5 C C r in g u e ε s e h a r in g u e ε s e h a r in g u e ε s e h a r in g u e ε s e h a r in g u e ε s e h a r in g u e ε s e h a r in g u e ε s e h a r in g u e ε s e h a r in g u e ε s e h a r in g u e ε s e h a Computation of belief results in large state space What a hairball!

170 Computing Marginal Beliefs X1X1 X2X2 X3X3 X4X4 X7X7 X5X5 r in g u e ε s e h a r in g u e ε s e h a r in g u e ε s e h a r in g u e ε s e h a Approximation Required!!!

171 BP over String-Valued Variables In fact, with a cyclic factor graph, messages and marginal beliefs grow unboundedly complex. X2X2 X1X1 ψ2ψ2 a a ε a a a a ψ1ψ1 a a

172 BP over String-Valued Variables In fact, with a cyclic factor graph, messages and marginal beliefs grow unboundedly complex! X2X2 X1X1 ψ2ψ2 a a ε a a a a ψ1ψ1 a a a

173 BP over String-Valued Variables In fact, with a cyclic factor graph, messages and marginal beliefs grow unboundedly complex! X2X2 X1X1 ψ2ψ2 a a ε a a a a ψ1ψ1 a a a aa

174 BP over String-Valued Variables In fact, with a cyclic factor graph, messages and marginal beliefs grow unboundedly complex! X2X2 X1X1 ψ2ψ2 a a ε a a a a ψ1ψ1 a a a a a a

175 BP over String-Valued Variables In fact, with a cyclic factor graph, messages and marginal beliefs grow unboundedly complex! X2X2 X1X1 ψ2ψ2 a a ε a a a a ψ1ψ1 a a a a a a a aa a a a a aaa aaaaaaaaa

176 Inference by Expectation Propagation 176

177 Expectation Propagation: The Details A belief at at variable is just the point-wise product of message: Key Idea: For each message,, we seek an approximate message Algorithm: for each

178 Expectation Propagation (EP) X1X1 X2X2 X3X3 X4X4 X7X7 X5X5 exponential-family approximations inside Belief at X 3 will be simple! Messages to and from X 3 will be simple! 178

179 Expectation propagation (EP) EP solves the problem by simplifying each message once it is computed.  Projects the message back into a tractable family. 179

180 Expectation Propagation (EP) exponential-family approximations inside Belief at X 3 will be simple! Messages to and from X 3 will be simple! X3X3

181 Expectation propagation (EP) EP solves the problem by simplifying each message once it is computed.  Projects the message back into a tractable family. In our setting, we can use n-gram models.  f approx (x) = product of weights of the n-grams in x  Just need to choose weights that give a good approx Best to use variable-length n-grams. 181

182 Expectation Propagation (EP) in a Nutshell X1X1 X2X2 X3X3 X4X4 X7X7 X5X5 r in g u e ε s e h a r in g u e ε s e h a r in g u e ε s e h a r in g u e ε s e h a

183 X1X1 X2X2 X3X3 X4X4 X7X7 X5X5 r in g u e ε s e h a r in g u e ε s e h a r in g u e ε s e h a

184 X1X1 X2X2 X3X3 X4X4 X7X7 X5X5 r in g u e ε s e h a r in g u e ε s e h a

185 X1X1 X2X2 X3X3 X4X4 X7X7 X5X5 r in g u e ε s e h a

186 X1X1 X2X2 X3X3 X4X4 X7X7 X5X5

187 Expectation propagation (EP) 187

188 Variable Order Approximations Use only the n-grams you really need!

189 Approximating beliefs with n-grams How to optimize this?  Option 1: Greedily add n-grams by expected count in f. Stop when adding next batch hurts the objective.  Option 2: Select n-grams from a large set using a convex relaxation + projected gradient (= tree- structured group lasso). Must incrementally expand the large set (“active set” method). 189

190 Results using Expectation Propagation 190

191 Trigram EP (Cyan) – slow, very accurate Baseline (Black) – slow, very accurate (pruning) Penalized EP (Red) – pretty fast, very accurate Bigram EP (Blue) – fast but inaccurate Unigram EP (Green) – fast but inaccurate Speed ranking (upper graph) Accuracy ranking (lower graph) … essentially opposites … 191

192 192

193 Inference by Dual Decomposition Exact 1-best inference! (can’t terminate in general because of undecidability, but does terminate in practice) 193

194 Challenges in Inference 194 Global discrete optimization problem. Variables range over a infinite set … cannot be solved by ILP or even brute force. Undecidable! Our previous papers used approximate algorithms: Loopy Belief Propagation, or Expectation Propagation. Q: Can we do exact inference? A: If we can live with 1-best and not marginal inference, then we can use Dual Decomposition … which is exact. (if it terminates! the problem is undecidable in general …)

195 Graphical Model for Phonology 195 Jointly decide the values of the inter-dependent latent variables, which range over a infinite set. 1) Morpheme URs 2) Word URs 3) Word SRs Concatenation (e.g.) Phonology (PFST) srizajgn e ɪʃ ən dæmn rεz ɪ gn#e ɪʃ ən rizajn#s dæmn#e ɪʃ ən dæmn#s r,εz ɪ gn’e ɪʃ n riz’ajnz d,æmn’e ɪʃ n d’æmz rεzignrεzign e ɪʃ ən

196 General Idea of Dual Decomp 196 srizajgn e ɪʃ ən dæmn rεz ɪ gn#e ɪʃ ən rizajn#s dæmn#e ɪʃ ən dæmn#s r,εz ɪ gn’e ɪʃ n riz’ajnz d,æmn’e ɪʃ n d’æmz rεzignrεzign e ɪʃ ən

197 General Idea of Dual Decomp zrizajn e ɪʃ ən dæmn e ɪʃ ən zdæmn rεz ɪ gn rεz ɪ gn#e ɪʃ ən rizajn#z dæmn#e ɪʃ ən dæmn#z r,εz ɪ gn’e ɪʃ n riz’ajnz d,æmn’e ɪʃ n d’æmz Subproblem 1Subproblem 2Subproblem 3Subproblem 4 197

198 I prefer rεz ɪ gn I prefer rizajn General Idea of Dual Decomp zrizajn e ɪʃ ən dæmn e ɪʃ ən zdæmn rεz ɪ gn rεz ɪ gn#e ɪʃ ən rizajn#z dæmn#e ɪʃ ən dæmn#z r,εz ɪ gn’e ɪʃ n riz’ajnz d,æmn’e ɪʃ n d’æmz Subproblem 1Subproblem 2Subproblem 3Subproblem 4 198

199 zrizajn e ɪʃ ən dæmn e ɪʃ ən zdæmn rεz ɪ gn rεz ɪ gn#e ɪʃ ən rizajn#z dæmn#e ɪʃ ən dæmn#z r,εz ɪ gn’e ɪʃ n riz’ajnz d,æmn’e ɪʃ n d’æmz Subproblem 1Subproblem 2Subproblem 3Subproblem 4 199

200 Substring Features and Active Set z rizajn e ɪʃ ən dæmn e ɪʃ ən zdæmn rεzɪgnrεzɪgn rεz ɪ gn#e ɪʃ ən rizajn#z dæmn#e ɪʃ ən dæmn#z r,εz ɪ gn’e ɪʃ n riz’ajnz d,æmn’e ɪʃ n d’æmz Subproblem 1 200 I prefer rεz ɪ gn Less ε, ɪ, g; more i, a, j (to match others) Less ε, ɪ, g; more i, a, j (to match others) I prefer rizajn Less i, a, j; more ε, ɪ, g (to match others) Less i, a, j; more ε, ɪ, g (to match others)

201 Features: “Active set” method How many features? Infinitely many possible n-grams! Trick: Gradually increase feature set as needed. – Like Paul & Eisner (2012), Cotterell & Eisner (2015) 1.Only add features on which strings disagree. 2.Only add abcd once abc and bcd already agree. – Exception: Add unigrams and bigrams for free. 201

202 Fragment of Our Graph for Catalan 202 ? ? grizos ? gris ? ? grize ?? grizes ? ? Stem of “grey” Separate these 4 words into 4 subproblems as before …

203 203 ? ? grizos ? gris ? ? grize ? ? ? ? grizes Redraw the graph to focus on the stem …

204 204 ? ? grizos ? gris ? ? grize ? ? grizes ? ? ?? ? Separate into 4 subproblems – each gets its own copy of the stem

205 205 ? ? grizos ? gris ε ? grize ? ? grizes ? ? εε ε nonzero features: { } Iteration: 1

206 206 ? ? grizos ? gris g ? grize ? ? grizes ? ? gg g nonzero features: { } Iteration: 3

207 207 ? ? grizos ? gris ? grize ? ? grizes ? ? griz nonzero features: {s, z, is, iz, s$, z$ } Iteration: 4 Feature weights (dual variable)

208 208 ? ? grizos ? gris ? grize ? ? grizes ? ? grizgrizo griz nonzero features: {s, z, is, iz, s$, z$, o, zo, o$ } nonzero features: {s, z, is, iz, s$, z$, o, zo, o$ } Iteration: 5 Feature weights (dual variable)

209 209 ? ? grizos ? gris ? grize ? ? grizes ? ? grizgrizo griz nonzero features: {s, z, is, iz, s$, z$, o, zo, o$ } nonzero features: {s, z, is, iz, s$, z$, o, zo, o$ } Iteration: 6 Iteration: 13 Feature weights (dual variable)

210 210 ? ? grizos ? gris griz ? grize ? ? grizes ? ? grizgrizo griz nonzero features: {s, z, is, iz, s$, z$, o, zo, o$ } nonzero features: {s, z, is, iz, s$, z$, o, zo, o$ } Iteration: 14 Feature weights (dual variable)

211 211 ? ? grizos ? gris griz ? grize ? ? grizes ? ? griz nonzero features: {s, z, is, iz, s$, z$, o, zo, o$ } nonzero features: {s, z, is, iz, s$, z$, o, zo, o$ } Iteration: 17 Feature weights (dual variable)

212 212 ? ? grizos ? gris griz ? grize ? ? grizes ? ? grizegriz nonzero features: {s, z, is, iz, s$, z$, o, zo, o$, e, ze, e$} Iteration: 18 Feature weights (dual variable)

213 213 ? ? grizos ? gris griz ? grize ? ? grizes ? ? grizegriz nonzero features: {s, z, is, iz, s$, z$, o, zo, o$, e, ze, e$} Iteration: 19 Iteration: 29 Feature weights (dual variable)

214 214 ? ? grizos ? gris griz ? grize ? ? grizes ? ? griz nonzero features: {s, z, is, iz, s$, z$, o, zo, o$, e, ze, e$} Iteration: 30 Feature weights (dual variable)

215 215 ? ? grizos ? gris griz ? grize ? ? grizes ? ? griz nonzero features: {s, z, is, iz, s$, z$, o, zo, o$, e, ze, e$} Iteration: 30 Converged!

216 I’ll try to arrange for r not i at position 2, i not z at position 3, z not  at position 4. Why n-gram features? 216 Positional features don’t understand insertion: In contrast, our “z” feature counts the number of “z” phonemes, without regard to position. These solutions already agree on “g”, “i”, “z” counts … they’re only negotiating over the “r” count. giz griz giz griz I need more r’s.

217 Why n-gram features? 217 Adjust weights λ until the “r” counts match: Next iteration agrees on all our unigram features: – Oops! Features matched only counts, not positions  – But bigram counts are still wrong … so bigram features get activated to save the day – If that’s not enough, add even longer substrings … giz griz I need more r’s … somewhere. girz griz I need more gr, ri, iz, less gi, ir, rz.

218 Results using Dual Decomposition 218

219 7 Inference Problems (graphs) EXERCISE (small) o 4 languages: Catalan, English, Maori, Tangale o 16 to 55 underlying morphemes. o 55 to 106 surface words. CELEX (large) o 3 languages: English, German, Dutch o 341 to 381 underlying morphemes. o 1000 surface words for each language. 219 # vars (unknown strings) # subproblems

220 Experimental Setup o Model 1: very simple phonology with only 1 parameter, trained by grid search. o Model 2S: sophisticated phonology with phonological features trained by hand- crafted morpheme URs: full supervision. o Model 2E: sophisticated phonology as Model 2S, trained by EM. o Evaluating inference on recovered latent variables under the different settings. 220

221 Experimental Questions o Is exact inference by DD practical? o Does it converge? o Does it get better results than approximate inference methods? o Does exact inference help EM? 221

222 ● DD seeks best λ via subgradient algorithm  reduce dual objective  tighten upper bound on primal objective ● If λ gets all sub-problems to agree (x 1 = … = x K )  constraints satisfied  dual value is also value of a primal solution  which must be max primal! (and min dual) 222 ≤ primal (function of strings x) dual (function of weights λ )

223 Convergence behavior (full graph) Catalan Maori EnglishTangale 223 Dual (tighten upper bound) primal (improve strings) optimal!

224 Comparisons ● Compare DD with two types of Belief Propagation (BP) inference. Approximate MAP inference (max-product BP) (baseline) Approximate marginal inference (sum-product BP) (TACL 2015) Exact MAP inference (dual decomposition) (this paper) 224 Exact marginal inference (we don’t know how!) variational approximation Viterbi approximation

225 Inference accuracy 225 Approximate MAP inference (max-product BP) (baseline) Approximate marginal inference (sum-product BP) (TACL 2015) Exact MAP inference (dual decomposition) (this paper) Model 1, EXERCISE: 90% Model 1, CELEX: 84% Model 2S, CELEX: 99% Model 2E, EXERCISE: 91% Model 1, EXERCISE: 95% Model 1, CELEX: 86% Model 2S, CELEX: 96% Model 2E, EXERCISE: 95% Model 1, EXERCISE: 97% Model 1, CELEX: 90% Model 2S, CELEX: 99% Model 2E, EXERCISE: 98% Model 1 – trivial phonology Model 2S – oracle phonology Model 2E – learned phonology (inference used within EM) improves improves more! worse

226 Conclusion A general DD algorithm for MAP inference on graphical models over strings. On the phonology problem, terminates in practice, guaranteeing the exact MAP solution. Improved inference for supervised model; improved EM training for unsupervised model. Try it for your own problems generalizing to new strings! 226

227 observed data hidden data probability distribution Future Work 227

228 Future: Which words are related? So far, we were told that “resignation” shares morphemes with “resigns” and “damnation.” We’d like to figure that out from raw text:  Related spellings  Related contexts 228 shared morphemes?

229 Future: Which words are related? So far, we were told that “resignation” shares morphemes with “resigns” and “damnation.” We’d like to figure that out from raw text:  “resignation” and “resigns” are spelled similarly And appear in similar semantic contexts (topics, dependents)  “resignation” and “damnation” are spelled similarly And appear in similar syntactic contexts (singular nouns)  Abstract morphemes fall into classes: RESIGN-, DAMN- are verbs while -ATION, -S attach to verbs 229

230 How is morphology like clustering?

231 Linguistics quiz: Find a morpheme Blah blah blah snozzcumber blah blah blah. Blah blah blahdy abla blah blah. Snozzcumbers blah blah blah abla blah. Blah blah blah snezzcumbri blah blah snozzcumber.

232 Dreyer & Eisner 2011 – “select & mutate” Many possible morphological slots

233 Many possible phylogenies Did [Taylor swift] just dis harry sytles Lets see how bad [T Swift] will be. #grammys it’s clear that [T-Swizzle] is on drugs Ladies STILL love [LL Cool James]. [LL Cool J] is looking realllll chizzled! [Taylor swift] is apart of the Illuminati NEW Andrews, Dredze, & Eisner 2014 – “select & mutate”

234 Future: Which words are related? So far, we were told that “resignation” shares morphemes with “resigns” and “damnation.” We’d like to get that from raw text. To infer the abstract morphemes from context, we need to extend our generative story to capture regularity in morpheme sequences.  Neural language models …  But, must deal with unknown, unbounded vocabulary of morphemes 234

235 Future: Improving the factors More attention to representations and features (autosegmental phonology) Algorithms require us to represent each factor as a WFST defining  (x,y)  Good modeling reasons to use a Bayes net.  Then  (x,y) = p(y | x) so each factor is a PFST.  Alas, PFST is left/right asymmetric (label bias)!  Can we substitute a WFST that defines p(y | x) only up to a normalizing constant Z(x)? “Double intractability” since x is unknown: expensive even to explore x by Gibbs sampling!  How about features that depend on all of x? CRF / RNN / LSTM? 235

236 Reconstructing the (multilingual) lexicon IndexSpellingMeaningPronunciationSyntax 123ca[si.ei]NNP (abbrev) 124can [k ɛɪ n] NN 125can [kæn], [k ɛ n], … MD 126cane [ke ɪ n] NN (mass) 127cane [ke ɪ n] NN 128canes [ke ɪ nz] NNS 236 (other columns would include translations, topics, counts, embeddings, …)

237 Conclusions Unsupervised learning of how all the words in a language (or across languages) are interrelated.  This is what kids and linguists do.  Given data, estimate a posterior distribution over the infinite probabilistic lexicon.  While training parameters that model how lexical entries are related (language-specific derivational processes or soft constraints). Starting to look feasible!  We now have a lot of the ingredients – generative models and algorithms. 237


Download ppt "1 ASRU, Dec. 2015 Graphical Models Over String-Valued Random Variables Jason Eisner Ryan Cotterell Nanyun (Violet) Peng Nick Andrews Markus Dreyer Michael."

Similar presentations


Ads by Google