Download presentation
Presentation is loading. Please wait.
Published byAmber Marshall Modified over 9 years ago
1
1 ASRU, Dec. 2015 Graphical Models Over String-Valued Random Variables Jason Eisner Ryan Cotterell Nanyun (Violet) Peng Nick Andrews Markus Dreyer Michael Paul 1 with
2
2 ASRU, Dec. 2015 Probabilistic Inference of Strings Jason Eisner Ryan Cotterell Nanyun (Violet) Peng Nick Andrews Markus Dreyer Michael Paul 2 with Pronunciation Dictionaries
3
3
4
lexicon (word types) semantics sentences discourse context resources entailment correlation inflection cognates transliteration abbreviation neologism language evolution translation alignment editing quotation speech misspellings,typos formatting entanglement annotation N tokens To recover variables, model and exploit their correlations
5
5
6
Bayesian View of the World observed data hidden data probability distribution 6
7
Bayesian NLP Some good NLP questions: Underlying facts or themes that help explain this document collection? An underlying parse tree that helps explain this sentence? An underlying meaning that helps explain that parse tree? An underlying grammar that helps explain why these sentences are structured as they are? An underlying grammar or evolutionary history that helps explain why these words are spelled as they are? 7
8
Today’s Challenge Too many words in a language! 8
9
Natural Language is Built from Words 9
10
Can store info about each word in a table IndexSpellingMeaningPronunciationSyntax 123ca[si.ei]NNP (abbrev) 124can [k ɛɪ n] NN 125can [kæn], [k ɛ n], … MD 126cane [ke ɪ n] NN (mass) 127cane [ke ɪ n] NN 128canes [ke ɪ nz] NNS (other columns would include translations, topics, counts, embeddings, …) 10
11
Problem: Too Many Words! Google analyzed 1 trillion words of English text Found > 13M distinct words with count ≥ 200 The problem isn’t storing such a big table … it’s acquiring the info for each row separately Need lots of evidence, or help from human speakers Hard to get for every word of the language Especially hard for complex or “low-resource” languages Omit rare words? Maybe, but many sentences contain them (Zipf’s Law) 11
12
Technically speaking, # words = Really the set of (possible) words is ∑* Names Neologisms Typos Productive processes: friend friendless friendlessness friendlessnessless … hand+bag handbag (sometimes can iterate) 12
13
Technically speaking, # words = Really the set of (possible) words is ∑* Names Neologisms Typos Productive processes: friend friendless friendlessness friendlessnessless … hand+bag handbag (sometimes can iterate) 13 Turkish word: uygarlaştiramadiklarimizdanmişsinizcasina = uygar+laş+tir+ama+dik+lar+imiz+dan+miş+siniz+casina (behaving) as if you are among those whom we could not cause to become civilized
14
14 A typical Polish verb (“to contain”) ImperfectivePerfective infinitivezawieraćzawrzeć present past future conditional imperative present active participleszawierający, -a, -e; -y, -e present passive participleszawierany, -a, -e; -, -e past passive participleszawarty, -a, -e; -, -te adverbial participlezawierając zawieramzaweiramy zawieraszzawieracie zawierazawierają zawierałem/zawierałamzawieraliśmy/zawierałyśmy zawierałeś/zawierałaśzawieraliście/zawierałyście zawierał/zawierała/zawierałozawierali/zawierały zawarłem/zawarłamzawarliśmy/zawarłyśmy zawarłeś/zawarłaśzawarliście/zawarłyście zawarł/zawarła/zawarłozawarli/zawarły będę zawierał/zawierałabędziemy zawierali/zawierały będziesz zawierał/zawierałabędziecie zawierali/zawierały będzie zawierał/zawierała/zawierałobędą zawierali/zawierały zawręzawrzemy zawreszzawrzecie zawrzezawrą zawierałbym/zawierałabymzawieralibyśmy/zawierałybyśmy zawierałbyś/zawierałabyśzawieralibyście/zawierałybyście zawierałby/zawierałaby/zawierałobyzawieraliby/zawierałyby zawarłbym/zawarłabymzawarlibyśmy/zawarłybyśmy zawarłbyś/zawarłabyśzawarlibyście/zawarłybyście zawarłby/zawarłaby/zawarłobyzawarliby/zawarłyby zawierajmy zawierajzawierajcie niech zawieraniech zawierają zawrzyjmy zawrzyjzawrzjcie niech zawrzeniech zawrą 100 inflected forms per verb Sort of predictable from one another! (verbs are more or less regular)
16
Solution: Don’t model every cell separately Noble gases Positive ions 16
17
Can store info about each word in a table IndexSpellingMeaningPronunciationSyntax 123ca[si.ei]NNP (abbrev) 124can [k ɛɪ n] NN 125can [kæn], [k ɛ n], … MD 126cane [ke ɪ n] NN (mass) 127cane [ke ɪ n] NN 128canes [ke ɪ nz] NNS 17 (other columns would include translations, topics, counts, embeddings, …)
18
What’s in the table? NLP strings are diverse … Use Orthographic (spelling) Phonological (pronunciation) Latent (intermediate steps not observed directly) Size Morphemes (meaningful subword units) Words Multi-word phrases, including “named entities” URLs 18
19
Language English, French, Russian, Hebrew, Chinese, … Related languages (Romance langs, Arabic dialects, …) Dead languages (common ancestors) – unobserved? Transliterations into different writing systems Medium Misspellings Typos Wordplay Social media 19 What’s in the table? NLP strings are diverse …
20
Some relationships within the table spelling pronunciation word noisy word (e.g., with a typo) word related word in another language (loanwords, language evolution, cognates) singular plural (for example) (root, binyan) word underlying form surface form 20
21
Chains of relations can be useful Misspelling or pun = spelling pronunciation spelling Cognate = word historical parent historical child 21
22
Reconstructing the (multilingual) lexicon IndexSpellingMeaningPronunciationSyntax 123ca[si.ei]NNP (abbrev) 124can [k ɛɪ n] NN 125can [kæn], [k ɛ n], … MD 126cane [ke ɪ n] NN (mass) 127cane [ke ɪ n] NN 128canes [ke ɪ nz] NNS (other columns would include translations, topics, distributional info such as counts, …) Ultimate goal: Probabilistically reconstruct all missing entries of this infinite multilingual table, given some entries and some text. Needed: Exploit the relationships (arrows). May have to discover those relationships. Approach: Linguistics + generative modeling + statistical inference. Modeling ingredients: Finite-state machines, graphical models, CRP. Inference ingredients: MCMC, BP/EP, DD. 22
23
Today’s Focus: Phonology (but methods also apply to other relationships among strings) 23
24
What is Phonology? [kæt] Phonology: Orthography: cat Phonology explains regular sound patterns 24
25
What is Phonology? [kæt] Phonetics: Phonology: Orthography: cat Phonology explains regular sound patterns Not phonetics, which deals with acoustics 25
26
Q: What do phonologists do? A: They find patterns among the pronunciations of words. 26
27
A Phonological Exercise Tenses Verbs [tɔk] [tɔks] [tɔkt] 1P Pres. Sg. 3P Pres. Sg.Past Tense Past Part. [tɔkt] [θeɪŋk] [θeɪŋks] [θeɪŋkt] [hæk] [hæks] [hækt] TALK THANK HACK CRACK SLAP [kɹæks] [kɹækt] [slæp] [slæpt] 27
28
Matrix Completion: Collaborative Filtering Movies Users -37 29 19 29 -36 67 77 22 -24 61 74 12 -79 -41 -52 -39 28
29
Matrix Completion: Collaborative Filtering 29 19 Movies Users 29 67 77 22 61 74 12 -79 -41 -39 -6 -3 2 [ 4 1 -5] [ 7 -2 0] [ 6 -2 3] [-9 1 4] [ 3 8 -5] [ [ 9 -2 1 [ [ 9 -7 2 [ [ 4 3 -2 [ [ -37 -36 -24 -52 29
30
Matrix Completion: Collaborative Filtering Prediction! 59 -80 6 46 -37 29 19 29 -36 67 77 22 -24 61 74 12 -79 -41 -52 -39 -6 -3 2 [ [ 9 -2 1 [ [ 9 -7 2 [ [ [ [ [ 4 1 -5] [ 7 -2 0] [ 6 -2 3] [-9 1 4] [ 3 8 -5] Movies Users 4 3 -2 [ 30
31
Matrix Completion: Collaborative Filtering [1,-4,3] [-5,2,1] -10 -11 Dot Product Gaussian Noise 31
32
Matrix Completion: Collaborative Filtering Prediction! 59 -80 6 46 -37 29 19 29 -36 67 77 22 -24 61 74 12 -79 -41 -52 -39 -6 -3 2 [ [ 9 -2 1 [ [ 9 -7 2 [ [ [ [ [ 4 1 -5] [ 7 -2 0] [ 6 -2 3] [-9 1 4] [ 3 8 -5] Movies Users 4 3 -2 [ 32
33
A Phonological Exercise [tɔk] [tɔks] [tɔkt] TALK THANK HACK 1P Pres. Sg. 3P Pres. Sg.Past Tense Past Part. Tenses Verbs [tɔkt] [θeɪŋk] [θeɪŋks] [θeɪŋkt] [hæk] [hæks] [hækt] CRACK SLAP [kɹæks] [kɹækt] [slæp] [slæpt] 33
34
A Phonological Exercise [tɔk] [tɔks] [tɔkt] TALK THANK HACK 1P Pres. Sg. 3P Pres. Sg.Past Tense Past Part. Suffixes Stems [tɔkt] [θeɪŋk] [θeɪŋks] [θeɪŋkt] [hæk] [hæks] [hækt] CRACK SLAP [kɹæks] [kɹækt] [slæp] [slæpt] /Ø//Ø/ /s//s/ /t//t//t//t/ /tɔk/ /θeɪŋk/ /hæk/ /slæp/ /kɹæk/ 34
35
A Phonological Exercise [tɔk] [tɔks] [tɔkt] TALK THANK HACK 1P Pres. Sg. 3P Pres. Sg.Past Tense Past Part. [tɔkt] [θeɪŋk] [θeɪŋks] [θeɪŋkt] [hæk] [hæks] [hækt] CRACK SLAP [kɹæks] [kɹækt] [slæp] [slæpt] /Ø//Ø/ /s//s/ /t//t//t//t/ /tɔk/ /θeɪŋk/ /hæk/ /slæp/ /kɹæk/ Suffixes Stems 35
36
A Phonological Exercise [tɔk] [tɔks] [tɔkt] TALK HACK 1P Pres. Sg. 3P Pres. Sg.Past Tense Past Part. [tɔkt] [θeɪŋk] [θeɪŋks] [θeɪŋkt] [hæk] [hæks] [hækt] CRACK SLAP [kɹæk] [kɹæks] [kɹækt] [slæp] [slæps] [slæpt] /Ø//Ø/ /s//s/ /t//t//t//t/ /tɔk/ /θeɪŋk/ /hæk/ /slæp/ /kɹæk/ Prediction! THANK Suffixes Stems 36
37
Why “talks” sounds like that tɔk s s tɔks Concatenate “talks” 37
38
A Phonological Exercise [tɔk] [tɔks] [tɔkt] TALK HACK 1P Pres. Sg. 3P Pres. Sg.Past Tense Past Part. Suffixes Stems [tɔkt] [θeɪŋk] [θeɪŋks] [θeɪŋkt] [hæk] [hæks] [hækt] CRACK SLAP CODE BAT [kɹæks] [kɹækt] [slæp] [slæpt] [koʊdz] [koʊdɪd] [bæt] [bætɪd] /Ø//Ø/ /s//s/ /t//t//t//t/ /tɔk/ /θeɪŋk/ /hæk/ /bæt/ /koʊd/ /slæp/ /kɹæk/ THANK 38
39
A Phonological Exercise [tɔk] [tɔks] [tɔkt] TALK HACK 1P Pres. Sg. 3P Pres. Sg.Past Tense Past Part. Suffixes Stems [tɔkt] [θeɪŋk] [θeɪŋks] [θeɪŋkt] [hæk] [hæks] [hækt] CRACK SLAP CODE BAT [kɹæks] [kɹækt] [slæp] [slæpt] [koʊdz] [koʊdɪd] [bæt] [bætɪd] /Ø//Ø/ /s//s/ /t//t//t//t/ /tɔk/ /θeɪŋk/ /hæk/ /bæt/ /koʊd/ /slæp/ /kɹæk/ z instead of s ɪd instead of t THANK 39
40
A Phonological Exercise [tɔk] [tɔks] [tɔkt] TALK THANK HACK 1P Pres. Sg. 3P Pres. Sg.Past Tense Past Part. Suffixes Stems [tɔkt] [θeɪŋk] [θeɪŋks] [θeɪŋkt] [hæk] [hæks] [hækt] CRACK SLAP CODE BAT EAT [kɹæks] [kɹækt] [slæp] [slæpt] [koʊdz] [koʊdɪd] [bæt] [bætɪd] [it] [eɪt] [itən] /Ø//Ø/ /s//s/ /t//t//t//t/ /tɔk/ /θeɪŋk/ /hæk/ /it/ /bæt/ /koʊd/ /slæp/ /kɹæk/ eɪt instead of i t ɪd 40
41
Why “codes” sounds like that koʊds koʊd#s koʊdz Concatenate Phonology (stochastic) “codes” Modeling word forms using latent underlying morphs and phonology. Cotterell et. al. TACL 2015 41
42
Why “resignation” sounds like that rizaignation rizaign#ation rεzɪgneɪʃn “resignation” Concatenate Phonology (stochastic) 42
43
dæmn e ɪʃ ən srizaign r,εz ɪ gn’e ɪʃ n riz’ajnz d,æmn’e ɪʃ n d’æmz rizaign#e ɪʃ ən rizaign#s dæmn#s dæmn#e ɪʃ ən Fragment of Our Graph for English 1) Morphemes 2) Underlying words 3) Surface words Concatenation Phonology “resignation”“resigns” “damnation” “damns” 3 rd -person singular suffix: very common! 43
44
Handling Multimorphemic Words gəliːpt gəliːbt tgə 44 liːb “geliebt” (German: loved) Matrix completion: each word built from one stem (row) + one suffix (column). WRONG Graphical model: a word can be built from any # of morphemes (parents). RIGHT
45
Limited to concatenation? No, could extend to templatic morphology … 45
46
A (Simple) Model of Phonology 46
47
[1,-4,3] [-5,2,1] -10 -11 Dot Product Gaussian Noise 47
48
rizaigns rizaign#s rizainz “resigns” Concatenate Phonology (stochastic) SθSθ 48
49
Upper Left ContextUpper Right Context Lower Left Context Phonology as an Edit Process r r i i z z a a i i g g n n s s 49
50
Upper Left Context Lower Left Context Upper Right Context Phonology as an Edit Process r r i i z z a a i i g g n n s s r r COPY 50
51
Upper Left Context Lower Left Context Upper Right Context Phonology as an Edit Process r r i i z z a a i i g g n n s s r r i i COPY 51
52
Upper Left Context Lower Left Context Upper Right Context Phonology as an Edit Process r r i i z z a a i i g g n n s s r r i i COPY z z 52
53
Upper Left Context Lower Left Context i i i i Upper Right Context Phonology as an Edit Process r r z z a a i i g g n n s s r r z z COPY a a 53
54
i i i i Lower Left Context Upper Left ContextUpper Right Context Phonology as an Edit Process r r z z a a i i g g n n s s r r z z a a i i COPY 54
55
i i i i Lower Left Context Upper Left Context Upper Right Context Phonology as an Edit Process r r z z a a i i g g n n s s r r z z a a i i DEL ɛ ɛ 55
56
i i i i Lower Left Context Upper Left Context Upper Right Context Phonology as an Edit Process r r z z a a i i g g n n s s r r z z a a i i ɛ ɛ COPY n n 56
57
i i i i Lower Left Context Upper Left Context Upper Right Context Phonology as an Edit Process r r z z a a i i g g n n s s r r z z a a i i ɛ ɛ n n SUB z z 57
58
i i i i Lower Left Context Upper Left Context Upper Right Context Phonology as an Edit Process r r z z a a i i g g n n s s r r z z a a i i ɛ ɛ n n SUB z z 58
59
i i i i Lower Left Context Upper Left Context Upper Right Context Phonology as an Edit Process r r z z a a i i g g n n s s r r z z a a i i ɛ ɛ COPY n n 59
60
Lower Left Context Upper Left ContextUpper Right Context i i i i Phonology as an Edit Process r r z z a a i i g g n n s s r r z z a a i i DEL ɛ ɛ Weights Feature Function ActionProb DEL.75 COPY.01 SUB(A).05 SUB(B).03... INS(A).02 INS(B).01... 60
61
Lower Left Context Upper Left ContextUpper Right Context i i i i Phonology as an Edit Process r r z z a a i i g g n n s s r r z z a a i i DEL ɛ ɛ Feature Function Weights Features 61
62
Lower Left Context Upper Left ContextUpper Right Context i i i i Phonology as an Edit Process r r z z a a i i g g n n s s r r z z a a i i DEL ɛ ɛ Feature Function Weights Features Surface Form 62
63
i i i i Lower Left Context Upper Left ContextUpper Right Context Phonology as an Edit Process r r z z a a i i g g n n s s r r z z a a i i DEL ɛ ɛ Feature Function Weights Features Surface Form Transduction 63
64
Lower Left Context Upper Left ContextUpper Right Context i i i i Phonology as an Edit Process r r z z a a i i g g n n s s r r z z a a i i DEL ɛ ɛ Feature Function Weights Features Surface Form Transduction Upper String 64
65
Phonological Attributes Binary Attributes (+ and -) 65
66
i i i i Lower Left Context Upper Left ContextUpper Right Context Phonology as an Edit Process r r z z a a i i g g n n s s r r z z a a i i DEL ɛ ɛ 66
67
Lower Left Context Upper Left ContextUpper Right Context i i i i Phonology as an Edit Process r r z z a a i i g g n n s s r r z z a a i i DEL ɛ ɛ Faithfulness Features EDIT(g, ɛ ) EDIT(+cons, ɛ ) EDIT(+voiced, ɛ ) 67
68
i i i i Lower Left Context Upper Left ContextUpper Right Context Phonology as an Edit Process r r z z a a i i g g n n s s r r z z a a i i COPY Markedness Features BIGRAM(a, i) BIGRAM(-high, -low) BIGRAM(+back, -back) 68
69
i i i i Lower Left Context Upper Left ContextUpper Right Context Phonology as an Edit Process r r z z a a i i g g n n s s r r z z a a i i COPY Markedness Features BIGRAM(a, i) BIGRAM(-high, -low) BIGRAM(+back, -back) Inspired by Optimality Theory: A popular Constraint-Based Phonology Formalism 69
70
Inference for Phonology 70
71
Bayesian View of the World observed data hidden data probability distribution 71
72
r,εz ɪ gn’e ɪʃ nd,æmn’e ɪʃ n d’æmz 72
73
dæmn e ɪʃ ən srizaign r,εz ɪ gn’e ɪʃ n riz’ajnz d,æmn’e ɪʃ n d’æmz rizaign#e ɪʃ ən rizaign#s dæmn#s dæmn#e ɪʃ ən 73
74
Bayesian View of the World observed data hidden data probability distribution the rest of the words! all of the morphs the parameter vectors θ, φ some of the words 74
75
Statistical Inference some surface forms of the language 1. The underlying forms giving rise to those surface forms 2. The surface forms for all other words, as predicted by the underlying forms probability distribution (we also learn the edit parameters that best explain the visible part of the iceberg) 75
76
Why this matters Phonological grammars are usually hand- engineered by phonologists. Linguistics goal: Create an automated phonologist? Cognitive science goal: Model how babies learn phonology? Engineering goal: Analyze and generate words we haven’t heard before? 76
77
The Generative Story (defines which iceberg shapes are likely) 1. Sample the parameters φ and θ from priors. These parameters describe the grammar of a new language: what tends to happen in the language. 2. Now choose the lexicon of morphs and words: – For each abstract morpheme a A, sample the morph M(a) ~ M φ – For each abstract word a=a 1,a 2 ···, sample its surface pronunciation S(a) from S θ (· | u), where u=M(a 1 )#M(a 2 ) ··· 3. This lexicon can now be used to communicate. A word’s pronunciation is now just looked up, not sampled; so it is the same each time it is used. 77 rizaign rizaign#s riz’ajnz
78
Why Probability? A language’s morphology and phonology are fixed, but probability models the learner’s uncertainty about what they are. Advantages: – Quantification of irregularity (“singed” vs. “sang”) – Soft models admit efficient learning and inference Our use is orthogonal to the way phonologists currently use probability to explain gradient phenomena 78
79
Basic Methods for Inference and Learning 79
80
Train the Parameters using EM (Dempster et al. 1977) E-Step (“inference”): – Infer the hidden strings (posterior distribution) r,εz ɪ gn’e ɪʃ nd,æmn’e ɪʃ n d’æmz
81
Train the Parameters using EM (Dempster et al. 1977) E-Step (“inference”): – Infer the hidden strings (posterior distribution) dæmn e ɪʃ ən srizaign r,εz ɪ gn’e ɪʃ n riz’ajnz d,æmn’e ɪʃ n d’æmz rizaign#e ɪʃ ən rizaign#s dæmn#s dæmn#e ɪʃ ən
82
Train the Parameters using EM (Dempster et al. 1977) E-Step (“inference”): – Infer the hidden strings (posterior distribution) M-Step (“learning”): – Improve the continuous model parameters θ, φ (gradient descent: the E-step provides supervision) Repeat till convergence i i i i r r z z a a i i g g n n s s r r z z a a i i DEL 82 riz’ajnz rizaign#s
83
Directed Graphical Model (defines the probability of a candidate solution) 1) Morphemes 2) Underlying words 3) Surface words Concatenation Phonology 83 Inference step: Find high-probability reconstructions of the hidden variables. High-probability if each string is likely given its parents. dæmn e ɪʃ ən srizaign r,εz ɪ gn’e ɪʃ n riz’ajnz d,æmn’e ɪʃ n d’æmz rizaign#e ɪʃ ən rizaign#s dæmn#s dæmn#e ɪʃ ən
84
1) Morphemes 2) Underlying words 3) Surface words Concatenation Phonology 84 Equivalent Factor Graph (defines the probability of a candidate solution) dæmn e ɪʃ ən srizaign r,εz ɪ gn’e ɪʃ n riz’ajnz d,æmn’e ɪʃ n d’æmz rizaign#e ɪʃ ən rizaign#s dæmn#s dæmn#e ɪʃ ən
85
85
86
86
87
87
88
88
89
89
90
90
91
91
92
92
93
93
94
Directed Graphical Model 94 1) Morphemes 2) Underlying words 3) Surface words Concatenation Phonology dæmn e ɪʃ ən srizaign r,εz ɪ gn’e ɪʃ n riz’ajnz d,æmn’e ɪʃ n d’æmz rizaign#e ɪʃ ən rizaign#s dæmn#s dæmn#e ɪʃ ən
95
Equivalent Factor Graph Each ellipse is a random variable Each square is a “factor” – a function that jointly scores the values of its few neighboring variables 95 1) Morphemes 2) Underlying words 3) Surface words Concatenation Phonology dæmn e ɪʃ ən srizaign r,εz ɪ gn’e ɪʃ n riz’ajnz d,æmn’e ɪʃ n d’æmz rizaign#e ɪʃ ən rizaign#s dæmn#s dæmn#e ɪʃ ən
96
? ? riz’ajnz ? r,εz ɪ gn’e ɪʃ n ? ? riz’ajnd ? ? Dumb Inference by Hill-Climbing 1) Morpheme URs 2) Word URs 3) Word SRs 96
97
foo ? riz’ajnz ? r,εz ɪ gn’e ɪʃ n s ? riz’ajnd da bar Dumb Inference by Hill-Climbing 1) Morpheme URs 2) Word URs 3) Word SRs 97
98
Dumb Inference by Hill-Climbing foo bar#s riz’ajnz bar#foo r,εz ɪ gn’e ɪʃ n s bar#da riz’ajnd da bar 1) Morpheme URs 2) Word URs 3) Word SRs 98
99
Dumb Inference by Hill-Climbing 8e-3 0.01 0.050.02 foo bar#s riz’ajnz bar#foo r,εz ɪ gn’e ɪʃ n s bar#da riz’ajnd da bar 1) Morpheme URs 2) Word URs 3) Word SRs 99
100
Dumb Inference by Hill-Climbing 8e-3 0.01 0.050.02 foo bar#s riz’ajnz bar#foo r,εz ɪ gn’e ɪʃ n s bar#da riz’ajnd da bar 1) Morpheme URs 2) Word URs 3) Word SRs 6e-1200 2e-1300 7e-1100 100
101
Dumb Inference by Hill-Climbing 8e-3 0.01 0.050.02 foo bar#s riz’ajnz bar#foo r,εz ɪ gn’e ɪʃ n s bar#da riz’ajnd da bar 1) Morpheme URs 2) Word URs 3) Word SRs 6e-1200 2e-1300 7e-1100 101
102
Dumb Inference by Hill-Climbing foo far#s riz’ajnz far#foo r,εz ɪ gn’e ɪʃ n s far#da riz’ajnd da far 1) Morpheme URs 2) Word URs 3) Word SRs ? 102
103
Dumb Inference by Hill-Climbing foo size#s riz’ajnz size#foo r,εz ɪ gn’e ɪʃ n s size#da riz’ajnd da size 1) Morpheme URs 2) Word URs 3) Word SRs ? 103
104
Dumb Inference by Hill-Climbing foo …#s riz’ajnz …#foo r,εz ɪ gn’e ɪʃ n s …#da riz’ajnd da … 1) Morpheme URs 2) Word URs 3) Word SRs ? 104
105
Dumb Inference by Hill-Climbing foo rizajn#s riz’ajnz rizajn#foo r,εz ɪ gn’e ɪʃ n s rizajn#da riz’ajnd da rizajn 1) Morpheme URs 2) Word URs 3) Word SRs 105
106
Dumb Inference by Hill-Climbing foo rizajn#s riz’ajnz rizajn#foo r,εz ɪ gn’e ɪʃ n s rizajn#da riz’ajnd da rizajn 1) Morpheme URs 2) Word URs 3) Word SRs 0.012e-50.008 106
107
Dumb Inference by Hill-Climbing e ɪʃ n rizajn#s riz’ajnz rizajn#e ɪʃ n r,εz ɪ gn’e ɪʃ n s rizajn#d riz’ajnd d rizajn 1) Morpheme URs 2) Word URs 3) Word SRs 0.010.0010.015 107
108
Dumb Inference by Hill-Climbing e ɪʃ n rizajgn#s riz’ajnz rizajgn#e ɪʃ n r,εz ɪ gn’e ɪʃ n s rizajgn#d riz’ajnd d rizajgn 1) Morpheme URs 2) Word URs 3) Word SRs 0.008 0.013 108
109
Dumb Inference by Hill-Climbing 109 Can we make this any smarter? This naïve method would be very slow. And it could wander around forever, get stuck in local maxima, etc. Alas, the problem of finding the best values in a factor graph is undecidable! (Can’t even solve by brute force because strings have unbounded length.) Exact methods that might not terminate (but do in practice) Approximate methods – which try to recover not just the best values, but the posterior distribution of values All our methods are based on finite-state automata
110
A Generative Model of Phonology A Directed Graphical Model of the lexicon rˌɛzɪgnˈeɪʃən dˈæmz rizˈajnz 110 (Approximate) Inference MCMC – Bouchard-Côté (2007) Belief Propagation – Dreyer and Eisner (2009) Expectation Propagation – Cotterell and Eisner (2015) Dual Decomposition – Peng, Cotterell, & Eisner (2015)
111
About to sell our mathematical soul? 111 Insight Efficiency Exactness
112
112 General algos Give up lovely dynamic programming? Big Models Specialized algos
113
113 Give up lovely dynamic programming? General algos Insight Specialized algos Not quite! – Yes, general algos … which call specialized algos as subroutines Within a framework such as belief propagation, we may run – parsers (Smith & Eisner 2008) – finite-state machines (Dreyer & Eisner 2009) A step of belief propagation takes time O(k n ) in general – To update one message from a factor that coordinates n variables that have k possible values each – If that’s slow, we can sometimes exploit special structure in the factor! large n: parser uses dynamic programming to coordinate many vars infinite k: FSMs use dynamic programming to coordinate strings
114
rˌɛzɪgnˈeɪʃən dˈæmz rizˈajnz 114 Distribution Over Surface Form: UR Prob dæme ɪʃ ən.80 dæmne ɪʃ ən.10 dæmine ɪʃ ən..001 dæmiine ɪʃ ən.0001 … … chomsky.000001 … r in g u e ε s e h a Encoded as Weighted Finite- State Automaton
115
115
116
Experimental Design 116
117
Experimental Datasets 7 languages from different families – Maori – Tangale – Indonesian – Catalan – English – Dutch – German 117 200 to 800 67 71 54 43 homework exercises: can we generalize correctly from small data? CELEX can we scale up to larger datasets? can we handle naturally occurring datasets that have more irregularity? # of observed words per experiment
118
118 Evaluation Setup r,εz ɪ gn’e ɪʃ n d’æmz riz’ajnz
119
119 Evaluation Setup dæmn e ɪʃ ən srizaign r,εz ɪ gn’e ɪʃ n riz’ajnz d,æmn’e ɪʃ n d’æmz rizaign#e ɪʃ ən rizaign#s dæmn#s dæmn#e ɪʃ ən did we guess this pronunciation right?
120
Distribution Over Surface Form: UR Prob dæme ɪʃ ən.80 dæmne ɪʃ ən.10 dæmine ɪʃ ən..001 dæmiine ɪʃ ən.0001 … … chomsky.000001 … Exploring the Evaluation Metrics 120 1-best error rate – Is the 1-best correct? *
121
Distribution Over Surface Form: UR Prob dæme ɪʃ ən.80 dæmne ɪʃ ən.10 dæmine ɪʃ ən..001 dæmiine ɪʃ ən.0001 … … chomsky.000001 … Exploring the Evaluation Metrics 121 1-best error rate – Is the 1-best correct? Cross Entropy – What is the probability of the correct answer?
122
Distribution Over Surface Form: UR Prob dæme ɪʃ ən.80 dæmne ɪʃ ən.10 dæmine ɪʃ ən..001 dæmiine ɪʃ ən.0001 … … chomsky.000001 … Exploring the Evaluation Metrics 122 1-best error rate – Is the 1-best correct? Cross Entropy – What is the probability of the correct answer? Expected Edit Distance – How close am I on average?
123
Distribution Over Surface Form: UR Prob dæme ɪʃ ən.80 dæmne ɪʃ ən.10 dæmine ɪʃ ən..001 dæmiine ɪʃ ən.0001 … … chomsky.000001 … Exploring the Evaluation Metrics 123 1-best error rate – Is the 1-best correct? Cross Entropy – What is the probability of the correct answer? Expected Edit Distance – How close am I on average? Average over many training-test splits
124
Evaluation Metrics: (Lower is Always Better) – 1-best error rate (did we get it right?) – cross-entropy (what probability did we give the right answer?) – expected edit-distance (how far away on average are we?) – Average each metric over many training-test splits Comparisons: – Lower Bound: Phonology as noisy concatenation – Upper Bound: Oracle URs from linguists 124
125
Evaluation Philosophy We’re evaluating a language learner, on languages we didn’t examine when designing the learner. We directly evaluate how well our learner predicts held-out words that the learner didn’t see. No direct evaluation of intermediate steps: – Did we get the “right” underlying forms? – Did we learn a “simple” or “natural” phonology? – It’s hard to judge the answers. Anyway, we only want the answers to be “yes” because we suspect that this will give us a more predictive theory. So let’s just see if the theory is predictive. Proof is in the pudding! Caveat: Linguists and child language learners also have access to other kinds of data that we’re not considering yet. 125
126
Results (using Loopy Belief Propagation for inference) 126
127
German Results 127 Error Bars with bootstrap resampling
128
CELEX Results 128
129
Phonological Exercise Results 129
130
Gold UR Recovery 130
131
Formalizing Our Setup Many scoring functions on strings (e.g., our phonology model) can be represented using FSMs 131
132
1) Morphemes 2) Underlying words 3) Surface words Concatenation Phonology 132 What class of functions will we allow for the factors? (black squares) dæmn e ɪʃ ən srizaign r,εz ɪ gn’e ɪʃ n riz’ajnz d,æmn’e ɪʃ n d’æmz rizaign#e ɪʃ ən rizaign#s dæmn#s dæmn#e ɪʃ ən
133
Real-Valued Functions on Strings We’ll need to define some nonnegative functions. f(x) = score of a string f(x,y) = score of a pair of strings Can represent deterministic processes: f(x,y) {1,0} Is y the observable result of deleting latent material from x? Can represent probability distributions, e.g., f(x) = p(x) under some “generating process” f(x,y) = p(x,y) under some “joint generating process” f(x,y) = p(y | x) under some “transducing process” 133
134
Restrict to Finite-State Functions One string input a c Boolean output Two string inputs (on 2 tapes) a:x c:z :y:y a:x/.5 c:z/.7 :y/.5.3 Real output a/.5 c/.7 /.5.3 Path weight = product of arc weights Score of input = total weight of accepting paths 134
135
Example: Stochastic Edit Distance p(y|x) a: :a b: :b a:b b:a a:a b:b O(k) deletion arcs O(k) insertion arcs O(k 2 ) substitution arcs O(k) identity arcs Likely edits = high-probability arcs 135
136
Computing p(y|x) c l a r a c a ? Given (x,y), construct a graph of all accepting paths in the original FSM. These are different explanations for how x could have been edited to yield y (x-to-y alignments). Use dyn. prog. to find highest-prob path, or total prob of all paths. 0 1 2 3 4 5 0 1 2 3 4 c: l: a: r: a: :c c:c :c l:c :c a:c :c r:c :c a:c :c c: l: a: r: a: :a c:a :a l:a :a a:a :a r:a :a a:a :a c: l: a: r: a: :c c:c :c l:c :c a:c :c r:c :c a:c :c c: l: a: r: a: :a c:a :a l:a :a a:a :a r:a :a a:a :a c: l: a: r: a: 0 1 2 3 4 5 0123401234 position in upper string position in lower string 136
137
Why restrict to finite-state functions? Can always compute f(x,y) efficiently. Construct the graph of accepting paths by FSM composition. Sum over them via dynamic prog., or by solving a linear system. Finite-state functions are closed under useful operations: Marginalization: h(x) = ∑ y f(x,y) Pointwise product: h(x) = f(x) ∙ g(x) Join: h(x,y,z) = f(x,y) ∙ g(y,z) 137
138
Define a function family Use finite-state machines (FSMs). The arc weights are parameterized. We tune the parameters to get weights that predict our training data well. The FSM topology defines a function family. In practice, generalizations of stochastic edit distance. So, we are learning the edit probabilities. With more states, these can depend on left and right context. 138
139
Probabilistic FSTs 139
140
Probabilistic FSTs 140
141
Finite-State Graphical Model Over String-Valued Random Variables ● Joint distribution over many strings ● Variables ● Range over Σ* infinite set of all strings ● Relations among variables ● Usually specified by (multi-tape) FSTs 141 A probabilistic approach to language change ( Bouchard-Côté et. al. NIPS 2008 ) Graphical models over multiple strings. (Dreyer and Eisner. EMNLP 2009 ) Large-scale cognate recovery (Hall and Klein. EMNLP 2011 )
142
Useful 3-tape FSMs f(x,y,z) = p(z | x, y) typically z is functionally dependent on x,y, so this represents a deterministic process Concatenation: f( dog, s, dogs) = 1 Binyan: f(ktb, a _ _ u _, aktub) = 1 Remark: WLOG, we can do everything with 2- tape FSMs. Similar to binary constraints in CSP. 142
143
Computational Hardness Ordinary graphical model inference is sometimes easy and sometimes hard, depending on graph topology But over strings, it can be hard even with simple topologies and simple finite-state factors 143
144
Simple model family can be NP-hard Multi-sequence alignment problem Generalize edit distance to k strings of length O(n) Dynamic programming would seek best path in a hypercube of size O(n^k) Similar to Steiner string problem (“consensus”) c: l: a: r: a: :c c:c :c l:c :c a:c :c r:c :c a:c :c c: l: a: r: a: :a c:a :a l:a :a a:a :a r:a :a a:a :a c: l: a: r: a: :c c:c :c l:c :c a:c :c r:c :c a:c :c c: l: a: r: a: :a c:a :a l:a :a a:a :a r:a :a a:a :a c: l: a: r: a: 0 1 2 3 4 5 0123401234 position in lower string 144
145
Post’s Correspondence Problem (1946) Given a 2-tape FSM f(x,y) of a certain form, is there a string z for which f(z,z) > 0 ? No Turing Machine can decide this in general So no Turing machine can determine in general whether this simple factor graph has any positive-weight solutions: b:bb:b b:bb:b a:a: Simple model family can be undecidable (!) z f xy = f 145 z = bbaabbbaa bba+ab+bba+a bb +aa+bb +baa f = :a:a :a:a a:ba:b a:aa:a b:a :: bba+ab+bba bb +aa+bb bba+ab bb +aa bba bb
146
Inference by Belief Propagation 146
147
147
148
148
149
149
150
150
151
151
152
152
153
153
154
Loopy belief propagation (BP) The beautiful observation (Dreyer & Eisner 2009): Each message is a 1-tape FSM that scores all strings. Messages are iteratively updated based on other messages, according to the rules of BP. The BP rules require only operations under which FSMs are closed! Achilles’ heel: The FSMs may grow large as the algorithm iterates. So the algorithm may be slow, or not converge at all. 154
155
Belief Propagation (BP) in a Nutshell X1X1 X2X2 X3X3 X4X4 X6X6 X5X5
156
dˌæmnˈeɪʃən rizˈajnz rˌɛzɪgnˈeɪʃən dæmnz rizajgnz rizajgneɪʃən dæmneɪʃən eɪʃən z z rizajgn dæmn dˈæmz 156
157
Belief Propagation (BP) in a Nutshell dˌæmnˈeɪʃən rizˈajnz rˌɛzɪgnˈeɪʃən dæmnz rizajgnz rizajgneɪʃən dæmneɪʃən eɪʃən z z rizajgn dæmn dˈæmz Factor to Variable Messages 157
158
Belief Propagation (BP) in a Nutshell dˌæmnˈeɪʃən rizˈajnz rˌɛzɪgnˈeɪʃən dæmnz rizajgnz rizajgneɪʃən dæmneɪʃən eɪʃən z z rizajgn dæmn dˈæmz Variable to Factor Messages 158
159
Belief Propagation (BP) in a Nutshell dˌæmnˈeɪʃən rizˈajnz rˌɛzɪgnˈeɪʃən dæmnz rizajgnz rizajgneɪʃən dæmneɪʃən eɪʃən z z rizajgn dæmn dˈæmz Encoded as Finite- State Machines r in g u e ε s e h a r in g u e ε e e s e h a r in g u e ε e e s e h a r in g u e ε e e s e h a r in g u e ε s e h a r in g u e ε s e h a 159
160
Belief Propagation (BP) in a Nutshell dˌæmnˈeɪʃən rizˈajnz rˌɛzɪgnˈeɪʃən dæmnz rizajgnz rizajgneɪʃən dæmneɪʃən eɪʃən z z rizajgn dæmn dˈæmz 160
161
Belief Propagation (BP) in a Nutshell dˌæmnˈeɪʃən rizˈajnz rˌɛzɪgnˈeɪʃən dæmnz rizajgnz rizajgneɪʃən dæmneɪʃən eɪʃən z z rizajgn dæmn dˈæmz r in g u e ε e e s e h a r in g u e ε e e s e h a 161
162
Belief Propagation (BP) in a Nutshell dˌæmnˈeɪʃən rizˈajnz rˌɛzɪgnˈeɪʃən dæmnz rizajgnz rizajgneɪʃən dæmneɪʃən eɪʃən z z rizajgn dæmn dˈæmz r i n g u e ε e e s e h a r i n g u e ε e e s e h a r i n g u e ε e e s e h a Point-wise product (finite-state intersection) yields marginal belief 162
163
Belief Propagation (BP) in a Nutshell dˌæmnˈeɪʃən rizˈajnz rˌɛzɪgnˈeɪʃən dæmnz rizajgnz rizajgneɪʃən dæmneɪʃən eɪʃən z z rizajgn dæmn dˈæmnz Distribution Over Underlying Forms: UR Prob rizajgnz.95 rezajnz.02 rezigz.02 rezgz.0001 … chomsky.000001 … r i n g u e ε e e s e h a r i n g u e ε e e s e h a r i n g u e ε e e s e h a 163
164
Computing Marginal Beliefs X1X1 X2X2 X3X3 X4X4 X7X7 X5X5
165
X1X1 X2X2 X3X3 X4X4 X7X7 X5X5
166
Belief Propagation (BP) in a Nutshell X1X1 X2X2 X3X3 X4X4 X6X6 X5X5 r in g u e ε e e s e h a r in g u e ε s e h a r in g u e ε e e s e h a r in g u e ε s e h a
167
Computing Marginal Beliefs X1X1 X2X2 X3X3 X4X4 X7X7 X5X5 r in g u e ε s e h a r in g u e ε s e h a r in g u e ε s e h a r in g u e ε s e h a
168
X1X1 X2X2 X3X3 X4X4 X7X7 X5X5 C C r in g u e ε s e h a r in g u e ε s e h a r in g u e ε s e h a r in g u e ε s e h a r in g u e ε s e h a r in g u e ε s e h a r in g u e ε s e h a r in g u e ε s e h a r in g u e ε s e h a r in g u e ε s e h a Computation of belief results in large state space
169
Computing Marginal Beliefs X1X1 X2X2 X3X3 X4X4 X7X7 X5X5 C C r in g u e ε s e h a r in g u e ε s e h a r in g u e ε s e h a r in g u e ε s e h a r in g u e ε s e h a r in g u e ε s e h a r in g u e ε s e h a r in g u e ε s e h a r in g u e ε s e h a r in g u e ε s e h a Computation of belief results in large state space What a hairball!
170
Computing Marginal Beliefs X1X1 X2X2 X3X3 X4X4 X7X7 X5X5 r in g u e ε s e h a r in g u e ε s e h a r in g u e ε s e h a r in g u e ε s e h a Approximation Required!!!
171
BP over String-Valued Variables In fact, with a cyclic factor graph, messages and marginal beliefs grow unboundedly complex. X2X2 X1X1 ψ2ψ2 a a ε a a a a ψ1ψ1 a a
172
BP over String-Valued Variables In fact, with a cyclic factor graph, messages and marginal beliefs grow unboundedly complex! X2X2 X1X1 ψ2ψ2 a a ε a a a a ψ1ψ1 a a a
173
BP over String-Valued Variables In fact, with a cyclic factor graph, messages and marginal beliefs grow unboundedly complex! X2X2 X1X1 ψ2ψ2 a a ε a a a a ψ1ψ1 a a a aa
174
BP over String-Valued Variables In fact, with a cyclic factor graph, messages and marginal beliefs grow unboundedly complex! X2X2 X1X1 ψ2ψ2 a a ε a a a a ψ1ψ1 a a a a a a
175
BP over String-Valued Variables In fact, with a cyclic factor graph, messages and marginal beliefs grow unboundedly complex! X2X2 X1X1 ψ2ψ2 a a ε a a a a ψ1ψ1 a a a a a a a aa a a a a aaa aaaaaaaaa
176
Inference by Expectation Propagation 176
177
Expectation Propagation: The Details A belief at at variable is just the point-wise product of message: Key Idea: For each message,, we seek an approximate message Algorithm: for each
178
Expectation Propagation (EP) X1X1 X2X2 X3X3 X4X4 X7X7 X5X5 exponential-family approximations inside Belief at X 3 will be simple! Messages to and from X 3 will be simple! 178
179
Expectation propagation (EP) EP solves the problem by simplifying each message once it is computed. Projects the message back into a tractable family. 179
180
Expectation Propagation (EP) exponential-family approximations inside Belief at X 3 will be simple! Messages to and from X 3 will be simple! X3X3
181
Expectation propagation (EP) EP solves the problem by simplifying each message once it is computed. Projects the message back into a tractable family. In our setting, we can use n-gram models. f approx (x) = product of weights of the n-grams in x Just need to choose weights that give a good approx Best to use variable-length n-grams. 181
182
Expectation Propagation (EP) in a Nutshell X1X1 X2X2 X3X3 X4X4 X7X7 X5X5 r in g u e ε s e h a r in g u e ε s e h a r in g u e ε s e h a r in g u e ε s e h a
183
X1X1 X2X2 X3X3 X4X4 X7X7 X5X5 r in g u e ε s e h a r in g u e ε s e h a r in g u e ε s e h a
184
X1X1 X2X2 X3X3 X4X4 X7X7 X5X5 r in g u e ε s e h a r in g u e ε s e h a
185
X1X1 X2X2 X3X3 X4X4 X7X7 X5X5 r in g u e ε s e h a
186
X1X1 X2X2 X3X3 X4X4 X7X7 X5X5
187
Expectation propagation (EP) 187
188
Variable Order Approximations Use only the n-grams you really need!
189
Approximating beliefs with n-grams How to optimize this? Option 1: Greedily add n-grams by expected count in f. Stop when adding next batch hurts the objective. Option 2: Select n-grams from a large set using a convex relaxation + projected gradient (= tree- structured group lasso). Must incrementally expand the large set (“active set” method). 189
190
Results using Expectation Propagation 190
191
Trigram EP (Cyan) – slow, very accurate Baseline (Black) – slow, very accurate (pruning) Penalized EP (Red) – pretty fast, very accurate Bigram EP (Blue) – fast but inaccurate Unigram EP (Green) – fast but inaccurate Speed ranking (upper graph) Accuracy ranking (lower graph) … essentially opposites … 191
192
192
193
Inference by Dual Decomposition Exact 1-best inference! (can’t terminate in general because of undecidability, but does terminate in practice) 193
194
Challenges in Inference 194 Global discrete optimization problem. Variables range over a infinite set … cannot be solved by ILP or even brute force. Undecidable! Our previous papers used approximate algorithms: Loopy Belief Propagation, or Expectation Propagation. Q: Can we do exact inference? A: If we can live with 1-best and not marginal inference, then we can use Dual Decomposition … which is exact. (if it terminates! the problem is undecidable in general …)
195
Graphical Model for Phonology 195 Jointly decide the values of the inter-dependent latent variables, which range over a infinite set. 1) Morpheme URs 2) Word URs 3) Word SRs Concatenation (e.g.) Phonology (PFST) srizajgn e ɪʃ ən dæmn rεz ɪ gn#e ɪʃ ən rizajn#s dæmn#e ɪʃ ən dæmn#s r,εz ɪ gn’e ɪʃ n riz’ajnz d,æmn’e ɪʃ n d’æmz rεzignrεzign e ɪʃ ən
196
General Idea of Dual Decomp 196 srizajgn e ɪʃ ən dæmn rεz ɪ gn#e ɪʃ ən rizajn#s dæmn#e ɪʃ ən dæmn#s r,εz ɪ gn’e ɪʃ n riz’ajnz d,æmn’e ɪʃ n d’æmz rεzignrεzign e ɪʃ ən
197
General Idea of Dual Decomp zrizajn e ɪʃ ən dæmn e ɪʃ ən zdæmn rεz ɪ gn rεz ɪ gn#e ɪʃ ən rizajn#z dæmn#e ɪʃ ən dæmn#z r,εz ɪ gn’e ɪʃ n riz’ajnz d,æmn’e ɪʃ n d’æmz Subproblem 1Subproblem 2Subproblem 3Subproblem 4 197
198
I prefer rεz ɪ gn I prefer rizajn General Idea of Dual Decomp zrizajn e ɪʃ ən dæmn e ɪʃ ən zdæmn rεz ɪ gn rεz ɪ gn#e ɪʃ ən rizajn#z dæmn#e ɪʃ ən dæmn#z r,εz ɪ gn’e ɪʃ n riz’ajnz d,æmn’e ɪʃ n d’æmz Subproblem 1Subproblem 2Subproblem 3Subproblem 4 198
199
zrizajn e ɪʃ ən dæmn e ɪʃ ən zdæmn rεz ɪ gn rεz ɪ gn#e ɪʃ ən rizajn#z dæmn#e ɪʃ ən dæmn#z r,εz ɪ gn’e ɪʃ n riz’ajnz d,æmn’e ɪʃ n d’æmz Subproblem 1Subproblem 2Subproblem 3Subproblem 4 199
200
Substring Features and Active Set z rizajn e ɪʃ ən dæmn e ɪʃ ən zdæmn rεzɪgnrεzɪgn rεz ɪ gn#e ɪʃ ən rizajn#z dæmn#e ɪʃ ən dæmn#z r,εz ɪ gn’e ɪʃ n riz’ajnz d,æmn’e ɪʃ n d’æmz Subproblem 1 200 I prefer rεz ɪ gn Less ε, ɪ, g; more i, a, j (to match others) Less ε, ɪ, g; more i, a, j (to match others) I prefer rizajn Less i, a, j; more ε, ɪ, g (to match others) Less i, a, j; more ε, ɪ, g (to match others)
201
Features: “Active set” method How many features? Infinitely many possible n-grams! Trick: Gradually increase feature set as needed. – Like Paul & Eisner (2012), Cotterell & Eisner (2015) 1.Only add features on which strings disagree. 2.Only add abcd once abc and bcd already agree. – Exception: Add unigrams and bigrams for free. 201
202
Fragment of Our Graph for Catalan 202 ? ? grizos ? gris ? ? grize ?? grizes ? ? Stem of “grey” Separate these 4 words into 4 subproblems as before …
203
203 ? ? grizos ? gris ? ? grize ? ? ? ? grizes Redraw the graph to focus on the stem …
204
204 ? ? grizos ? gris ? ? grize ? ? grizes ? ? ?? ? Separate into 4 subproblems – each gets its own copy of the stem
205
205 ? ? grizos ? gris ε ? grize ? ? grizes ? ? εε ε nonzero features: { } Iteration: 1
206
206 ? ? grizos ? gris g ? grize ? ? grizes ? ? gg g nonzero features: { } Iteration: 3
207
207 ? ? grizos ? gris ? grize ? ? grizes ? ? griz nonzero features: {s, z, is, iz, s$, z$ } Iteration: 4 Feature weights (dual variable)
208
208 ? ? grizos ? gris ? grize ? ? grizes ? ? grizgrizo griz nonzero features: {s, z, is, iz, s$, z$, o, zo, o$ } nonzero features: {s, z, is, iz, s$, z$, o, zo, o$ } Iteration: 5 Feature weights (dual variable)
209
209 ? ? grizos ? gris ? grize ? ? grizes ? ? grizgrizo griz nonzero features: {s, z, is, iz, s$, z$, o, zo, o$ } nonzero features: {s, z, is, iz, s$, z$, o, zo, o$ } Iteration: 6 Iteration: 13 Feature weights (dual variable)
210
210 ? ? grizos ? gris griz ? grize ? ? grizes ? ? grizgrizo griz nonzero features: {s, z, is, iz, s$, z$, o, zo, o$ } nonzero features: {s, z, is, iz, s$, z$, o, zo, o$ } Iteration: 14 Feature weights (dual variable)
211
211 ? ? grizos ? gris griz ? grize ? ? grizes ? ? griz nonzero features: {s, z, is, iz, s$, z$, o, zo, o$ } nonzero features: {s, z, is, iz, s$, z$, o, zo, o$ } Iteration: 17 Feature weights (dual variable)
212
212 ? ? grizos ? gris griz ? grize ? ? grizes ? ? grizegriz nonzero features: {s, z, is, iz, s$, z$, o, zo, o$, e, ze, e$} Iteration: 18 Feature weights (dual variable)
213
213 ? ? grizos ? gris griz ? grize ? ? grizes ? ? grizegriz nonzero features: {s, z, is, iz, s$, z$, o, zo, o$, e, ze, e$} Iteration: 19 Iteration: 29 Feature weights (dual variable)
214
214 ? ? grizos ? gris griz ? grize ? ? grizes ? ? griz nonzero features: {s, z, is, iz, s$, z$, o, zo, o$, e, ze, e$} Iteration: 30 Feature weights (dual variable)
215
215 ? ? grizos ? gris griz ? grize ? ? grizes ? ? griz nonzero features: {s, z, is, iz, s$, z$, o, zo, o$, e, ze, e$} Iteration: 30 Converged!
216
I’ll try to arrange for r not i at position 2, i not z at position 3, z not at position 4. Why n-gram features? 216 Positional features don’t understand insertion: In contrast, our “z” feature counts the number of “z” phonemes, without regard to position. These solutions already agree on “g”, “i”, “z” counts … they’re only negotiating over the “r” count. giz griz giz griz I need more r’s.
217
Why n-gram features? 217 Adjust weights λ until the “r” counts match: Next iteration agrees on all our unigram features: – Oops! Features matched only counts, not positions – But bigram counts are still wrong … so bigram features get activated to save the day – If that’s not enough, add even longer substrings … giz griz I need more r’s … somewhere. girz griz I need more gr, ri, iz, less gi, ir, rz.
218
Results using Dual Decomposition 218
219
7 Inference Problems (graphs) EXERCISE (small) o 4 languages: Catalan, English, Maori, Tangale o 16 to 55 underlying morphemes. o 55 to 106 surface words. CELEX (large) o 3 languages: English, German, Dutch o 341 to 381 underlying morphemes. o 1000 surface words for each language. 219 # vars (unknown strings) # subproblems
220
Experimental Setup o Model 1: very simple phonology with only 1 parameter, trained by grid search. o Model 2S: sophisticated phonology with phonological features trained by hand- crafted morpheme URs: full supervision. o Model 2E: sophisticated phonology as Model 2S, trained by EM. o Evaluating inference on recovered latent variables under the different settings. 220
221
Experimental Questions o Is exact inference by DD practical? o Does it converge? o Does it get better results than approximate inference methods? o Does exact inference help EM? 221
222
● DD seeks best λ via subgradient algorithm reduce dual objective tighten upper bound on primal objective ● If λ gets all sub-problems to agree (x 1 = … = x K ) constraints satisfied dual value is also value of a primal solution which must be max primal! (and min dual) 222 ≤ primal (function of strings x) dual (function of weights λ )
223
Convergence behavior (full graph) Catalan Maori EnglishTangale 223 Dual (tighten upper bound) primal (improve strings) optimal!
224
Comparisons ● Compare DD with two types of Belief Propagation (BP) inference. Approximate MAP inference (max-product BP) (baseline) Approximate marginal inference (sum-product BP) (TACL 2015) Exact MAP inference (dual decomposition) (this paper) 224 Exact marginal inference (we don’t know how!) variational approximation Viterbi approximation
225
Inference accuracy 225 Approximate MAP inference (max-product BP) (baseline) Approximate marginal inference (sum-product BP) (TACL 2015) Exact MAP inference (dual decomposition) (this paper) Model 1, EXERCISE: 90% Model 1, CELEX: 84% Model 2S, CELEX: 99% Model 2E, EXERCISE: 91% Model 1, EXERCISE: 95% Model 1, CELEX: 86% Model 2S, CELEX: 96% Model 2E, EXERCISE: 95% Model 1, EXERCISE: 97% Model 1, CELEX: 90% Model 2S, CELEX: 99% Model 2E, EXERCISE: 98% Model 1 – trivial phonology Model 2S – oracle phonology Model 2E – learned phonology (inference used within EM) improves improves more! worse
226
Conclusion A general DD algorithm for MAP inference on graphical models over strings. On the phonology problem, terminates in practice, guaranteeing the exact MAP solution. Improved inference for supervised model; improved EM training for unsupervised model. Try it for your own problems generalizing to new strings! 226
227
observed data hidden data probability distribution Future Work 227
228
Future: Which words are related? So far, we were told that “resignation” shares morphemes with “resigns” and “damnation.” We’d like to figure that out from raw text: Related spellings Related contexts 228 shared morphemes?
229
Future: Which words are related? So far, we were told that “resignation” shares morphemes with “resigns” and “damnation.” We’d like to figure that out from raw text: “resignation” and “resigns” are spelled similarly And appear in similar semantic contexts (topics, dependents) “resignation” and “damnation” are spelled similarly And appear in similar syntactic contexts (singular nouns) Abstract morphemes fall into classes: RESIGN-, DAMN- are verbs while -ATION, -S attach to verbs 229
230
How is morphology like clustering?
231
Linguistics quiz: Find a morpheme Blah blah blah snozzcumber blah blah blah. Blah blah blahdy abla blah blah. Snozzcumbers blah blah blah abla blah. Blah blah blah snezzcumbri blah blah snozzcumber.
232
Dreyer & Eisner 2011 – “select & mutate” Many possible morphological slots
233
Many possible phylogenies Did [Taylor swift] just dis harry sytles Lets see how bad [T Swift] will be. #grammys it’s clear that [T-Swizzle] is on drugs Ladies STILL love [LL Cool James]. [LL Cool J] is looking realllll chizzled! [Taylor swift] is apart of the Illuminati NEW Andrews, Dredze, & Eisner 2014 – “select & mutate”
234
Future: Which words are related? So far, we were told that “resignation” shares morphemes with “resigns” and “damnation.” We’d like to get that from raw text. To infer the abstract morphemes from context, we need to extend our generative story to capture regularity in morpheme sequences. Neural language models … But, must deal with unknown, unbounded vocabulary of morphemes 234
235
Future: Improving the factors More attention to representations and features (autosegmental phonology) Algorithms require us to represent each factor as a WFST defining (x,y) Good modeling reasons to use a Bayes net. Then (x,y) = p(y | x) so each factor is a PFST. Alas, PFST is left/right asymmetric (label bias)! Can we substitute a WFST that defines p(y | x) only up to a normalizing constant Z(x)? “Double intractability” since x is unknown: expensive even to explore x by Gibbs sampling! How about features that depend on all of x? CRF / RNN / LSTM? 235
236
Reconstructing the (multilingual) lexicon IndexSpellingMeaningPronunciationSyntax 123ca[si.ei]NNP (abbrev) 124can [k ɛɪ n] NN 125can [kæn], [k ɛ n], … MD 126cane [ke ɪ n] NN (mass) 127cane [ke ɪ n] NN 128canes [ke ɪ nz] NNS 236 (other columns would include translations, topics, counts, embeddings, …)
237
Conclusions Unsupervised learning of how all the words in a language (or across languages) are interrelated. This is what kids and linguists do. Given data, estimate a posterior distribution over the infinite probabilistic lexicon. While training parameters that model how lexical entries are related (language-specific derivational processes or soft constraints). Starting to look feasible! We now have a lot of the ingredients – generative models and algorithms. 237
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.