1 ASRU, Dec. 2015 Graphical Models Over String-Valued Random Variables Jason Eisner Ryan Cotterell Nanyun (Violet) Peng Nick Andrews Markus Dreyer Michael.

Slides:

Advertisements

Similar presentations

Dougal Sutherland, 9/25/13.

Advertisements

Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.

CS Morphological Parsing CS Parsing Taking a surface input and analyzing its components and underlying structure Morphological parsing:

Factorial Mixture of Gaussians and the Marginal Independence Model Ricardo Silva Joint work-in-progress with Zoubin Ghahramani.

Unsupervised Learning

Linear Classifiers (perceptrons)

Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman.

Belief Propagation by Jakob Metzler. Outline Motivation Pearl’s BP Algorithm Turbo Codes Generalized Belief Propagation Free Energies.

GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.

Visual Recognition Tutorial

Bayesian network inference

Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.

A Non-Parametric Bayesian Approach to Inflectional Morphology Jason Eisner Johns Hopkins University This is joint work with Markus Dreyer. Most of the.

Conditional Random Fields

Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.

. Approximate Inference Slides by Nir Friedman. When can we hope to approximate? Two situations: u Highly stochastic distributions “Far” evidence is discarded.

Computer vision: models, learning and inference Chapter 10 Graphical Models.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.

Crash Course on Machine Learning

More Machine Learning Linear Regression Squared Error L1 and L2 Regularization Gradient Descent.

Intro to NLP - J. Eisner1 Part-of-Speech Tagging A Canonical Finite-State Task.

Unambiguity Regularization for Unsupervised Learning of Probabilistic Grammars Kewei TuVasant Honavar Departments of Statistics and Computer Science University.

Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture4 1 August 2007.

1 Naïve Bayes Models for Probability Estimation Daniel Lowd University of Washington (Joint work with Pedro Domingos)

Graphical models for part of speech tagging

CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.

Intro to NLP - J. Eisner1 Finite-State and the Noisy Channel.

Adaptor Grammars Ehsan Khoddammohammadi Recent Advances in Parsing Technology WS 2012/13 Saarland University 1.

Visibility Graph. Voronoi Diagram Control is easy: stay equidistant away from closest obstacles.

Boltzmann Machine (BM) (§6.4) Hopfield model + hidden nodes + simulated annealing BM Architecture –a set of visible nodes: nodes can be accessed from outside.

Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao.

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.

Unsupervised Constraint Driven Learning for Transliteration Discovery M. Chang, D. Goldwasser, D. Roth, and Y. Tu.

1 LING 696B: Midterm review: parametric and non-parametric inductive inference.

Inference Complexity As Learning Bias Daniel Lowd Dept. of Computer and Information Science University of Oregon Joint work with Pedro Domingos.

Dual Decomposition Inference for Graphical Models over Strings

Randomized Algorithms for Bayesian Hierarchical Clustering

Graphical Models over Multiple Strings Markus Dreyer and Jason Eisner Dept. of Computer Science, Johns Hopkins University EMNLP 2009 Presented by Ji Zongcheng.

Robotics Club: 5:30 this evening

CIAR Summer School Tutorial Lecture 1b Sigmoid Belief Nets Geoffrey Hinton.

Approximation-aware Dependency Parsing by Belief Propagation September 19, 2015 TACL at EMNLP 1 Matt Gormley Mark Dredze Jason Eisner.

Bayesian networks and their application in circuit reliability estimation Erin Taylor.

Penalized EP for Graphical Models Over Strings Ryan Cotterell and Jason Eisner.

Ensemble Methods in Machine Learning

John Lafferty Andrew McCallum Fernando Pereira

Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

Section 6: Approximation-aware Training 1. Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been.

1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.

[TACL] Modeling Word Forms Using Latent Underlying Morphs and Phonology Ryan Cotterell and Nanyun Peng and Jason Eisner 1.

Intro to NLP - J. Eisner1 Finite-State and the Noisy Channel.

Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

Morphological Smoothing and Extrapolation of Word Embeddings

Part-of-Speech Tagging

Dual Decomposition Inference for Graphical Models over Strings

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Section 4: Incorporating Structure into Factors and Variables

CS 4/527: Artificial Intelligence

CSCI 5822 Probabilistic Models of Human and Machine Learning

CSCI 5832 Natural Language Processing

Probabilistic Models with Latent Variables

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Finite-State and the Noisy Channel

Expectation-Maximization & Belief Propagation

Word embeddings (continued)

Memory-Based Learning Instance-Based Learning K-Nearest Neighbor

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

A Joint Model of Orthography and Morphological Segmentation

Presentation transcript:

1 ASRU, Dec Graphical Models Over String-Valued Random Variables Jason Eisner Ryan Cotterell Nanyun (Violet) Peng Nick Andrews Markus Dreyer Michael Paul 1 with

2 ASRU, Dec Probabilistic Inference of Strings Jason Eisner Ryan Cotterell Nanyun (Violet) Peng Nick Andrews Markus Dreyer Michael Paul 2 with Pronunciation Dictionaries

3

lexicon (word types) semantics sentences discourse context resources entailment correlation inflection cognates transliteration abbreviation neologism language evolution translation alignment editing quotation speech misspellings,typos formatting entanglement annotation N tokens To recover variables, model and exploit their correlations

5

Bayesian View of the World observed data hidden data probability distribution 6

Bayesian NLP Some good NLP questions:  Underlying facts or themes that help explain this document collection?  An underlying parse tree that helps explain this sentence? An underlying meaning that helps explain that parse tree?  An underlying grammar that helps explain why these sentences are structured as they are?  An underlying grammar or evolutionary history that helps explain why these words are spelled as they are? 7

Today’s Challenge Too many words in a language! 8

Natural Language is Built from Words 9

Can store info about each word in a table IndexSpellingMeaningPronunciationSyntax 123ca[si.ei]NNP (abbrev) 124can [k ɛɪ n] NN 125can [kæn], [k ɛ n], … MD 126cane [ke ɪ n] NN (mass) 127cane [ke ɪ n] NN 128canes [ke ɪ nz] NNS (other columns would include translations, topics, counts, embeddings, …) 10

Problem: Too Many Words! Google analyzed 1 trillion words of English text Found > 13M distinct words with count ≥ 200 The problem isn’t storing such a big table … it’s acquiring the info for each row separately  Need lots of evidence, or help from human speakers Hard to get for every word of the language Especially hard for complex or “low-resource” languages  Omit rare words? Maybe, but many sentences contain them (Zipf’s Law) 11

Technically speaking, # words =  Really the set of (possible) words is ∑* Names Neologisms Typos Productive processes:  friend  friendless  friendlessness  friendlessnessless  …  hand+bag  handbag (sometimes can iterate) 12

Technically speaking, # words =  Really the set of (possible) words is ∑* Names Neologisms Typos Productive processes:  friend  friendless  friendlessness  friendlessnessless  …  hand+bag  handbag (sometimes can iterate) 13 Turkish word: uygarlaştiramadiklarimizdanmişsinizcasina = uygar+laş+tir+ama+dik+lar+imiz+dan+miş+siniz+casina (behaving) as if you are among those whom we could not cause to become civilized

14 A typical Polish verb (“to contain”) ImperfectivePerfective infinitivezawieraćzawrzeć present past future conditional imperative present active participleszawierający, -a, -e; -y, -e present passive participleszawierany, -a, -e; -, -e past passive participleszawarty, -a, -e; -, -te adverbial participlezawierając zawieramzaweiramy zawieraszzawieracie zawierazawierają zawierałem/zawierałamzawieraliśmy/zawierałyśmy zawierałeś/zawierałaśzawieraliście/zawierałyście zawierał/zawierała/zawierałozawierali/zawierały zawarłem/zawarłamzawarliśmy/zawarłyśmy zawarłeś/zawarłaśzawarliście/zawarłyście zawarł/zawarła/zawarłozawarli/zawarły będę zawierał/zawierałabędziemy zawierali/zawierały będziesz zawierał/zawierałabędziecie zawierali/zawierały będzie zawierał/zawierała/zawierałobędą zawierali/zawierały zawręzawrzemy zawreszzawrzecie zawrzezawrą zawierałbym/zawierałabymzawieralibyśmy/zawierałybyśmy zawierałbyś/zawierałabyśzawieralibyście/zawierałybyście zawierałby/zawierałaby/zawierałobyzawieraliby/zawierałyby zawarłbym/zawarłabymzawarlibyśmy/zawarłybyśmy zawarłbyś/zawarłabyśzawarlibyście/zawarłybyście zawarłby/zawarłaby/zawarłobyzawarliby/zawarłyby zawierajmy zawierajzawierajcie niech zawieraniech zawierają zawrzyjmy zawrzyjzawrzjcie niech zawrzeniech zawrą 100 inflected forms per verb Sort of predictable from one another! (verbs are more or less regular)

Solution: Don’t model every cell separately Noble gases Positive ions 16

Can store info about each word in a table IndexSpellingMeaningPronunciationSyntax 123ca[si.ei]NNP (abbrev) 124can [k ɛɪ n] NN 125can [kæn], [k ɛ n], … MD 126cane [ke ɪ n] NN (mass) 127cane [ke ɪ n] NN 128canes [ke ɪ nz] NNS 17 (other columns would include translations, topics, counts, embeddings, …)

What’s in the table? NLP strings are diverse … Use  Orthographic (spelling)  Phonological (pronunciation)  Latent (intermediate steps not observed directly) Size  Morphemes (meaningful subword units)  Words  Multi-word phrases, including “named entities”  URLs 18

Language  English, French, Russian, Hebrew, Chinese, …  Related languages (Romance langs, Arabic dialects, …)  Dead languages (common ancestors) – unobserved?  Transliterations into different writing systems Medium  Misspellings  Typos  Wordplay  Social media 19 What’s in the table? NLP strings are diverse …

Some relationships within the table spelling  pronunciation word  noisy word (e.g., with a typo) word  related word in another language (loanwords, language evolution, cognates) singular  plural (for example) (root, binyan)  word underlying form  surface form 20

Chains of relations can be useful Misspelling or pun = spelling  pronunciation  spelling Cognate = word  historical parent  historical child 21

Reconstructing the (multilingual) lexicon IndexSpellingMeaningPronunciationSyntax 123ca[si.ei]NNP (abbrev) 124can [k ɛɪ n] NN 125can [kæn], [k ɛ n], … MD 126cane [ke ɪ n] NN (mass) 127cane [ke ɪ n] NN 128canes [ke ɪ nz] NNS (other columns would include translations, topics, distributional info such as counts, …) Ultimate goal: Probabilistically reconstruct all missing entries of this infinite multilingual table, given some entries and some text. Needed: Exploit the relationships (arrows). May have to discover those relationships. Approach: Linguistics + generative modeling + statistical inference. Modeling ingredients: Finite-state machines, graphical models, CRP. Inference ingredients: MCMC, BP/EP, DD. 22

Today’s Focus: Phonology (but methods also apply to other relationships among strings) 23

What is Phonology? [kæt] Phonology: Orthography: cat Phonology explains regular sound patterns 24

What is Phonology? [kæt] Phonetics: Phonology: Orthography: cat Phonology explains regular sound patterns Not phonetics, which deals with acoustics 25

Q: What do phonologists do? A: They find patterns among the pronunciations of words. 26

A Phonological Exercise Tenses Verbs [tɔk] [tɔks] [tɔkt] 1P Pres. Sg. 3P Pres. Sg.Past Tense Past Part. [tɔkt] [θeɪŋk] [θeɪŋks] [θeɪŋkt] [hæk] [hæks] [hækt] TALK THANK HACK CRACK SLAP [kɹæks] [kɹækt] [slæp] [slæpt] 27

Matrix Completion: Collaborative Filtering Movies Users

Matrix Completion: Collaborative Filtering Movies Users [ ] [ ] [ ] [-9 1 4] [ ] [ [ [ [ [ [ [ [

Matrix Completion: Collaborative Filtering Prediction! [ [ [ [ [ [ [ [ [ ] [ ] [ ] [-9 1 4] [ ] Movies Users [ 30

Matrix Completion: Collaborative Filtering [1,-4,3] [-5,2,1] Dot Product Gaussian Noise 31

Matrix Completion: Collaborative Filtering Prediction! [ [ [ [ [ [ [ [ [ ] [ ] [ ] [-9 1 4] [ ] Movies Users [ 32

A Phonological Exercise [tɔk] [tɔks] [tɔkt] TALK THANK HACK 1P Pres. Sg. 3P Pres. Sg.Past Tense Past Part. Tenses Verbs [tɔkt] [θeɪŋk] [θeɪŋks] [θeɪŋkt] [hæk] [hæks] [hækt] CRACK SLAP [kɹæks] [kɹækt] [slæp] [slæpt] 33

A Phonological Exercise [tɔk] [tɔks] [tɔkt] TALK THANK HACK 1P Pres. Sg. 3P Pres. Sg.Past Tense Past Part. Suffixes Stems [tɔkt] [θeɪŋk] [θeɪŋks] [θeɪŋkt] [hæk] [hæks] [hækt] CRACK SLAP [kɹæks] [kɹækt] [slæp] [slæpt] /Ø//Ø/ /s//s/ /t//t//t//t/ /tɔk/ /θeɪŋk/ /hæk/ /slæp/ /kɹæk/ 34

A Phonological Exercise [tɔk] [tɔks] [tɔkt] TALK THANK HACK 1P Pres. Sg. 3P Pres. Sg.Past Tense Past Part. [tɔkt] [θeɪŋk] [θeɪŋks] [θeɪŋkt] [hæk] [hæks] [hækt] CRACK SLAP [kɹæks] [kɹækt] [slæp] [slæpt] /Ø//Ø/ /s//s/ /t//t//t//t/ /tɔk/ /θeɪŋk/ /hæk/ /slæp/ /kɹæk/ Suffixes Stems 35

A Phonological Exercise [tɔk] [tɔks] [tɔkt] TALK HACK 1P Pres. Sg. 3P Pres. Sg.Past Tense Past Part. [tɔkt] [θeɪŋk] [θeɪŋks] [θeɪŋkt] [hæk] [hæks] [hækt] CRACK SLAP [kɹæk] [kɹæks] [kɹækt] [slæp] [slæps] [slæpt] /Ø//Ø/ /s//s/ /t//t//t//t/ /tɔk/ /θeɪŋk/ /hæk/ /slæp/ /kɹæk/ Prediction! THANK Suffixes Stems 36

Why “talks” sounds like that tɔk s s tɔks Concatenate “talks” 37

A Phonological Exercise [tɔk] [tɔks] [tɔkt] TALK HACK 1P Pres. Sg. 3P Pres. Sg.Past Tense Past Part. Suffixes Stems [tɔkt] [θeɪŋk] [θeɪŋks] [θeɪŋkt] [hæk] [hæks] [hækt] CRACK SLAP CODE BAT [kɹæks] [kɹækt] [slæp] [slæpt] [koʊdz] [koʊdɪd] [bæt] [bætɪd] /Ø//Ø/ /s//s/ /t//t//t//t/ /tɔk/ /θeɪŋk/ /hæk/ /bæt/ /koʊd/ /slæp/ /kɹæk/ THANK 38

A Phonological Exercise [tɔk] [tɔks] [tɔkt] TALK HACK 1P Pres. Sg. 3P Pres. Sg.Past Tense Past Part. Suffixes Stems [tɔkt] [θeɪŋk] [θeɪŋks] [θeɪŋkt] [hæk] [hæks] [hækt] CRACK SLAP CODE BAT [kɹæks] [kɹækt] [slæp] [slæpt] [koʊdz] [koʊdɪd] [bæt] [bætɪd] /Ø//Ø/ /s//s/ /t//t//t//t/ /tɔk/ /θeɪŋk/ /hæk/ /bæt/ /koʊd/ /slæp/ /kɹæk/ z instead of s ɪd instead of t THANK 39

A Phonological Exercise [tɔk] [tɔks] [tɔkt] TALK THANK HACK 1P Pres. Sg. 3P Pres. Sg.Past Tense Past Part. Suffixes Stems [tɔkt] [θeɪŋk] [θeɪŋks] [θeɪŋkt] [hæk] [hæks] [hækt] CRACK SLAP CODE BAT EAT [kɹæks] [kɹækt] [slæp] [slæpt] [koʊdz] [koʊdɪd] [bæt] [bætɪd] [it] [eɪt] [itən] /Ø//Ø/ /s//s/ /t//t//t//t/ /tɔk/ /θeɪŋk/ /hæk/ /it/ /bæt/ /koʊd/ /slæp/ /kɹæk/ eɪt instead of i t ɪd 40

Why “codes” sounds like that koʊds koʊd#s koʊdz Concatenate Phonology (stochastic) “codes” Modeling word forms using latent underlying morphs and phonology. Cotterell et. al. TACL

Why “resignation” sounds like that rizaignation rizaign#ation rεzɪgneɪʃn “resignation” Concatenate Phonology (stochastic) 42

dæmn e ɪʃ ən srizaign r,εz ɪ gn’e ɪʃ n riz’ajnz d,æmn’e ɪʃ n d’æmz rizaign#e ɪʃ ən rizaign#s dæmn#s dæmn#e ɪʃ ən Fragment of Our Graph for English 1) Morphemes 2) Underlying words 3) Surface words Concatenation Phonology “resignation”“resigns” “damnation” “damns” 3 rd -person singular suffix: very common! 43

Handling Multimorphemic Words gəliːpt gəliːbt tgə 44 liːb “geliebt” (German: loved) Matrix completion: each word built from one stem (row) + one suffix (column). WRONG Graphical model: a word can be built from any # of morphemes (parents). RIGHT

Limited to concatenation? No, could extend to templatic morphology … 45

A (Simple) Model of Phonology 46

[1,-4,3] [-5,2,1] Dot Product Gaussian Noise 47

rizaigns rizaign#s rizainz “resigns” Concatenate Phonology (stochastic) SθSθ 48

Upper Left ContextUpper Right Context Lower Left Context Phonology as an Edit Process r r i i z z a a i i g g n n s s 49

Upper Left Context Lower Left Context Upper Right Context Phonology as an Edit Process r r i i z z a a i i g g n n s s r r COPY 50

Upper Left Context Lower Left Context Upper Right Context Phonology as an Edit Process r r i i z z a a i i g g n n s s r r i i COPY 51

Upper Left Context Lower Left Context Upper Right Context Phonology as an Edit Process r r i i z z a a i i g g n n s s r r i i COPY z z 52

Upper Left Context Lower Left Context i i i i Upper Right Context Phonology as an Edit Process r r z z a a i i g g n n s s r r z z COPY a a 53

i i i i Lower Left Context Upper Left ContextUpper Right Context Phonology as an Edit Process r r z z a a i i g g n n s s r r z z a a i i COPY 54

i i i i Lower Left Context Upper Left Context Upper Right Context Phonology as an Edit Process r r z z a a i i g g n n s s r r z z a a i i DEL ɛ ɛ 55

i i i i Lower Left Context Upper Left Context Upper Right Context Phonology as an Edit Process r r z z a a i i g g n n s s r r z z a a i i ɛ ɛ COPY n n 56

i i i i Lower Left Context Upper Left Context Upper Right Context Phonology as an Edit Process r r z z a a i i g g n n s s r r z z a a i i ɛ ɛ n n SUB z z 57

i i i i Lower Left Context Upper Left Context Upper Right Context Phonology as an Edit Process r r z z a a i i g g n n s s r r z z a a i i ɛ ɛ n n SUB z z 58

i i i i Lower Left Context Upper Left Context Upper Right Context Phonology as an Edit Process r r z z a a i i g g n n s s r r z z a a i i ɛ ɛ COPY n n 59

Lower Left Context Upper Left ContextUpper Right Context i i i i Phonology as an Edit Process r r z z a a i i g g n n s s r r z z a a i i DEL ɛ ɛ Weights Feature Function ActionProb DEL.75 COPY.01 SUB(A).05 SUB(B) INS(A).02 INS(B)

Lower Left Context Upper Left ContextUpper Right Context i i i i Phonology as an Edit Process r r z z a a i i g g n n s s r r z z a a i i DEL ɛ ɛ Feature Function Weights Features 61

Lower Left Context Upper Left ContextUpper Right Context i i i i Phonology as an Edit Process r r z z a a i i g g n n s s r r z z a a i i DEL ɛ ɛ Feature Function Weights Features Surface Form 62

i i i i Lower Left Context Upper Left ContextUpper Right Context Phonology as an Edit Process r r z z a a i i g g n n s s r r z z a a i i DEL ɛ ɛ Feature Function Weights Features Surface Form Transduction 63

Lower Left Context Upper Left ContextUpper Right Context i i i i Phonology as an Edit Process r r z z a a i i g g n n s s r r z z a a i i DEL ɛ ɛ Feature Function Weights Features Surface Form Transduction Upper String 64

Phonological Attributes Binary Attributes (+ and -) 65

i i i i Lower Left Context Upper Left ContextUpper Right Context Phonology as an Edit Process r r z z a a i i g g n n s s r r z z a a i i DEL ɛ ɛ 66

Lower Left Context Upper Left ContextUpper Right Context i i i i Phonology as an Edit Process r r z z a a i i g g n n s s r r z z a a i i DEL ɛ ɛ Faithfulness Features EDIT(g, ɛ ) EDIT(+cons, ɛ ) EDIT(+voiced, ɛ ) 67

i i i i Lower Left Context Upper Left ContextUpper Right Context Phonology as an Edit Process r r z z a a i i g g n n s s r r z z a a i i COPY Markedness Features BIGRAM(a, i) BIGRAM(-high, -low) BIGRAM(+back, -back) 68

i i i i Lower Left Context Upper Left ContextUpper Right Context Phonology as an Edit Process r r z z a a i i g g n n s s r r z z a a i i COPY Markedness Features BIGRAM(a, i) BIGRAM(-high, -low) BIGRAM(+back, -back) Inspired by Optimality Theory: A popular Constraint-Based Phonology Formalism 69

Inference for Phonology 70

Bayesian View of the World observed data hidden data probability distribution 71

r,εz ɪ gn’e ɪʃ nd,æmn’e ɪʃ n d’æmz 72

dæmn e ɪʃ ən srizaign r,εz ɪ gn’e ɪʃ n riz’ajnz d,æmn’e ɪʃ n d’æmz rizaign#e ɪʃ ən rizaign#s dæmn#s dæmn#e ɪʃ ən 73

Bayesian View of the World observed data hidden data probability distribution the rest of the words! all of the morphs the parameter vectors θ, φ some of the words 74

Statistical Inference some surface forms of the language 1. The underlying forms giving rise to those surface forms 2. The surface forms for all other words, as predicted by the underlying forms probability distribution (we also learn the edit parameters that best explain the visible part of the iceberg) 75

Why this matters Phonological grammars are usually hand- engineered by phonologists. Linguistics goal: Create an automated phonologist? Cognitive science goal: Model how babies learn phonology? Engineering goal: Analyze and generate words we haven’t heard before? 76

The Generative Story (defines which iceberg shapes are likely) 1. Sample the parameters φ and θ from priors. These parameters describe the grammar of a new language: what tends to happen in the language. 2. Now choose the lexicon of morphs and words: – For each abstract morpheme a  A, sample the morph M(a) ~ M φ – For each abstract word a=a 1,a 2 ···, sample its surface pronunciation S(a) from S θ (· | u), where u=M(a 1 )#M(a 2 ) ··· 3. This lexicon can now be used to communicate. A word’s pronunciation is now just looked up, not sampled; so it is the same each time it is used. 77 rizaign rizaign#s riz’ajnz

Why Probability? A language’s morphology and phonology are fixed, but probability models the learner’s uncertainty about what they are. Advantages: – Quantification of irregularity (“singed” vs. “sang”) – Soft models admit efficient learning and inference Our use is orthogonal to the way phonologists currently use probability to explain gradient phenomena 78

Basic Methods for Inference and Learning 79

Train the Parameters using EM (Dempster et al. 1977) E-Step (“inference”): – Infer the hidden strings (posterior distribution) r,εz ɪ gn’e ɪʃ nd,æmn’e ɪʃ n d’æmz

Train the Parameters using EM (Dempster et al. 1977) E-Step (“inference”): – Infer the hidden strings (posterior distribution) dæmn e ɪʃ ən srizaign r,εz ɪ gn’e ɪʃ n riz’ajnz d,æmn’e ɪʃ n d’æmz rizaign#e ɪʃ ən rizaign#s dæmn#s dæmn#e ɪʃ ən

Train the Parameters using EM (Dempster et al. 1977) E-Step (“inference”): – Infer the hidden strings (posterior distribution) M-Step (“learning”): – Improve the continuous model parameters θ, φ (gradient descent: the E-step provides supervision) Repeat till convergence i i i i r r z z a a i i g g n n s s r r z z a a i i DEL 82 riz’ajnz rizaign#s

Directed Graphical Model (defines the probability of a candidate solution) 1) Morphemes 2) Underlying words 3) Surface words Concatenation Phonology 83 Inference step: Find high-probability reconstructions of the hidden variables. High-probability if each string is likely given its parents. dæmn e ɪʃ ən srizaign r,εz ɪ gn’e ɪʃ n riz’ajnz d,æmn’e ɪʃ n d’æmz rizaign#e ɪʃ ən rizaign#s dæmn#s dæmn#e ɪʃ ən

1) Morphemes 2) Underlying words 3) Surface words Concatenation Phonology 84 Equivalent Factor Graph (defines the probability of a candidate solution) dæmn e ɪʃ ən srizaign r,εz ɪ gn’e ɪʃ n riz’ajnz d,æmn’e ɪʃ n d’æmz rizaign#e ɪʃ ən rizaign#s dæmn#s dæmn#e ɪʃ ən

85

86

87

88

89

90

91

92

93

Directed Graphical Model 94 1) Morphemes 2) Underlying words 3) Surface words Concatenation Phonology dæmn e ɪʃ ən srizaign r,εz ɪ gn’e ɪʃ n riz’ajnz d,æmn’e ɪʃ n d’æmz rizaign#e ɪʃ ən rizaign#s dæmn#s dæmn#e ɪʃ ən

Equivalent Factor Graph Each ellipse is a random variable Each square is a “factor” – a function that jointly scores the values of its few neighboring variables 95 1) Morphemes 2) Underlying words 3) Surface words Concatenation Phonology dæmn e ɪʃ ən srizaign r,εz ɪ gn’e ɪʃ n riz’ajnz d,æmn’e ɪʃ n d’æmz rizaign#e ɪʃ ən rizaign#s dæmn#s dæmn#e ɪʃ ən

? ? riz’ajnz ? r,εz ɪ gn’e ɪʃ n ? ? riz’ajnd ? ? Dumb Inference by Hill-Climbing 1) Morpheme URs 2) Word URs 3) Word SRs 96

foo ? riz’ajnz ? r,εz ɪ gn’e ɪʃ n s ? riz’ajnd da bar Dumb Inference by Hill-Climbing 1) Morpheme URs 2) Word URs 3) Word SRs 97

Dumb Inference by Hill-Climbing foo bar#s riz’ajnz bar#foo r,εz ɪ gn’e ɪʃ n s bar#da riz’ajnd da bar 1) Morpheme URs 2) Word URs 3) Word SRs 98

Dumb Inference by Hill-Climbing 8e foo bar#s riz’ajnz bar#foo r,εz ɪ gn’e ɪʃ n s bar#da riz’ajnd da bar 1) Morpheme URs 2) Word URs 3) Word SRs 99

Dumb Inference by Hill-Climbing 8e foo bar#s riz’ajnz bar#foo r,εz ɪ gn’e ɪʃ n s bar#da riz’ajnd da bar 1) Morpheme URs 2) Word URs 3) Word SRs 6e e e

Dumb Inference by Hill-Climbing 8e foo bar#s riz’ajnz bar#foo r,εz ɪ gn’e ɪʃ n s bar#da riz’ajnd da bar 1) Morpheme URs 2) Word URs 3) Word SRs 6e e e-1100  101

Dumb Inference by Hill-Climbing foo far#s riz’ajnz far#foo r,εz ɪ gn’e ɪʃ n s far#da riz’ajnd da far 1) Morpheme URs 2) Word URs 3) Word SRs ? 102

Dumb Inference by Hill-Climbing foo size#s riz’ajnz size#foo r,εz ɪ gn’e ɪʃ n s size#da riz’ajnd da size 1) Morpheme URs 2) Word URs 3) Word SRs ? 103

Dumb Inference by Hill-Climbing foo …#s riz’ajnz …#foo r,εz ɪ gn’e ɪʃ n s …#da riz’ajnd da … 1) Morpheme URs 2) Word URs 3) Word SRs ? 104

Dumb Inference by Hill-Climbing foo rizajn#s riz’ajnz rizajn#foo r,εz ɪ gn’e ɪʃ n s rizajn#da riz’ajnd da rizajn 1) Morpheme URs 2) Word URs 3) Word SRs 105

Dumb Inference by Hill-Climbing foo rizajn#s riz’ajnz rizajn#foo r,εz ɪ gn’e ɪʃ n s rizajn#da riz’ajnd da rizajn 1) Morpheme URs 2) Word URs 3) Word SRs 0.012e

Dumb Inference by Hill-Climbing e ɪʃ n rizajn#s riz’ajnz rizajn#e ɪʃ n r,εz ɪ gn’e ɪʃ n s rizajn#d riz’ajnd d rizajn 1) Morpheme URs 2) Word URs 3) Word SRs

Dumb Inference by Hill-Climbing e ɪʃ n rizajgn#s riz’ajnz rizajgn#e ɪʃ n r,εz ɪ gn’e ɪʃ n s rizajgn#d riz’ajnd d rizajgn 1) Morpheme URs 2) Word URs 3) Word SRs

Dumb Inference by Hill-Climbing 109  Can we make this any smarter?  This naïve method would be very slow. And it could wander around forever, get stuck in local maxima, etc.  Alas, the problem of finding the best values in a factor graph is undecidable! (Can’t even solve by brute force because strings have unbounded length.)  Exact methods that might not terminate (but do in practice)  Approximate methods – which try to recover not just the best values, but the posterior distribution of values  All our methods are based on finite-state automata

A Generative Model of Phonology A Directed Graphical Model of the lexicon rˌɛzɪgnˈeɪʃən dˈæmz rizˈajnz 110 (Approximate) Inference MCMC – Bouchard-Côté (2007) Belief Propagation – Dreyer and Eisner (2009) Expectation Propagation – Cotterell and Eisner (2015) Dual Decomposition – Peng, Cotterell, & Eisner (2015)

About to sell our mathematical soul? 111 Insight Efficiency Exactness

112 General algos Give up lovely dynamic programming? Big Models Specialized algos

113 Give up lovely dynamic programming? General algos Insight Specialized algos Not quite! – Yes, general algos … which call specialized algos as subroutines Within a framework such as belief propagation, we may run – parsers (Smith & Eisner 2008) – finite-state machines (Dreyer & Eisner 2009) A step of belief propagation takes time O(k n ) in general – To update one message from a factor that coordinates n variables that have k possible values each – If that’s slow, we can sometimes exploit special structure in the factor! large n: parser uses dynamic programming to coordinate many vars infinite k: FSMs use dynamic programming to coordinate strings

rˌɛzɪgnˈeɪʃən dˈæmz rizˈajnz 114 Distribution Over Surface Form: UR Prob dæme ɪʃ ən.80 dæmne ɪʃ ən.10 dæmine ɪʃ ən..001 dæmiine ɪʃ ən.0001 … … chomsky … r in g u e ε s e h a Encoded as Weighted Finite- State Automaton

115

Experimental Design 116

Experimental Datasets 7 languages from different families – Maori – Tangale – Indonesian – Catalan – English – Dutch – German to homework exercises: can we generalize correctly from small data? CELEX can we scale up to larger datasets? can we handle naturally occurring datasets that have more irregularity? # of observed words per experiment

118 Evaluation Setup r,εz ɪ gn’e ɪʃ n d’æmz riz’ajnz

119 Evaluation Setup dæmn e ɪʃ ən srizaign r,εz ɪ gn’e ɪʃ n riz’ajnz d,æmn’e ɪʃ n d’æmz rizaign#e ɪʃ ən rizaign#s dæmn#s dæmn#e ɪʃ ən did we guess this pronunciation right?

Distribution Over Surface Form: UR Prob dæme ɪʃ ən.80 dæmne ɪʃ ən.10 dæmine ɪʃ ən..001 dæmiine ɪʃ ən.0001 … … chomsky … Exploring the Evaluation Metrics best error rate – Is the 1-best correct? ＊

Distribution Over Surface Form: UR Prob dæme ɪʃ ən.80 dæmne ɪʃ ən.10 dæmine ɪʃ ən..001 dæmiine ɪʃ ən.0001 … … chomsky … Exploring the Evaluation Metrics best error rate – Is the 1-best correct? Cross Entropy – What is the probability of the correct answer?

Distribution Over Surface Form: UR Prob dæme ɪʃ ən.80 dæmne ɪʃ ən.10 dæmine ɪʃ ən..001 dæmiine ɪʃ ən.0001 … … chomsky … Exploring the Evaluation Metrics best error rate – Is the 1-best correct? Cross Entropy – What is the probability of the correct answer? Expected Edit Distance – How close am I on average?

Distribution Over Surface Form: UR Prob dæme ɪʃ ən.80 dæmne ɪʃ ən.10 dæmine ɪʃ ən..001 dæmiine ɪʃ ən.0001 … … chomsky … Exploring the Evaluation Metrics best error rate – Is the 1-best correct? Cross Entropy – What is the probability of the correct answer? Expected Edit Distance – How close am I on average? Average over many training-test splits

Evaluation Metrics: (Lower is Always Better) – 1-best error rate (did we get it right?) – cross-entropy (what probability did we give the right answer?) – expected edit-distance (how far away on average are we?) – Average each metric over many training-test splits Comparisons: – Lower Bound: Phonology as noisy concatenation – Upper Bound: Oracle URs from linguists 124

Evaluation Philosophy We’re evaluating a language learner, on languages we didn’t examine when designing the learner. We directly evaluate how well our learner predicts held-out words that the learner didn’t see. No direct evaluation of intermediate steps: – Did we get the “right” underlying forms? – Did we learn a “simple” or “natural” phonology? – It’s hard to judge the answers. Anyway, we only want the answers to be “yes” because we suspect that this will give us a more predictive theory. So let’s just see if the theory is predictive. Proof is in the pudding! Caveat: Linguists and child language learners also have access to other kinds of data that we’re not considering yet. 125

Results (using Loopy Belief Propagation for inference) 126

German Results 127 Error Bars with bootstrap resampling

CELEX Results 128

Phonological Exercise Results 129

Gold UR Recovery 130

Formalizing Our Setup Many scoring functions on strings (e.g., our phonology model) can be represented using FSMs 131

1) Morphemes 2) Underlying words 3) Surface words Concatenation Phonology 132 What class of functions will we allow for the factors? (black squares) dæmn e ɪʃ ən srizaign r,εz ɪ gn’e ɪʃ n riz’ajnz d,æmn’e ɪʃ n d’æmz rizaign#e ɪʃ ən rizaign#s dæmn#s dæmn#e ɪʃ ən

Real-Valued Functions on Strings We’ll need to define some nonnegative functions. f(x) = score of a string f(x,y) = score of a pair of strings Can represent deterministic processes: f(x,y)  {1,0}  Is y the observable result of deleting latent material from x? Can represent probability distributions, e.g.,  f(x) = p(x) under some “generating process”  f(x,y) = p(x,y) under some “joint generating process”  f(x,y) = p(y | x) under some “transducing process” 133

Restrict to Finite-State Functions One string input a c  Boolean output Two string inputs (on 2 tapes) a:x c:z :y:y a:x/.5 c:z/.7  :y/.5.3 Real output a/.5 c/.7  /.5.3 Path weight = product of arc weights Score of input = total weight of accepting paths 134

Example: Stochastic Edit Distance p(y|x) a:   :a b:   :b a:b b:a a:a b:b O(k) deletion arcs O(k) insertion arcs O(k 2 ) substitution arcs O(k) identity arcs Likely edits = high-probability arcs 135

Computing p(y|x) c l a r a c a ? Given (x,y), construct a graph of all accepting paths in the original FSM. These are different explanations for how x could have been edited to yield y (x-to-y alignments). Use dyn. prog. to find highest-prob path, or total prob of all paths c:  l:  a:  r:  a:   :c c:c  :c l:c  :c a:c  :c r:c  :c a:c  :c c:  l:  a:  r:  a:   :a c:a  :a l:a  :a a:a  :a r:a  :a a:a  :a c:  l:  a:  r:  a:   :c c:c  :c l:c  :c a:c  :c r:c  :c a:c  :c c:  l:  a:  r:  a:   :a c:a  :a l:a  :a a:a  :a r:a  :a a:a  :a c:  l:  a:  r:  a:  position in upper string position in lower string 136

Why restrict to finite-state functions? Can always compute f(x,y) efficiently.  Construct the graph of accepting paths by FSM composition.  Sum over them via dynamic prog., or by solving a linear system. Finite-state functions are closed under useful operations:  Marginalization: h(x) = ∑ y f(x,y)  Pointwise product: h(x) = f(x) ∙ g(x)  Join: h(x,y,z) = f(x,y) ∙ g(y,z) 137

Define a function family Use finite-state machines (FSMs). The arc weights are parameterized. We tune the parameters to get weights that predict our training data well. The FSM topology defines a function family. In practice, generalizations of stochastic edit distance.  So, we are learning the edit probabilities.  With more states, these can depend on left and right context. 138

Probabilistic FSTs 139

Probabilistic FSTs 140

Finite-State Graphical Model Over String-Valued Random Variables ● Joint distribution over many strings ● Variables ● Range over Σ*  infinite set of all strings ● Relations among variables ● Usually specified by (multi-tape) FSTs 141 A probabilistic approach to language change ( Bouchard-Côté et. al. NIPS 2008 ) Graphical models over multiple strings. (Dreyer and Eisner. EMNLP 2009 ) Large-scale cognate recovery (Hall and Klein. EMNLP 2011 )

Useful 3-tape FSMs f(x,y,z) = p(z | x, y)  typically z is functionally dependent on x,y, so this represents a deterministic process Concatenation: f( dog, s, dogs) = 1 Binyan: f(ktb, a _ _ u _, aktub) = 1 Remark: WLOG, we can do everything with 2- tape FSMs. Similar to binary constraints in CSP. 142

Computational Hardness Ordinary graphical model inference is sometimes easy and sometimes hard, depending on graph topology But over strings, it can be hard even with simple topologies and simple finite-state factors  143

Simple model family can be NP-hard Multi-sequence alignment problem  Generalize edit distance to k strings of length O(n)  Dynamic programming would seek best path in a hypercube of size O(n^k) Similar to Steiner string problem (“consensus”) c:  l:  a:  r:  a:   :c c:c  :c l:c  :c a:c  :c r:c  :c a:c  :c c:  l:  a:  r:  a:   :a c:a  :a l:a  :a a:a  :a r:a  :a a:a  :a c:  l:  a:  r:  a:   :c c:c  :c l:c  :c a:c  :c r:c  :c a:c  :c c:  l:  a:  r:  a:   :a c:a  :a l:a  :a a:a  :a r:a  :a a:a  :a c:  l:  a:  r:  a:  position in lower string 144

Post’s Correspondence Problem (1946)  Given a 2-tape FSM f(x,y) of a certain form, is there a string z for which f(z,z) > 0 ?  No Turing Machine can decide this in general  So no Turing machine can determine in general whether this simple factor graph has any positive-weight solutions: b:bb:b b:bb:b a:a: Simple model family can be undecidable (!) z f xy = f 145 z = bbaabbbaa bba+ab+bba+a  bb  +aa+bb  +baa f = :a:a :a:a a:ba:b a:aa:a b:a :: bba+ab+bba bb  +aa+bb  bba+ab bb  +aa bba bb 

Inference by Belief Propagation 146

147

148

149

150

151

152

153

Loopy belief propagation (BP) The beautiful observation (Dreyer & Eisner 2009):  Each message is a 1-tape FSM that scores all strings.  Messages are iteratively updated based on other messages, according to the rules of BP.  The BP rules require only operations under which FSMs are closed! Achilles’ heel:  The FSMs may grow large as the algorithm iterates. So the algorithm may be slow, or not converge at all. 154

Belief Propagation (BP) in a Nutshell X1X1 X2X2 X3X3 X4X4 X6X6 X5X5

dˌæmnˈeɪʃən rizˈajnz rˌɛzɪgnˈeɪʃən dæmnz rizajgnz rizajgneɪʃən dæmneɪʃən eɪʃən z z rizajgn dæmn dˈæmz 156

Belief Propagation (BP) in a Nutshell dˌæmnˈeɪʃən rizˈajnz rˌɛzɪgnˈeɪʃən dæmnz rizajgnz rizajgneɪʃən dæmneɪʃən eɪʃən z z rizajgn dæmn dˈæmz Factor to Variable Messages 157

Belief Propagation (BP) in a Nutshell dˌæmnˈeɪʃən rizˈajnz rˌɛzɪgnˈeɪʃən dæmnz rizajgnz rizajgneɪʃən dæmneɪʃən eɪʃən z z rizajgn dæmn dˈæmz Variable to Factor Messages 158

Belief Propagation (BP) in a Nutshell dˌæmnˈeɪʃən rizˈajnz rˌɛzɪgnˈeɪʃən dæmnz rizajgnz rizajgneɪʃən dæmneɪʃən eɪʃən z z rizajgn dæmn dˈæmz Encoded as Finite- State Machines r in g u e ε s e h a r in g u e ε e e s e h a r in g u e ε e e s e h a r in g u e ε e e s e h a r in g u e ε s e h a r in g u e ε s e h a 159

Belief Propagation (BP) in a Nutshell dˌæmnˈeɪʃən rizˈajnz rˌɛzɪgnˈeɪʃən dæmnz rizajgnz rizajgneɪʃən dæmneɪʃən eɪʃən z z rizajgn dæmn dˈæmz 160

Belief Propagation (BP) in a Nutshell dˌæmnˈeɪʃən rizˈajnz rˌɛzɪgnˈeɪʃən dæmnz rizajgnz rizajgneɪʃən dæmneɪʃən eɪʃən z z rizajgn dæmn dˈæmz r in g u e ε e e s e h a r in g u e ε e e s e h a 161

Belief Propagation (BP) in a Nutshell dˌæmnˈeɪʃən rizˈajnz rˌɛzɪgnˈeɪʃən dæmnz rizajgnz rizajgneɪʃən dæmneɪʃən eɪʃən z z rizajgn dæmn dˈæmz r i n g u e ε e e s e h a r i n g u e ε e e s e h a r i n g u e ε e e s e h a Point-wise product (finite-state intersection) yields marginal belief 162

Belief Propagation (BP) in a Nutshell dˌæmnˈeɪʃən rizˈajnz rˌɛzɪgnˈeɪʃən dæmnz rizajgnz rizajgneɪʃən dæmneɪʃən eɪʃən z z rizajgn dæmn dˈæmnz Distribution Over Underlying Forms: UR Prob rizajgnz.95 rezajnz.02 rezigz.02 rezgz.0001 … chomsky … r i n g u e ε e e s e h a r i n g u e ε e e s e h a r i n g u e ε e e s e h a 163

Computing Marginal Beliefs X1X1 X2X2 X3X3 X4X4 X7X7 X5X5

X1X1 X2X2 X3X3 X4X4 X7X7 X5X5

Belief Propagation (BP) in a Nutshell X1X1 X2X2 X3X3 X4X4 X6X6 X5X5 r in g u e ε e e s e h a r in g u e ε s e h a r in g u e ε e e s e h a r in g u e ε s e h a

Computing Marginal Beliefs X1X1 X2X2 X3X3 X4X4 X7X7 X5X5 r in g u e ε s e h a r in g u e ε s e h a r in g u e ε s e h a r in g u e ε s e h a

X1X1 X2X2 X3X3 X4X4 X7X7 X5X5 C C r in g u e ε s e h a r in g u e ε s e h a r in g u e ε s e h a r in g u e ε s e h a r in g u e ε s e h a r in g u e ε s e h a r in g u e ε s e h a r in g u e ε s e h a r in g u e ε s e h a r in g u e ε s e h a Computation of belief results in large state space

Computing Marginal Beliefs X1X1 X2X2 X3X3 X4X4 X7X7 X5X5 C C r in g u e ε s e h a r in g u e ε s e h a r in g u e ε s e h a r in g u e ε s e h a r in g u e ε s e h a r in g u e ε s e h a r in g u e ε s e h a r in g u e ε s e h a r in g u e ε s e h a r in g u e ε s e h a Computation of belief results in large state space What a hairball!

Computing Marginal Beliefs X1X1 X2X2 X3X3 X4X4 X7X7 X5X5 r in g u e ε s e h a r in g u e ε s e h a r in g u e ε s e h a r in g u e ε s e h a Approximation Required!!!

BP over String-Valued Variables In fact, with a cyclic factor graph, messages and marginal beliefs grow unboundedly complex. X2X2 X1X1 ψ2ψ2 a a ε a a a a ψ1ψ1 a a

BP over String-Valued Variables In fact, with a cyclic factor graph, messages and marginal beliefs grow unboundedly complex! X2X2 X1X1 ψ2ψ2 a a ε a a a a ψ1ψ1 a a a

BP over String-Valued Variables In fact, with a cyclic factor graph, messages and marginal beliefs grow unboundedly complex! X2X2 X1X1 ψ2ψ2 a a ε a a a a ψ1ψ1 a a a aa

BP over String-Valued Variables In fact, with a cyclic factor graph, messages and marginal beliefs grow unboundedly complex! X2X2 X1X1 ψ2ψ2 a a ε a a a a ψ1ψ1 a a a a a a

BP over String-Valued Variables In fact, with a cyclic factor graph, messages and marginal beliefs grow unboundedly complex! X2X2 X1X1 ψ2ψ2 a a ε a a a a ψ1ψ1 a a a a a a a aa a a a a aaa aaaaaaaaa

Inference by Expectation Propagation 176

Expectation Propagation: The Details A belief at at variable is just the point-wise product of message: Key Idea: For each message,, we seek an approximate message Algorithm: for each

Expectation Propagation (EP) X1X1 X2X2 X3X3 X4X4 X7X7 X5X5 exponential-family approximations inside Belief at X 3 will be simple! Messages to and from X 3 will be simple! 178

Expectation propagation (EP) EP solves the problem by simplifying each message once it is computed.  Projects the message back into a tractable family. 179

Expectation Propagation (EP) exponential-family approximations inside Belief at X 3 will be simple! Messages to and from X 3 will be simple! X3X3

Expectation propagation (EP) EP solves the problem by simplifying each message once it is computed.  Projects the message back into a tractable family. In our setting, we can use n-gram models.  f approx (x) = product of weights of the n-grams in x  Just need to choose weights that give a good approx Best to use variable-length n-grams. 181

Expectation Propagation (EP) in a Nutshell X1X1 X2X2 X3X3 X4X4 X7X7 X5X5 r in g u e ε s e h a r in g u e ε s e h a r in g u e ε s e h a r in g u e ε s e h a

X1X1 X2X2 X3X3 X4X4 X7X7 X5X5 r in g u e ε s e h a r in g u e ε s e h a r in g u e ε s e h a

X1X1 X2X2 X3X3 X4X4 X7X7 X5X5 r in g u e ε s e h a r in g u e ε s e h a

X1X1 X2X2 X3X3 X4X4 X7X7 X5X5 r in g u e ε s e h a

X1X1 X2X2 X3X3 X4X4 X7X7 X5X5

Expectation propagation (EP) 187

Variable Order Approximations Use only the n-grams you really need!

Approximating beliefs with n-grams How to optimize this?  Option 1: Greedily add n-grams by expected count in f. Stop when adding next batch hurts the objective.  Option 2: Select n-grams from a large set using a convex relaxation + projected gradient (= tree- structured group lasso). Must incrementally expand the large set (“active set” method). 189

Results using Expectation Propagation 190

Trigram EP (Cyan) – slow, very accurate Baseline (Black) – slow, very accurate (pruning) Penalized EP (Red) – pretty fast, very accurate Bigram EP (Blue) – fast but inaccurate Unigram EP (Green) – fast but inaccurate Speed ranking (upper graph) Accuracy ranking (lower graph) … essentially opposites … 191

192

Inference by Dual Decomposition Exact 1-best inference! (can’t terminate in general because of undecidability, but does terminate in practice) 193

Challenges in Inference 194 Global discrete optimization problem. Variables range over a infinite set … cannot be solved by ILP or even brute force. Undecidable! Our previous papers used approximate algorithms: Loopy Belief Propagation, or Expectation Propagation. Q: Can we do exact inference? A: If we can live with 1-best and not marginal inference, then we can use Dual Decomposition … which is exact. (if it terminates! the problem is undecidable in general …)

Graphical Model for Phonology 195 Jointly decide the values of the inter-dependent latent variables, which range over a infinite set. 1) Morpheme URs 2) Word URs 3) Word SRs Concatenation (e.g.) Phonology (PFST) srizajgn e ɪʃ ən dæmn rεz ɪ gn#e ɪʃ ən rizajn#s dæmn#e ɪʃ ən dæmn#s r,εz ɪ gn’e ɪʃ n riz’ajnz d,æmn’e ɪʃ n d’æmz rεzignrεzign e ɪʃ ən

General Idea of Dual Decomp 196 srizajgn e ɪʃ ən dæmn rεz ɪ gn#e ɪʃ ən rizajn#s dæmn#e ɪʃ ən dæmn#s r,εz ɪ gn’e ɪʃ n riz’ajnz d,æmn’e ɪʃ n d’æmz rεzignrεzign e ɪʃ ən

General Idea of Dual Decomp zrizajn e ɪʃ ən dæmn e ɪʃ ən zdæmn rεz ɪ gn rεz ɪ gn#e ɪʃ ən rizajn#z dæmn#e ɪʃ ən dæmn#z r,εz ɪ gn’e ɪʃ n riz’ajnz d,æmn’e ɪʃ n d’æmz Subproblem 1Subproblem 2Subproblem 3Subproblem 4 197

I prefer rεz ɪ gn I prefer rizajn General Idea of Dual Decomp zrizajn e ɪʃ ən dæmn e ɪʃ ən zdæmn rεz ɪ gn rεz ɪ gn#e ɪʃ ən rizajn#z dæmn#e ɪʃ ən dæmn#z r,εz ɪ gn’e ɪʃ n riz’ajnz d,æmn’e ɪʃ n d’æmz Subproblem 1Subproblem 2Subproblem 3Subproblem 4 198

zrizajn e ɪʃ ən dæmn e ɪʃ ən zdæmn rεz ɪ gn rεz ɪ gn#e ɪʃ ən rizajn#z dæmn#e ɪʃ ən dæmn#z r,εz ɪ gn’e ɪʃ n riz’ajnz d,æmn’e ɪʃ n d’æmz Subproblem 1Subproblem 2Subproblem 3Subproblem 4 199

Substring Features and Active Set z rizajn e ɪʃ ən dæmn e ɪʃ ən zdæmn rεzɪgnrεzɪgn rεz ɪ gn#e ɪʃ ən rizajn#z dæmn#e ɪʃ ən dæmn#z r,εz ɪ gn’e ɪʃ n riz’ajnz d,æmn’e ɪʃ n d’æmz Subproblem I prefer rεz ɪ gn Less ε, ɪ, g; more i, a, j (to match others) Less ε, ɪ, g; more i, a, j (to match others) I prefer rizajn Less i, a, j; more ε, ɪ, g (to match others) Less i, a, j; more ε, ɪ, g (to match others)

Features: “Active set” method How many features? Infinitely many possible n-grams! Trick: Gradually increase feature set as needed. – Like Paul & Eisner (2012), Cotterell & Eisner (2015) 1.Only add features on which strings disagree. 2.Only add abcd once abc and bcd already agree. – Exception: Add unigrams and bigrams for free. 201

Fragment of Our Graph for Catalan 202 ? ? grizos ? gris ? ? grize ?? grizes ? ? Stem of “grey” Separate these 4 words into 4 subproblems as before …

203 ? ? grizos ? gris ? ? grize ? ? ? ? grizes Redraw the graph to focus on the stem …

204 ? ? grizos ? gris ？ ? grize ? ? grizes ? ? ？？？ Separate into 4 subproblems – each gets its own copy of the stem

205 ? ? grizos ? gris ε ? grize ? ? grizes ? ? εε ε nonzero features: { } Iteration: 1

206 ? ? grizos ? gris g ? grize ? ? grizes ? ? gg g nonzero features: { } Iteration: 3

207 ? ? grizos ? gris ? grize ? ? grizes ? ? griz nonzero features: {s, z, is, iz, s$, z$ } Iteration: 4 Feature weights (dual variable)

208 ? ? grizos ? gris ? grize ? ? grizes ? ? grizgrizo griz nonzero features: {s, z, is, iz, s$, z$, o, zo, o$ } nonzero features: {s, z, is, iz, s$, z$, o, zo, o$ } Iteration: 5 Feature weights (dual variable)

209 ? ? grizos ? gris ? grize ? ? grizes ? ? grizgrizo griz nonzero features: {s, z, is, iz, s$, z$, o, zo, o$ } nonzero features: {s, z, is, iz, s$, z$, o, zo, o$ } Iteration: 6 Iteration: 13 Feature weights (dual variable)

210 ? ? grizos ? gris griz ? grize ? ? grizes ? ? grizgrizo griz nonzero features: {s, z, is, iz, s$, z$, o, zo, o$ } nonzero features: {s, z, is, iz, s$, z$, o, zo, o$ } Iteration: 14 Feature weights (dual variable)

211 ? ? grizos ? gris griz ? grize ? ? grizes ? ? griz nonzero features: {s, z, is, iz, s$, z$, o, zo, o$ } nonzero features: {s, z, is, iz, s$, z$, o, zo, o$ } Iteration: 17 Feature weights (dual variable)

212 ? ? grizos ? gris griz ? grize ? ? grizes ? ? grizegriz nonzero features: {s, z, is, iz, s$, z$, o, zo, o$, e, ze, e$} Iteration: 18 Feature weights (dual variable)

213 ? ? grizos ? gris griz ? grize ? ? grizes ? ? grizegriz nonzero features: {s, z, is, iz, s$, z$, o, zo, o$, e, ze, e$} Iteration: 19 Iteration: 29 Feature weights (dual variable)

214 ? ? grizos ? gris griz ? grize ? ? grizes ? ? griz nonzero features: {s, z, is, iz, s$, z$, o, zo, o$, e, ze, e$} Iteration: 30 Feature weights (dual variable)

215 ? ? grizos ? gris griz ? grize ? ? grizes ? ? griz nonzero features: {s, z, is, iz, s$, z$, o, zo, o$, e, ze, e$} Iteration: 30 Converged!

I’ll try to arrange for r not i at position 2, i not z at position 3, z not  at position 4. Why n-gram features? 216 Positional features don’t understand insertion: In contrast, our “z” feature counts the number of “z” phonemes, without regard to position. These solutions already agree on “g”, “i”, “z” counts … they’re only negotiating over the “r” count. giz griz giz griz I need more r’s.

Why n-gram features? 217 Adjust weights λ until the “r” counts match: Next iteration agrees on all our unigram features: – Oops! Features matched only counts, not positions  – But bigram counts are still wrong … so bigram features get activated to save the day – If that’s not enough, add even longer substrings … giz griz I need more r’s … somewhere. girz griz I need more gr, ri, iz, less gi, ir, rz.

Results using Dual Decomposition 218

7 Inference Problems (graphs) EXERCISE (small) o 4 languages: Catalan, English, Maori, Tangale o 16 to 55 underlying morphemes. o 55 to 106 surface words. CELEX (large) o 3 languages: English, German, Dutch o 341 to 381 underlying morphemes. o 1000 surface words for each language. 219 # vars (unknown strings) # subproblems

Experimental Setup o Model 1: very simple phonology with only 1 parameter, trained by grid search. o Model 2S: sophisticated phonology with phonological features trained by hand- crafted morpheme URs: full supervision. o Model 2E: sophisticated phonology as Model 2S, trained by EM. o Evaluating inference on recovered latent variables under the different settings. 220

Experimental Questions o Is exact inference by DD practical? o Does it converge? o Does it get better results than approximate inference methods? o Does exact inference help EM? 221

● DD seeks best λ via subgradient algorithm  reduce dual objective  tighten upper bound on primal objective ● If λ gets all sub-problems to agree (x 1 = … = x K )  constraints satisfied  dual value is also value of a primal solution  which must be max primal! (and min dual) 222 ≤ primal (function of strings x) dual (function of weights λ )

Convergence behavior (full graph) Catalan Maori EnglishTangale 223 Dual (tighten upper bound) primal (improve strings) optimal!

Comparisons ● Compare DD with two types of Belief Propagation (BP) inference. Approximate MAP inference (max-product BP) (baseline) Approximate marginal inference (sum-product BP) (TACL 2015) Exact MAP inference (dual decomposition) (this paper) 224 Exact marginal inference (we don’t know how!) variational approximation Viterbi approximation

Inference accuracy 225 Approximate MAP inference (max-product BP) (baseline) Approximate marginal inference (sum-product BP) (TACL 2015) Exact MAP inference (dual decomposition) (this paper) Model 1, EXERCISE: 90% Model 1, CELEX: 84% Model 2S, CELEX: 99% Model 2E, EXERCISE: 91% Model 1, EXERCISE: 95% Model 1, CELEX: 86% Model 2S, CELEX: 96% Model 2E, EXERCISE: 95% Model 1, EXERCISE: 97% Model 1, CELEX: 90% Model 2S, CELEX: 99% Model 2E, EXERCISE: 98% Model 1 – trivial phonology Model 2S – oracle phonology Model 2E – learned phonology (inference used within EM) improves improves more! worse

Conclusion A general DD algorithm for MAP inference on graphical models over strings. On the phonology problem, terminates in practice, guaranteeing the exact MAP solution. Improved inference for supervised model; improved EM training for unsupervised model. Try it for your own problems generalizing to new strings! 226

observed data hidden data probability distribution Future Work 227

Future: Which words are related? So far, we were told that “resignation” shares morphemes with “resigns” and “damnation.” We’d like to figure that out from raw text:  Related spellings  Related contexts 228 shared morphemes?

Future: Which words are related? So far, we were told that “resignation” shares morphemes with “resigns” and “damnation.” We’d like to figure that out from raw text:  “resignation” and “resigns” are spelled similarly And appear in similar semantic contexts (topics, dependents)  “resignation” and “damnation” are spelled similarly And appear in similar syntactic contexts (singular nouns)  Abstract morphemes fall into classes: RESIGN-, DAMN- are verbs while -ATION, -S attach to verbs 229

How is morphology like clustering?

Linguistics quiz: Find a morpheme Blah blah blah snozzcumber blah blah blah. Blah blah blahdy abla blah blah. Snozzcumbers blah blah blah abla blah. Blah blah blah snezzcumbri blah blah snozzcumber.

Dreyer & Eisner 2011 – “select & mutate” Many possible morphological slots

Many possible phylogenies Did [Taylor swift] just dis harry sytles Lets see how bad [T Swift] will be. #grammys it’s clear that [T-Swizzle] is on drugs Ladies STILL love [LL Cool James]. [LL Cool J] is looking realllll chizzled! [Taylor swift] is apart of the Illuminati NEW Andrews, Dredze, & Eisner 2014 – “select & mutate”

Future: Which words are related? So far, we were told that “resignation” shares morphemes with “resigns” and “damnation.” We’d like to get that from raw text. To infer the abstract morphemes from context, we need to extend our generative story to capture regularity in morpheme sequences.  Neural language models …  But, must deal with unknown, unbounded vocabulary of morphemes 234

Future: Improving the factors More attention to representations and features (autosegmental phonology) Algorithms require us to represent each factor as a WFST defining  (x,y)  Good modeling reasons to use a Bayes net.  Then  (x,y) = p(y | x) so each factor is a PFST.  Alas, PFST is left/right asymmetric (label bias)!  Can we substitute a WFST that defines p(y | x) only up to a normalizing constant Z(x)? “Double intractability” since x is unknown: expensive even to explore x by Gibbs sampling!  How about features that depend on all of x? CRF / RNN / LSTM? 235

Reconstructing the (multilingual) lexicon IndexSpellingMeaningPronunciationSyntax 123ca[si.ei]NNP (abbrev) 124can [k ɛɪ n] NN 125can [kæn], [k ɛ n], … MD 126cane [ke ɪ n] NN (mass) 127cane [ke ɪ n] NN 128canes [ke ɪ nz] NNS 236 (other columns would include translations, topics, counts, embeddings, …)

Conclusions Unsupervised learning of how all the words in a language (or across languages) are interrelated.  This is what kids and linguists do.  Given data, estimate a posterior distribution over the infinite probabilistic lexicon.  While training parameters that model how lexical entries are related (language-specific derivational processes or soft constraints). Starting to look feasible!  We now have a lot of the ingredients – generative models and algorithms. 237