1 Parameterized Finite-State Machines and Their Training Jason Eisner Jason Eisner Johns Hopkins University March 4, 2004 — Saarbrücken.

Slides:



Advertisements
Similar presentations
Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.
Advertisements

Factorial Mixture of Gaussians and the Marginal Independence Model Ricardo Silva Joint work-in-progress with Zoubin Ghahramani.
1 1 CDT314 FABER Formal Languages, Automata and Models of Computation Lecture 3 School of Innovation, Design and Engineering Mälardalen University 2012.
Computer vision: models, learning and inference Chapter 8 Regression.
Supervised Learning Recap
The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010.
Intro to NLP - J. Eisner1 Finite-State Methods.
Best-First Search: Agendas
1 Profile Hidden Markov Models For Protein Structure Prediction Colin Cherry
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수
Hidden Markov Models Theory By Johan Walters (SR 2003)
1 Gholamreza Haffari Anoop Sarkar Presenter: Milan Tofiloski Natural Language Lab Simon Fraser university Homotopy-based Semi- Supervised Hidden Markov.
Lecture 15 Hidden Markov Models Dr. Jianjun Hu mleg.cse.sc.edu/edu/csce833 CSCE833 Machine Learning University of South Carolina Department of Computer.
Sequential Modeling with the Hidden Markov Model Lecture 9 Spoken Language Processing Prof. Andrew Rosenberg.
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
Visual Recognition Tutorial
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Logistics Course reviews Project report deadline: March 16 Poster session guidelines: – 2.5 minutes per poster (3 hrs / 55 minus overhead) – presentations.
Bayesian network inference
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
Announcements  Homework 4 is due on this Thursday (02/27/2004)  Project proposal is due on 03/02.
Lecture 5: Learning models using EM
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
Conditional Random Fields
The EM algorithm LING 572 Fei Xia 03/01/07. What is EM? EM stands for “expectation maximization”. A parameter estimation method: it falls into the general.
Expectation Maximization Algorithm
1 Hidden Markov Model Instructor : Saeed Shiry  CHAPTER 13 ETHEM ALPAYDIN © The MIT Press, 2004.
Statistical techniques in NLP Vasileios Hatzivassiloglou University of Texas at Dallas.
Maximum Likelihood (ML), Expectation Maximization (EM)
. Learning Bayesian networks Most Slides by Nir Friedman Some by Dan Geiger.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
EM algorithm LING 572 Fei Xia 03/02/06. Outline The EM algorithm EM for PM models Three special cases –Inside-outside algorithm –Forward-backward algorithm.
Crash Course on Machine Learning
Intro to NLP - J. Eisner1 Part-of-Speech Tagging A Canonical Finite-State Task.
Graphical models for part of speech tagging
Intro to NLP - J. Eisner1 Finite-State and the Noisy Channel.
/425 Declarative Methods - J. Eisner1 Constraints on Strings.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.
Intro to NLP - J. Eisner1 Building Finite-State Machines.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
PGM 2003/04 Tirgul 2 Hidden Markov Models. Introduction Hidden Markov Models (HMM) are one of the most common form of probabilistic graphical models,
1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
John Lafferty Andrew McCallum Fernando Pereira
Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.
Intro to NLP - J. Eisner1 Finite-State and the Noisy Channel.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Data-Intensive Computing with MapReduce Jimmy Lin University of Maryland Thursday, March 14, 2013 Session 8: Sequence Labeling This work is licensed under.
Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Conditional Random Fields and Its Applications Presenter: Shih-Hsiang Lin 06/25/2007.
Natural Language Processing : Probabilistic Context Free Grammars Updated 8/07.
. The EM algorithm Lecture #11 Acknowledgement: Some slides of this lecture are due to Nir Friedman.
Hidden Markov Models HMM Hassanin M. Al-Barhamtoshy
Hidden Markov Models BMI/CS 576
Part-of-Speech Tagging
Statistical Models for Automatic Speech Recognition
Machine Learning Basics
Hidden Markov Models Part 2: Algorithms
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Building Finite-State Machines
Finite-State and the Noisy Channel
Presentation transcript:

1 Parameterized Finite-State Machines and Their Training Jason Eisner Jason Eisner Johns Hopkins University March 4, 2004 — Saarbrücken

2 Outline – The Vision Slide! 1.Finite-state machines as a shared language for models & data. 2.The training gizmo (an algorithm). Should use out of-the-box finite-state gizmos to build and train most of our current models. Easier, faster, better, & enables fancier models.

3 Function from strings to... a:x/.5 c:z/.7  :y/.5.3 Acceptors (FSAs)Transducers (FSTs) a:x c:z :y:y a c  Unweighted Weighted a/.5 c/.7  /.5.3 {false, true}strings numbers(string, num) pairs

4 Sample functions Unweighted Weighted {false, true}strings numbers(string, num) pairs Grammatical? How grammatical? Better, how likely? Markup Correction Translation Good markups Good corrections Good translations Acceptors (FSAs)Transducers (FSTs)

5 Sample data, encoded same way Unweighted Weighted {false, true}strings numbers(string, num) pairs Input string Corpus Dictionary Input lattice Reweighted corpus Weighted dictionary Bilingual corpus Bilingual lexicon Database (WordNet) Prob. bilingual lexicon Weighted database Acceptors (FSAs)Transducers (FSTs) ba nana aid d

6 Some Applications  Prediction, classification, generation of text  More generally, “filling in the blanks” (probabilistic reconstruction of hidden data)  Speech recognition  Machine translation, OCR, other noisy-channel models  Sequence alignment / Edit distance / Computational biology  Text normalization, segmentation, categorization  Information extraction  Stochastic phonology/morphology, including lexicon  Tagging, chunking, finite-state parsing  Syntactic transformations (smoothing PCFG rulesets)

7 Finite-state “programming” Object code compiler Function Source code programmer Finite-state machine regexp compiler Better object code optimizer Better object code determinization, minimization, pruning Function on strings Regular expression programmer c a  a?c* Programming LangsFinite-State World

8 Finite-state “programming” Function composition FST/WFST composition Function inversion (available in Prolog) FST inversion Higher-order functions... Finite-state operators... Small modular cooperating functions (structured programming) Small modular regexps, combined via operators Programming LangsFinite-State World

9 Finite-state “programming” Programming LangsFinite-State World More features you wish other languages had!

Plain FSTsProbabilistic FSTs What they represent Functions on strings. Or nondeterministic functions (relations). Prob. distributions p(x,y) or p(y|x). How they’re currently used Encode expert knowledge about Arabic morphology, etc. Noisy channel models p(x)p(y|x)p(z|y)… (much more limited) How they’re currently built Fancy regular expressions (or sometimes TBL) Build parts by hand For each part, get arc weights somehow Then combine parts (much more limited) Currently Two Finite-State Camps

11 How About Learning??? Training Probabilistic FSMs  State of the world – surprising:  Training for HMMs, alignment, many variants  But no basic training algorithm for all FSAs  Fancy toolkits for building them, but no learning  New algorithm:  Training for FSAs, FSTs, … (collectively FSMs)  Supervised, unsupervised, incompletely supervised …  Train components separately or all at once  Epsilon-cycles OK  Complicated parameterizations OK “If you build it, it will train”

12 Current Limitation  Big FSM must be made of separately trainable parts. Knight & Graehl transliteration p(English text) p(English text  English phonemes) p(English phonemes  Japanese phonemes) p(Japanese phonemes  Japanese text) o o o Need explicit training data for this part (smaller loanword corpus). A pity – would like to use guesses. Topology must be simple enough to train by current methods. A pity – would like to get some of that expert knowledge in here! Topology: sensitive to syllable struct? Parameterization: /t/ and /d/ are similar phonemes … parameter tying?

13 Example: ab is accepted along 2 paths p(ab) = (.5.7.3) + (.5.6.4) =.225 Probabilistic FSA  /.5 a/1 a/.7 b/.3  /.5 b/.6.4 Regexp: (a*.7 b) +.5 (ab*.6 ) Theorem: Any probabilistic FSM has a regexp like this.

14 Example: ab is accepted along 2 paths weight(ab) = (p  q  r)  (w  x  y  z) Weights Need Not be Reals /p/p a/x a/qa/q b/rb/r /w/w b/yb/y z If   * satisfy “semiring” axioms, the finite-state constructions continue to work correctly.

15 Goal: Parameterized FSMs  Parameterized FSM:  An FSM whose arc probabilities depend on parameters: they are formulas. /p/p a/ q*exp(t+u) a/qa/q b/ (1-q)r  / 1-p a/ra/r 1-s a/ exp(t+v) Expert first: Construct the FSM (topology & parameterization). Automatic takes over: Given training data, find parameter values that optimize arc probs.

16 Goal: Parameterized FSMs  Parameterized FSM:  An FSM whose arc probabilities depend on parameters: they are formulas.  /.1 a/.44 a/.2 b/. 8  /.9 a/.3.7 a/.56 Expert first: Construct the FSM (topology & parameterization). Automatic takes over: Given training data, find parameter values that optimize arc probs.

17 Goal: Parameterized FSMs  FSM whose arc probabilities are formulas. Knight & Graehl transliteration p(English text) p(English text  English phonemes) p(English phonemes  Japanese phonemes) p(Japanese phonemes  Japanese text) o o o “Would like to get some of that expert knowledge in here” Use probabilistic regexps like (a*.7 b) +.5 (ab*.6 ) … If the probabilities are variables (a* x b) + y (ab* z ) … then arc weights of the compiled machine are nasty formulas. (Especially after minimization!)

18 Goal: Parameterized FSMs  An FSM whose arc probabilities are formulas. Knight & Graehl transliteration p(English text) p(English text  English phonemes) p(English phonemes  Japanese phonemes) p(Japanese phonemes  Japanese text) o o o “/t/ and /d/ are similar …” Tied probs for doubling them: /t/:/tt/ /d/:/dd/ p p

19 Goal: Parameterized FSMs  An FSM whose arc probabilities are formulas. Knight & Graehl transliteration p(English text) p(English text  English phonemes) p(English phonemes  Japanese phonemes) p(Japanese phonemes  Japanese text) o o o “/t/ and /d/ are similar …” Loosely coupled probabilities: /t/:/tt/ /d/:/dd/ exp p+q+r (coronal, stop, unvoiced) exp p+q+s (coronal, stop, voiced) (with normalization)

20 Outline of this talk 1.What can you build with parameterized FSMs? 2.How do you train them?

21 p(x) = Finite-State Operations  Projection GIVES YOU marginal distribution domain( p(x,y) ) p(y) = range( p(x,y) ) a : b / 0.3

p(x) q(x) = Finite-State Operations  Probabilistic union GIVES YOU mixture model p(x) q(x) p(x) q(x)

23  p(x) + (1-  )q(x) = Finite-State Operations  Probabilistic union GIVES YOU mixture model p(x) ++ q(x) p(x) q(x)  1-  Learn the mixture parameter  !

24 p(x|z) = Finite-State Operations  Composition GIVES YOU chain rule p(x|y) o p(y|z) p(x,z) = o z p(x|y) o p(y|z)  The most popular statistical FSM operation  Cross-product construction

25 Finite-State Operations  Concatenation, probabilistic closure HANDLE unsegmented text p(x)q(x) p(x) q(x) * p(x)  Just glue together machines for the different segments, and let them figure out how to align with the text

26 Finite-State Operations  Directed replacement MODELS noise or postprocessing p(x, noisy y) = p(x,y) o  Resulting machine compensates for noise or postprocessing D noise model defined by dir. replacement

27 p(x)*q(x) = Finite-State Operations  Intersection GIVES YOU product models  e.g., exponential / maxent, perceptron, Naïve Bayes, … p(x) & q(x) p NB (y | x)  & p(y) p(A(x)|y) & p(B(x)|y) &  Cross-product construction (like composition)  Need a normalization op too – computes  x f(x) “pathsum” or “partition function”

28 Finite-State Operations  Conditionalization (new operation) p(y | x) = condit( p(x,y) )  Construction: reciprocal(determinize(domain( ))) o p(x,y) not possible for all weighted FSAs  Resulting machine can be composed with other distributions: p(y | x) * q(x)

29 Other Useful Finite-State Constructions  Complete graphs YIELD n-gram models  Other graphs YIELD fancy language models (skips, caching, etc.)  Compilation from other formalism  FSM:  Wordlist (cf. trie), pronunciation dictionary...  Speech hypothesis lattice  Decision tree (Sproat & Riley)  Weighted rewrite rules (Mohri & Sproat)  TBL or probabilistic TBL (Roche & Schabes)  PCFG (approximation!) (e.g., Mohri & Nederhof)  Optimality theory grammars (e.g., Eisner)  Logical description of set (Vaillette; Klarlund)

30 Object code compiler Function Source code programmer Finite-state machine regexp compiler Better object code optimizer Better object code determinization, minimization, pruning Function on strings Regular expression programmer c a  a?c* Programming LangsFinite-State World Regular Expression Calculus as a Programming Language

31 Regular Expression Calculus as a Modelling Language  Oops! Statistical FSMs still done “in assembly language”!  Build machines by manipulating arcs and states  For training,  get the weights by some exogenous procedure and patch them onto arcs  you may need extra training data for this  you may need to devise and implement a new variant of EM  Would rather build models declaratively ((a*.7 b) +.5 (ab*.6 ))  repl.9 ((a:(b +.3  ))*,L,R)

32 A Simple Example: Segmentation tapirseatgrass  tapirs eat grass? tapir seat grass?... Strategy: build a finite-state model of p(spaced text, spaceless text) Then maximize p(???, tapirseatgrass) Start with a distribution p(English word)  a machine D (for dictionary) Construct p(spaced text)  (D space)* 0.99 D Compose with p(spaceless | spaced)  ((  space)+(space:  ))*

33 train on spaced or spaceless text Strategy: build a finite-state model of p(spaced text, spaceless text) Then maximize p(???, tapirseatgrass) Start with a distribution p(English word)  a machine D (for dictionary) Construct p(spaced text)  (D space)* 0.99 D Compose with p(spaceless | spaced)  ((  space)+(space:  ))* A Simple Example: Segmentation D should include novel words: D = KnownWord (Letter* 0.85 Suffix) Could improve to consider letter n-grams, morphology... Noisy channel could do more than just delete spaces: Vowel deletion (Semitic); OCR garbling ( cl  d, ri  n, rn  m...)

34 Outline 1.What can you build with parameterized FSMs? 2.How do you train them? Hint: Make the finite-state machinery do the work.

35 How Many Parameters? Final machine p(x,z) But really I built it as p(x,y) o p(z|y) 17 weights – 4 sum-to-one constraints = 13 apparently free parameters 5 free parameters 1 free parameter

36 How Many Parameters? But really I built it as p(x,y) o p(z|y) 5 free parameters 1 free parameter Even these 6 numbers could be tied... or derived by formula from a smaller parameter set.

37 How Many Parameters? But really I built it as p(x,y) o p(z|y) 5 free parameters 1 free parameter Really I built this as (a:p)*.7 (b: (p +.2 q))*.5 3 free parameters

38 Training a Parameterized FST Given: an expression (or code) to build the FST from a parameter vector  1.Pick an initial value of  2.Build the FST – implements fast prob. model 3.Run FST on some training examples to compute an objective function F(  ) 4.Collect E-counts or gradient  F(  ) 5.Update  to increase F(  ) 6.Unless we converged, return to step 2

39 x1x1 y1y1 y 2 y 1 x 2 x 1 x 3 x 2 x 1 y 3 y 2 y 1 Training a Parameterized FST (our current FST, reflecting our current guess of the parameter vector) … … At each training pair (x i, y i ), collect E counts or gradients that indicate how to increase p(x i, y i ). T =

40 What are x i and y i ? xixi yiyi T = (our current FST, reflecting our current guess of the parameter vector)

41 What are x i and y i ? x i = banana y i = bandaid T = (our current FST, reflecting our current guess of the parameter vector)

42 What are x i and y i ? x i = y i = ba ndaid ba nana T = (our current FST, reflecting our current guess of the parameter vector) fully supervised

43 What are x i and y i ? x i = y i = ba na ba ndaid  loosely supervised T = (our current FST, reflecting our current guess of the parameter vector)

44 What are x i and y i ? y i = ba ndai unsupervised, e.g., Baum-Welch. Transition seq x i is hidden Emission seq y i is observed d x i =  * =  T = (our current FST, reflecting our current guess of the parameter vector)

45 Building the Trellis x i = y i = T = Extracts paths from T that are compatible with (x i, y i ). Tends to unroll loops of T, as in HMMs, but not always. x i o T o y i = COMPOSE to get trellis:

46 Summing the Trellis Extracts paths from T that are compatible with (x i, y i ). Tends to unroll loops of T, as in HMMs, but not always. x i o T o y i = Let t i = total probability of all paths in trellis = p(x i, y i ) This is what we want to increase! How to compute t i ? If acyclic (exponentially many paths) : dynamic programming. If cyclic (infinitely many paths) : solve sparse linear system. x i, y i are regexps (denoting strings or sets of strings)

47 epsilonify ( x i o T o y i ) Remark: In principle, FSM minimization algorithm already knows how to compute t i, although not the best method. Summing the Trellis x i o T o y i = Let t i = total probability of all paths in trellis = p(x i, y i ). This is what we want to increase! minimize ( ) = titi replace all arc labels with 

Mama/.05Mama/.005 Mama Iwant/.0005 Mama Iwant Iwant/ Example: Baby Think & Baby Talk  :m/.05 Mama:m IWant:  /.1 IWant:u/.8.1 IWant:  /.1 mmobservetalk recover think, by composition  :m/.05 Mama:m X:m/.4.2 XX/.032 Total = X:b/.2 X:m/.4.2

Joint Prob. by Double Composition  :m/.05 Mama:m X:b/.2 X:m/.4.2 IWant:  /.1 IWant:u/.8.1 mmtalk compose IWant:  /.1  :m/.05 Mama:m X:m/.4.2 think  p(  * : mm) = = sum of paths

Joint Prob. by Double Composition  :m/.05 Mama:m X:b/.2 X:m/.4.2 IWant:  /.1 IWant:u/.8.1 mmtalk compose  :m/.05 Mama:m think p(  * : mm) =.0005 = sum of paths MamaIWant IWant:  /.1

Joint Prob. by Double Composition  :m/.05 Mama:m X:b/.2 X:m/.4.2 IWant:  /.1 IWant:u/.8.1 mmtalk compose IWant:  /.1  :m/.05 Mama:m X:m/.4.2 think  p(  * : mm) = = sum of paths

Summing Over All Paths  :m/.05 Mama:m X:b/.2 X:m/.4.2 IWant:  /.1 IWant:u/.8.1 m:  talk compose.1  :  /.1  :  /.05 ::  :  /.4.2 think :: p(  * : mm) = = sum of paths

Summing Over All Paths  :m/.05 Mama:m X:b/.2 X:m/.4.2 IWant:  /.1 IWant:u/.8.1 m:  talk compose + minimize think :: p(  * : mm) = = sum of paths

54 Where We Are Now “minimize (epsilonify ( x i o T o y i ) )” = titi // obtains t i = sum of trellis paths = p(x i, y i ). Want to change parameters to make t i increase. Solution: Annotate every probability with bookkeeping info. So probabilities know how they depend on parameters. Then the probability t i will know, too! It will emerge annotated with info about how to increase it. The machine T is built with annotations from the ground up. a vector find t i, or use EM

55 Example: ab is accepted along 2 paths p(ab) = (.5.7.3) + (.5.6.4) =.225 Probabilistic FSA  /.5 a/1 a/.7 b/.3  /.5 b/.6.4

56 Example: ab is accepted along 2 paths weight(ab) = (p  q  r)  (w  x  y  z) Weights Need Not be Reals /p/p a/x a/qa/q b/rb/r /w/w b/yb/y z If   * satisfy “semiring” axioms, the finite-state constructions continue to work correctly.

57 Union:  p  q Concat:  p  q Closure: * p* Intersect, Compose:  p  q Weight of a string is total weight of its accepting paths. Semiring Definitions p q p q p p q

58 Union:  p  q = p+q Concat:  p  q = pq Closure: * p* = 1+p+p 2 + … = (1-p) -1 Intersect, Compose:  p  q = pq Weight of a string is total weight of its accepting paths. The Probability Semiring p q p q p p q

59 Union:  (p,x)  (q,y) = (p+q, x+y) Concat:  (p,x)  (q,y) = (pq, py + qx) Closure: * (p,x)* = ((1-p) -1, (1-p) -2 x) Intersect, Compose:  (p,x)  (q,y) = (pq, py + qx) The (Probability, Gradient) Semiring p,x q,y p,x q,y p,x q,y Base case where p is gradient p, p

60 We Did It!  We now have a clean algorithm for computing the gradient. x i o T o y i = How to compute t i ? Just like before, when t i = p(x i, y i ). But in new semiring. If acyclic (exponentially many paths) : dynamic programming. If cyclic (infinitely many paths) : solve sparse linear system. Or can always just use minimize ( epsilonify (x i o T o y i ) ). Let t i = total annotated probability of all paths in trellis = (p(x i, y i ), p(x i, y i )). Aggregate over i (training examples).

61 An Alternative: EM Would be easy to train probabilities if we’d seen the paths the machine followed 1.E-step: Which paths probably generated the observed data? (according to current probabilities) 2.M-step: Reestimate probabilities (or  ) as if those guesses were right 3.Repeat Guaranteed to converge to local optimum.

Mama/.005 Mama Iwant/.0005 Mama Iwant Iwant/ IWant:  /.1 paths consistent with (x i,y i )  :m/.05 Mama:m X:m/.4.2 XX/.032 Total =  :m/.05 Mama:m X:b/.2 X:m/.4.2 IWant:  /.1 IWant:u/.8.1 mm talk think  xixixixi T yiyiyiyi Baby Says mm

Which Arcs Did We Follow? p(mm) = p(  * : mm) = = sum of all paths p(Mama : mm) =.005 p(Mama Iwant : mm) =.0005 p(Mama Iwant Iwant : mm) = etc. p(XX : mm) =.032 p(Mama | mm) =.005/ = p(Mama Iwant | mm) =.0005/ = p(Mama Iwant Iwant | mm) =.00005/ = p(XX | mm) =.032/ = relative probs..1 IWant:  /.1  :m/.05 Mama:m X:m/.4.2 paths consistent with (  *, mm )

Count Uses of Original Arcs p(Mama | mm) =.005/ = p(Mama Iwant | mm) =.0005/ = p(Mama Iwant Iwant | mm) =.00005/ = p(XX | mm) =.032/ = relative probs..1 IWant:  /.1  :m/.05 Mama:m X:m/.4.2 paths consistent with (  *, mm )  :m/.05 Mama:m X:b/.2 X:m/.4.2 IWant:  /.1 IWant:u/.8.1

paths consistent with (  *, mm )  :m/.05 Mama:m X:b/.2 X:m/.4.2 IWant:  /.1 IWant:u/.8.1 Count Uses of Original Arcs p(Mama | mm) =.005/ = p(Mama Iwant | mm) =.0005/ = p(Mama Iwant Iwant | mm) =.00005/ = p(XX | mm) =.032/ = relative probs.  :m/.05 Mama:m X:b/.2 X:m/.4.2 IWant:  /.1 IWant:u/.8.1 Expect  2 traversals of original arc (on example  *, mm ).1 IWant:  /.1  :m/.05 Mama:m X:m/.4.2

66.1/b 3.8/b 5 b 1 = (1,0,0,0,0,0,0)b 4 = (0,0,0,1,0,0,0)b 7 = (0,0,0,0,0,0,1) b 2 = (0,1,0,0,0,0,0)b 5 = (0,0,0,0,1,0,0) b 3 = (0,0,1,0,0,0,0)b 6 = (0,0,0,0,0,1,0) Expected-Value Formulation.05/b 4 1/0.3/b 1.4/b 2.1/b6.1/b6.1/b 7 T = Associate a value with each arc we wish to track

67.1/b 3.8/b 5 b 1 = (1,0,0,0,0,0,0)b 4 = (0,0,0,1,0,0,0)b 7 = (0,0,0,0,0,0,1) b 2 = (0,1,0,0,0,0,0)b 5 = (0,0,0,0,1,0,0) b 3 = (0,0,1,0,0,0,0)b 6 = (0,0,0,0,0,1,0) Expected-Value Formulation.05/b 4 1/0.3/b 1.4/b 2.1/b6.1/b6.1/b 7 T = Associate a value with each arc we wish to track x i o T o y i =.4/b 2.1/b 3 has total value b 2 + b 2 + b 3 = (0,2,1,0,0,0,0) Tells us the observed counts of arcs in T.

68 Expected-Value Formulation Associate a value with each arc we wish to track x i o T o y i =.4/b 2.2/b 3 has total value b 2 + b 2 + b 3 = (0,2,1,0,0,0,0) Tells us the observed counts of arcs in T. But what if x i o T o y i had multiple paths? We want the expected path value for the E step of EM. Some paths more likely than others. expected value =  value(path) p(path | x i, y i ) =  value(path) p(path) /  p(path) We’ll arrange for t i = (  p(path),  value(path) p(path))

69 Base case where v is arc value Union:  (p,x)  (q,y) = (p+q, x+y) Concat:  (p,x)  (q,y) = (pq, py + qx) Closure: * (p,x)* = ((1-p) -1, (1-p) -2 x) Intersect, Compose:  (p,x)  (q,y) = (pq, py + qx) The Expectation Semiring p,x q,y p,x q,y p,x q,y p, pv t i = (  p(path),  value(path) p(path)) same as before!

70 That’s the algorithm!  Existing mechanisms do all the work  Keeps count of original arcs despite composition, loop unrolling, etc.  Cyclic sums handled internally by the minimization step, which heavily uses semiring closure operation  Flexible: can define arc values as we like  Example: Log-linear (maxent) parameterization  M-step: Must reestimate  from feature counts (e.g., Iterative Scaling)  If arc’s weight is exp(  2 +  5 ), let its value be (0,1,0,0,1,...)  Then total value of correct path for (x i,y i ) – counts observed features  E-step: Needs to find expected value of path for (x i,y i )

71 Some Optimizations Exploit (partial) acyclicity Avoid expensive vector operations Exploit sparsity Rebuild quickly after parameter update x i o T o y i = Let t i = total annotated probability of all paths in trellis = (p(x i, y i ), bookkeeping information).

72 Fast Updating 1.Pick an initial value of  2.Build the FST – implements fast prob. model... 6.Unless we converged, return to step 2  But step 2 might be slow!  Recompiles the FST from its parameterized regexp, using the new parameters .  This involves a lot of structure-building, not just arithmetic  Matching arc labels in intersection and composition  Memory allocation/deallocation  Heuristic decisions about time-space tradeoffs

73 Fast Updating  Solution: Weights remember underlying formulas  A weight is a pointer into a formula DAG exp + 5 * 22 1 * 0.7 8 may or may not be used in obj. function; update on demand Each node caches its current value When (some) parameters are updated, invalidate (some) caches Similar to a heap Allows approximate updates

74  Easy to experiment with interesting models.  Change a model = edit declarative specification  Combine models = give a simple regexp  Train the model = push a button  Share your model/data = upload to archive  Speed up training = download latest version (conj gradient, pruning …)  Avoid local maxima = download latest version (deterministic annealing …) The Sunny Future

75 Marrying Two Finite-State Traditions Classic stat models & variants  simple FSMs HMMs, edit distance, sequence alignment, n-grams, segmentation Expert knowledge  hand-crafted FSMs Extended regexps, phonology/morphology, info extraction, syntax … Design complex finite-state model for task Any extended regexp Any machine topology; epsilon-cycles ok Parameterize as desired to make it probabilistic Combine models freely, tying parameters at will Then find best param values from data (by EM or CG) Trainable from dataTailored to task Tailor model, then train end-to-end

76 How about non-finite-state models?  Can approximate by finite-state... ... but chart parser is more efficient than its finite-state approximation  Can we generalize approach?  our semirings do generalize to inside-outside  regexps don’t generalize as nicely  still want a declarative specification language!  with generic algorithms behind the scenes  classical algorithms should be special cases

77 Recurring Problem, Recurring Methodology  Problem: Lots of possibilities for hidden structure  taggings  parses  translations  alignments  segmentations...  Formal problem: Combinatorial optimization  Solution: Dynamic programming + parameter training + EM exponential blowup in each case

78 Devil in the Details  Problem: Combinatorial optimization + training  Solution: Dynamic programming + EM  Major work to try a new idea / modify a system  chart management / indexing  prioritization of partial solutions (best-first, A*)  parameter management  inside-outside formulas  different systems for training and decoding  careful code engineering (memory, speed)  conjugate gradient, annealing,...  parallelization  But we’d like to play with many new models!

79 Happens Over & Over In Our Work  Each of these allows many variants in the details  Most papers only have time to try one...  Synchronous grammars  Alignment templates  Two-level coercive transduction  Generation-heavy MT  Syntax projection and induction  Morphology projection and induction  Parsing  Stack decoding  Long-range language models  All finite-state models...

80 Secret Weapon  A new programming language, Dyna  Declarative language (like Prolog)  Just specify structure of the optimization problem  Most of our algorithms are only a few lines  An optimizing compiler produces careful, fast code CKY lattice parsing in 3 lines  constit(X,I,J) += word(W,I,J) * rewrite(X,W).  constit(X,I,J) += constit(Y,I,K) * constit(Z,K,J) * rewrite(X,Y,Z).  goal += constit(s,0,J) * end(J).

81 Secret Weapon  Dyna looks similar to Prolog but...  Terms have values (e.g., trainable probabilities)  No backtracking! (tabling instead)  Execution strategy is bottom-up & priority driven  High-level algorithm transformations  more speed  Doesn’t take over your project  compiles to C++ class within your main program  Already started to use it for parsing, MT, grammar induction... ... and for finite-state work described in this talk.

FIN that’s all folks (for now) wish lists to

83 Log-Linear Parameterization

84 Need Faster Minimization  Hard step is the minimization:  Want total semiring weight of all paths  Weighted  -closure must invert a semiring matrix  Want to beat this! (takes O(n 3 ) time)  Optimizations exploit features of problem.1  :  /.1  :  /.05 ::  :  /.4.2

85 All-Pairs vs. Single-Source  For each q, r,  -closure f inds total weight of all q r paths  But we only need total weight of init final paths  Solve linear system instead of inverting matrix:  Let  (r) = total weight of init r paths   (r) =  q  (q) * weight(q  r)   (init) = 1 +  q  (q) * weight(q  init)  But still O(n 3 ) in worst case

86 Cycles Are Usually Local  In HMM case, Ti = (   x i ) o T o (y i   ) is an acyclic lattice:  Acyclicity allows linear-time dynamic programming to find our sum over paths  If not acyclic, first decompose into minimal cyclic components (Tarjan 1972, 1981; Mohri 1998)  Now full O(n 3 ) algorithm must be run for several small n instead of one big n – and reassemble results  More powerful decompositions available (Tarjan 1981); block-structured matrices

87 Avoid Semiring Operations  Our semiring operations aren’t O(1)  They manipulate vector values  To see how this slows us down, consider HMMs:  Our algorithm computes sum over paths in lattice.  If acyclic, requires a forward pass only.  Where’s backward pass?  What we’re pushing forward is (p,v)  Arcs v go forward to be downweighted by later probs, instead of probs going backward to downweight arcs.  The vector v rapidly loses sparsity, so this is slow!

88  We’re already computing forward probabilities  (q)  Also compute backward probabilities  (r) Avoid Semiring Operations p  Total probability of paths through this arc =  (q) * p *  (r)  E[path value] =  q,r (  (q) * p(q  r) *  (r)) * value(q  r)  Exploits structure of semiring  Now ,  are probabilities, not vector values q  (q) r  (r)

89 Avoid Semiring Operations  Now our linear systems are over the reals:  Let  (r) = total weight of init r paths   (r) =  q  (q) * weight(q  r)   (init) = 1 +  q  (q) * weight(q  init)  Well studied! Still O(n 3 ) in worst case, but:  Proportionately faster for sparser graph  O(|states| |arcs|) by iterative methods like conj. gradient  Usually |arcs| << |states| 2  Approximate solutions possible  Relaxation (Mohri 1998) and back-relaxation (Eisner 2001); or stop iterative method earlier  Lower space requirement: O(|states|) vs. O(|states| 2 )

90 Ways to Improve Toolkit  Experiment with other learning algs …  Conjugate gradient is a trivial variation; should be faster  Annealing etc. to avoid local optima  Experiment with other objective functions …  Trivial to incorporate a Bayesian prior  Discriminative training: maximize p(y | x), not p(x,y)  Experiment with other parameterizations …  Mixture models  Maximum entropy (log-linear): track expected feature counts, not arc counts  Generalize more: Incorporate graphical modelling

91 Some Applications  Prediction, classification, generation; more generally, “filling in of blanks”  Speech recognition  Machine translation, OCR, other noisy-channel models  Sequence alignment / Edit distance / Computational biology  Text normalization, segmentation, categorization  Information extraction  Stochastic phonology/morphology, including lexicon  Tagging, chunking, finite-state parsing  Syntactic transformations (smoothing PCFG rulesets)  Quickly specify & combine models  Tie parameters & train end-to-end  Unsupervised, partly supervised, erroneously supervised