Finite-State and the Noisy Channel

Finite-State and the Noisy Channel
Intro to NLP - J. Eisner

theprophetsaidtothecity
Word Segmentation x = theprophetsaidtothecity What does this say? And what other words are substrings? Could segment with parsing (how?), but slow. Given L = a “lexicon” FSA that matches all English words. What do these regexps do? L* L* & x (what does this FSA look like? how many paths?) [ L “ ” ]* [ L “ ”:0 ]* .o. x [ L “ ” ]* .o. [ [“ ”:0] | \“ ” ]* .o. x (equivalent) Intro to NLP - J. Eisner

Word Segmentation theprophetsaidtothecity x = [ [ L “ ”:0 ]* .o. x ].u
What does this say? And what other words are substrings? Could segment with parsing (how?), but slow. Given L = a “lexicon” FSA that matches all English words. Let’s try to recover the spaces with a single regexp … [ [ L “ ”:0 ]* .o. x ].u word sequence transduced to spaceless version restrict to paths that produce observed string x language of upper strings of those paths Intro to NLP - J. Eisner

Word Segmentation theprophetsaidtothecity x = [ [ L “ ”:0 ]* .o. x ].u
Given L = a “lexicon” FSA that matches all English words. What if Lexicon is weighted? From unigrams to bigrams? Smooth L to include unseen words? Make upper strings be over “alphabet” of words, not letters? [ [ L “ ”:0 ]* .o. x ].u Point of the lecture: This solution generalizes! word sequence transduced to spaceless version restrict to paths that produce observed string x language of upper strings of those paths Intro to NLP - J. Eisner

Spelling correction Spelling correction also needs a lexicon L
But there is distortion beyond deleting spaces … Let T be a transducer that models common typos and other spelling errors ance  ence (deliverance, ...) e  e (deliverance, ...) e  e // Cons _ Cons (athlete, ...) rr  r (embarrasş occurrence, …) ge  dge (privilege, …) etc. What does L .o. T represent? How’s that useful? Should L and T have probabilities? We’ll want T to include “all possible” errors … Intro to NLP - J. Eisner

Noisy Channel Model real language Y noisy channel Y  X
observed string X want to recover Y from X Intro to NLP - J. Eisner

Noisy Channel Model real language Y correct spelling typos
noisy channel Y  X typos observed string X misspelling want to recover Y from X Intro to NLP - J. Eisner

Noisy Channel Model real language Y (lexicon space)* delete spaces
noisy channel Y  X delete spaces observed string X text w/o spaces want to recover Y from X Intro to NLP - J. Eisner

Noisy Channel Model real language Y (lexicon space)* pronunciation
language model acoustic model noisy channel Y  X pronunciation observed string X speech want to recover Y from X Intro to NLP - J. Eisner

delete everything but terminals
Noisy Channel Model real language Y tree probabilistic CFG noisy channel Y  X delete everything but terminals observed string X text want to recover Y from X Intro to NLP - J. Eisner

Noisy Channel Model real language Y p(Y) * p(X | Y)
noisy channel Y  X p(X | Y) = observed string X p(X,Y) want to recover yY from xX choose y that maximizes p(y | x) or equivalently p(x,y) Intro to NLP - J. Eisner

Noisy Channel Model p(Y) .o. = * p(X | Y) = p(X,Y)
a:a/0.7 b:b/0.3 p(Y) .o. = * a:D/0.9 a:C/0.1 b:C/0.8 b:D/0.2 p(X | Y) = p(X,Y) a:D/0.63 a:C/0.07 b:C/0.24 b:D/0.06 Note p(x,y) sums to 1. Suppose x=“C”; what is best “y”? Intro to NLP - J. Eisner

Noisy Channel Model p(Y) .o. * p(X | Y) = = p(X,Y)
a:a/0.7 b:b/0.3 p(Y) .o. * a:C/0.1 b:C/0.8 p(X | Y) a:D/0.9 b:D/0.2 = = p(X,Y) a:C/0.07 b:C/0.24 a:D/0.63 b:D/0.06 Suppose x=“C”; what is best “y”? Intro to NLP - J. Eisner

Noisy Channel Model p(Y) .o. * p(X | Y) .o. * (X==x)? 0 or 1 = =
a:a/0.7 b:b/0.3 p(Y) .o. * a:C/0.1 b:C/0.8 p(X | Y) a:D/0.9 b:D/0.2 .o. * C:C/1 (X==x)? 0 or 1 restrict just to paths compatible with output “C” = = p(x,Y) a:C/0.07 b:C/0.24 best path Intro to NLP - J. Eisner

Noisy Channel Model [ Y .o. T .o. x ].u
(i.e., paths transducing y  x have total prob p(x | y)) noisy channel FST: accepts pair (x,y) with weight p(x | y) [ Y .o. T .o. x ].u source model FSA: accepts string y with weight p(y) restrict to paths that produce observed string x language of upper strings of those paths (i.e., paths accepting y have total prob p(y)) Intro to NLP - J. Eisner

Just Bayes’ Theorem Again
a:a/0.7 b:b/0.3 p(Y) = prior on Y .o. * p(X | Y) = likelihood of Y, for each X a:C/0.1 b:C/0.8 a:D/0.9 b:D/0.2 .o. * C:C/1 evidence X=x restrict just to paths compatible with output “C” = = p(x,Y): divide by constant p(x) to get posterior p(Y | x) a:C/0.07 b:C/0.24 best path Intro to NLP - J. Eisner

Morpheme Segmentation
Let Lexicon be a machine that matches all Turkish words Same problem as word segmentation Just at a lower level: morpheme segmentation Turkish word: uygarlaştıramadıklarımızdanmışsınızcasına = uygar+laş+tır+ma+dık+ları+mız+dan+mış+sınız+ca+sı+na (behaving) as if you are among those whom we could not cause to become civilized Some constraints on morpheme sequence: bigram probs Generative model – concatenate then fix up joints stop + -ing = stopping, fly + -s = flies, vowel harmony Use a cascade of transducers to handle all the fixups But this is just morphology! Can use probabilities here too (but people often don’t) Intro to NLP - J. Eisner

An FST: Baby Think & Baby Talk
👶:m/.4 IWant: /.1 IWant:u/.8 .1 :m/.05 Mama:m .2 FST mapping think to talk (👶 = just babbling, no thought) m observe talk recover think, by composition :m/.05 Mama:m 👶:m/.4 .2 .1 IWant: /.1 Mama/.005 Mama Iwant/.0005 Mama Iwant Iwant/.00005 Mama/.05 👶👶/.032 Intro to NLP - J. Eisner

Joint Prob. by Double Composition
think  :m/.05 Mama:m 👶:b/.2 👶:m/.4 .2 IWant: /.1 IWant:u/.8 .1 m talk To get really into the spirit, it’s better to say we’re composing not just these 2 machines but actually 3 machines, one that says the input must match sigma* (I.e, anything goes) and the other that says the output must match mm. So we’re looking at all paths that match those constraints. And their total probability is … Last question: How would we manipulate the machine to compute that probability? compose .1 IWant: /.1 :m/.05 Mama:m 👶:m/.4 .2 p(* : mm) = = sum of all paths Intro to NLP - J. Eisner

Cute method for summing all paths
think  :m/.05 Mama:m 👶:b/.2 👶:m/.4 .2 IWant: /.1 IWant:u/.8 .1 m talk To get really into the spirit, it’s better to say we’re composing not just these 2 machines but actually 3 machines, one that says the input must match sigma* (I.e, anything goes) and the other that says the output must match mm. So we’re looking at all paths that match those constraints. And their total probability is … Last question: How would we manipulate the machine to compute that probability? compose .1 IWant: /.1 :m/.05 Mama:m 👶:m/.4 .2 p(* : mm) = = sum of all paths Intro to NLP - J. Eisner

think : :m/.05 Mama:m 👶:b/.2 👶:m/.4 .2 IWant: /.1 IWant:u/.8 .1 m: talk Let’s change the two constraint automata into transducers to and from epsilon. They’re essentially going to delete the input and output of the paths they keep. … Now this is a nondeterministic machine – it describes a bunch of ways to transduce epsilon to epsilon. In fact infinitely many because of the loop. How do we add up all those ways and get the total probability of transducing epsilon to epsilon? Well, we can certainly describe that transduction in a single state – so minimize! .1 :/.1 :/.05 : :/.4 .2 compose p(* : mm) = = sum of all paths Intro to NLP - J. Eisner

think : :m/.05 Mama:m 👶:b/.2 👶:m/.4 .2 IWant: /.1 IWant:u/.8 .1 m: talk compose + minimize p(* : mm) = = sum of all paths Intro to NLP - J. Eisner

p(Mama … | mm) p(Mama * | mm) = .005555/.0375555 = .148
: think :Mama :m/.05 Mama:m 👶:b/.2 👶:m/.4 .2 IWant: /.1 IWant:u/.8 .1 m: talk p(Mama * | mm) = / = .148 compose + minimize p(Mama * : mm) = = sum of Mama … paths p(* : mm) = = sum of all paths Intro to NLP - J. Eisner

minimum edit distance (best alignment)
Older baby said caca Was probably thinking clara (?) Do these match up well? How well? minimum edit distance (best alignment) clara caca clar a c a ca clara c aca clara caca 3 substitutions + 1 deletion = total cost 4 2 deletions + 1 insertion = total cost 3 1 deletion + 1 substitution = total cost 2 5 deletions + 4 insertions = total cost 9 Intro to NLP - J. Eisner 25

Edit distance as min-cost path
l:e a:e r:e e:c c:c l:c a:c r:c e:a c:a l:a a:a r:a 1 2 3 4 position in upper string position in lower string c l a r a ? c a c a Minimum-cost path shows the best alignment, and its cost is the edit distance Intro to NLP - J. Eisner 26

position in upper string 0 1 0 1 2 c:e l:e a:e r:e e:c c:c l:c a:c r:c e:a c:a l:a a:a r:a 1 2 3 4 c c l c l a r a c l a c l a r c a c a c c a c a c 0 1 2 0 1 Minimum-cost path shows the best alignment, and its cost is the edit distance position in lower string Intro to NLP - J. Eisner 27

position in upper string c:e l:e a:e r:e a:e 1 2 3 4 A deletion edge has cost 1 It advances in the upper string only, so it’s horizontal It pairs the next letter of the upper string with e (empty) in the lower string c:e l:e a:e r:e a:e c:e l:e a:e r:e a:e position in lower string c:e l:e a:e r:e a:e c:e l:e a:e r:e a:e Intro to NLP - J. Eisner 28

position in upper string 1 2 3 4 An insertion edge has cost 1 It advances in the lower string only, so it’s vertical It pairs e (empty) in the upper string with the next letter of the lower string e:c e:c e:c e:c e:c e:c e:a e:a e:a e:a e:a e:a position in lower string e:c e:c e:c e:c e:c e:c e:a e:a e:a e:a e:a e:a Intro to NLP - J. Eisner 29

position in upper string A substitution edge has cost 0 or 1 It advances in the upper and lower strings simultaneously, so it’s diagonal It pairs the next letter of the upper string with the next letter of the lower string Cost is 0 or 1 depending on whether those letters are identical! 1 2 3 4 c:c l:c a:c a:c r:c c:a l:a a:a r:a a:a position in lower string c:c l:c a:c r:c a:c c:a l:a a:a r:a a:a Intro to NLP - J. Eisner 30

l:e a:e r:e e:c c:c l:c a:c r:c e:a c:a l:a a:a r:a 1 2 3 4 position in upper string position in lower string We’re looking for a path from upper left to lower right (so as to get through both strings) Solid edges have cost 0, dashed edges have cost 1 So we want the path with the fewest dashed edges Intro to NLP - J. Eisner 31

position in upper string c:e l:e a:e r:e e:c c:c l:c a:c r:c e:a c:a l:a a:a r:a 1 2 3 4 clara caca 3 substitutions + 1 deletion = total cost 4 position in lower string Intro to NLP - J. Eisner 32

position in upper string c:e l:e a:e r:e e:c c:c l:c a:c r:c e:a c:a l:a a:a r:a 1 2 3 4 clar a c a ca 2 deletions + 1 insertion = total cost 3 position in lower string Intro to NLP - J. Eisner 33

position in upper string c:e l:e a:e r:e e:c c:c l:c a:c r:c e:a c:a l:a a:a r:a 1 2 3 4 clara c aca 1 deletion + 1 substitution = total cost 2 position in lower string Intro to NLP - J. Eisner 34

position in upper string c:e l:e a:e r:e e:c c:c l:c a:c r:c e:a c:a l:a a:a r:a 1 2 3 4 clara caca 5 deletions + 4 insertions = total cost 9 position in lower string Intro to NLP - J. Eisner 35

Edit Distance Transducer
O(k) deletion arcs b:e a:e O(k2) substitution arcs a:b b:a e:a O(k) insertion arcs a:a e:b b:b O(k) no-change arcs Intro to NLP - J. Eisner 36

Edit Distance Transducer
Stochastic Edit Distance Transducer O(k) deletion arcs b:e a:e O(k2) substitution arcs a:b b:a e:a O(k) insertion arcs a:a e:b b:b O(k) identity arcs Likely edits = high-probability arcs Defines p(x | y), e.g., p(caca | clara) Intro to NLP - J. Eisner 37

Most likely path that maps clara to caca?
Best path (by Dijkstra’s algorithm) clara c:e l:e a:e r:e a:e .o. c:c l:c a:c a:c r:c e:c e:c e:c a:e e:a b:e e:b a:b b:a a:a b:b e:c e:c e:c c:e l:e a:e r:e a:e c:a l:a a:a r:a a:a = e:a e:a e:a e:a e:a e:a c:e l:e a:e r:e a:e c:c l:c a:c r:c a:c e:c e:c e:c e:c e:c e:c c:e l:e a:e r:e a:e .o. c:a l:a a:a r:a a:a e:a e:a e:a e:a e:a e:a caca c:e l:e a:e r:e a:e Intro to NLP - J. Eisner 38

What string produced caca? (posterior distribution)
Lexicon* .o. more complicated FST, with cycles. Each path is a possible input string aligned to caca. A path’s weight is proportional to the posterior probability of that path as an explanation for caca. Could find best path. a:e e:a b:e e:b a:b b:a a:a b:b = .o. caca Intro to NLP - J. Eisner 39

Speech Recognition by FST Composition (Pereira & Riley 1996)
trigram language model p(word seq) .o. pronunciation model p(phone seq | word seq) acoustic model p(acoustics | phone seq) .o. observed acoustics Intro to NLP - J. Eisner 40

Speech Recognition by FST Composition (Pereira & Riley 1996)
trigram language model p(word seq) .o. p(phone seq | word seq) CAT:k æ t æ : p(acoustics | phone seq) phone context phone context .o. Intro to NLP - J. Eisner 41

HMM Tagging Can think of this as a finite-state noisy channel, too!
See slides in the “tagging” lecture. Intro to NLP - J. Eisner 42

Finite-State and the Noisy Channel

Similar presentations

Presentation on theme: "Finite-State and the Noisy Channel"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Finite-State and the Noisy Channel

Similar presentations

Presentation on theme: "Finite-State and the Noisy Channel"— Presentation transcript:

Similar presentations

About project

Feedback