Presentation is loading. Please wait.

Presentation is loading. Please wait.

Finite-State and the Noisy Channel

Similar presentations


Presentation on theme: "Finite-State and the Noisy Channel"— Presentation transcript:

1 Finite-State and the Noisy Channel
Intro to NLP - J. Eisner

2 theprophetsaidtothecity
Word Segmentation x = theprophetsaidtothecity What does this say? And what other words are substrings? Could segment with parsing (how?), but slow. Given L = a “lexicon” FSA that matches all English words. What do these regexps do? L* L* & x (what does this FSA look like? how many paths?) [ L “ ” ]* [ L “ ”:0 ]* .o. x [ L “ ” ]* .o. [ [“ ”:0] | \“ ” ]* .o. x (equivalent) Intro to NLP - J. Eisner

3 Word Segmentation theprophetsaidtothecity x = [ [ L “ ”:0 ]* .o. x ].u
What does this say? And what other words are substrings? Could segment with parsing (how?), but slow. Given L = a “lexicon” FSA that matches all English words. Let’s try to recover the spaces with a single regexp … [ [ L “ ”:0 ]* .o. x ].u word sequence transduced to spaceless version restrict to paths that produce observed string x language of upper strings of those paths Intro to NLP - J. Eisner

4 Word Segmentation theprophetsaidtothecity x = [ [ L “ ”:0 ]* .o. x ].u
Given L = a “lexicon” FSA that matches all English words. What if Lexicon is weighted? From unigrams to bigrams? Smooth L to include unseen words? Make upper strings be over “alphabet” of words, not letters? [ [ L “ ”:0 ]* .o. x ].u Point of the lecture: This solution generalizes! word sequence transduced to spaceless version restrict to paths that produce observed string x language of upper strings of those paths Intro to NLP - J. Eisner

5 Spelling correction Spelling correction also needs a lexicon L
But there is distortion beyond deleting spaces … Let T be a transducer that models common typos and other spelling errors ance  ence (deliverance, ...) e  e (deliverance, ...) e  e // Cons _ Cons (athlete, ...) rr  r (embarrasş occurrence, …) ge  dge (privilege, …) etc. What does L .o. T represent? How’s that useful? Should L and T have probabilities? We’ll want T to include “all possible” errors … Intro to NLP - J. Eisner

6 Noisy Channel Model real language Y noisy channel Y  X
observed string X want to recover Y from X Intro to NLP - J. Eisner

7 Noisy Channel Model real language Y correct spelling typos
noisy channel Y  X typos observed string X misspelling want to recover Y from X Intro to NLP - J. Eisner

8 Noisy Channel Model real language Y (lexicon space)* delete spaces
noisy channel Y  X delete spaces observed string X text w/o spaces want to recover Y from X Intro to NLP - J. Eisner

9 Noisy Channel Model real language Y (lexicon space)* pronunciation
language model acoustic model noisy channel Y  X pronunciation observed string X speech want to recover Y from X Intro to NLP - J. Eisner

10 delete everything but terminals
Noisy Channel Model real language Y tree probabilistic CFG noisy channel Y  X delete everything but terminals observed string X text want to recover Y from X Intro to NLP - J. Eisner

11 Noisy Channel Model real language Y p(Y) * p(X | Y)
noisy channel Y  X p(X | Y) = observed string X p(X,Y) want to recover yY from xX choose y that maximizes p(y | x) or equivalently p(x,y) Intro to NLP - J. Eisner

12 Remember: Noisy Channel and Bayes’ Theorem
mess up y into x “decoder” most likely reconstruction of y y x p(Y=y) p(X=x | Y=y) maximize p(Y=y | X=x) = p(Y=y) p(X=x | Y=y) / (X=x) = p(Y=y) p(X=x | Y=y) / y’ p(Y=y’) p(X=x | Y=y’) Intro to NLP - J. Eisner

13 Noisy Channel Model p(Y) .o. = * p(X | Y) = p(X,Y)
a:a/0.7 b:b/0.3 p(Y) .o. = * a:D/0.9 a:C/0.1 b:C/0.8 b:D/0.2 p(X | Y) = p(X,Y) a:D/0.63 a:C/0.07 b:C/0.24 b:D/0.06 Note p(x,y) sums to 1. Suppose x=“C”; what is best “y”? Intro to NLP - J. Eisner

14 Noisy Channel Model p(Y) .o. * p(X | Y) = = p(X,Y)
a:a/0.7 b:b/0.3 p(Y) .o. * a:C/0.1 b:C/0.8 p(X | Y) a:D/0.9 b:D/0.2 = = p(X,Y) a:C/0.07 b:C/0.24 a:D/0.63 b:D/0.06 Suppose x=“C”; what is best “y”? Intro to NLP - J. Eisner

15 Noisy Channel Model p(Y) .o. * p(X | Y) .o. * (X==x)? 0 or 1 = =
a:a/0.7 b:b/0.3 p(Y) .o. * a:C/0.1 b:C/0.8 p(X | Y) a:D/0.9 b:D/0.2 .o. * C:C/1 (X==x)? 0 or 1 restrict just to paths compatible with output “C” = = p(x,Y) a:C/0.07 b:C/0.24 best path Intro to NLP - J. Eisner

16 Noisy Channel Model [ Y .o. T .o. x ].u
(i.e., paths transducing y  x have total prob p(x | y)) noisy channel FST: accepts pair (x,y) with weight p(x | y) [ Y .o. T .o. x ].u source model FSA: accepts string y with weight p(y) restrict to paths that produce observed string x language of upper strings of those paths (i.e., paths accepting y have total prob p(y)) Intro to NLP - J. Eisner

17 Just Bayes’ Theorem Again
a:a/0.7 b:b/0.3 p(Y) = prior on Y .o. * p(X | Y) = likelihood of Y, for each X a:C/0.1 b:C/0.8 a:D/0.9 b:D/0.2 .o. * C:C/1 evidence X=x restrict just to paths compatible with output “C” = = p(x,Y): divide by constant p(x) to get posterior p(Y | x) a:C/0.07 b:C/0.24 best path Intro to NLP - J. Eisner

18 Morpheme Segmentation
Let Lexicon be a machine that matches all Turkish words Same problem as word segmentation Just at a lower level: morpheme segmentation Turkish word: uygarlaştıramadıklarımızdanmışsınızcasına = uygar+laş+tır+ma+dık+ları+mız+dan+mış+sınız+ca+sı+na (behaving) as if you are among those whom we could not cause to become civilized Some constraints on morpheme sequence: bigram probs Generative model – concatenate then fix up joints stop + -ing = stopping, fly + -s = flies, vowel harmony Use a cascade of transducers to handle all the fixups But this is just morphology! Can use probabilities here too (but people often don’t) Intro to NLP - J. Eisner

19 An FST: Baby Think & Baby Talk
👶:m/.4 IWant: /.1 IWant:u/.8 .1 :m/.05 Mama:m .2 FST mapping think to talk (👶 = just babbling, no thought) m observe talk recover think, by composition :m/.05 Mama:m 👶:m/.4 .2 .1 IWant: /.1 Mama/.005 Mama Iwant/.0005 Mama Iwant Iwant/.00005 Mama/.05 👶👶/.032 Intro to NLP - J. Eisner

20 Joint Prob. by Double Composition
think :m/.05 Mama:m 👶:b/.2 👶:m/.4 .2 IWant: /.1 IWant:u/.8 .1 m talk To get really into the spirit, it’s better to say we’re composing not just these 2 machines but actually 3 machines, one that says the input must match sigma* (I.e, anything goes) and the other that says the output must match mm. So we’re looking at all paths that match those constraints. And their total probability is … Last question: How would we manipulate the machine to compute that probability? compose .1 IWant: /.1 :m/.05 Mama:m 👶:m/.4 .2 p(* : mm) = = sum of all paths Intro to NLP - J. Eisner

21 Cute method for summing all paths
think :m/.05 Mama:m 👶:b/.2 👶:m/.4 .2 IWant: /.1 IWant:u/.8 .1 m talk To get really into the spirit, it’s better to say we’re composing not just these 2 machines but actually 3 machines, one that says the input must match sigma* (I.e, anything goes) and the other that says the output must match mm. So we’re looking at all paths that match those constraints. And their total probability is … Last question: How would we manipulate the machine to compute that probability? compose .1 IWant: /.1 :m/.05 Mama:m 👶:m/.4 .2 p(* : mm) = = sum of all paths Intro to NLP - J. Eisner

22 Cute method for summing all paths
think : :m/.05 Mama:m 👶:b/.2 👶:m/.4 .2 IWant: /.1 IWant:u/.8 .1 m: talk Let’s change the two constraint automata into transducers to and from epsilon. They’re essentially going to delete the input and output of the paths they keep. … Now this is a nondeterministic machine – it describes a bunch of ways to transduce epsilon to epsilon. In fact infinitely many because of the loop. How do we add up all those ways and get the total probability of transducing epsilon to epsilon? Well, we can certainly describe that transduction in a single state – so minimize! .1 :/.1 :/.05 : :/.4 .2 compose p(* : mm) = = sum of all paths Intro to NLP - J. Eisner

23 Cute method for summing all paths
think : :m/.05 Mama:m 👶:b/.2 👶:m/.4 .2 IWant: /.1 IWant:u/.8 .1 m: talk compose + minimize p(* : mm) = = sum of all paths Intro to NLP - J. Eisner

24 p(Mama … | mm) p(Mama * | mm) = .005555/.0375555 = .148
: think :Mama :m/.05 Mama:m 👶:b/.2 👶:m/.4 .2 IWant: /.1 IWant:u/.8 .1 m: talk p(Mama * | mm) = / = .148 compose + minimize p(Mama * : mm) = = sum of Mama … paths p(* : mm) = = sum of all paths Intro to NLP - J. Eisner

25 minimum edit distance (best alignment)
Older baby said caca Was probably thinking clara (?) Do these match up well? How well? minimum edit distance (best alignment) clara caca clar a c a ca clara c aca clara caca 3 substitutions + 1 deletion = total cost 4 2 deletions + 1 insertion = total cost 3 1 deletion + 1 substitution = total cost 2 5 deletions + 4 insertions = total cost 9 Intro to NLP - J. Eisner 25

26 Edit distance as min-cost path
l:e a:e r:e e:c c:c l:c a:c r:c e:a c:a l:a a:a r:a 1 2 3 4 position in upper string position in lower string c l a r a ? c a c a Minimum-cost path shows the best alignment, and its cost is the edit distance Intro to NLP - J. Eisner 26

27 Edit distance as min-cost path
position in upper string 0 1 0 1 2 c:e l:e a:e r:e e:c c:c l:c a:c r:c e:a c:a l:a a:a r:a 1 2 3 4 c c l c l a r a c l a c l a r c a c a c c a c a c 0 1 2 0 1 Minimum-cost path shows the best alignment, and its cost is the edit distance position in lower string Intro to NLP - J. Eisner 27

28 Edit distance as min-cost path
position in upper string c:e l:e a:e r:e a:e 1 2 3 4 A deletion edge has cost 1 It advances in the upper string only, so it’s horizontal It pairs the next letter of the upper string with e (empty) in the lower string c:e l:e a:e r:e a:e c:e l:e a:e r:e a:e position in lower string c:e l:e a:e r:e a:e c:e l:e a:e r:e a:e Intro to NLP - J. Eisner 28

29 Edit distance as min-cost path
position in upper string 1 2 3 4 An insertion edge has cost 1 It advances in the lower string only, so it’s vertical It pairs e (empty) in the upper string with the next letter of the lower string e:c e:c e:c e:c e:c e:c e:a e:a e:a e:a e:a e:a position in lower string e:c e:c e:c e:c e:c e:c e:a e:a e:a e:a e:a e:a Intro to NLP - J. Eisner 29

30 Edit distance as min-cost path
position in upper string A substitution edge has cost 0 or 1 It advances in the upper and lower strings simultaneously, so it’s diagonal It pairs the next letter of the upper string with the next letter of the lower string Cost is 0 or 1 depending on whether those letters are identical! 1 2 3 4 c:c l:c a:c a:c r:c c:a l:a a:a r:a a:a position in lower string c:c l:c a:c r:c a:c c:a l:a a:a r:a a:a Intro to NLP - J. Eisner 30

31 Edit distance as min-cost path
l:e a:e r:e e:c c:c l:c a:c r:c e:a c:a l:a a:a r:a 1 2 3 4 position in upper string position in lower string We’re looking for a path from upper left to lower right (so as to get through both strings) Solid edges have cost 0, dashed edges have cost 1 So we want the path with the fewest dashed edges Intro to NLP - J. Eisner 31

32 Edit distance as min-cost path
position in upper string c:e l:e a:e r:e e:c c:c l:c a:c r:c e:a c:a l:a a:a r:a 1 2 3 4 clara caca 3 substitutions + 1 deletion = total cost 4 position in lower string Intro to NLP - J. Eisner 32

33 Edit distance as min-cost path
position in upper string c:e l:e a:e r:e e:c c:c l:c a:c r:c e:a c:a l:a a:a r:a 1 2 3 4 clar a c a ca 2 deletions + 1 insertion = total cost 3 position in lower string Intro to NLP - J. Eisner 33

34 Edit distance as min-cost path
position in upper string c:e l:e a:e r:e e:c c:c l:c a:c r:c e:a c:a l:a a:a r:a 1 2 3 4 clara c aca 1 deletion + 1 substitution = total cost 2 position in lower string Intro to NLP - J. Eisner 34

35 Edit distance as min-cost path
position in upper string c:e l:e a:e r:e e:c c:c l:c a:c r:c e:a c:a l:a a:a r:a 1 2 3 4 clara caca 5 deletions + 4 insertions = total cost 9 position in lower string Intro to NLP - J. Eisner 35

36 Edit Distance Transducer
O(k) deletion arcs b:e a:e O(k2) substitution arcs a:b b:a e:a O(k) insertion arcs a:a e:b b:b O(k) no-change arcs Intro to NLP - J. Eisner 36

37 Edit Distance Transducer
Stochastic Edit Distance Transducer O(k) deletion arcs b:e a:e O(k2) substitution arcs a:b b:a e:a O(k) insertion arcs a:a e:b b:b O(k) identity arcs Likely edits = high-probability arcs Defines p(x | y), e.g., p(caca | clara) Intro to NLP - J. Eisner 37

38 Most likely path that maps clara to caca?
Best path (by Dijkstra’s algorithm) clara c:e l:e a:e r:e a:e .o. c:c l:c a:c a:c r:c e:c e:c e:c a:e e:a b:e e:b a:b b:a a:a b:b e:c e:c e:c c:e l:e a:e r:e a:e c:a l:a a:a r:a a:a = e:a e:a e:a e:a e:a e:a c:e l:e a:e r:e a:e c:c l:c a:c r:c a:c e:c e:c e:c e:c e:c e:c c:e l:e a:e r:e a:e .o. c:a l:a a:a r:a a:a e:a e:a e:a e:a e:a e:a caca c:e l:e a:e r:e a:e Intro to NLP - J. Eisner 38

39 What string produced caca? (posterior distribution)
Lexicon* .o. more complicated FST, with cycles. Each path is a possible input string aligned to caca. A path’s weight is proportional to the posterior probability of that path as an explanation for caca. Could find best path. a:e e:a b:e e:b a:b b:a a:a b:b = .o. caca Intro to NLP - J. Eisner 39

40 Speech Recognition by FST Composition (Pereira & Riley 1996)
trigram language model p(word seq) .o. pronunciation model p(phone seq | word seq) acoustic model p(acoustics | phone seq) .o. observed acoustics Intro to NLP - J. Eisner 40

41 Speech Recognition by FST Composition (Pereira & Riley 1996)
trigram language model p(word seq) .o. p(phone seq | word seq) CAT:k æ t æ : p(acoustics | phone seq) phone context phone context .o. Intro to NLP - J. Eisner 41

42 HMM Tagging Can think of this as a finite-state noisy channel, too!
See slides in the “tagging” lecture. Intro to NLP - J. Eisner 42


Download ppt "Finite-State and the Noisy Channel"

Similar presentations


Ads by Google