Computational Linguistics - Jason Eisner1 It’s All About High- Probability Paths in Graphs Airport Travel Hidden Markov Models Parsing (if you generalize)

Slides:

Advertisements

Similar presentations

Three Basic Problems Compute the probability of a text: P m (W 1,N ) Compute maximum probability tag sequence: arg max T 1,N P m (T 1,N | W 1,N ) Compute.

Advertisements

Intro to NLP - J. Eisner1 Learning in the Limit Golds Theorem.

1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.

Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.

Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.

HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:

1 Statistical NLP: Lecture 12 Probabilistic Context Free Grammars.

Intro to NLP - J. Eisner1 Hidden Markov Models and the Forward-Backward Algorithm.

Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.

Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수

Statistical NLP: Lecture 11

Hidden Markov Models Theory By Johan Walters (SR 2003)

Statistical NLP: Hidden Markov Models Updated 8/12/2005.

Hidden Markov Models (HMMs) Steven Salzberg CMSC 828H, Univ. of Maryland Fall 2010.

Albert Gatt Corpora and Statistical Methods Lecture 8.

Tagging with Hidden Markov Models. Viterbi Algorithm. Forward-backward algorithm Reading: Chap 6, Jurafsky & Martin Instructor: Paul Tarau, based on Rada.

. Hidden Markov Model Lecture #6. 2 Reminder: Finite State Markov Chain An integer time stochastic process, consisting of a domain D of m states {1,…,m}

Intro to NLP - J. Eisner1 Probabilistic CKY.

Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.

Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.

This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.

Expectation Maximization Algorithm

Dynamic Programming1. 2 Outline and Reading Matrix Chain-Product (§5.3.1) The General Technique (§5.3.2) 0-1 Knapsack Problem (§5.3.3)

Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.

Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.

Intro to NLP - J. Eisner1 The Expectation Maximization (EM) Algorithm … continued!

Maximum Entropy Model LING 572 Fei Xia 02/08/07. Topics in LING 572 Easy: –kNN, Rocchio, DT, DL –Feature selection, binarization, system combination –Bagging.

1 Theory I Algorithm Design and Analysis (11 - Edit distance and approximate string matching) Prof. Dr. Th. Ottmann.

Intro to NLP - J. Eisner1 Part-of-Speech Tagging A Canonical Finite-State Task.

Intro to NLP - J. Eisner1 Earley’s Algorithm (1970) Nice combo of our parsing ideas so far:  no restrictions on the form of the grammar:  A.

Graphical models for part of speech tagging

CS 4705 Hidden Markov Models Julia Hirschberg CS4705.

Intro to NLP - J. Eisner1 Finite-State and the Noisy Channel.

BINF6201/8201 Hidden Markov Models for Sequence Analysis

11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.

S. Salzberg CMSC 828N 1 Three classic HMM problems 2.Decoding: given a model and an output sequence, what is the most likely state sequence through the.

Hidden Markov Models & POS Tagging Corpora and Statistical Methods Lecture 9.

PGM 2003/04 Tirgul 2 Hidden Markov Models. Introduction Hidden Markov Models (HMM) are one of the most common form of probabilistic graphical models,

1 CONTEXT DEPENDENT CLASSIFICATION  Remember: Bayes rule  Here: The class to which a feature vector belongs depends on:  Its own value  The values.

795M Winter /12/20151 Hidden Markov Models Chris Brew The Ohio State University.

Hidden Markovian Model. Some Definitions Finite automation is defined by a set of states, and a set of transitions between states that are taken based.

1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.

1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.

Albert Gatt Corpora and Statistical Methods. Acknowledgement Some of the examples in this lecture are taken from a tutorial on HMMs by Wolgang Maass.

Intro to NLP - J. Eisner1 Finite-State and the Noisy Channel.

/02/20161 Probabilistic Context Free Grammars Chris Brew Ohio State University.

Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.

Data-Intensive Computing with MapReduce Jimmy Lin University of Maryland Thursday, March 14, 2013 Session 8: Sequence Labeling This work is licensed under.

Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.

N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.

Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.

Natural Language Processing : Probabilistic Context Free Grammars Updated 8/07.

Hidden Markov Models BMI/CS 576

CS 224S / LINGUIST 285 Spoken Language Processing

Part-of-Speech Tagging

The Expectation Maximization (EM) Algorithm

Statistical NLP Winter 2009

and the Forward-Backward Algorithm

Inside-Outside & Forward-Backward Algorithms are just Backprop

Statistical Models for Automatic Speech Recognition

CSCI 5832 Natural Language Processing

Hidden Markov Models Part 2: Algorithms

Earley’s Algorithm (1970) Nice combo of our parsing ideas so far:

Three classic HMM problems

N-Gram Model Formulas Word sequences Chain rule of probability

The Expectation Maximization (EM) Algorithm

Finite-State and the Noisy Channel

Hidden Markov Models Teaching Demo The University of Arizona

CSCI 5582 Artificial Intelligence

and the Forward-Backward Algorithm

Presentation transcript:

Computational Linguistics - Jason Eisner1 It’s All About High- Probability Paths in Graphs Airport Travel Hidden Markov Models Parsing (if you generalize) Edit Distance Finite-State Machines (regular languages)

Computational Linguistics - Jason Eisner 2 HMM trellis: Graph with 2 33  8 billion paths yet small: only 2* = 68 states and 2*67 = 134 edges Day 1: 2 cones Start C H C H Day 2: 3 cones C H p(H|Start)*p(2|H) p(H|H)*p(3|H) p(H|C)*p(3|H) p(C|H)*p(3|C) Day 3: 3 cones p(C|Start)*p(2|C) p(C|C)*p(3|C) … Day 34: lose diary Stop p(Stop|C) p(Stop|H) C H p(C|C)*p(2|C) p(H|H)*p(2|H) p(C|H)*p(2|C) Day 33: 2 cones C H p(C|C)*p(2|C) p(H|H)*p(2|H) p(H|C)*p(2|H) p(C|H)*p(2|C) Day 32: 2 cones p(H|C)*p(2|H) C H … We don’t know the correct path But we know how likely each path is (a posteriori )  At least according to our current model … So which is the most likely path?

This is a classic problem in graph algorithms. How many paths from FriendHouse to MyHouse? How many miles is the longest such path? How many miles is the shortest such path? Impossible to compute? What is the shortest such path? Finding the Minimum-Cost Path (a.k.a. the “shortest path problem”)  (cycles) Airports, miles 2410 FriendHouse  BOS  JFK  DFW  ORD  MyHouse example from Goodrich & Tamassia

4 Minimum-Cost Path (in Dyna) path_to(start) min= 0. path_to(B) min= path_to(A) + edge(A,B). goal min= path_to(end).

5 Understanding the Central Rule path_to(B) min= path_to(A) + edge(A,B). e.g., path_to(“DFW”) The shortest path from start to state B + edge(“MIA”, “DFW”) = = 2389 plus one extra edge. path_to(“MIA”) must first go to some previous state A and then on to B, so the total cost is the cost of the shortest path from start to A + edge(“JFK”, “DFW”) = = 1588 … path_to(“JFK”) … but there may be many choices of A, so choose the minimum of all such possibilities. is defined to be 1588

“Can get to start for free” (just stay put!) start := “FriendHouse”. end := “MyHouse”. “Can get to B by going to a previous state A + paying for A  B edge” 6 Minimum-Cost Path (in Dyna) path_to(B) min= 0 for B==start. path_to(B) min= path_to(A) + edge(A,B). goal min= path_to(end). “In particular, here’s how much it costs to get to end” (Note: goal has no value if there’s no way to get to end) % or use = instead of min= Length 0 paths from start to B (must have B==start) Length > 0 paths from start to B (must have a next-to-last state A) We take min= over all paths:

Minimum-Cost Path (in Dyna) Note: This runs fine in Dyna.  But if you want to write a procedural algorithm instead of a declarative specification, you can use Dijkstra’s algorithm. Dyna is doing something like that internally for this program.  If the graph has no cycles, you can use a simpler algorithm, which visits the vertices “in order.” So, compute path_to(B) only after computing path_to(A) for all states A such that edge(A,B) is defined. path_to(start) min= 0. path_to(B) min= path_to(A) + edge(A,B). goal min= path_to(end).

How to find the min-cost path itself? path_to(start) min= 0. path_to(B) min= path_to(A) + edge(A,B). with_key A. Store a backpointer from “DTW” back to “JFK” Remembers that path_to(“DTW”) got its min value when A was “JFK” We’ll define $key(path_to(“DTW”)) to be “JFK”  Automatic definition using the “with_key” construction  Lets us store information in $key(…) about how the minimum was achieved

path_to(start) min= 0. path_to(B) min= path_to(A) + edge(A,B). bestpath(B) = [B | bestpath($key(path_to(B)))]. with_key A. How to find the min-cost path itself? Now we can trace backpointers from any B back to start  bestpath(“FriendHouse”) = [“FriendHouse”]  bestpath(“BOS”) = [“BOS”, “FriendHouse”]  bestpath(“JFK”) = [“JFK”, “BOS”, “FriendHouse”]  bestpath(“DFW”) = [“DFW”, “JFK”, “BOS”, “FriendHouse”] = [“DFW” | bestpath(“JFK”)] prepends “DFW” to [“JFK”, “BOS”,...] = [“DFW” | bestpath($key(path_to(“DFW”)))] with_key []. % base case used by bestpath(start)

How to find the min-cost path itself? path_to(start) min= 0. path_to(B) min= path_to(A) + edge(A,B). bestpath(B) = [B | bestpath($key(path_to(B)))]. goal min= path_to(end). optimal_path = bestpath(end). with_key A. with_key []. % base case used by bestpath(start)

How to find the min-cost path itself? Or key can be whole path back from B (not just the 1 preceding state A). path_to(start) min= 0. path_to(B) min= path_to(A) + edge(A,B). bestpath(B) = [B | bestpath($key(path_to(B)))]. goal min= path_to(end). optimal_path = bestpath(end). with_key A. with_key []. path_to(start) min= 0 with_key [start]. path_to(B) min= path_to(A) + edge(A,B) with_key [B | $key(path_to(A))]. goal min= path_to(end). optimal_path = $key(path_to(end)).

Defining the Input Graph We need a graph with weights on the edges: What if there are multiple edges from A to B?  Pick the shortest: For example, define edge(A,B) using min=. path_to(start) min= 0. path_to(B) min= path_to(A) + edge(A,B). goal min= path_to(end). start := “FriendHouse”. end := “MyHouse”. edge(“BOS”, “JFK”) := 187. edge(“BOS”, “MIA”) := edge(“JFK”, “DFW”) := edge(“JFK”, “SFO”) := …

In Dyna, the value of &point(X,Y) is just point(X,Y) itself: a location. If we wrote point(X,Y), Dyna would want rules defining the point function. loc(“BOS”) = &point(2927, -3767). loc(“JFK”) = &point(2808, -3914). loc(“MIA”) = &point(1782, -4260). … Defining the Input Graph We could define distances between airports by rule! Euclidean distance formula (assuming flat earth): path_to(start) min= 0. path_to(B) min= path_to(A) + edge(A,B). goal min= path_to(end). dist( &point(X,Y), &point(X2,Y2) ) = sqrt((X-X2)**2 + (Y-Y2)**2). edge(A,B) = dist( loc(A), loc(B) ). has_flight(“BOS”, “JFK”). has_flight(“JFK”, “MIA”). has_flight(“BOS”, “MIA”). … for has_flight(A,B).

Edit Distance Baby actually said caca Baby was probably thinking clara (?) Do these match up well? How well? clara caca claraclara clara caca clar a c a ca 3 substitutions + 1 deletion = total cost 4 2 deletions + 1 insertion = total cost 3 1 deletion + 1 substitution = total cost 2 5 deletions + 4 insertions = total cost 9 minimum edit distance (best alignment)

Intro to NLP - J. Eisner15 c l a r a Edit distance as min-cost path c a ? Minimum-cost path shows the best alignment, and its cost is the edit distance c:  l:  a:  r:  a:   :c c:c  :c l:c  :c a:c  :c r:c  :c a:c  :c c:  l:  a:  r:  a:   :a c:a  :a l:a  :a a:a  :a r:a  :a a:a  :a c:  l:  a:  r:  a:   :c c:c  :c l:c  :c a:c  :c r:c  :c a:c  :c c:  l:  a:  r:  a:   :a c:a  :a l:a  :a a:a  :a r:a  :a a:a  :a c:  l:  a:  r:  a:  position in upper string position in lower string

c a c a c a c a c c c l c l a c l a r c l a r a c Intro to NLP - J. Eisner16 Edit distance as min-cost path Minimum-cost path shows the best alignment, and its cost is the edit distance 0 position in upper string position in lower string

Intro to NLP - J. Eisner17 Edit distance as min-cost path c:  l:  a:  r:  a:  c:  l:  a:  r:  a:  c:  l:  a:  r:  a:  c:  l:  a:  r:  a:  c:  l:  a:  r:  a:  position in upper string position in lower string A deletion edge has cost 1 It advances in the upper string only, so it’s horizontal It pairs the next letter of the upper string with  (empty) in the lower string

Intro to NLP - J. Eisner18 Edit distance as min-cost path  :c  :a  :c  :a position in upper string position in lower string An insertion edge has cost 1 It advances in the lower string only, so it’s vertical It pairs  (empty) in the upper string with the next letter of the lower string

Intro to NLP - J. Eisner19 Edit distance as min-cost path c:c l:c a:c r:c a:c c:a l:a a:a r:aa:a c:c l:c a:c r:ca:c c:a l:a a:a r:a a:a position in upper string position in lower string A substitution edge has cost 0 or 1 It advances in the upper and lower strings simultaneously, so it’s diagonal It pairs the next letter of the upper string with the next letter of the lower string Cost is 0 or 1 depending on whether those letters are identical!

Intro to NLP - J. Eisner20 Edit distance as min-cost path We’re looking for a path from upper left to lower right (so as to get through both strings) Solid edges have cost 0, dashed edges have cost 1 So we want the path with the fewest dashed edges c:  l:  a:  r:  a:   :c c:c  :c l:c  :c a:c  :c r:c  :c a:c  :c c:  l:  a:  r:  a:   :a c:a  :a l:a  :a a:a  :a r:a  :a a:a  :a c:  l:  a:  r:  a:   :c c:c  :c l:c  :c a:c  :c r:c  :c a:c  :c c:  l:  a:  r:  a:   :a c:a  :a l:a  :a a:a  :a r:a  :a a:a  :a c:  l:  a:  r:  a:  position in upper string position in lower string

Intro to NLP - J. Eisner21 Edit distance as min-cost path position in upper string position in lower string clara caca 3 substitutions + 1 deletion = total cost 4

Intro to NLP - J. Eisner22 Edit distance as min-cost path position in upper string position in lower string clar a c a ca 2 deletions + 1 insertion = total cost 3

position in upper string position in lower string Intro to NLP - J. Eisner23 Edit distance as min-cost path claraclara c aca 1 deletion + 1 substitution = total cost 2

Intro to NLP - J. Eisner24 Edit distance as min-cost path position in upper string position in lower string clara caca 5 deletions + 4 insertions = total cost 9

Edit distance as min-cost path Again, we have to define the graph by rule: edge( &state(U-1, L-1), &state(U, L) ) = subst_cost( upper(U), lower(L) ). edge( &state(U, L-1), &state(U, L) ) = ins_cost( lower(L) ). edge( &state(U-1, L), &state(U, L) ) = del_cost( upper(U) ). start = &state(0,0). end= &state(upper_length, lower_length). In Dyna, the value of &state(U,L) is state(U,L) itself: a compound name. If we wrote state(U,L), Dyna would want rules defining the state function. upper(1) := “c”. upper(2) := “l”. upper(3) := “a”. … upper_length := 5. Upper string: lower(1) := “c”. lower(2) := “a”. lower(3) := “c”. lower(4) := “a”. lower_length := 4. Lower string:

We’ve seen lots of “min-cost path” problems Same algorithm in all cases, just different graphs  And you can run other useful algorithms on those graphs too Airports Edit distance Parsing  Parsing is actually a little more general.  Still a dynamic programming problem. Still uses min= in Dyna.  We need a “hypergraph” with hyperedges like “S”  [“NP”, “VP”].  Find a “hyperpath” from the start state (“START” nonterminal) to the end state (the collection of all input words). Viterbi tagging in an HMM  Ice cream  weather  Words  part of speech tags

Computational Linguistics - Jason Eisner 27 HMM trellis: Graph with 2 33  8 billion paths yet small: only 2* = 68 states and 2*67 = 134 edges Day 1: 2 cones Start C H C H Day 2: 3 cones C H p(H|Start)*p(2|H) p(H|H)*p(3|H) p(H|C)*p(3|H) p(C|H)*p(3|C) Day 3: 3 cones p(C|Start)*p(2|C) p(C|C)*p(3|C) … Day 34: lose diary Stop p(Stop|C) p(Stop|H) C H p(C|C)*p(2|C) p(H|H)*p(2|H) p(C|H)*p(2|C) Day 33: 2 cones C H p(C|C)*p(2|C) p(H|H)*p(2|H) p(H|C)*p(2|H) p(C|H)*p(2|C) Day 32: 2 cones p(H|C)*p(2|H) C H … Paths are different ways that we could explain the observed evidence Which is the most likely path? (according to our current model)

Finds the max-prob path instead of the min-cost path. Again, we have to define the graph by rule: Max-Probability Path in an HMM path_to(start) max= 1. path_to(B) max= path_to(A) * edge(A,B). goal max= path_to(end). In Dyna, the value of &state(Time,Tag) is just state(Time,Tag) itself. Similarly, &start_tag, &end_tag, &eos are just symbols, not items. start = &state(0, &start_tag). end = state(length+1, &end_tag). Day 1: 2 cones Start p(H|Start)*p(2|H) p(C|Start)*p(2|C) C H C H Day 2: 3 cones p(H|H)*p(3|H) p(C|H)*p(3|C) p(C|C)*p(3|C) p(H|C)*p(3|H) edge( &state(Time-1, PrevTag), &state(Time, Tag) ) = p_transition(PrevTag, Tag) * p_emission(Tag, word(Time)). e.g., edge( &state(“C”,1), &state(“H”,2) ) = p_transition(“C”, “H”) * p_emission(“H”, word(2)) p(H|C)*p(3|H) To extract the actual path, use with_key and follow backpointers. This is called the Viterbi algorithm.

Again, we have to define the graph by rule: Max-Probability Path in an HMM In Dyna, the value of &state(Time,Tag) is just state(Time,Tag) itself. Similarly, &start_tag, &end_tag, &eos are just symbols, not items. start = &state(0, &start_tag). end = state(length+1, &end_tag). Day 1: 2 cones Start p(H|Start)*p(2|H) p(C|Start)*p(2|C) C H C H Day 2: 3 cones p(H|H)*p(3|H) p(C|H)*p(3|C) p(C|C)*p(3|C) p(H|C)*p(3|H) edge( &state(Time-1, PrevTag), &state(Time, Tag) ) = p_transition(PrevTag, Tag) * p_emission(Tag, word(Time)). e.g., edge( &state(“C”,1), &state(“H”,2) ) = p_transition(“C”, “H”) * p_emission(“H”, word(2)) p(H|C)*p(3|H) p_emission(“H”, “3”) := 0.7. … p_transition(“C”, “H”) := 0.1. … p_transition(&start_tag, “C”) := 0.5. … p_emission(&end_tag, &eos) := 1. … Initial model: word(1) := “2”. word(2) := “3”. word(3) := “3”. … length := 33. word(length+1) := &eos. Input sentence:

Again, we have to define the graph by rule: Max-Probability Path in an HMM In Dyna, the value of &state(Time,Tag) is just state(Time,Tag) itself. Similarly, &start_tag, &end_tag, &eos are just symbols, not items. start = &state(0, &start_tag). end = state(length+1, &end_tag). Day 1: 2 cones Start p(H|Start)*p(2|H) p(C|Start)*p(2|C) C H C H Day 2: 3 cones p(H|H)*p(3|H) p(C|H)*p(3|C) p(C|C)*p(3|C) p(H|C)*p(3|H) edge( &state(Time-1, PrevTag), &state(Time, Tag) ) = p_transition(PrevTag, Tag) * p_emission(Tag, word(Time)). e.g., edge( &state(“C”,1), &state(“H”,2) ) = p_transition(“C”, “H”) * p_emission(“H”, word(2)) p(H|C)*p(3|H) p_emission(“PlNoun”, “horses”) := … p_transition(“PlNoun”, “Conj”) := … p_transition(&start_tag, “PlNoun”) := … p_emission(&end_tag, &eos) := 1. … Initial model: word(1) := “horses”. word(2) := “and”. word(3) := “Lukasiewicz”. … Input sentence:

Intro to NLP - J. Eisner Intro to NLP - J. Eisner31 Viterbi tagging Paths that explain our 2-word input sentence:  Det Adj0.35  Det N0.2  N V0.45 Most probable path gives the single best tag sequence:  N V0.45 Find it by following backpointers But for a long sentence with many ambiguous words, there might be a gazillion paths to explain it  So even best path might have prob  Do we really trust it to be the right answer?

Day 1: 2 cones Start C H C H Day 2: 3 cones C H p(H|Start)*p(2|H) p(H|H)*p(3|H) p(H|C)*p(3|H) p(C|H)*p(3|C) Day 3: 3 cones p(C|Start)*p(2|C) p(C|C)*p(3|C) lose diary Stop p(Stop|C) p(Stop|H) C H p(C|C)*p(2|C) p(H|H)*p(2|H) p(C|H)*p(2|C) Day 33: 2 cones C H p(C|C)*p(2|C) p(H|H)*p(2|H) p(H|C)*p(2|H) p(C|H)*p(2|C) Day 32: 2 cones p(H|C)*p(2|H) C H … Intro to NLP - J. Eisner32 We know how likely each path is (a posteriori )  At least according to our current model … Don’t just find the single best path (“Viterbi path”). If we chose random paths from the posterior distribution, which states and edges would we usually see?  That is, which states and edges are probably correct – according to model? HMM trellis: Graph with 2 33  8 billion paths yet small: only 2* = 68 states and 2*67 = 134 edges

Intro to NLP - J. Eisner Intro to NLP - J. Eisner33 Alternative to Viterbi tagging: Posterior tagging Give each word the tag that’s most probable in context.  Det Adj 0.35  Det N0.2  N V0.45 Output is  Det V0 Defensible: maximizes expected # of correct tags. But not a coherent sequence. May screw up subsequent processing (e.g., can’t find any parse). How do we compute highest-prob tag for each word?  Forward-backward algorithm!  exp # correct tags = = 0.9  exp # correct tags = = 0.75  exp # correct tags = = 0.9  exp # correct tags = = 1.0

Remember Forward-Backward Algorithm All paths through state: ax + ay + az + bx + by + bz + cx + cy + cz = (a+b+c)  (x+y+z) =  (C)   (C) C x y z a b c   = (a+b+c)  p  (x+y+z) =  (H)  p   (C) C H p x y z a b c   All paths through edge: apx + apy + apz + bpx + bpy + bpz + cpx + cpy + cpz All paths from state:  = (p 3 u + p 3 v + p 3 w) + (p 4 x + p 4 y + p 4 z) = p 3   3 + p 4   4 C H p4p4 z x y C p3p3 u v w 44 33  All paths to state:  = (ap 1 + bp 1 + cp 1 ) + (dp 2 + ep 2 + fp 2 ) =  1  p 1 +  2  p 2 H H p2p2 f d e 22 C p1p1 a b c 11 

Forward-Backward Algorithm in Dyna path_to(start) max= 1. path_to(B) max= path_to(A) * edge(A,B). goal max= path_to(end). % max of all complete paths Most probable path from start to each B:

Total probability of all paths from start to each B: Total probability of all paths from each A to end: Total prob of paths through state B or edge A  B: Forward-Backward Algorithm in Dyna alpha(start) += 1. alpha(B) += alpha(A) * edge(A,B). z += alpha(end). % total of all complete paths beta(end) += 1. beta(A) += edge(A,B) * beta(B). z_another_way += beta(start). % total of all complete paths alphabeta(B) = alpha(B) * beta(B). alphabeta(A,B) = alpha(A) * edge(A,B) * beta(B).

use for posterior tagging Total probability of all paths from start to each B: Total probability of all paths from each A to end: Total posterior prob of paths through state B or edge A  B (i.e., what fraction of paths go through B or A  B?): Forward-Backward Algorithm in Dyna alpha(start) += 1. alpha(B) += alpha(A) * edge(A,B). z += alpha(end). % total of all complete paths beta(end) += 1. beta(A) += edge(A,B) * beta(B). z_another_way += beta(start). % total of all complete paths p_posterior(B) = alpha(B) * beta(B) / z. p_posterior(A,B) = alpha(A) * edge(A,B) * beta(B) / z.

Total probability of all paths from start to each B: z is now the probability of the evidence (total probability of all ways of generating the evidence): p(word sequence) or p(ice cream sequence) We can apply the same idea to other noisy channels … Forward Algorithm alpha(start) += 1. alpha(B) += alpha(A) * edge(A,B). z += alpha(end). % total of all complete paths

Intro to NLP - J. Eisner39 Forward algorithm applied to edit distance Baby was thinking clara ? Or something else? It went through noisy channel and came out as caca To reconstruct underlying form, use Bayes’ Theorem! Assume we have prior p( clara ) What z tells us is p( caca | clara ) …  … if we define edge weights to be probs of the insertions, deletions, or substitutions on those specific edges – e.g., p(  | l ), p( c | r )  So each path describes a sequence of edits that might happen given clara  The paths in our graph are all edit seqs yielding caca ; we’re summing their probs alpha(start) += 1. alpha(B) += alpha(A) * edge(A,B). z += alpha(end). % total of all complete paths c:  l:  a:  r:  a:   :c c:c  :c l:c  :c a:c  :c r:c  :c a:c  :c c:  l:  a:  r:  a:   :a c:a  :a l:a  :a a:a  :a r:a  :a a:a  :a c:  l:  a:  r:  a:   :c c:c  :c l:c  :c a:c  :c r:c  :c a:c  :c c:  l:  a:  r:  a:   :a c:a  :a l:a  :a a:a  :a r:a  :a a:a  :a c:  l:  a:  r:  a: 

Having computed which states and edges are likely on random paths, we can now summarize what tends to happen on random paths:  How many of the H states fall on 3-ice-cream days?  How many of the H states are followed by another H?  We used these faux “observed” counts to re-estimate the params. count_emission(Tag,word(Time)) += p_posterior( &state(Time,Tag) ). count_transition(PrevTag,Tag) += p_posterior( &state(Time,Tag) ). p_emission(Tag, Word) = count_emission(Tag,Word) / count(Tag). p_transition(Prev, Tag) = count_transition(Prev, Tag) / count(Prev). can add 1 to these counts for smoothing Reestimating HMM parameters Day 1: 2 cones Start C H C H Day 2: 3 cones C H p(H|Start)*p(2|H) p(H|H)*p(3|H) p(H|C)*p(3|H) p(C|H)*p(3|C) Day 3: 3 cones p(C|Start)*p(2|C) p(C|C)*p(3|C) lose diary Stop p(Stop|C) p(Stop|H) C H p(C|C)*p(2|C) p(H|H)*p(2|H) p(C|H)*p(2|C) Day 33: 2 cones C H p(C|C)*p(2|C) p(H|H)*p(2|H) p(H|C)*p(2|H) p(C|H)*p(2|C) Day 32: 2 cones p(H|C)*p(2|H) C H …

Intro to NLP - J. Eisner41 Repeat until convergence! Reestimating parameters: Expectation-Maximization (EM) in General  Start by devising a noisy channel  Any model that predicts the corpus observations via some hidden structure (tags, parses, …)  Initially guess the parameters of the model!  Educated guess is best, but random can work  Expectation step: Use current parameters (and observations) to reconstruct hidden structure  Maximization step: Use that hidden structure (and observations) to reestimate parameters

Intro to NLP - J. Eisner42 Guess of unknown parameters (probabilities) initial guess M step Observed structure (words, ice cream) Guess of unknown hidden structure (tags, parses, weather) E step Expectation-Maximization (EM) in General

Intro to NLP - J. Eisner43 Guess of unknown parameters (probabilities) M step Observed structure (words, ice cream) EM for Hidden Markov Models Guess of unknown hidden structure (tags, parses, weather) E step initial guess

Intro to NLP - J. Eisner44 Guess of unknown parameters (probabilities) M step Observed structure (words, ice cream) EM for Hidden Markov Models Guess of unknown hidden structure (tags, parses, weather) E step initial guess

Intro to NLP - J. Eisner45 Guess of unknown parameters (probabilities) M step Observed structure (words, ice cream) EM for Hidden Markov Models Guess of unknown hidden structure (tags, parses, weather) E step initial guess

Intro to NLP - J. Eisner46 EM for Grammar Reestimation PARSERPARSER Grammar scorerscorer correct test trees accuracy LEARNER training trees test sentences cheap, plentiful and appropriate expensive and/or wrong sublanguage E step M step

Intro to NLP - J. Eisner47 Two Versions of EM  The Viterbi approximation (max=)  Expectation: pick the best parse of each sentence  Maximization: retrain on this best-parsed corpus  Advantage: Speed!  Real EM (+=)  Expectation: find all parses of each sentence  Maximization: retrain on all parses in proportion to their probability (as if we observed fractional count)  Advantage: p(training corpus) guaranteed to increase  Exponentially many parses, so need something clever (inside-outside algorithm – generalizes forward-backward)

Summary: Graphs and EM Given incomplete data Construct a graph (or hypergraph) of all possible ways to complete it  May be exponentially or infinitely many paths (or hyperpaths)  Yet number of states and edges is manageable Day 1: 2 cones Start C H C H Day 2: 3 cones C H p(H|Start)*p(2|H) p(H|H)*p(3|H) p(H|C)*p(3|H) p(C|H)*p(3|C) Day 3: 3 cones p(C|Start)*p(2|C) p(C|C)*p(3|C) … HMM tagging (observe word sequence: unknown tag sequence) c:  l:  a:  r:  a:   :c c:c  :c l:c  :c a:c  :c r:c  :c a:c  :c c:  l:  a:  r:  a:   :a c:a  :a l:a  :a a:a  :a r:a  :a a:a  :a c:  l:  a:  r:  a:   :c c:c  :c l:c  :c a:c  :c r:c  :c a:c  :c c:  l:  a:  r:  a:   :a c:a  :a l:a  :a a:a  :a r:a  :a a:a  :a c:  l:  a:  r:  a:  Edit distance (observe 2 strings: unknown alignment and edit sequence) Parsing (observe string: unknown tree)

Edit distance Summary: Graphs and EM Given incomplete data Construct a graph (or hypergraph) of all possible ways to complete it The E step uses += or max= to reason efficiently about the paths Collects a set of probable edges (and how probable they were)  Notice that the states are tied to positions in the input  On each edge, something happened as a result of rolling a die or dice Edit distance p( c | r ) r:c 3,2 4,3 p(H|C)*p(3|H) C,7 H,8 H:3 0S70S7 S  NP VP 0 NP 11 VP 7 p(NP VP | S) Parsing HMM tagging

Edit distance Summary: Graphs and EM Given incomplete data Construct a graph (or hypergraph) of all possible ways to complete it The E step uses += or max= to reason efficiently about the paths Collects a set of probable edges (and how probable they were) The M step treats these edges as training data  On each edge, something happened as a result of rolling a die or dice  Reestimates model parameters to predict these “observed” dice rolls Edit distance p( c | r ) r:c p(H|C)*p(3|H) H:3 S  NP VP p(NP VP | S) Parsing HMM tagging

Edit distance Summary: Graphs and EM Given incomplete data Construct a graph (or hypergraph) of all possible ways to complete it The E step collects a set of probable edges The M step treats these edges as training data To train what? How about using a conditional log-linear model?  E step counts features of the “observed” edges  M step adjusts  until the expected feature counts equal the “observed” counts  What linguistic features might help define the probabilities below? Edit distance p( c | r ) r:c p(H|C)*p(3|H) H:3 S  NP VP p(NP VP | S) Parsing HMM tagging

Use this paradigm across NLP … First define a probability distribution over structured objects To compute about unseen parts, just have to construct the right graph! Examples:  Change what’s observed vs. unknown below; some may be partly observed  More context in the states – trigram HMM, contextual edit distance, FSTs  Beyond edit distance: complex models of string pairs for machine translation Day 1: 2 cones Start C H C H Day 2: 3 cones C H p(H|Start)*p(2|H) p(H|H)*p(3|H) p(H|C)*p(3|H) p(C|H)*p(3|C) Day 3: 3 cones p(C|Start)*p(2|C) p(C|C)*p(3|C) … HMM tagging (observe word sequence: unknown tag sequence) c:  l:  a:  r:  a:   :c c:c  :c l:c  :c a:c  :c r:c  :c a:c  :c c:  l:  a:  r:  a:   :a c:a  :a l:a  :a a:a  :a r:a  :a a:a  :a c:  l:  a:  r:  a:   :c c:c  :c l:c  :c a:c  :c r:c  :c a:c  :c c:  l:  a:  r:  a:   :a c:a  :a l:a  :a a:a  :a r:a  :a a:a  :a c:  l:  a:  r:  a:  Edit distance (observe 2 strings: unknown alignment and edit sequence) Parsing (observe string: unknown tree)