Hidden Markov Models (HMMs) for Information Extraction

Slides:



Advertisements
Similar presentations
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
Advertisements

Hidden Markov Model.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Bioinformatics Hidden Markov Models. Markov Random Processes n A random sequence has the Markov property if its distribution is determined solely by its.
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Hidden Markov Models Ellen Walker Bioinformatics Hiram College, 2008.
Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수
Statistical NLP: Lecture 11
Hidden Markov Models Theory By Johan Walters (SR 2003)
Hidden Markov Model Most pages of the slides are from lecture notes from Prof. Serafim Batzoglou’s course in Stanford: CS 262: Computational Genomics (Winter.
Statistical NLP: Hidden Markov Models Updated 8/12/2005.
1 Hidden Markov Models (HMMs) Probabilistic Automata Ubiquitous in Speech/Speaker Recognition/Verification Suitable for modelling phenomena which are dynamic.
Lecture 15 Hidden Markov Models Dr. Jianjun Hu mleg.cse.sc.edu/edu/csce833 CSCE833 Machine Learning University of South Carolina Department of Computer.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
… Hidden Markov Models Markov assumption: Transition model:
Heuristic Local Alignerers 1.The basic indexing & extension technique 2.Indexing: techniques to improve sensitivity Pairs of Words, Patterns 3.Systems.
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
Hidden Markov Models. Decoding GIVEN x = x 1 x 2 ……x N We want to find  =  1, ……,  N, such that P[ x,  ] is maximized  * = argmax  P[ x,  ] We.
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Linear-Space Alignment. Linear-space alignment Using 2 columns of space, we can compute for k = 1…M, F(M/2, k), F r (M/2, N – k) PLUS the backpointers.
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
Conditional Random Fields
Time Warping Hidden Markov Models Lecture 2, Thursday April 3, 2003.
Forward-backward algorithm LING 572 Fei Xia 02/23/06.
Bioinformatics Hidden Markov Models. Markov Random Processes n A random sequence has the Markov property if its distribution is determined solely by its.
. Hidden Markov Models with slides from Lise Getoor, Sebastian Thrun, William Cohen, and Yair Weiss.
Hidden Markov Models.
Doug Downey, adapted from Bryan Pardo,Northwestern University
6/29/2015 2:01 AM1 CSE 573 Finite State Machines for Information Extraction Topics –Administrivia –Background –DIRT –Finite State Machine Overview –HMMs.
Sequence labeling and beam search LING 572 Fei Xia 2/15/07.
Learning Bayesian Networks
Hidden Markov models Sushmita Roy BMI/CS 576 Oct 16 th, 2014.
CPSC 422, Lecture 18Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18 Feb, 25, 2015 Slide Sources Raymond J. Mooney University of.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
CS262 Lecture 5, Win07, Batzoglou Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Training HMMs Version 4: February 2005.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
Hidden Markov Models for Information Extraction CSE 454.
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
Hidden Markov Models & POS Tagging Corpora and Statistical Methods Lecture 9.
PGM 2003/04 Tirgul 2 Hidden Markov Models. Introduction Hidden Markov Models (HMM) are one of the most common form of probabilistic graphical models,
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
CS Statistical Machine learning Lecture 24
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Markov Chains and Hidden Markov Model.
CPSC 422, Lecture 19Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 23, 2015 Slide Sources Raymond J. Mooney University of.
Probabilistic reasoning over time Ch. 15, 17. Probabilistic reasoning over time So far, we’ve mostly dealt with episodic environments –Exceptions: games.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
John Lafferty Andrew McCallum Fernando Pereira
Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.
Conditional Markov Models: MaxEnt Tagging and MEMMs
Hidden Markov Models (HMMs) –probabilistic models for learning patterns in sequences (e.g. DNA, speech, weather, cards...) (2 nd order model)
Albert Gatt Corpora and Statistical Methods. Acknowledgement Some of the examples in this lecture are taken from a tutorial on HMMs by Wolgang Maass.
1 Hidden Markov Models Hsin-min Wang References: 1.L. R. Rabiner and B. H. Juang, (1993) Fundamentals of Speech Recognition, Chapter.
Eric Xing © Eric CMU, Machine Learning Structured Models: Hidden Markov Models versus Conditional Random Fields Eric Xing Lecture 13,
Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
Hidden Markov Models Wassnaa AL-mawee Western Michigan University Department of Computer Science CS6800 Adv. Theory of Computation Prof. Elise De Doncker.
Hidden Markov Models – Concepts 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Hidden Markov Models (HMMs) for Information Extraction
Hidden Markov Models Part 2: Algorithms
CSE 574 Finite State Machines for Information Extraction
Presentation transcript:

Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454

Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) 4/17/2017 Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) standard sequence model in genomics, speech, NLP, … 2

What’s an HMM? Set of states Set of potential observations Initial probabilities Transition probabilities Set of potential observations Emission probabilities o1 o2 o3 o4 o5 HMM generates observation sequence

Hidden Markov Models (HMMs) 4/17/2017 Finite state machine Hidden state sequence Generates o1 o2 o3 o4 o5 o6 o7 o8 Observation sequence Technical approach? At CMU I worked w/ HMMs. HMMs: standard tool…, It is a model that makes certain Markov and independence assumptions as depicted in this Graphical Model. Pr certain states are the val of this random variable… = this product of independent pr. Emissions are traditionally atomic entities, like words. Big part of power is the context of finite state: Viterbi We say this is generative model because it has parameters for probability of producing observed emissions (given state) Adapted from Cohen & McCallum 4

Hidden Markov Models (HMMs) 4/17/2017 Finite state machine Hidden state sequence Generates o1 o2 o3 o4 o5 o6 o7 o8 Observation sequence Graphical Model Random variable yt takes values from sss{s1, s2, s3, s4} Hidden states ... yt-2 yt-1 yt ... Technical approach? At CMU I worked w/ HMMs. HMMs: standard tool…, It is a model that makes certain Markov and independence assumptions as depicted in this Graphical Model. Pr certain states are the val of this random variable… = this product of independent pr. Emissions are traditionally atomic entities, like words. Big part of power is the context of finite state: Viterbi We say this is generative model because it has parameters for probability of producing observed emissions (given state) Random variable xt takes values from s{o1, o2, o3, o4, o5, …} Observations ... xt-2 xt-1 xt ... Adapted from Cohen & McCallum 5

HMM Graphical Model Generates ... Hidden states Observations yt-2 yt-1 4/17/2017 Finite state machine Hidden state sequence Generates o1 o2 o3 o4 o5 o6 o7 o8 Observation sequence Graphical Model ... Hidden states Observations yt-2 yt-1 yt xt-2 xt-1 xt Random variable yt takes values from sss{s1, s2, s3, s4} Random variable xt takes values from s{o1, o2, o3, o4, o5, …} Technical approach? At CMU I worked w/ HMMs. HMMs: standard tool…, It is a model that makes certain Markov and independence assumptions as depicted in this Graphical Model. Pr certain states are the val of this random variable… = this product of independent pr. Emissions are traditionally atomic entities, like words. Big part of power is the context of finite state: Viterbi We say this is generative model because it has parameters for probability of producing observed emissions (given state) 6

HMM Graphical Model ... Hidden states Observations yt-2 yt-1 yt xt-2 4/17/2017 Graphical Model ... Hidden states Observations yt-2 yt-1 yt xt-2 xt-1 xt Random variable yt takes values from sss{s1, s2, s3, s4} Random variable xt takes values from s{o1, o2, o3, o4, o5, …} Need Parameters: Start state probabilities: P(y1=sk ) Transition probabilities: P(yt=si | yt-1=sk) Observation probabilities: P(xt=oj | yt=sk ) Training: Maximize probability of training obervations Technical approach? At CMU I worked w/ HMMs. HMMs: standard tool…, It is a model that makes certain Markov and independence assumptions as depicted in this Graphical Model. Pr certain states are the val of this random variable… = this product of independent pr. Emissions are traditionally atomic entities, like words. Big part of power is the context of finite state: Viterbi We say this is generative model because it has parameters for probability of producing observed emissions (given state) Usually multinomial over atomic, fixed alphabet 7

Example: The Dishonest Casino A casino has two dice: Fair die P(1) = P(2) = P(3) = P(5) = P(6) = 1/6 Loaded die P(1) = P(2) = P(3) = P(5) = 1/10 P(6) = 1/2 Dealer switches back-&-forth between fair and loaded die about once every 20 turns Game: You bet $1 You roll (always with a fair die) Casino player rolls (maybe with fair die, maybe with loaded die) Highest number wins $2 Slides from Serafim Batzoglou 8

The dishonest casino HMM 0.05 0.95 0.95 FAIR LOADED P(1|F) = 1/6 P(2|F) = 1/6 P(3|F) = 1/6 P(4|F) = 1/6 P(5|F) = 1/6 P(6|F) = 1/6 P(1|L) = 1/10 P(2|L) = 1/10 P(3|L) = 1/10 P(4|L) = 1/10 P(5|L) = 1/10 P(6|L) = 1/2 0.05 Slides from Serafim Batzoglou 9

Question # 1 – Evaluation GIVEN A sequence of rolls by the casino player 124552646214614613613666166466163661636616361… QUESTION How likely is this sequence, given our model of how the casino works? This is the EVALUATION problem in HMMs Slides from Serafim Batzoglou 10

Question # 2 – Decoding GIVEN A sequence of rolls by the casino player 1245526462146146136136661664661636616366163… QUESTION What portion of the sequence was generated with the fair die, and what portion with the loaded die? This is the DECODING question in HMMs Slides from Serafim Batzoglou 11

Question # 3 – Learning GIVEN A sequence of rolls by the casino player 124552646214614613613666166466163661636616361651… QUESTION How “loaded” is the loaded die? How “fair” is the fair die? How often does the casino player change from fair to loaded, and back? This is the LEARNING question in HMMs Slides from Serafim Batzoglou 12

What’s this have to do with Info Extraction? 0.05 0.95 0.95 FAIR LOADED P(1|F) = 1/6 P(2|F) = 1/6 P(3|F) = 1/6 P(4|F) = 1/6 P(5|F) = 1/6 P(6|F) = 1/6 P(1|L) = 1/10 P(2|L) = 1/10 P(3|L) = 1/10 P(4|L) = 1/10 P(5|L) = 1/10 P(6|L) = 1/2 0.05 13

What’s this have to do with Info Extraction? 0.05 0.95 0.95 TEXT NAME P(the | T) = 0.003 P(from | T) = 0.002 ….. P(Dan | N) = 0.005 P(Sue | N) = 0.003 … 0.05 14

IE with Hidden Markov Models Given a sequence of observations: Yesterday Pedro Domingos spoke this example sentence. and a trained HMM: person name location name background Find the most likely state sequence: (Viterbi) Yesterday Pedro Domingos spoke this example sentence. Any words said to be generated by the designated “person name” state extract as a person name: Person name: Pedro Domingos Slide by Cohen & McCallum

IE with Hidden Markov Models For sparse extraction tasks : Separate HMM for each type of target Each HMM should Model entire document Consist of target and non-target states Not necessarily fully connected 16 Slide by Okan Basegmez

Or … Combined HMM Example – Research Paper Headers 17 Slide by Okan Basegmez

HMM Example: “Nymble” Train on ~500k words of news wire text. Results: [Bikel, et al 1998], [BBN “IdentiFinder”] Task: Named Entity Extraction Person end-of-sentence start-of-sentence Org Train on ~500k words of news wire text. (Five other name classes) Other Results: Case Language F1 . Mixed English 93% Upper English 91% Mixed Spanish 90% Slide adapted from Cohen & McCallum

Finite State Model vs. Path … x1 x2 x3 x4 x5 x6 … y1 y2 y3 y4 y5 y6 Person end-of-sentence start-of-sentence Org (Five other name classes) vs. Path Other y1 y2 y3 y4 y5 y6 … x1 x2 x3 x4 x5 x6 …

Question #1 – Evaluation GIVEN A sequence of observations x1 x2 x3 x4 ……xN A trained HMM θ=( , , ) QUESTION How likely is this sequence, given our HMM ? P(x,θ) Why do we care? Need it for learning to choose among competing models!

A parse of a sequence Given a sequence x = x1……xN, A parse of o is a sequence of states y = y1, ……, yN person 1 2 K … 1 1 2 K … 1 2 K … 1 2 K … … 2 2 other K location x1 x2 x3 xK Slide by Serafim Batzoglou

Question #2 - Decoding GIVEN A sequence of observations x1 x2 x3 x4 ……xN A trained HMM θ=( , , ) QUESTION How dow we choose the corresponding parse (state sequence) y1 y2 y3 y4 ……yN , which “best” explains x1 x2 x3 x4 ……xN ? There are several reasonable optimality criteria: single optimal sequence, average statistics for individual states, …

Question #3 - Learning GIVEN A sequence of observations x1 x2 x3 x4 ……xN QUESTION How do we learn the model parameters θ =( , , ) which maximize P(x, θ ) ?

Three Questions Evaluation Decoding Learning Forward algorithm Viterbi algorithm Learning Baum-Welch Algorithm (aka “forward-backward”) A kind of EM (expectation maximization)

Naive Solution to #1: Evaluation Given observations x=x1 …xN and HMM θ, what is p(x) ? Enumerate every possible state sequence y=y1 …yN Probability of x and given particular y Probability of particular y Summing over all possible state sequences we get 2T multiplications per sequence For small HMMs T=10, N=10 there are 10 billion sequences! NT state sequences!

Many Calculations Repeated Use Dynamic Programming Cache and reuse inner sums “forward variables”

Solution to #1: Evaluation Use Dynamic Programming: Define forward variable probability that at time t - the state is Si - the partial observation sequence x=x1 …xt has been emitted

Base Case: Forward Variable t(i) prob - that the state at t has value Si and - the partial obs sequence x=x1 …xt has been seen 1 person 2 other Base Case: t=1 1(i) = p(y1=Si) p(x1=o1) … K location x1

Inductive Case: Forward Variable t(i) prob - that the state at t has value Si and - the partial obs sequence x=x1 …xt has been seen 1 1 1 1 person … … 2 2 2 2 other … … Si … … … … K K K K location … … x1 x2 x3 xt

The Forward Algorithm t-1(1) t-1(2) t-1(3) t(3) t-1(K) yt-1 yt SK

The Forward Algorithm INITIALIZATION INDUCTION TERMINATION Time: O(K2N) Space: O(KN) K = |S| #states N length of sequence

The Backward Algorithm

The Backward Algorithm INITIALIZATION INDUCTION TERMINATION Time: O(K2N) Space: O(KN)

Three Questions Evaluation Decoding Learning Forward algorithm (also Backward algorithm) Decoding Viterbi algorithm Learning Baum-Welch Algorithm (aka “forward-backward”) A kind of EM (expectation maximization)

#2 – Decoding Problem Several possible meanings Given x=x1 …xN and HMM θ, what is “best” parse y1 …yT? Several possible meanings 1. States which are individually most likely: most likely state y*t at time t is then K K

#2 – Decoding Problem Several possible meanings of ‘solution’ Given x=x1 …xN and HMM θ, what is “best” parse y1 …yT? Several possible meanings of ‘solution’ States which are individually most likely Single best state sequence We want sequence y1 …yT, such that P(x,y) is maximized y* = argmaxy P( x, y ) Again, we can use dynamic programming! 1 2 K … o1 o2 o3 oT ? ? ? ? ? ? ? ?

δt(i) Like t(i) = prob that the state, y, at time t has value Si and the partial obs sequence x=x1 …xt has been seen Define δt(i) = probability of most likely state sequence ending with state Si, given observations x1, …, xt P(y1, …,yt-1 , yt=Si | o1, …, ot, )

state sequence ending with state Si, given observations x1, …, xt δt(i) = probability of most likely state sequence ending with state Si, given observations x1, …, xt P(y1, …,yt-1 , yt=Si | o1, …, ot, ) Base Case: t=1 Maxi P(y1=Si) P(x1=o1 | y1 = Si)

Inductive Step Take Max P(y1, …,yt-1 , yt=Si | o1, …, ot, ) t-1(1) t-1(K) K K

Backtracking to get state sequence y* The Viterbi Algorithm DEFINE INITIALIZATION INDUCTION TERMINATION | o1, …, ot,  Backtracking to get state sequence y*

The Viterbi Algorithm Remember: δt(i) = probability of most likely x1 x2 ……xj-1 xj……………………………..xT State 1 Maxi δj-1(i) * Ptrans* Pobs 2 δj(i) i K Remember: δt(i) = probability of most likely state seq ending with yt = state St Slides from Serafim Batzoglou

Terminating Viterbi δ δ δ δ δ Choose Max x1 x2 …………………………………………..xT State 1 δ δ 2 Choose Max δ i δ K δ

Terminating Viterbi δ* How did we compute *? x1 x2 …………………………………………..xT State 1 2 δ* Max i K How did we compute *? Maxi δT-1(i) * Ptrans* Pobs Now Backchain to Find Final Sequence Time: O(K2T) Space: O(KT) Linear in length of sequence

The Viterbi Algorithm Pedro Domingos 44

Three Questions Evaluation Decoding Learning Forward algorithm (Could also go other direction) Decoding Viterbi algorithm Learning Baum-Welch Algorithm (aka “forward-backward”) A kind of EM (expectation maximization)

Solution to #3 - Learning If we have labeled training data! Input: } person name location name background states & edges, but no probabilities Many labeled sentences Yesterday Pedro Domingos spoke this example sentence. Output: Initial state & transition probabilities: p(y1), p(yt|yt-1) Emission probabilities: p(xt | yt)

} Supervised Learning Input: Output: person name location name background states & edges, but no probabilities Yesterday Pedro Domingos spoke this example sentence. Daniel Weld gave his talk in Mueller 153. Sieg 128 is a nasty lecture hall, don’t you think? The next distinguished lecture is by Oren Etzioni on Thursday. Output: Initial state probabilities: p(y1) P(y1=name) = 1/4 P(y1=location) = 1/4 P(y1=background) = 2/4

} Supervised Learning Input: Output: person name location name background states & edges, but no probabilities Yesterday Pedro Domingos spoke this example sentence. Daniel Weld gave his talk in Mueller 153. Sieg 128 is a nasty lecture hall, don’t you think? The next distinguished lecture is by Oren Etzioni on Thursday. Output: State transition probabilities: p(yt|yt-1) P(yt=name | yt-1=name) = P(yt=name | yt-1=background) = Etc… 3/6 2/22

} Supervised Learning Input: Output: person name location name background states & edges, but no probabilities Many labeled sentences Yesterday Pedro Domingos spoke this example sentence. Output: Initial state probabilities: p(y1), p(yt|yt-1) Emission probabilities: p(xt | yt)

} Supervised Learning Input: Output: person name location name background states & edges, but no probabilities Many labeled sentences Yesterday Pedro Domingos spoke this example sentence. Output: Initial state probabilities: p(y1), p(yt|yt-1) Emission probabilities: p(xt | yt)

Solution to #3 - Learning Given x1 …xN , how do we learn θ =( , , ) to maximize P(x)? Unfortunately, there is no known way to analytically find a global maximum θ * such that θ * = arg max P(o | θ) But it is possible to find a local maximum; given an initial model θ, we can always find a model θ’ such that P(o | θ’) ≥ P(o | θ)

Chicken & Egg Problem If we knew the actual sequence of states It would be easy to learn transition and emission probabilities But we can’t observe states, so we don’t! If we knew transition & emission probabilities Then it’d be easy to estimate the sequence of states (Viterbi) But we don’t know them! 52 Slide by Daniel S. Weld 52

Simplest Version Mixture of two distributions Know: form of distribution & variance, % =5 Just need mean of each distribution .01 .03 .05 .07 .09 53 Slide by Daniel S. Weld 53

Input Looks Like .01 .03 .05 .07 .09 54 Slide by Daniel S. Weld 54

We Want to Predict ? .01 .03 .05 .07 .09 55 Slide by Daniel S. Weld 55

Chicken & Egg Note that coloring instances would be easy if we knew Gausians…. .01 .03 .05 .07 .09 56 Slide by Daniel S. Weld 56

Chicken & Egg And finding the Gausians would be easy If we knew the coloring .01 .03 .05 .07 .09 57 Slide by Daniel S. Weld 57

Expectation Maximization (EM) Pretend we do know the parameters Initialize randomly: set 1=?; 2=? .01 .03 .05 .07 .09 58 Slide by Daniel S. Weld 58

Expectation Maximization (EM) Pretend we do know the parameters Initialize randomly [E step] Compute probability of instance having each possible value of the hidden variable .01 .03 .05 .07 .09 59 Slide by Daniel S. Weld 59

Expectation Maximization (EM) Pretend we do know the parameters Initialize randomly [E step] Compute probability of instance having each possible value of the hidden variable .01 .03 .05 .07 .09 60 Slide by Daniel S. Weld 60

Expectation Maximization (EM) Pretend we do know the parameters Initialize randomly [E step] Compute probability of instance having each possible value of the hidden variable [M step] Treating each instance as fractionally having both values compute the new parameter values .01 .03 .05 .07 .09 61 Slide by Daniel S. Weld 61

ML Mean of Single Gaussian Uml = argminu i(xi – u)2 .01 .03 .05 .07 .09 62 Slide by Daniel S. Weld 62

Expectation Maximization (EM) [E step] Compute probability of instance having each possible value of the hidden variable [M step] Treating each instance as fractionally having both values compute the new parameter values .01 .03 .05 .07 .09 63 Slide by Daniel S. Weld 63

Expectation Maximization (EM) [E step] Compute probability of instance having each possible value of the hidden variable .01 .03 .05 .07 .09 64 Slide by Daniel S. Weld 64

Expectation Maximization (EM) [E step] Compute probability of instance having each possible value of the hidden variable [M step] Treating each instance as fractionally having both values compute the new parameter values .01 .03 .05 .07 .09 65 Slide by Daniel S. Weld 65

Expectation Maximization (EM) [E step] Compute probability of instance having each possible value of the hidden variable [M step] Treating each instance as fractionally having both values compute the new parameter values .01 .03 .05 .07 .09 66 Slide by Daniel S. Weld 66

EM for HMMs [E step] Compute probability of instance having each possible value of the hidden variable Compute the forward and backward probabilities for given model parameters and our observations [M step] Treating each instance as fractionally having every value compute the new parameter values - Re-estimate the model parameters - Simple counting 67 67

Summary - Learning Use hill-climbing Idea Called the Baum/Welch algorithm Also “forward-backward algorithm” Idea Use an initial parameter instantiation Loop Compute the forward and backward probabilities for given model parameters and our observations Re-estimate the parameters Until estimates don’t change much

The Problem with HMMs We want more than an Atomic View of Words We want many arbitrary, overlapping features of words y y y identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor last person name was female next two words are “and Associates” t - 1 t t+1 … is “Wisniewski” … part of noun phrase ends in “-ski” x x x t - 1 t t +1 Slide by Cohen & McCallum

Problems with the Joint Model These arbitrary features are not independent. Multiple levels of granularity (chars, words, phrases) Multiple dependent modalities (words, formatting, layout) Past & future Two choices: Model the dependencies. Each state would have its own Bayes Net. But we are already starved for training data! Ignore the dependencies. This causes “over-counting” of evidence (ala naïve Bayes). Big problem when combining evidence, as in Viterbi! (Similar issues in bio-sequence modeling,...) S S S S S S t - 1 t t+1 t - 1 t t+1 O O O O O O Slide by Cohen & McCallum t - 1 t t +1 t - 1 t t +1

Discriminative vs. Generative Models So far: all models generative Generative Models … model P(y, x) Discriminative Models … model P(y | x) P(y|x) does not include a model of P(x), so it does not need to model the dependencies between features!

Discriminative Models often better Eventually, what we care about is p(y|x)! Bayes Net describes a family of joint distributions of, whose conditionals take certain form But there are many other joint models, whose conditionals also have that form. We want to make independence assumptions among y, but not among x.

Conditional Sequence Models We prefer a model that is trained to maximize a conditional probability rather than joint probability: P(y|x) instead of P(y,x): Can examine features, but not responsible for generating them. Don’t have to explicitly model their dependencies. Don’t “waste modeling effort” trying to generate what we are given at test time anyway. Slide by Cohen & McCallum

Finite State Models Generative directed models HMMs Naïve Bayes Sequence General Graphs Conditional Conditional Conditional Logistic Regression General CRFs Linear-chain CRFs General Graphs Sequence

Fix following slides

Linear-Chain Conditional Random Fields From HMMs to CRFs can also be written as (set , …) We let new parameters vary freely, so we need normalization constant Z.

Linear-Chain Conditional Random Fields Introduce feature functions ( , ) Then the conditional distribution is This is a linear-chain CRF, but includes only current word’s identity as a feature One feature per transition One feature per state-observation pair

Linear-Chain Conditional Random Fields Conditional p(y|x) that follows from joint p(y,x) of HMM is a linear CRF with certain feature functions!

Linear-Chain Conditional Random Fields Definition: A linear-chain CRF is a distribution that takes the form where Z(x) is a normalization function parameters feature functions

Linear-Chain Conditional Random Fields HMM-like linear-chain CRF Linear-chain CRF, in which transition score depends on the current observation … … … …