Hidden Markov Models (HMMs) for Information Extraction

Slides:



Advertisements
Similar presentations
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
Advertisements

Bioinformatics Hidden Markov Models. Markov Random Processes n A random sequence has the Markov property if its distribution is determined solely by its.
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Statistical NLP: Lecture 11
Hidden Markov Models Theory By Johan Walters (SR 2003)
Hidden Markov Model Most pages of the slides are from lecture notes from Prof. Serafim Batzoglou’s course in Stanford: CS 262: Computational Genomics (Winter.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
… Hidden Markov Models Markov assumption: Transition model:
Hidden Markov Models. Decoding GIVEN x = x 1 x 2 ……x N We want to find  =  1, ……,  N, such that P[ x,  ] is maximized  * = argmax  P[ x,  ] We.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Lecture 5: Learning models using EM
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
Conditional Random Fields
Bioinformatics Hidden Markov Models. Markov Random Processes n A random sequence has the Markov property if its distribution is determined solely by its.
. Hidden Markov Models with slides from Lise Getoor, Sebastian Thrun, William Cohen, and Yair Weiss.
Hidden Markov Models.
Doug Downey, adapted from Bryan Pardo,Northwestern University
6/29/2015 2:01 AM1 CSE 573 Finite State Machines for Information Extraction Topics –Administrivia –Background –DIRT –Finite State Machine Overview –HMMs.
Hidden Markov Models (HMMs) for Information Extraction
Sequence labeling and beam search LING 572 Fei Xia 2/15/07.
Learning Bayesian Networks
Hidden Markov models Sushmita Roy BMI/CS 576 Oct 16 th, 2014.
CPSC 422, Lecture 18Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18 Feb, 25, 2015 Slide Sources Raymond J. Mooney University of.
CS262 Lecture 5, Win07, Batzoglou Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Information Extraction Yunyao Li EECS /SI /29/2006.
Hidden Markov Models for Information Extraction CSE 454.
Hidden Markov Models Yves Moreau Katholieke Universiteit Leuven.
PGM 2003/04 Tirgul 2 Hidden Markov Models. Introduction Hidden Markov Models (HMM) are one of the most common form of probabilistic graphical models,
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
CS Statistical Machine learning Lecture 24
CPSC 422, Lecture 19Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 23, 2015 Slide Sources Raymond J. Mooney University of.
School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.
John Lafferty Andrew McCallum Fernando Pereira
Conditional Markov Models: MaxEnt Tagging and MEMMs
Eric Xing © Eric CMU, Machine Learning Structured Models: Hidden Markov Models versus Conditional Random Fields Eric Xing Lecture 13,
Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.
Conditional Random Fields & Table Extraction Dongfang Xu School of Information.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Conditional Random Fields and Its Applications Presenter: Shih-Hsiang Lin 06/25/2007.
Hidden Markov Models – Concepts 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Hidden Markov Models HMM Hassanin M. Al-Barhamtoshy
Hidden Markov Models BMI/CS 576
Hidden Markov Models for Information Extraction
Lecture 7: Constrained Conditional Models
CSC 594 Topics in AI – Natural Language Processing
Hidden Markov Models.
Conditional Random Fields
Hidden Markov Models (HMMs)
Statistical Models for Automatic Speech Recognition
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19
Hidden Markov Models - Training
Hidden Markov Models (HMMs)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18
Hidden Markov Models Part 2: Algorithms
CSC 594 Topics in AI – Natural Language Processing
Bayesian Models in Machine Learning
CSE 574 Finite State Machines for Information Extraction
Statistical Models for Automatic Speech Recognition
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Speech Processing Speech Recognition
CPSC 503 Computational Linguistics
CSCI 5832 Natural Language Processing
Hidden Markov Models Teaching Demo The University of Arizona
IE With Undirected Models
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18
Sequential Learning with Dependency Nets
Presentation transcript:

Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454

Administrivia Group meetings next week Feel free to rev proposals thru weekend

What is “Information Extraction” 9/4/2018 As a task: Filling slots in a database from sub-segments of text. October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… IE NAME TITLE ORGANIZATION Bill Gates CEO Microsoft Bill Veghte VP Microsoft Richard Stallman founder Free Soft.. Slides from Cohen & McCallum 3

Classify Pre-segmented Candidates Try alternate window sizes: Landscape of IE Techniques: Models Lexicons Alabama Alaska … Wisconsin Wyoming Abraham Lincoln was born in Kentucky. member? Classify Pre-segmented Candidates Classifier which class? Sliding Window Try alternate window sizes: Boundary Models BEGIN END Context Free Grammars NNP V P NP PP VP S Most likely parse? Finite State Machines Abraham Lincoln was born in Kentucky. Most likely state sequence? Each model can capture words, formatting, or both Slides from Cohen & McCallum 9/4/2018 4:15 PM

Finite State Models Generative directed models HMMs Naïve Bayes Sequence General Graphs Conditional Conditional Conditional Logistic Regression General CRFs Linear-chain CRFs General Graphs Sequence

Warning Graphical models add another set of arcs between nodes where the arcs mean something completely different – confusing. Skip for 454 New sldies are too abstract

Recap: Naïve Bayes Classifier Spam? Hidden state y Causal dependency (probabilistic) P(xi | y=spam) P(xi | y≠spam) Random variables (Boolean) x1 x2 x3 Observable Nigeria? Widow? CSE 454?

The article appeared in the Seattle Times. Recap: Naïve Bayes Assumption: features independent given label Generative Classifier Model joint distribution p(x,y) Inference Learning: counting Can we use for IE directly? Labels of neighboring words dependent! The article appeared in the Seattle Times. city? Need to consider sequence! capitalization length suffix

Hidden Markov Models Finite state model state sequence other person … y1 y2 y3 y4 y5 y6 y7 y8 location other observation sequence x1 x2 x3 x4 x5 x6 x7 x8 person person Yesterday Pedro … Generative Sequence Model 2 assumptions make joint distribution tractable 1. Each state depends only on its immediate predecessor. 2. Each observation depends only on current state. Graphical Model transitions … … observations

Hidden Markov Models Generative Sequence Model Model Parameters Finite state model state sequence other person … y1 y2 y3 y4 y5 y6 y7 y8 location other observation sequence x1 x2 x3 x4 x5 x6 x7 x8 person person Yesterday Pedro … Generative Sequence Model Model Parameters Start state probabilities Transition probabilities Observation probabilities Graphical Model transitions … … observations

HMM Formally Set of states: {yi} Set of possible observations {xi} Probability of initial state Transition Probabilities Emission Probabilities

Example: Dishonest Casino Casino has two dice: Fair die P(1) = P(2) = P(3) = P(5) = P(6) = 1/6 Loaded die P(1) = P(2) = P(3) = P(5) = 1/10 P(6) = 1/2 Casino player switches dice Approx once every 20 turns Game: You bet $1 You roll (always with a fair die) Casino player rolls (maybe with fair die, maybe with loaded die) Highest number wins $2 Slides from Serafim Batzoglou 13

The dishonest casino model 0.05 0.95 0.95 FAIR LOADED P(1|F) = 1/6 P(2|F) = 1/6 P(3|F) = 1/6 P(4|F) = 1/6 P(5|F) = 1/6 P(6|F) = 1/6 P(1|L) = 1/10 P(2|L) = 1/10 P(3|L) = 1/10 P(4|L) = 1/10 P(5|L) = 1/10 P(6|L) = 1/2 0.05 Slides from Serafim Batzoglou 14

IE with Hidden Markov Models Given a sequence of observations: Yesterday Pedro Domingos spoke this example sentence. and a trained HMM: person name location name background Find the most likely state sequence: (Viterbi) Yesterday Pedro Domingos spoke this example sentence. Any words said to be generated by the designated “person name” state extract as a person name: Person name: Pedro Domingos Slide by Cohen & McCallum

IE with Hidden Markov Models For sparse extraction tasks : Separate HMM for each type of target Each HMM should Model entire document Consist of target and non-target states Not necessarily fully connected 16 Slide by Okan Basegmez

Or … Combined HMM Example – Research Paper Headers 17 Slide by Okan Basegmez

HMM Example: “Nymble” Train on ~500k words of news wire text. Results: [Bikel, et al 1998], [BBN “IdentiFinder”] Task: Named Entity Extraction Person end-of-sentence start-of-sentence Org Train on ~500k words of news wire text. (Five other name classes) Other Results: Case Language F1 . Mixed English 93% Upper English 91% Mixed Spanish 90% Slide adapted from Cohen & McCallum

Finite State Model vs. Path … x1 x2 x3 x4 x5 x6 … y1 y2 y3 y4 y5 y6 Person end-of-sentence start-of-sentence Org (Five other name classes) vs. Path Other y1 y2 y3 y4 y5 y6 … x1 x2 x3 x4 x5 x6 …

Question #1 – Evaluation GIVEN A sequence of observations x1 x2 x3 x4 ……xN A trained HMM θ=( , , ) QUESTION How likely is this sequence, given our HMM ? P(x,θ) Why do we care? Need it for learning to choose among competing models!

Question #2 - Decoding GIVEN A sequence of observations x1 x2 x3 x4 ……xN A trained HMM θ=( , , ) QUESTION How dow we choose the corresponding parse (state sequence) y1 y2 y3 y4 ……yN , which “best” explains x1 x2 x3 x4 ……xN ? There are several reasonable optimality criteria: single optimal sequence, average statistics for individual states, …

A parse of a sequence Given a sequence x = x1……xN, A parse of o is a sequence of states y = y1, ……, yN person 1 2 K … 1 1 2 K … 1 2 K … 1 2 K … … 2 2 other K location x1 x2 x3 xK Slide by Serafim Batzoglou

Question #3 - Learning GIVEN A sequence of observations x1 x2 x3 x4 ……xN QUESTION How do we learn the model parameters θ =( , , ) which maximize P(x, θ ) ?

Three Questions Evaluation Decoding Learning Forward algorithm (Could also go other direction) Decoding Viterbi algorithm Learning Baum-Welch Algorithm (aka “forward-backward”) A kind of EM (expectation maximization)

A Solution to #1: Evaluation Given observations x=x1 …xN and HMM θ, what is p(x) ? Enumerate every possible state sequence y=y1 …yN Probability of x and given particular y Probability of particular y Summing over all possible state sequences we get 2T multiplications per sequence For small HMMs T=10, N=10 there are 10 billion sequences! NT state sequences!

Solution to #1: Evaluation Use Dynamic Programming: Define forward variable probability that at time t - the state is Si - the partial observation sequence x=x1 …xt has been emitted

Forward Variable t(i) Prob - that the state at time t vas value Si and - the partial obs sequence x=x1 …xt has been seen 1 1 1 1 person … … 2 2 2 2 other … … … … … … K K K K location … … x1 x2 x3 xt

Forward Variable t(i) prob - that the state at t vas value Si and - the partial obs sequence x=x1 …xt has been seen 1 1 1 1 person … … 2 2 2 2 other … … Si … … … … K K K K location … … x1 x2 x3 xt

Solution to #1: Evaluation Use Dynamic Programming Cache and reuse inner sums Define forward variables probability that at time t the state is yt = Si the partial observation sequence x=x1 …xt has been omitted

The Forward Algorithm

The Forward Algorithm INITIALIZATION INDUCTION TERMINATION Time: O(K2N) Space: O(KN) K = |S| #states N length of sequence

The Backward Algorithm

Three Questions Evaluation Decoding Learning Forward algorithm (Could also go other direction) Decoding Viterbi algorithm Learning Baum-Welch Algorithm (aka “forward-backward”) A kind of EM (expectation maximization)

Looks like this but deltas Need new slide! Looks like this but deltas

Solution to #2 - Decoding Given x=x1 …xN and HMM θ, what is “best” parse y1 …yN? Several optimal solutions 1. States which are individually most likely 2. Single best state sequence We want to find sequence y1 …yN, such that P(x,y) is maximized y* = argmaxy P( x, y ) Again, we can use dynamic programming! 1 2 K … o1 o2 o3 oK

Backtracking to get state sequence y* The Viterbi Algorithm DEFINE INITIALIZATION INDUCTION TERMINATION Backtracking to get state sequence y*

The Viterbi Algorithm Time: O(K2T) Linear in length of sequence Space: x1 x2 ……xj-1 xj……………………………..xT State 1 Max 2 i δj(i) K Time: O(K2T) Space: O(KT) Linear in length of sequence Remember: δk(i) = probability of most likely state seq ending with state Sk Slides from Serafim Batzoglou

The Viterbi Algorithm Pedro Domingos 41

Three Questions Evaluation Decoding Learning Forward algorithm (Could also go other direction) Decoding Viterbi algorithm Learning Baum-Welch Algorithm (aka “forward-backward”) A kind of EM (expectation maximization)

Solution to #3 - Learning Given x1 …xN , how do we learn θ =( , , ) to maximize P(x)? Unfortunately, there is no known way to analytically find a global maximum θ * such that θ * = arg max P(o | θ) But it is possible to find a local maximum; given an initial model θ, we can always find a model θ’ such that P(o | θ’) ≥ P(o | θ)

Chicken & Egg Problem If we knew the actual sequence of states It would be easy to learn transition and emission probabilities But we can’t observe states, so we don’t! If we knew transition & emission probabilities Then it’d be easy to estimate the sequence of states (Viterbi) But we don’t know them! 44 Slide by Daniel S. Weld 44

Simplest Version Mixture of two distributions Know: form of distribution & variance, % =5 Just need mean of each distribution .01 .03 .05 .07 .09 45 Slide by Daniel S. Weld 45

Input Looks Like .01 .03 .05 .07 .09 46 Slide by Daniel S. Weld 46

We Want to Predict ? .01 .03 .05 .07 .09 47 Slide by Daniel S. Weld 47

Chicken & Egg Note that coloring instances would be easy if we knew Gausians…. .01 .03 .05 .07 .09 48 Slide by Daniel S. Weld 48

Chicken & Egg And finding the Gausians would be easy If we knew the coloring .01 .03 .05 .07 .09 49 Slide by Daniel S. Weld 49

Expectation Maximization (EM) Pretend we do know the parameters Initialize randomly: set 1=?; 2=? .01 .03 .05 .07 .09 50 Slide by Daniel S. Weld 50

Expectation Maximization (EM) Pretend we do know the parameters Initialize randomly [E step] Compute probability of instance having each possible value of the hidden variable .01 .03 .05 .07 .09 51 Slide by Daniel S. Weld 51

Expectation Maximization (EM) Pretend we do know the parameters Initialize randomly [E step] Compute probability of instance having each possible value of the hidden variable .01 .03 .05 .07 .09 52 Slide by Daniel S. Weld 52

Expectation Maximization (EM) Pretend we do know the parameters Initialize randomly [E step] Compute probability of instance having each possible value of the hidden variable [M step] Treating each instance as fractionally having both values compute the new parameter values .01 .03 .05 .07 .09 53 Slide by Daniel S. Weld 53

ML Mean of Single Gaussian Uml = argminu i(xi – u)2 .01 .03 .05 .07 .09 54 Slide by Daniel S. Weld 54

Expectation Maximization (EM) [E step] Compute probability of instance having each possible value of the hidden variable [M step] Treating each instance as fractionally having both values compute the new parameter values .01 .03 .05 .07 .09 55 Slide by Daniel S. Weld 55

Expectation Maximization (EM) [E step] Compute probability of instance having each possible value of the hidden variable .01 .03 .05 .07 .09 56 Slide by Daniel S. Weld 56

Expectation Maximization (EM) [E step] Compute probability of instance having each possible value of the hidden variable [M step] Treating each instance as fractionally having both values compute the new parameter values .01 .03 .05 .07 .09 57 Slide by Daniel S. Weld 57

Expectation Maximization (EM) [E step] Compute probability of instance having each possible value of the hidden variable [M step] Treating each instance as fractionally having both values compute the new parameter values .01 .03 .05 .07 .09 58 Slide by Daniel S. Weld 58

EM for HMMs [E step] Compute probability of instance having each possible value of the hidden variable Compute the forward and backward probabilities for given model parameters and our observations [M step] Treating each instance as fractionally having both values compute the new parameter values - Re-estimate the model parameters - Simple Counting 59 59

Summary - Learning Use hill-climbing Idea Called the forward-backward (or Baum/Welch) algorithm Idea Use an initial parameter instantiation Loop Compute the forward and backward probabilities for given model parameters and our observations Re-estimate the parameters Until estimates don’t change much

Following slides are unclear

The Problem with HMMs We want more than an Atomic View of Words We want many arbitrary, overlapping features of words y y y identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor last person name was female next two words are “and Associates” t - 1 t t+1 … is “Wisniewski” … part of noun phrase ends in “-ski” x x x t - 1 t t +1 Slide by Cohen & McCallum

Finite State Models ?   Generative directed models HMMs Naïve Bayes Sequence General Graphs Conditional Conditional Conditional Logistic Regression General CRFs Linear-chain CRFs General Graphs Sequence

Problems with Richer Representation and a Joint Model These arbitrary features are not independent. Multiple levels of granularity (chars, words, phrases) Multiple dependent modalities (words, formatting, layout) Past & future Two choices: Model the dependencies. Each state would have its own Bayes Net. But we are already starved for training data! Ignore the dependencies. This causes “over-counting” of evidence (ala naïve Bayes). Big problem when combining evidence, as in Viterbi! (Similar issues in bio-sequence modeling,...) S S S S S S t - 1 t t+1 t - 1 t t+1 O O O O O O Slide by Cohen & McCallum t - 1 t t +1 t - 1 t t +1

Discriminative and Generative Models So far: all models generative Generative Models … model P(x,y) Discriminative Models … model P(x|y) P(x|y) does not include a model of P(x), so it does not need to model the dependencies between features!

Discriminative Models often better Eventually, what we care about is p(y|x)! Bayes Net describes a family of joint distributions of, whose conditionals take certain form But there are many other joint models, whose conditionals also have that form. We want to make independence assumptions among y, but not among x.

Conditional Sequence Models We prefer a model that is trained to maximize a conditional probability rather than joint probability: P(y|x) instead of P(y,x): Can examine features, but not responsible for generating them. Don’t have to explicitly model their dependencies. Don’t “waste modeling effort” trying to generate what we are given at test time anyway. Slide by Cohen & McCallum

Finite State Models Generative directed models HMMs Naïve Bayes Sequence General Graphs Conditional Conditional Conditional Logistic Regression General CRFs Linear-chain CRFs General Graphs Sequence

Linear-Chain Conditional Random Fields Conditional p(y|x) that follows from joint p(y,x) of HMM is a linear CRF with certain feature functions!

Linear-Chain Conditional Random Fields Definition: A linear-chain CRF is a distribution that takes the form where Z(x) is a normalization function parameters feature functions

Linear-Chain Conditional Random Fields HMM-like linear-chain CRF Linear-chain CRF, in which transition score depends on the current observation … … … …