Hidden Markov Models for Information Extraction CSE 454.

Slides:



Advertisements
Similar presentations
Hidden Markov Models (HMM) Rabiner’s Paper
Advertisements

Learning HMM parameters
Hidden Markov Models Reading: Russell and Norvig, Chapter 15, Sections
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Page 1 Hidden Markov Models for Automatic Speech Recognition Dr. Mike Johnson Marquette University, EECE Dept.
An Introduction to Hidden Markov Models and Gesture Recognition Troy L. McDaniel Research Assistant Center for Cognitive Ubiquitous Computing Arizona State.
Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수
Statistical NLP: Lecture 11
Ch-9: Markov Models Prepared by Qaiser Abbas ( )
Hidden Markov Models Theory By Johan Walters (SR 2003)
Statistical NLP: Hidden Markov Models Updated 8/12/2005.
Foundations of Statistical NLP Chapter 9. Markov Models 한 기 덕한 기 덕.
1 Hidden Markov Models (HMMs) Probabilistic Automata Ubiquitous in Speech/Speaker Recognition/Verification Suitable for modelling phenomena which are dynamic.
Hidden Markov Models Fundamentals and applications to bioinformatics.
Hidden Markov Models in NLP
Lecture 15 Hidden Markov Models Dr. Jianjun Hu mleg.cse.sc.edu/edu/csce833 CSCE833 Machine Learning University of South Carolina Department of Computer.
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
Part II. Statistical NLP Advanced Artificial Intelligence (Hidden) Markov Models Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
… Hidden Markov Models Markov assumption: Transition model:
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
Forward-backward algorithm LING 572 Fei Xia 02/23/06.
1 Hidden Markov Model Instructor : Saeed Shiry  CHAPTER 13 ETHEM ALPAYDIN © The MIT Press, 2004.
Chapter 3 (part 3): Maximum-Likelihood and Bayesian Parameter Estimation Hidden Markov Model: Extension of Markov Chains All materials used in this course.
Doug Downey, adapted from Bryan Pardo,Northwestern University
Hidden Markov Models David Meir Blei November 1, 1999.
Hidden Markov models Sushmita Roy BMI/CS 576 Oct 16 th, 2014.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
Fall 2001 EE669: Natural Language Processing 1 Lecture 9: Hidden Markov Models (HMMs) (Chapter 9 of Manning and Schutze) Dr. Mary P. Harper ECE, Purdue.
Dynamic Bayesian Networks CSE 473. © Daniel S. Weld Topics Agency Problem Spaces Search Knowledge Representation Reinforcement Learning InferencePlanningLearning.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
Learning Hidden Markov Model Structure for Information Extraction Kristie Seymour, Andrew McCullum, & Ronald Rosenfeld.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Ch10 HMM Model 10.1 Discrete-Time Markov Process 10.2 Hidden Markov Models 10.3 The three Basic Problems for HMMS and the solutions 10.4 Types of HMMS.
Graphical models for part of speech tagging
Part of Speech Tagging & Hidden Markov Models Mitch Marcus CSE 391.
Fundamentals of Hidden Markov Model Mehmet Yunus Dönmez.
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
Hidden Markov Models in Keystroke Dynamics Md Liakat Ali, John V. Monaco, and Charles C. Tappert Seidenberg School of CSIS, Pace University, White Plains,
S. Salzberg CMSC 828N 1 Three classic HMM problems 2.Decoding: given a model and an output sequence, what is the most likely state sequence through the.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Hidden Markov Models & POS Tagging Corpora and Statistical Methods Lecture 9.
1 CONTEXT DEPENDENT CLASSIFICATION  Remember: Bayes rule  Here: The class to which a feature vector belongs depends on:  Its own value  The values.
CPSC 422, Lecture 15Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 15 Oct, 14, 2015.
Hidden Markovian Model. Some Definitions Finite automation is defined by a set of states, and a set of transitions between states that are taken based.
School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
John Lafferty Andrew McCallum Fernando Pereira
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Elements of a Discrete Model Evaluation.
Hidden Markov Models (HMMs) –probabilistic models for learning patterns in sequences (e.g. DNA, speech, weather, cards...) (2 nd order model)
1 Hidden Markov Models Hsin-min Wang References: 1.L. R. Rabiner and B. H. Juang, (1993) Fundamentals of Speech Recognition, Chapter.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
1 Hidden Markov Model Xiaole Shirley Liu STAT115, STAT215.
Hidden Markov Models HMM Hassanin M. Al-Barhamtoshy
Hidden Markov Models BMI/CS 576
Hidden Markov Models for Information Extraction
Speech Processing Speech Recognition
LECTURE 15: REESTIMATION, EM AND MIXTURES
Presentation transcript:

Hidden Markov Models for Information Extraction CSE 454

© Daniel S. Weld 2 Course Overview Systems Foundation: Networking & Clusters Datamining Synchronization & Monitors Crawler Architecture Case Studies: Nutch, Google, Altavista Information Retrieval Precision vs Recall Inverted Indicies P2P Security Web Services Semantic Web Info Extraction Ecommerce

© Daniel S. Weld 3 What is Information Extraction (IE) Is the task of populating database slots with corresponding phrases from text Slide by Okan Basegmez

© Daniel S. Weld 4 What are HMMs? A HMM is a finite state automaton with stochastic transitions and symbol emissions. (Rabiner 1989)

© Daniel S. Weld 5 Why use HMMs for IE Strong statistical foundations Well suited to natural language domains Handling new data robustly Computationally efficient to develop Disadvantages A priori notion of model topology Large amounts of training data Slide by Okan Basegmez

© Daniel S. Weld 6 Defn: Markov Model Q: set of states  init prob distribution A: transition probability distribution s 0 s 1 s 2 s 3 s 4 s 5 s 6 s0s1s2s3s4s5s6s0s1s2s3s4s5s6 p 12 Probability of transitioning from s 1 to s 2 ∑ ?

© Daniel S. Weld 7 E.g. Predict Web Behavior Q: set of states (Pages)  init prob distribution (Likelihood of site entry point) A: transition probability distribution (User navigation model) When will visitor leave site?

© Daniel S. Weld 8 Diversion: Relational Markov Models

© Daniel S. Weld 9 Probability Distribution, A Forward Causality The probability of s t does not depend directly on values of future states. Probability of new state could depend on The history of states visited. Pr(s t |s t-1,s t-2,…, s 0 ) Markovian Assumption Pr(s t |s t-1,s t-2,…s 0 ) = Pr(s t |s t-1 ) Stationary Model Assumption Pr(s t |s t-1 ) = Pr(s k |s k-1 ) for all k.

© Daniel S. Weld 10 Defn: Hidden Markov Model Q: set of states  init prob distribution A: transition probability distribution O set of possible observatons b i (o t ) probability of s i emitting o t (hidden!)

© Daniel S. Weld 11 HMMs and their Usage HMMs very common in Computational Linguistics: Speech recognition (observed: acoustic signal, hidden: words) Handwriting recognition (observed: image, hidden: words) Part-of-speech tagging (observed: words, hidden: part-of-speech tags) Machine translation (observed: foreign words, hidden: words in target language) Information Extraction (observed: Slide by Bonnie Dorr

© Daniel S. Weld 12 Information Extraction with HMMs Example - Research Paper Headers Slide by Okan Basegmez

© Daniel S. Weld 13 The Three Basic HMM Problems Problem 1 (Evaluation): Given the observation sequence O=o 1,…,o T and an HMM model, how do we compute the probability of O given the model? Slide by Bonnie Dorr

© Daniel S. Weld 14 The Three Basic HMM Problems Problem 2 (Decoding): Given the observation sequence O=o 1,…,o T and an HMM model, how do we find the state sequence that best explains the observations? Slide by Bonnie Dorr

© Daniel S. Weld 15 Problem 3 (Learning): How do we adjust the model parameters, to maximize The Three Basic HMM Problems Slide by Bonnie Dorr ?

© Daniel S. Weld 16 Information Extraction with HMMs Given a Model M and its parameters Information Extraction is performed by determining the sequence that was most likely to have generated the entire document This sequence can be recovered by dynamic programming with Viterbi algorithm Slide by Okan Basegmez

© Daniel S. Weld 17 Information Extraction with HMMs Probability of a string x being emitted by an HMM M State sequence V(x|M) that has the highest probability of having produced an observation sequence Slide by Okan Basegmez

© Daniel S. Weld 18 Simple Example Rain t-1 Umbrella t-1 Rain t Umbrella t Rain t+1 Umbrella t+1 RtRt P(U t ) t0.9 f0.2 R t-1 P(R t ) t0.7 f0.3

© Daniel S. Weld 19 Simple Example true false trueUmbrella Rain 1 true false true Rain 2 true false Rain 3 true false true Rain 4

© Daniel S. Weld 20 Forward Probabilities What is the probability that, given an HMM, at time t the state is i and the partial observation o 1 … o t has been generated? Slide by Bonnie Dorr

© Daniel S. Weld 21 Problem 1: Probability of an Observation Sequence What is ? The probability of a observation sequence is the sum of the probabilities of all possible state sequences in the HMM. Naïve computation is very expensive. Given T observations and N states, there are N T possible state sequences. Even small HMMs, e.g. T=10 and N=10, contain 10 billion different paths Solution to this and problem 2 is to use dynamic programming Slide by Bonnie Dorr

© Daniel S. Weld 22 Forward Probabilities Slide by Bonnie Dorr

© Daniel S. Weld 23 Forward Algorithm Initialization: Induction: Termination: Slide by Bonnie Dorr

© Daniel S. Weld 24 Forward Algorithm Complexity In the naïve approach to solving problem 1 it takes on the order of 2T*N T computations The forward algorithm takes on the order of N 2 T computations Slide by Bonnie Dorr

© Daniel S. Weld 25 Backward Probabilities Analogous to the forward probability, just in the other direction What is the probability that given an HMM and given the state at time t is i, the partial observation o t+1 … o T is generated? Slide by Bonnie Dorr

© Daniel S. Weld 26 Backward Probabilities Slide by Bonnie Dorr

© Daniel S. Weld 27 Backward Algorithm Initialization: Induction: Termination: Slide by Bonnie Dorr

© Daniel S. Weld 28 Problem 2: Decoding The solution to Problem 1 (Evaluation) gives us the sum of all paths through an HMM efficiently. For Problem 2, we want to find the path with the highest probability. We want to find the state sequence Q=q 1 …q T, such that Slide by Bonnie Dorr

© Daniel S. Weld 29 Viterbi Algorithm Similar to computing the forward probabilities, but instead of summing over transitions from incoming states, compute the maximum Forward: Viterbi Recursion: Slide by Bonnie Dorr

© Daniel S. Weld 30 Viterbi Algorithm Initialization: Induction: Termination: Read out path: Slide by Bonnie Dorr

© Daniel S. Weld 31 Information Extraction We want specific info from text documents For example, from colloq s, want Speaker name Location Start time

© Daniel S. Weld 32 Simple HMM for Job Titles

© Daniel S. Weld 33 HMMs for Info Extraction For sparse extraction tasks : Separate HMM for each type of target Each HMM should Model entire document Consist of target and non-target states Not necessarily fully connected Given HMM, how extract info? Slide by Okan Basegmez

© Daniel S. Weld 34 How Learn HMM? Two questions: structure & parameters

© Daniel S. Weld 35 Simplest Case Fix structure Learn transition & emission probabilities Training data…? Label each word as target or non-target Challenges Sparse training data Unseen words have… Smoothing!

© Daniel S. Weld 36 Problem 3: Learning So far: assumed we know the underlying model Often these parameters are estimated on annotated training data, which has 2 drawbacks: Annotation is difficult and/or expensive Training data is different from the current data We want to maximize the parameters with respect to the current data, i.e., we’re looking for a model, such that Slide by Bonnie Dorr

© Daniel S. Weld 37 Problem 3: Learning Unfortunately, there is no known way to analytically find a global maximum, i.e., a model, such that But it is possible to find a local maximum! Given an initial model, we can always find a model, such that Slide by Bonnie Dorr

© Daniel S. Weld 38 Parameter Re-estimation Use the forward-backward (or Baum-Welch) algorithm, which is a hill-climbing algorithm Using an initial parameter instantiation, the forward-backward algorithm iteratively re-estimates the parameters and improves the probability that given observation are generated by the new parameters Slide by Bonnie Dorr

© Daniel S. Weld 39 Parameter Re-estimation Three parameters need to be re-estimated: Initial state distribution: Transition probabilities: a i,j Emission probabilities: b i (o t ) Slide by Bonnie Dorr

© Daniel S. Weld 40 Re-estimating Transition Probabilities What’s the probability of being in state s i at time t and going to state s j, given the current model and parameters? Slide by Bonnie Dorr

© Daniel S. Weld 41 Re-estimating Transition Probabilities Slide by Bonnie Dorr

© Daniel S. Weld 42 Re-estimating Transition Probabilities The intuition behind the re-estimation equation for transition probabilities is Formally: Slide by Bonnie Dorr

© Daniel S. Weld 43 Re-estimating Transition Probabilities Defining As the probability of being in state s i, given the complete observation O We can say: Slide by Bonnie Dorr

© Daniel S. Weld 44 Review of Probabilities Forward probability: The probability of being in state s i, given the partial observation o 1,…,o t Backward probability: The probability of being in state s i, given the partial observation o t+1,…,o T Transition probability: The probability of going from state s i, to state s j, given the complete observation o 1,…,o T State probability: The probability of being in state s i, given the complete observation o 1,…,o T Slide by Bonnie Dorr

© Daniel S. Weld 45 Re-estimating Initial State Probabilities Initial state distribution: is the probability that s i is a start state Re-estimation is easy: Formally: Slide by Bonnie Dorr

© Daniel S. Weld 46 Re-estimation of Emission Probabilities Emission probabilities are re-estimated as Formally: Where Note that here is the Kronecker delta function and is not related to the in the discussion of the Viterbi algorithm!! Slide by Bonnie Dorr

© Daniel S. Weld 47 The Updated Model Coming from we get to by the following update rules: Slide by Bonnie Dorr

© Daniel S. Weld 48 Expectation Maximization The forward-backward algorithm is an instance of the more general EM algorithm The E Step: Compute the forward and backward probabilities for a give model The M Step: Re-estimate the model parameters Slide by Bonnie Dorr

© Daniel S. Weld 49 Importance of HMM Topology Certain structures better capture the observed phenomena in the prefix, target and suffix sequences Building structures by hand does not scale to large corpora Human intuitions don’t always correspond to structures that make the best use of HMM potential Slide by Okan Basegmez

© Daniel S. Weld 50 How Learn Structure?

© Daniel S. Weld 51 Conclusion IE is performed by recovering the most likely state sequence (Viterbi) Transition and Emission Parameters can be learned from training data (Baum-Welch) Shrinkage improves parameter estimation Task-specific state-transition structure can be automatically discovered Slide by Okan Basegmez

© Daniel S. Weld 52 References Information Extraction with HMM Structures Learned by Stochastic Optimization, Dayne Freitag and Andrew McCallum Information Extraction with HMMs and Shrinkage - Dayne Frietag and Andrew McCallum Learning Hidden Markov Model Structure for Information Extraction, Kristie Seymore, Andrew McCallum, Roni Rosenfeld Inducing Probabilistic Grammars by Bayesian Model Merging, Andreas Stolcke, Stephen Omohundro Information Extraction using Hidden Markov Models, Leek, T. R, Master ’ s thesis, UCSD Slide by Okan Basegmez