Download presentation
Presentation is loading. Please wait.
Published byGervase Stanley Modified over 9 years ago
1
Hidden Markov Models for Information Extraction CSE 454
2
© Daniel S. Weld 2 Course Overview Systems Foundation: Networking & Clusters Datamining Synchronization & Monitors Crawler Architecture Case Studies: Nutch, Google, Altavista Information Retrieval Precision vs Recall Inverted Indicies P2P Security Web Services Semantic Web Info Extraction Ecommerce
3
© Daniel S. Weld 3 What is Information Extraction (IE) Is the task of populating database slots with corresponding phrases from text Slide by Okan Basegmez
4
© Daniel S. Weld 4 What are HMMs? A HMM is a finite state automaton with stochastic transitions and symbol emissions. (Rabiner 1989)
5
© Daniel S. Weld 5 Why use HMMs for IE Strong statistical foundations Well suited to natural language domains Handling new data robustly Computationally efficient to develop Disadvantages A priori notion of model topology Large amounts of training data Slide by Okan Basegmez
6
© Daniel S. Weld 6 Defn: Markov Model Q: set of states init prob distribution A: transition probability distribution s 0 s 1 s 2 s 3 s 4 s 5 s 6 s0s1s2s3s4s5s6s0s1s2s3s4s5s6 p 12 Probability of transitioning from s 1 to s 2 ∑ ?
7
© Daniel S. Weld 7 E.g. Predict Web Behavior Q: set of states (Pages) init prob distribution (Likelihood of site entry point) A: transition probability distribution (User navigation model) When will visitor leave site?
8
© Daniel S. Weld 8 Diversion: Relational Markov Models
9
© Daniel S. Weld 9 Probability Distribution, A Forward Causality The probability of s t does not depend directly on values of future states. Probability of new state could depend on The history of states visited. Pr(s t |s t-1,s t-2,…, s 0 ) Markovian Assumption Pr(s t |s t-1,s t-2,…s 0 ) = Pr(s t |s t-1 ) Stationary Model Assumption Pr(s t |s t-1 ) = Pr(s k |s k-1 ) for all k.
10
© Daniel S. Weld 10 Defn: Hidden Markov Model Q: set of states init prob distribution A: transition probability distribution O set of possible observatons b i (o t ) probability of s i emitting o t (hidden!)
11
© Daniel S. Weld 11 HMMs and their Usage HMMs very common in Computational Linguistics: Speech recognition (observed: acoustic signal, hidden: words) Handwriting recognition (observed: image, hidden: words) Part-of-speech tagging (observed: words, hidden: part-of-speech tags) Machine translation (observed: foreign words, hidden: words in target language) Information Extraction (observed: Slide by Bonnie Dorr
12
© Daniel S. Weld 12 Information Extraction with HMMs Example - Research Paper Headers Slide by Okan Basegmez
13
© Daniel S. Weld 13 The Three Basic HMM Problems Problem 1 (Evaluation): Given the observation sequence O=o 1,…,o T and an HMM model, how do we compute the probability of O given the model? Slide by Bonnie Dorr
14
© Daniel S. Weld 14 The Three Basic HMM Problems Problem 2 (Decoding): Given the observation sequence O=o 1,…,o T and an HMM model, how do we find the state sequence that best explains the observations? Slide by Bonnie Dorr
15
© Daniel S. Weld 15 Problem 3 (Learning): How do we adjust the model parameters, to maximize The Three Basic HMM Problems Slide by Bonnie Dorr ?
16
© Daniel S. Weld 16 Information Extraction with HMMs Given a Model M and its parameters Information Extraction is performed by determining the sequence that was most likely to have generated the entire document This sequence can be recovered by dynamic programming with Viterbi algorithm Slide by Okan Basegmez
17
© Daniel S. Weld 17 Information Extraction with HMMs Probability of a string x being emitted by an HMM M State sequence V(x|M) that has the highest probability of having produced an observation sequence Slide by Okan Basegmez
18
© Daniel S. Weld 18 Simple Example Rain t-1 Umbrella t-1 Rain t Umbrella t Rain t+1 Umbrella t+1 RtRt P(U t ) t0.9 f0.2 R t-1 P(R t ) t0.7 f0.3
19
© Daniel S. Weld 19 Simple Example true false trueUmbrella Rain 1 true false true Rain 2 true false Rain 3 true false true Rain 4
20
© Daniel S. Weld 20 Forward Probabilities What is the probability that, given an HMM, at time t the state is i and the partial observation o 1 … o t has been generated? Slide by Bonnie Dorr
21
© Daniel S. Weld 21 Problem 1: Probability of an Observation Sequence What is ? The probability of a observation sequence is the sum of the probabilities of all possible state sequences in the HMM. Naïve computation is very expensive. Given T observations and N states, there are N T possible state sequences. Even small HMMs, e.g. T=10 and N=10, contain 10 billion different paths Solution to this and problem 2 is to use dynamic programming Slide by Bonnie Dorr
22
© Daniel S. Weld 22 Forward Probabilities Slide by Bonnie Dorr
23
© Daniel S. Weld 23 Forward Algorithm Initialization: Induction: Termination: Slide by Bonnie Dorr
24
© Daniel S. Weld 24 Forward Algorithm Complexity In the naïve approach to solving problem 1 it takes on the order of 2T*N T computations The forward algorithm takes on the order of N 2 T computations Slide by Bonnie Dorr
25
© Daniel S. Weld 25 Backward Probabilities Analogous to the forward probability, just in the other direction What is the probability that given an HMM and given the state at time t is i, the partial observation o t+1 … o T is generated? Slide by Bonnie Dorr
26
© Daniel S. Weld 26 Backward Probabilities Slide by Bonnie Dorr
27
© Daniel S. Weld 27 Backward Algorithm Initialization: Induction: Termination: Slide by Bonnie Dorr
28
© Daniel S. Weld 28 Problem 2: Decoding The solution to Problem 1 (Evaluation) gives us the sum of all paths through an HMM efficiently. For Problem 2, we want to find the path with the highest probability. We want to find the state sequence Q=q 1 …q T, such that Slide by Bonnie Dorr
29
© Daniel S. Weld 29 Viterbi Algorithm Similar to computing the forward probabilities, but instead of summing over transitions from incoming states, compute the maximum Forward: Viterbi Recursion: Slide by Bonnie Dorr
30
© Daniel S. Weld 30 Viterbi Algorithm Initialization: Induction: Termination: Read out path: Slide by Bonnie Dorr
31
© Daniel S. Weld 31 Information Extraction We want specific info from text documents For example, from colloq emails, want Speaker name Location Start time
32
© Daniel S. Weld 32 Simple HMM for Job Titles
33
© Daniel S. Weld 33 HMMs for Info Extraction For sparse extraction tasks : Separate HMM for each type of target Each HMM should Model entire document Consist of target and non-target states Not necessarily fully connected Given HMM, how extract info? Slide by Okan Basegmez
34
© Daniel S. Weld 34 How Learn HMM? Two questions: structure & parameters
35
© Daniel S. Weld 35 Simplest Case Fix structure Learn transition & emission probabilities Training data…? Label each word as target or non-target Challenges Sparse training data Unseen words have… Smoothing!
36
© Daniel S. Weld 36 Problem 3: Learning So far: assumed we know the underlying model Often these parameters are estimated on annotated training data, which has 2 drawbacks: Annotation is difficult and/or expensive Training data is different from the current data We want to maximize the parameters with respect to the current data, i.e., we’re looking for a model, such that Slide by Bonnie Dorr
37
© Daniel S. Weld 37 Problem 3: Learning Unfortunately, there is no known way to analytically find a global maximum, i.e., a model, such that But it is possible to find a local maximum! Given an initial model, we can always find a model, such that Slide by Bonnie Dorr
38
© Daniel S. Weld 38 Parameter Re-estimation Use the forward-backward (or Baum-Welch) algorithm, which is a hill-climbing algorithm Using an initial parameter instantiation, the forward-backward algorithm iteratively re-estimates the parameters and improves the probability that given observation are generated by the new parameters Slide by Bonnie Dorr
39
© Daniel S. Weld 39 Parameter Re-estimation Three parameters need to be re-estimated: Initial state distribution: Transition probabilities: a i,j Emission probabilities: b i (o t ) Slide by Bonnie Dorr
40
© Daniel S. Weld 40 Re-estimating Transition Probabilities What’s the probability of being in state s i at time t and going to state s j, given the current model and parameters? Slide by Bonnie Dorr
41
© Daniel S. Weld 41 Re-estimating Transition Probabilities Slide by Bonnie Dorr
42
© Daniel S. Weld 42 Re-estimating Transition Probabilities The intuition behind the re-estimation equation for transition probabilities is Formally: Slide by Bonnie Dorr
43
© Daniel S. Weld 43 Re-estimating Transition Probabilities Defining As the probability of being in state s i, given the complete observation O We can say: Slide by Bonnie Dorr
44
© Daniel S. Weld 44 Review of Probabilities Forward probability: The probability of being in state s i, given the partial observation o 1,…,o t Backward probability: The probability of being in state s i, given the partial observation o t+1,…,o T Transition probability: The probability of going from state s i, to state s j, given the complete observation o 1,…,o T State probability: The probability of being in state s i, given the complete observation o 1,…,o T Slide by Bonnie Dorr
45
© Daniel S. Weld 45 Re-estimating Initial State Probabilities Initial state distribution: is the probability that s i is a start state Re-estimation is easy: Formally: Slide by Bonnie Dorr
46
© Daniel S. Weld 46 Re-estimation of Emission Probabilities Emission probabilities are re-estimated as Formally: Where Note that here is the Kronecker delta function and is not related to the in the discussion of the Viterbi algorithm!! Slide by Bonnie Dorr
47
© Daniel S. Weld 47 The Updated Model Coming from we get to by the following update rules: Slide by Bonnie Dorr
48
© Daniel S. Weld 48 Expectation Maximization The forward-backward algorithm is an instance of the more general EM algorithm The E Step: Compute the forward and backward probabilities for a give model The M Step: Re-estimate the model parameters Slide by Bonnie Dorr
49
© Daniel S. Weld 49 Importance of HMM Topology Certain structures better capture the observed phenomena in the prefix, target and suffix sequences Building structures by hand does not scale to large corpora Human intuitions don’t always correspond to structures that make the best use of HMM potential Slide by Okan Basegmez
50
© Daniel S. Weld 50 How Learn Structure?
51
© Daniel S. Weld 51 Conclusion IE is performed by recovering the most likely state sequence (Viterbi) Transition and Emission Parameters can be learned from training data (Baum-Welch) Shrinkage improves parameter estimation Task-specific state-transition structure can be automatically discovered Slide by Okan Basegmez
52
© Daniel S. Weld 52 References Information Extraction with HMM Structures Learned by Stochastic Optimization, Dayne Freitag and Andrew McCallum Information Extraction with HMMs and Shrinkage - Dayne Frietag and Andrew McCallum Learning Hidden Markov Model Structure for Information Extraction, Kristie Seymore, Andrew McCallum, Roni Rosenfeld Inducing Probabilistic Grammars by Bayesian Model Merging, Andreas Stolcke, Stephen Omohundro Information Extraction using Hidden Markov Models, Leek, T. R, Master ’ s thesis, UCSD Slide by Okan Basegmez
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.