Hidden Markov Models for Information Extraction CSE 454.

Hidden Markov Models for Information Extraction CSE 454

© Daniel S. Weld 2 Course Overview Systems Foundation: Networking & Clusters Datamining Synchronization & Monitors Crawler Architecture Case Studies: Nutch, Google, Altavista Information Retrieval Precision vs Recall Inverted Indicies P2P Security Web Services Semantic Web Info Extraction Ecommerce

© Daniel S. Weld 5 Why use HMMs for IE Strong statistical foundations Well suited to natural language domains Handling new data robustly Computationally efficient to develop Disadvantages A priori notion of model topology Large amounts of training data Slide by Okan Basegmez

© Daniel S. Weld 6 Defn: Markov Model Q: set of states  init prob distribution A: transition probability distribution s 0 s 1 s 2 s 3 s 4 s 5 s 6 s0s1s2s3s4s5s6s0s1s2s3s4s5s6 p 12 Probability of transitioning from s 1 to s 2 ∑ ?

© Daniel S. Weld 7 E.g. Predict Web Behavior Q: set of states (Pages)  init prob distribution (Likelihood of site entry point) A: transition probability distribution (User navigation model) When will visitor leave site?

© Daniel S. Weld 9 Probability Distribution, A Forward Causality The probability of s t does not depend directly on values of future states. Probability of new state could depend on The history of states visited. Pr(s t |s t-1,s t-2,…, s 0 ) Markovian Assumption Pr(s t |s t-1,s t-2,…s 0 ) = Pr(s t |s t-1 ) Stationary Model Assumption Pr(s t |s t-1 ) = Pr(s k |s k-1 ) for all k.

© Daniel S. Weld 10 Defn: Hidden Markov Model Q: set of states  init prob distribution A: transition probability distribution O set of possible observatons b i (o t ) probability of s i emitting o t (hidden!)

© Daniel S. Weld 11 HMMs and their Usage HMMs very common in Computational Linguistics: Speech recognition (observed: acoustic signal, hidden: words) Handwriting recognition (observed: image, hidden: words) Part-of-speech tagging (observed: words, hidden: part-of-speech tags) Machine translation (observed: foreign words, hidden: words in target language) Information Extraction (observed: Slide by Bonnie Dorr

© Daniel S. Weld 13 The Three Basic HMM Problems Problem 1 (Evaluation): Given the observation sequence O=o 1,…,o T and an HMM model, how do we compute the probability of O given the model? Slide by Bonnie Dorr

© Daniel S. Weld 14 The Three Basic HMM Problems Problem 2 (Decoding): Given the observation sequence O=o 1,…,o T and an HMM model, how do we find the state sequence that best explains the observations? Slide by Bonnie Dorr

© Daniel S. Weld 16 Information Extraction with HMMs Given a Model M and its parameters Information Extraction is performed by determining the sequence that was most likely to have generated the entire document This sequence can be recovered by dynamic programming with Viterbi algorithm Slide by Okan Basegmez

© Daniel S. Weld 17 Information Extraction with HMMs Probability of a string x being emitted by an HMM M State sequence V(x|M) that has the highest probability of having produced an observation sequence Slide by Okan Basegmez

© Daniel S. Weld 21 Problem 1: Probability of an Observation Sequence What is ? The probability of a observation sequence is the sum of the probabilities of all possible state sequences in the HMM. Naïve computation is very expensive. Given T observations and N states, there are N T possible state sequences. Even small HMMs, e.g. T=10 and N=10, contain 10 billion different paths Solution to this and problem 2 is to use dynamic programming Slide by Bonnie Dorr

© Daniel S. Weld 24 Forward Algorithm Complexity In the naïve approach to solving problem 1 it takes on the order of 2T*N T computations The forward algorithm takes on the order of N 2 T computations Slide by Bonnie Dorr

© Daniel S. Weld 25 Backward Probabilities Analogous to the forward probability, just in the other direction What is the probability that given an HMM and given the state at time t is i, the partial observation o t+1 … o T is generated? Slide by Bonnie Dorr

© Daniel S. Weld 28 Problem 2: Decoding The solution to Problem 1 (Evaluation) gives us the sum of all paths through an HMM efficiently. For Problem 2, we want to find the path with the highest probability. We want to find the state sequence Q=q 1 …q T, such that Slide by Bonnie Dorr

© Daniel S. Weld 29 Viterbi Algorithm Similar to computing the forward probabilities, but instead of summing over transitions from incoming states, compute the maximum Forward: Viterbi Recursion: Slide by Bonnie Dorr

© Daniel S. Weld 33 HMMs for Info Extraction For sparse extraction tasks : Separate HMM for each type of target Each HMM should Model entire document Consist of target and non-target states Not necessarily fully connected Given HMM, how extract info? Slide by Okan Basegmez

© Daniel S. Weld 35 Simplest Case Fix structure Learn transition & emission probabilities Training data…? Label each word as target or non-target Challenges Sparse training data Unseen words have… Smoothing!

© Daniel S. Weld 36 Problem 3: Learning So far: assumed we know the underlying model Often these parameters are estimated on annotated training data, which has 2 drawbacks: Annotation is difficult and/or expensive Training data is different from the current data We want to maximize the parameters with respect to the current data, i.e., we’re looking for a model, such that Slide by Bonnie Dorr

© Daniel S. Weld 37 Problem 3: Learning Unfortunately, there is no known way to analytically find a global maximum, i.e., a model, such that But it is possible to find a local maximum! Given an initial model, we can always find a model, such that Slide by Bonnie Dorr

© Daniel S. Weld 38 Parameter Re-estimation Use the forward-backward (or Baum-Welch) algorithm, which is a hill-climbing algorithm Using an initial parameter instantiation, the forward-backward algorithm iteratively re-estimates the parameters and improves the probability that given observation are generated by the new parameters Slide by Bonnie Dorr

© Daniel S. Weld 39 Parameter Re-estimation Three parameters need to be re-estimated: Initial state distribution: Transition probabilities: a i,j Emission probabilities: b i (o t ) Slide by Bonnie Dorr

© Daniel S. Weld 44 Review of Probabilities Forward probability: The probability of being in state s i, given the partial observation o 1,…,o t Backward probability: The probability of being in state s i, given the partial observation o t+1,…,o T Transition probability: The probability of going from state s i, to state s j, given the complete observation o 1,…,o T State probability: The probability of being in state s i, given the complete observation o 1,…,o T Slide by Bonnie Dorr

© Daniel S. Weld 46 Re-estimation of Emission Probabilities Emission probabilities are re-estimated as Formally: Where Note that here is the Kronecker delta function and is not related to the in the discussion of the Viterbi algorithm!! Slide by Bonnie Dorr

© Daniel S. Weld 48 Expectation Maximization The forward-backward algorithm is an instance of the more general EM algorithm The E Step: Compute the forward and backward probabilities for a give model The M Step: Re-estimate the model parameters Slide by Bonnie Dorr

© Daniel S. Weld 49 Importance of HMM Topology Certain structures better capture the observed phenomena in the prefix, target and suffix sequences Building structures by hand does not scale to large corpora Human intuitions don’t always correspond to structures that make the best use of HMM potential Slide by Okan Basegmez

© Daniel S. Weld 51 Conclusion IE is performed by recovering the most likely state sequence (Viterbi) Transition and Emission Parameters can be learned from training data (Baum-Welch) Shrinkage improves parameter estimation Task-specific state-transition structure can be automatically discovered Slide by Okan Basegmez

© Daniel S. Weld 52 References Information Extraction with HMM Structures Learned by Stochastic Optimization, Dayne Freitag and Andrew McCallum Information Extraction with HMMs and Shrinkage - Dayne Frietag and Andrew McCallum Learning Hidden Markov Model Structure for Information Extraction, Kristie Seymore, Andrew McCallum, Roni Rosenfeld Inducing Probabilistic Grammars by Bayesian Model Merging, Andreas Stolcke, Stephen Omohundro Information Extraction using Hidden Markov Models, Leek, T. R, Master ’ s thesis, UCSD Slide by Okan Basegmez

Hidden Markov Models for Information Extraction CSE 454.

Similar presentations

Presentation on theme: "Hidden Markov Models for Information Extraction CSE 454."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Hidden Markov Models for Information Extraction CSE 454.

Similar presentations

Presentation on theme: "Hidden Markov Models for Information Extraction CSE 454."— Presentation transcript:

Similar presentations

About project

Feedback