Learning Hidden Markov Model Structure for Information Extraction Kristie Seymour, Andrew McCullum, & Ronald Rosenfeld
Hidden Markov Model Structures Machine learning tool applied to Information Extraction Machine learning tool applied to Information Extraction Part of speech tagging (Kupiec 1992) Part of speech tagging (Kupiec 1992) Topic detection & tracking (Yamron et al 1998) Topic detection & tracking (Yamron et al 1998) Dialog act modeling (Stolcke, Shriberg, & others 1998) Dialog act modeling (Stolcke, Shriberg, & others 1998)
HMM in Information Extraction Gene names and locations (Luek 1997) Gene names and locations (Luek 1997) Named-entity extraction (Nymble system – Friberg & McCallum 1999) Named-entity extraction (Nymble system – Friberg & McCallum 1999) Information Extraction Strategy Information Extraction Strategy 1 HMM = 1 Field 1 HMM = 1 Field 1 state / class 1 state / class Hand-built models using human data inspection Hand-built models using human data inspection
HMM Advantages Strong statistical foundations Strong statistical foundations Used well in Natural Language programming Used well in Natural Language programming Handles new data robustly Handles new data robustly Uses established training algorithms which are computationally efficient to develop and evaluate Uses established training algorithms which are computationally efficient to develop and evaluate
HMM Disadvantages Require a priori notion of model topology Require a priori notion of model topology Need large amounts of training data to use Need large amounts of training data to use
Authors’ Contribution Automatically determined model structure from data Automatically determined model structure from data One HMM to extract all information One HMM to extract all information Introduced DISTANTLY-LABELED DATA Introduced DISTANTLY-LABELED DATA
OUTLINE Information Extraction basics with HMM Information Extraction basics with HMM Learning model structure from data Learning model structure from data Training data Training data Experiment results Experiment results Model selection Model selection Error breakdown Error breakdown Conclusions Conclusions Future work Future work
Information Extraction basics with HMM OBJECT – to code every word of CS research paper headers OBJECT – to code every word of CS research paper headers Title Title Author Author Date Date Keyword Keyword Etc. Etc. 1 HMM / 1 Header 1 HMM / 1 Header Initial state to Final state Initial state to Final state
Discrete output, First-order HMM Q – set of states Q – set of states q I – initial state q I – initial state q F – final state in transition q F – final state in transition ∑ = {σ 1, σ 2,..., σ m } - discrete output vocabulary ∑ = {σ 1, σ 2,..., σ m } - discrete output vocabulary X = x 1 x 2... x i - output string X = x 1 x 2... x i - output stringPROCESS Initital state -> new state -> emit output symbol -> Initital state -> new state -> emit output symbol -> another state -> new state -> emit another output symbol -> another state -> new state -> emit another output symbol ->... FINAL STATE... FINAL STATEPARAMETERS P(q -> q’) – transition probabilities P(q -> q’) – transition probabilities P(q ↑ σ) – emission probabilities P(q ↑ σ) – emission probabilities
The probability of string x being emitted by an HMM M is computed as a sum over all possible paths where q 0 and q l+1 are restricted to be q I and q F respectively, and x l+1 is an end-of-string token (uses Forward algorithm)
The output is observable, but the underlying state sequence is HIDDEN
To recover the state sequence V(x|M) that has the highest probability of having produced an observation sequence (uses Viterbi algorithm)
HMM application Each state has a class (i.e. title, author) Each state has a class (i.e. title, author) Each word in the header is an observation Each word in the header is an observation Each state emits words from header with associated CLASS TAG Each state emits words from header with associated CLASS TAG This is learned from TRAINING DATA This is learned from TRAINING DATA
Learning model structure from data Decide on states and associated transition states Decide on states and associated transition states Set up labeled training data Set up labeled training data Use MERGE techniques Use MERGE techniques Neighbor merge (link all adjacent words in title)Neighbor merge (link all adjacent words in title) V-merging - 2 states with same label and transitions (one transition to title and out)V-merging - 2 states with same label and transitions (one transition to title and out) Apply Bayesian model merging to maximize result accuracy Apply Bayesian model merging to maximize result accuracy
Example Hidden Markov Model
Bayesian model merging seeks to find the model structure that maximizes the probability of the model (M) given some training data (D), by iteratively merging states until an optimal tradeoff between fit to the data and model size has been reached
Three types of training data Labeled data Labeled data Unlabeled data Unlabeled data Distantly-labeled data Distantly-labeled data
Labeled data manual and expensive manual and expensive Provides COUNTS function c() estimates model parameters Provides COUNTS function c() estimates model parameters
Formulas for deriving parameters using counts c() (4) Transition Probabilities (5) Emission Probabilities
Unlabeled Data Needs estimated parameters from labeled data Needs estimated parameters from labeled data Use Baum-Welch training algorithm Use Baum-Welch training algorithm Iterative expectation-maximization algorithm which adjusts model parameters to locally maximize results from unlabeled dataIterative expectation-maximization algorithm which adjusts model parameters to locally maximize results from unlabeled data Sensitive to initial parametersSensitive to initial parameters
Distantly-labeled data Data labeled for another purpose Data labeled for another purpose Partially applied to this domain for training Partially applied to this domain for training EXAMPLE - CS research headers – BibTeX bibliographic labeled citations EXAMPLE - CS research headers – BibTeX bibliographic labeled citations
Experiment results Prepare text using computer program Prepare text using computer program Header- beginning to INTRODUCTION or end of 1 st pageHeader- beginning to INTRODUCTION or end of 1 st page Remove punctuation, case, & newlinesRemove punctuation, case, & newlines LabelLabel +ABSTRACT+Abstract +ABSTRACT+Abstract +INTRO+Introduction +INTRO+Introduction +PAGE+End of 1 st page +PAGE+End of 1 st page Manually label 1000 headers Manually label 1000 headers Minus 65 discarded due to poor formatMinus 65 discarded due to poor format Derive fixed word vocabularies from training Derive fixed word vocabularies from training
Sources & Amounts of Training Data
Model selection MODELS state / class MODELS state / class MODEL 1 – fully connected HMM model with uniform transition estimates between states MODEL 1 – fully connected HMM model with uniform transition estimates between states MODEL 2 – maximum likelihood transition estimate with others uniform MODEL 2 – maximum likelihood transition estimate with others uniform MODEL 3 – all likelihood transitions estimates BASELINE used for HMM model MODEL 3 – all likelihood transitions estimates BASELINE used for HMM model MODEL 4 – adds smoothing – no zero results MODEL 4 – adds smoothing – no zero results
ACCURACY OF MODELS ( by % word classification accuracy ) L Labeled data L+D Labeled and Distantly-labeled
Multiple states / class - hand distantly-labeled + automatic distantly-labeled
Compared BASELINE to best MULTI-STATE to V-MERGED models
UNLABELED DATA & TRAINING INITIAL L + D + U λ = each emission distribution λ varies optimum distribution PP includes smoothing
Error breakdown Errors by CLASS TAG Errors by CLASS TAG BOLD – distantly-labeled data tags BOLD – distantly-labeled data tags
Conclusions Research paper headers work Research paper headers work Improvement factors Improvement factors Multi-state classesMulti-state classes Distantly-labeled data (10%)Distantly-labeled data (10%) Distantly-labeled data can reduce labeled dataDistantly-labeled data can reduce labeled data
Future work Use Bayesian model merging to completely automate model learning Use Bayesian model merging to completely automate model learning Also describe layout by position on page Also describe layout by position on page Model internal state structure Model internal state structure
Model of Internal State Structure First 2 words – explicit Multiple affiliations possible Last 2 words - explicit
My Assessment Highly mathematical and complex Highly mathematical and complex Even unlabeled data is in a preset order Even unlabeled data is in a preset order Model requires work setting up training data Model requires work setting up training data Change in target data will completely change model Change in target data will completely change model Valuable experiments with heuristics and smoothing impacting results Valuable experiments with heuristics and smoothing impacting results Wish they had included a sample 1 st page Wish they had included a sample 1 st page
QUESTIONS