Download presentation
Presentation is loading. Please wait.
1
Learning Hidden Markov Model Structure for Information Extraction Kristie Seymour, Andrew McCullum, & Ronald Rosenfeld
2
Hidden Markov Model Structures Machine learning tool applied to Information Extraction Machine learning tool applied to Information Extraction Part of speech tagging (Kupiec 1992) Part of speech tagging (Kupiec 1992) Topic detection & tracking (Yamron et al 1998) Topic detection & tracking (Yamron et al 1998) Dialog act modeling (Stolcke, Shriberg, & others 1998) Dialog act modeling (Stolcke, Shriberg, & others 1998)
3
HMM in Information Extraction Gene names and locations (Luek 1997) Gene names and locations (Luek 1997) Named-entity extraction (Nymble system – Friberg & McCallum 1999) Named-entity extraction (Nymble system – Friberg & McCallum 1999) Information Extraction Strategy Information Extraction Strategy 1 HMM = 1 Field 1 HMM = 1 Field 1 state / class 1 state / class Hand-built models using human data inspection Hand-built models using human data inspection
4
HMM Advantages Strong statistical foundations Strong statistical foundations Used well in Natural Language programming Used well in Natural Language programming Handles new data robustly Handles new data robustly Uses established training algorithms which are computationally efficient to develop and evaluate Uses established training algorithms which are computationally efficient to develop and evaluate
5
HMM Disadvantages Require a priori notion of model topology Require a priori notion of model topology Need large amounts of training data to use Need large amounts of training data to use
6
Authors’ Contribution Automatically determined model structure from data Automatically determined model structure from data One HMM to extract all information One HMM to extract all information Introduced DISTANTLY-LABELED DATA Introduced DISTANTLY-LABELED DATA
7
OUTLINE Information Extraction basics with HMM Information Extraction basics with HMM Learning model structure from data Learning model structure from data Training data Training data Experiment results Experiment results Model selection Model selection Error breakdown Error breakdown Conclusions Conclusions Future work Future work
8
Information Extraction basics with HMM OBJECT – to code every word of CS research paper headers OBJECT – to code every word of CS research paper headers Title Title Author Author Date Date Keyword Keyword Etc. Etc. 1 HMM / 1 Header 1 HMM / 1 Header Initial state to Final state Initial state to Final state
9
Discrete output, First-order HMM Q – set of states Q – set of states q I – initial state q I – initial state q F – final state in transition q F – final state in transition ∑ = {σ 1, σ 2,..., σ m } - discrete output vocabulary ∑ = {σ 1, σ 2,..., σ m } - discrete output vocabulary X = x 1 x 2... x i - output string X = x 1 x 2... x i - output stringPROCESS Initital state -> new state -> emit output symbol -> Initital state -> new state -> emit output symbol -> another state -> new state -> emit another output symbol -> another state -> new state -> emit another output symbol ->... FINAL STATE... FINAL STATEPARAMETERS P(q -> q’) – transition probabilities P(q -> q’) – transition probabilities P(q ↑ σ) – emission probabilities P(q ↑ σ) – emission probabilities
10
The probability of string x being emitted by an HMM M is computed as a sum over all possible paths where q 0 and q l+1 are restricted to be q I and q F respectively, and x l+1 is an end-of-string token (uses Forward algorithm)
11
The output is observable, but the underlying state sequence is HIDDEN
12
To recover the state sequence V(x|M) that has the highest probability of having produced an observation sequence (uses Viterbi algorithm)
13
HMM application Each state has a class (i.e. title, author) Each state has a class (i.e. title, author) Each word in the header is an observation Each word in the header is an observation Each state emits words from header with associated CLASS TAG Each state emits words from header with associated CLASS TAG This is learned from TRAINING DATA This is learned from TRAINING DATA
14
Learning model structure from data Decide on states and associated transition states Decide on states and associated transition states Set up labeled training data Set up labeled training data Use MERGE techniques Use MERGE techniques Neighbor merge (link all adjacent words in title)Neighbor merge (link all adjacent words in title) V-merging - 2 states with same label and transitions (one transition to title and out)V-merging - 2 states with same label and transitions (one transition to title and out) Apply Bayesian model merging to maximize result accuracy Apply Bayesian model merging to maximize result accuracy
15
Example Hidden Markov Model
16
Bayesian model merging seeks to find the model structure that maximizes the probability of the model (M) given some training data (D), by iteratively merging states until an optimal tradeoff between fit to the data and model size has been reached
17
Three types of training data Labeled data Labeled data Unlabeled data Unlabeled data Distantly-labeled data Distantly-labeled data
18
Labeled data manual and expensive manual and expensive Provides COUNTS function c() estimates model parameters Provides COUNTS function c() estimates model parameters
19
Formulas for deriving parameters using counts c() (4) Transition Probabilities (5) Emission Probabilities
20
Unlabeled Data Needs estimated parameters from labeled data Needs estimated parameters from labeled data Use Baum-Welch training algorithm Use Baum-Welch training algorithm Iterative expectation-maximization algorithm which adjusts model parameters to locally maximize results from unlabeled dataIterative expectation-maximization algorithm which adjusts model parameters to locally maximize results from unlabeled data Sensitive to initial parametersSensitive to initial parameters
21
Distantly-labeled data Data labeled for another purpose Data labeled for another purpose Partially applied to this domain for training Partially applied to this domain for training EXAMPLE - CS research headers – BibTeX bibliographic labeled citations EXAMPLE - CS research headers – BibTeX bibliographic labeled citations
22
Experiment results Prepare text using computer program Prepare text using computer program Header- beginning to INTRODUCTION or end of 1 st pageHeader- beginning to INTRODUCTION or end of 1 st page Remove punctuation, case, & newlinesRemove punctuation, case, & newlines LabelLabel +ABSTRACT+Abstract +ABSTRACT+Abstract +INTRO+Introduction +INTRO+Introduction +PAGE+End of 1 st page +PAGE+End of 1 st page Manually label 1000 headers Manually label 1000 headers Minus 65 discarded due to poor formatMinus 65 discarded due to poor format Derive fixed word vocabularies from training Derive fixed word vocabularies from training
23
Sources & Amounts of Training Data
24
Model selection MODELS 1-4 - 1 state / class MODELS 1-4 - 1 state / class MODEL 1 – fully connected HMM model with uniform transition estimates between states MODEL 1 – fully connected HMM model with uniform transition estimates between states MODEL 2 – maximum likelihood transition estimate with others uniform MODEL 2 – maximum likelihood transition estimate with others uniform MODEL 3 – all likelihood transitions estimates BASELINE used for HMM model MODEL 3 – all likelihood transitions estimates BASELINE used for HMM model MODEL 4 – adds smoothing – no zero results MODEL 4 – adds smoothing – no zero results
25
ACCURACY OF MODELS ( by % word classification accuracy ) L Labeled data L+D Labeled and Distantly-labeled
26
Multiple states / class - hand distantly-labeled + automatic distantly-labeled
27
Compared BASELINE to best MULTI-STATE to V-MERGED models
28
UNLABELED DATA & TRAINING INITIAL L + D + U λ = 0.5 0.5 each emission distribution λ varies optimum distribution PP includes smoothing
29
Error breakdown Errors by CLASS TAG Errors by CLASS TAG BOLD – distantly-labeled data tags BOLD – distantly-labeled data tags
31
Conclusions Research paper headers work Research paper headers work Improvement factors Improvement factors Multi-state classesMulti-state classes Distantly-labeled data (10%)Distantly-labeled data (10%) Distantly-labeled data can reduce labeled dataDistantly-labeled data can reduce labeled data
32
Future work Use Bayesian model merging to completely automate model learning Use Bayesian model merging to completely automate model learning Also describe layout by position on page Also describe layout by position on page Model internal state structure Model internal state structure
33
Model of Internal State Structure First 2 words – explicit Multiple affiliations possible Last 2 words - explicit
34
My Assessment Highly mathematical and complex Highly mathematical and complex Even unlabeled data is in a preset order Even unlabeled data is in a preset order Model requires work setting up training data Model requires work setting up training data Change in target data will completely change model Change in target data will completely change model Valuable experiments with heuristics and smoothing impacting results Valuable experiments with heuristics and smoothing impacting results Wish they had included a sample 1 st page Wish they had included a sample 1 st page
35
QUESTIONS
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.