Presentation is loading. Please wait.

Presentation is loading. Please wait.

Learning Hidden Markov Model Structure for Information Extraction Kristie Seymour, Andrew McCullum, & Ronald Rosenfeld.

Similar presentations


Presentation on theme: "Learning Hidden Markov Model Structure for Information Extraction Kristie Seymour, Andrew McCullum, & Ronald Rosenfeld."— Presentation transcript:

1 Learning Hidden Markov Model Structure for Information Extraction Kristie Seymour, Andrew McCullum, & Ronald Rosenfeld

2 Hidden Markov Model Structures Machine learning tool applied to Information Extraction Machine learning tool applied to Information Extraction Part of speech tagging (Kupiec 1992) Part of speech tagging (Kupiec 1992) Topic detection & tracking (Yamron et al 1998) Topic detection & tracking (Yamron et al 1998) Dialog act modeling (Stolcke, Shriberg, & others 1998) Dialog act modeling (Stolcke, Shriberg, & others 1998)

3 HMM in Information Extraction Gene names and locations (Luek 1997) Gene names and locations (Luek 1997) Named-entity extraction (Nymble system – Friberg & McCallum 1999) Named-entity extraction (Nymble system – Friberg & McCallum 1999) Information Extraction Strategy Information Extraction Strategy 1 HMM = 1 Field 1 HMM = 1 Field 1 state / class 1 state / class Hand-built models using human data inspection Hand-built models using human data inspection

4 HMM Advantages Strong statistical foundations Strong statistical foundations Used well in Natural Language programming Used well in Natural Language programming Handles new data robustly Handles new data robustly Uses established training algorithms which are computationally efficient to develop and evaluate Uses established training algorithms which are computationally efficient to develop and evaluate

5 HMM Disadvantages Require a priori notion of model topology Require a priori notion of model topology Need large amounts of training data to use Need large amounts of training data to use

6 Authors’ Contribution Automatically determined model structure from data Automatically determined model structure from data One HMM to extract all information One HMM to extract all information Introduced DISTANTLY-LABELED DATA Introduced DISTANTLY-LABELED DATA

7 OUTLINE Information Extraction basics with HMM Information Extraction basics with HMM Learning model structure from data Learning model structure from data Training data Training data Experiment results Experiment results Model selection Model selection Error breakdown Error breakdown Conclusions Conclusions Future work Future work

8 Information Extraction basics with HMM OBJECT – to code every word of CS research paper headers OBJECT – to code every word of CS research paper headers Title Title Author Author Date Date Keyword Keyword Etc. Etc. 1 HMM / 1 Header 1 HMM / 1 Header Initial state to Final state Initial state to Final state

9 Discrete output, First-order HMM Q – set of states Q – set of states q I – initial state q I – initial state q F – final state in transition q F – final state in transition ∑ = {σ 1, σ 2,..., σ m } - discrete output vocabulary ∑ = {σ 1, σ 2,..., σ m } - discrete output vocabulary X = x 1 x 2... x i - output string X = x 1 x 2... x i - output stringPROCESS Initital state -> new state -> emit output symbol -> Initital state -> new state -> emit output symbol -> another state -> new state -> emit another output symbol -> another state -> new state -> emit another output symbol ->... FINAL STATE... FINAL STATEPARAMETERS P(q -> q’) – transition probabilities P(q -> q’) – transition probabilities P(q ↑ σ) – emission probabilities P(q ↑ σ) – emission probabilities

10 The probability of string x being emitted by an HMM M is computed as a sum over all possible paths where q 0 and q l+1 are restricted to be q I and q F respectively, and x l+1 is an end-of-string token (uses Forward algorithm)

11 The output is observable, but the underlying state sequence is HIDDEN

12 To recover the state sequence V(x|M) that has the highest probability of having produced an observation sequence (uses Viterbi algorithm)

13 HMM application Each state has a class (i.e. title, author) Each state has a class (i.e. title, author) Each word in the header is an observation Each word in the header is an observation Each state emits words from header with associated CLASS TAG Each state emits words from header with associated CLASS TAG This is learned from TRAINING DATA This is learned from TRAINING DATA

14 Learning model structure from data Decide on states and associated transition states Decide on states and associated transition states Set up labeled training data Set up labeled training data Use MERGE techniques Use MERGE techniques Neighbor merge (link all adjacent words in title)Neighbor merge (link all adjacent words in title) V-merging - 2 states with same label and transitions (one transition to title and out)V-merging - 2 states with same label and transitions (one transition to title and out) Apply Bayesian model merging to maximize result accuracy Apply Bayesian model merging to maximize result accuracy

15 Example Hidden Markov Model

16 Bayesian model merging seeks to find the model structure that maximizes the probability of the model (M) given some training data (D), by iteratively merging states until an optimal tradeoff between fit to the data and model size has been reached

17 Three types of training data Labeled data Labeled data Unlabeled data Unlabeled data Distantly-labeled data Distantly-labeled data

18 Labeled data manual and expensive manual and expensive Provides COUNTS function c() estimates model parameters Provides COUNTS function c() estimates model parameters

19 Formulas for deriving parameters using counts c() (4) Transition Probabilities (5) Emission Probabilities

20 Unlabeled Data Needs estimated parameters from labeled data Needs estimated parameters from labeled data Use Baum-Welch training algorithm Use Baum-Welch training algorithm Iterative expectation-maximization algorithm which adjusts model parameters to locally maximize results from unlabeled dataIterative expectation-maximization algorithm which adjusts model parameters to locally maximize results from unlabeled data Sensitive to initial parametersSensitive to initial parameters

21 Distantly-labeled data Data labeled for another purpose Data labeled for another purpose Partially applied to this domain for training Partially applied to this domain for training EXAMPLE - CS research headers – BibTeX bibliographic labeled citations EXAMPLE - CS research headers – BibTeX bibliographic labeled citations

22 Experiment results Prepare text using computer program Prepare text using computer program Header- beginning to INTRODUCTION or end of 1 st pageHeader- beginning to INTRODUCTION or end of 1 st page Remove punctuation, case, & newlinesRemove punctuation, case, & newlines LabelLabel +ABSTRACT+Abstract +ABSTRACT+Abstract +INTRO+Introduction +INTRO+Introduction +PAGE+End of 1 st page +PAGE+End of 1 st page Manually label 1000 headers Manually label 1000 headers Minus 65 discarded due to poor formatMinus 65 discarded due to poor format Derive fixed word vocabularies from training Derive fixed word vocabularies from training

23 Sources & Amounts of Training Data

24 Model selection MODELS 1-4 - 1 state / class MODELS 1-4 - 1 state / class MODEL 1 – fully connected HMM model with uniform transition estimates between states MODEL 1 – fully connected HMM model with uniform transition estimates between states MODEL 2 – maximum likelihood transition estimate with others uniform MODEL 2 – maximum likelihood transition estimate with others uniform MODEL 3 – all likelihood transitions estimates BASELINE used for HMM model MODEL 3 – all likelihood transitions estimates BASELINE used for HMM model MODEL 4 – adds smoothing – no zero results MODEL 4 – adds smoothing – no zero results

25 ACCURACY OF MODELS ( by % word classification accuracy ) L Labeled data L+D Labeled and Distantly-labeled

26 Multiple states / class - hand distantly-labeled + automatic distantly-labeled

27 Compared BASELINE to best MULTI-STATE to V-MERGED models

28 UNLABELED DATA & TRAINING INITIAL L + D + U λ = 0.5 0.5 each emission distribution λ varies optimum distribution PP includes smoothing

29 Error breakdown Errors by CLASS TAG Errors by CLASS TAG BOLD – distantly-labeled data tags BOLD – distantly-labeled data tags

30

31 Conclusions Research paper headers work Research paper headers work Improvement factors Improvement factors Multi-state classesMulti-state classes Distantly-labeled data (10%)Distantly-labeled data (10%) Distantly-labeled data can reduce labeled dataDistantly-labeled data can reduce labeled data

32 Future work Use Bayesian model merging to completely automate model learning Use Bayesian model merging to completely automate model learning Also describe layout by position on page Also describe layout by position on page Model internal state structure Model internal state structure

33 Model of Internal State Structure First 2 words – explicit Multiple affiliations possible Last 2 words - explicit

34 My Assessment Highly mathematical and complex Highly mathematical and complex Even unlabeled data is in a preset order Even unlabeled data is in a preset order Model requires work setting up training data Model requires work setting up training data Change in target data will completely change model Change in target data will completely change model Valuable experiments with heuristics and smoothing impacting results Valuable experiments with heuristics and smoothing impacting results Wish they had included a sample 1 st page Wish they had included a sample 1 st page

35 QUESTIONS


Download ppt "Learning Hidden Markov Model Structure for Information Extraction Kristie Seymour, Andrew McCullum, & Ronald Rosenfeld."

Similar presentations


Ads by Google