Learning Hidden Markov Model Structure for Information Extraction Kristie Seymour, Andrew McCullum, & Ronald Rosenfeld.

Slides:



Advertisements
Similar presentations
Hidden Markov Models (HMM) Rabiner’s Paper
Advertisements

HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
Learning HMM parameters
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira.
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Page 1 Hidden Markov Models for Automatic Speech Recognition Dr. Mike Johnson Marquette University, EECE Dept.
Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수
Statistical NLP: Lecture 11
Ch-9: Markov Models Prepared by Qaiser Abbas ( )
Chapter 6: HIDDEN MARKOV AND MAXIMUM ENTROPY Heshaam Faili University of Tehran.
Hidden Markov Models Theory By Johan Walters (SR 2003)
Hidden Markov Models Fundamentals and applications to bioinformatics.
1 Gholamreza Haffari Anoop Sarkar Presenter: Milan Tofiloski Natural Language Lab Simon Fraser university Homotopy-based Semi- Supervised Hidden Markov.
Hidden Markov Models in NLP
Hidden Markov Model based 2D Shape Classification Ninad Thakoor 1 and Jean Gao 2 1 Electrical Engineering, University of Texas at Arlington, TX-76013,
Lecture 15 Hidden Markov Models Dr. Jianjun Hu mleg.cse.sc.edu/edu/csce833 CSCE833 Machine Learning University of South Carolina Department of Computer.
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
Lecture 5: Learning models using EM
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Comparative ab initio prediction of gene structures using pair HMMs
1 Hidden Markov Model Instructor : Saeed Shiry  CHAPTER 13 ETHEM ALPAYDIN © The MIT Press, 2004.
Hidden Markov Model: Extension of Markov Chains
Elze de Groot1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter
. Hidden Markov Models with slides from Lise Getoor, Sebastian Thrun, William Cohen, and Yair Weiss.
Chapter 3 (part 3): Maximum-Likelihood and Bayesian Parameter Estimation Hidden Markov Model: Extension of Markov Chains All materials used in this course.
Doug Downey, adapted from Bryan Pardo,Northwestern University
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Isolated-Word Speech Recognition Using Hidden Markov Models
Hidden Markov Models Applied to Information Extraction Part I: Concept Part I: Concept HMM Tutorial HMM Tutorial Part II: Sample Application Part II: Sample.
THE HIDDEN MARKOV MODEL (HMM)
Graphical models for part of speech tagging
Fundamentals of Hidden Markov Model Mehmet Yunus Dönmez.
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging for Bengali with Hidden Markov Model Sandipan Dandapat,
Hidden Markov Models for Information Extraction CSE 454.
Hidden Markov Models Usman Roshan CS 675 Machine Learning.
Hidden Markov Models BMI/CS 776 Mark Craven March 2002.
Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.
1 Information Extraction using HMMs Sunita Sarawagi.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-
NLP. Introduction to NLP Sequence of random variables that aren’t independent Examples –weather reports –text.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Mining Logs Files for Data-Driven System Management Advisor.
CS Statistical Machine learning Lecture 24
School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.
Beam Sampling for the Infinite Hidden Markov Model by Jurgen Van Gael, Yunus Saatic, Yee Whye Teh and Zoubin Ghahramani (ICML 2008) Presented by Lihan.
John Lafferty Andrew McCallum Fernando Pereira
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Elements of a Discrete Model Evaluation.
Hidden Markov Models (HMMs) –probabilistic models for learning patterns in sequences (e.g. DNA, speech, weather, cards...) (2 nd order model)
1 Hidden Markov Models Hsin-min Wang References: 1.L. R. Rabiner and B. H. Juang, (1993) Fundamentals of Speech Recognition, Chapter.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Introducing Hidden Markov Models First – a Markov Model State : sunny cloudy rainy sunny ? A Markov Model is a chain-structured process where future states.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
Hidden Markov Models BMI/CS 576
Learning, Uncertainty, and Information: Learning Parameters
Hidden Markov Models for Information Extraction
Hidden Markov Models (HMMs)
Hidden Markov Models (HMMs)
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Visual Recognition of American Sign Language Using Hidden Markov Models 문현구 문현구.
Presentation transcript:

Learning Hidden Markov Model Structure for Information Extraction Kristie Seymour, Andrew McCullum, & Ronald Rosenfeld

Hidden Markov Model Structures Machine learning tool applied to Information Extraction Machine learning tool applied to Information Extraction Part of speech tagging (Kupiec 1992) Part of speech tagging (Kupiec 1992) Topic detection & tracking (Yamron et al 1998) Topic detection & tracking (Yamron et al 1998) Dialog act modeling (Stolcke, Shriberg, & others 1998) Dialog act modeling (Stolcke, Shriberg, & others 1998)

HMM in Information Extraction Gene names and locations (Luek 1997) Gene names and locations (Luek 1997) Named-entity extraction (Nymble system – Friberg & McCallum 1999) Named-entity extraction (Nymble system – Friberg & McCallum 1999) Information Extraction Strategy Information Extraction Strategy 1 HMM = 1 Field 1 HMM = 1 Field 1 state / class 1 state / class Hand-built models using human data inspection Hand-built models using human data inspection

HMM Advantages Strong statistical foundations Strong statistical foundations Used well in Natural Language programming Used well in Natural Language programming Handles new data robustly Handles new data robustly Uses established training algorithms which are computationally efficient to develop and evaluate Uses established training algorithms which are computationally efficient to develop and evaluate

HMM Disadvantages Require a priori notion of model topology Require a priori notion of model topology Need large amounts of training data to use Need large amounts of training data to use

Authors’ Contribution Automatically determined model structure from data Automatically determined model structure from data One HMM to extract all information One HMM to extract all information Introduced DISTANTLY-LABELED DATA Introduced DISTANTLY-LABELED DATA

OUTLINE Information Extraction basics with HMM Information Extraction basics with HMM Learning model structure from data Learning model structure from data Training data Training data Experiment results Experiment results Model selection Model selection Error breakdown Error breakdown Conclusions Conclusions Future work Future work

Information Extraction basics with HMM OBJECT – to code every word of CS research paper headers OBJECT – to code every word of CS research paper headers Title Title Author Author Date Date Keyword Keyword Etc. Etc. 1 HMM / 1 Header 1 HMM / 1 Header Initial state to Final state Initial state to Final state

Discrete output, First-order HMM Q – set of states Q – set of states q I – initial state q I – initial state q F – final state in transition q F – final state in transition ∑ = {σ 1, σ 2,..., σ m } - discrete output vocabulary ∑ = {σ 1, σ 2,..., σ m } - discrete output vocabulary X = x 1 x 2... x i - output string X = x 1 x 2... x i - output stringPROCESS Initital state -> new state -> emit output symbol -> Initital state -> new state -> emit output symbol -> another state -> new state -> emit another output symbol -> another state -> new state -> emit another output symbol ->... FINAL STATE... FINAL STATEPARAMETERS P(q -> q’) – transition probabilities P(q -> q’) – transition probabilities P(q ↑ σ) – emission probabilities P(q ↑ σ) – emission probabilities

The probability of string x being emitted by an HMM M is computed as a sum over all possible paths where q 0 and q l+1 are restricted to be q I and q F respectively, and x l+1 is an end-of-string token (uses Forward algorithm)

The output is observable, but the underlying state sequence is HIDDEN

To recover the state sequence V(x|M) that has the highest probability of having produced an observation sequence (uses Viterbi algorithm)

HMM application Each state has a class (i.e. title, author) Each state has a class (i.e. title, author) Each word in the header is an observation Each word in the header is an observation Each state emits words from header with associated CLASS TAG Each state emits words from header with associated CLASS TAG This is learned from TRAINING DATA This is learned from TRAINING DATA

Learning model structure from data Decide on states and associated transition states Decide on states and associated transition states Set up labeled training data Set up labeled training data Use MERGE techniques Use MERGE techniques Neighbor merge (link all adjacent words in title)Neighbor merge (link all adjacent words in title) V-merging - 2 states with same label and transitions (one transition to title and out)V-merging - 2 states with same label and transitions (one transition to title and out) Apply Bayesian model merging to maximize result accuracy Apply Bayesian model merging to maximize result accuracy

Example Hidden Markov Model

Bayesian model merging seeks to find the model structure that maximizes the probability of the model (M) given some training data (D), by iteratively merging states until an optimal tradeoff between fit to the data and model size has been reached

Three types of training data Labeled data Labeled data Unlabeled data Unlabeled data Distantly-labeled data Distantly-labeled data

Labeled data manual and expensive manual and expensive Provides COUNTS function c() estimates model parameters Provides COUNTS function c() estimates model parameters

Formulas for deriving parameters using counts c() (4) Transition Probabilities (5) Emission Probabilities

Unlabeled Data Needs estimated parameters from labeled data Needs estimated parameters from labeled data Use Baum-Welch training algorithm Use Baum-Welch training algorithm Iterative expectation-maximization algorithm which adjusts model parameters to locally maximize results from unlabeled dataIterative expectation-maximization algorithm which adjusts model parameters to locally maximize results from unlabeled data Sensitive to initial parametersSensitive to initial parameters

Distantly-labeled data Data labeled for another purpose Data labeled for another purpose Partially applied to this domain for training Partially applied to this domain for training EXAMPLE - CS research headers – BibTeX bibliographic labeled citations EXAMPLE - CS research headers – BibTeX bibliographic labeled citations

Experiment results Prepare text using computer program Prepare text using computer program Header- beginning to INTRODUCTION or end of 1 st pageHeader- beginning to INTRODUCTION or end of 1 st page Remove punctuation, case, & newlinesRemove punctuation, case, & newlines LabelLabel +ABSTRACT+Abstract +ABSTRACT+Abstract +INTRO+Introduction +INTRO+Introduction +PAGE+End of 1 st page +PAGE+End of 1 st page Manually label 1000 headers Manually label 1000 headers Minus 65 discarded due to poor formatMinus 65 discarded due to poor format Derive fixed word vocabularies from training Derive fixed word vocabularies from training

Sources & Amounts of Training Data

Model selection MODELS state / class MODELS state / class MODEL 1 – fully connected HMM model with uniform transition estimates between states MODEL 1 – fully connected HMM model with uniform transition estimates between states MODEL 2 – maximum likelihood transition estimate with others uniform MODEL 2 – maximum likelihood transition estimate with others uniform MODEL 3 – all likelihood transitions estimates BASELINE used for HMM model MODEL 3 – all likelihood transitions estimates BASELINE used for HMM model MODEL 4 – adds smoothing – no zero results MODEL 4 – adds smoothing – no zero results

ACCURACY OF MODELS ( by % word classification accuracy ) L Labeled data L+D Labeled and Distantly-labeled

Multiple states / class - hand distantly-labeled + automatic distantly-labeled

Compared BASELINE to best MULTI-STATE to V-MERGED models

UNLABELED DATA & TRAINING INITIAL L + D + U λ = each emission distribution λ varies optimum distribution PP includes smoothing

Error breakdown Errors by CLASS TAG Errors by CLASS TAG BOLD – distantly-labeled data tags BOLD – distantly-labeled data tags

Conclusions Research paper headers work Research paper headers work Improvement factors Improvement factors Multi-state classesMulti-state classes Distantly-labeled data (10%)Distantly-labeled data (10%) Distantly-labeled data can reduce labeled dataDistantly-labeled data can reduce labeled data

Future work Use Bayesian model merging to completely automate model learning Use Bayesian model merging to completely automate model learning Also describe layout by position on page Also describe layout by position on page Model internal state structure Model internal state structure

Model of Internal State Structure First 2 words – explicit Multiple affiliations possible Last 2 words - explicit

My Assessment Highly mathematical and complex Highly mathematical and complex Even unlabeled data is in a preset order Even unlabeled data is in a preset order Model requires work setting up training data Model requires work setting up training data Change in target data will completely change model Change in target data will completely change model Valuable experiments with heuristics and smoothing impacting results Valuable experiments with heuristics and smoothing impacting results Wish they had included a sample 1 st page Wish they had included a sample 1 st page

QUESTIONS