Hidden Markov Models Applied to Information Extraction Part I: Concept Part I: Concept HMM Tutorial HMM Tutorial Part II: Sample Application Part II: Sample.

Hidden Markov Models Applied to Information Extraction Part I: Concept Part I: Concept HMM Tutorial HMM Tutorial Part II: Sample Application Part II: Sample Application AutoBib: web information extraction AutoBib: web information extraction Larry Reeve INFO629: Artificial Intelligence Dr. Weber, Fall 2004

Part I: Concept HMM Motivation Real-world has structures and processes which have (or produce) observable outputs Real-world has structures and processes which have (or produce) observable outputs Usually sequential (process unfolds over time) Usually sequential (process unfolds over time) Cannot see the event producing the output Cannot see the event producing the output Example: speech signals Example: speech signals Problem: how to construct a model of the structure or process given only observations Problem: how to construct a model of the structure or process given only observations

HMM Background Basic theory developed and published in 1960s and 70s Basic theory developed and published in 1960s and 70s No widespread understanding and application until late 80s No widespread understanding and application until late 80s Why? Why? Theory published in mathematic journals which were not widely read by practicing engineers Theory published in mathematic journals which were not widely read by practicing engineers Insufficient tutorial material for readers to understand and apply concepts Insufficient tutorial material for readers to understand and apply concepts

HMM Uses Uses Uses Speech recognition Speech recognition Recognizing spoken words and phrases Recognizing spoken words and phrases Text processing Text processing Parsing raw records into structured records Parsing raw records into structured records Bioinformatics Bioinformatics Protein sequence prediction Protein sequence prediction Financial Financial Stock market forecasts (price pattern prediction) Stock market forecasts (price pattern prediction) Comparison shopping services Comparison shopping services

HMM Overview Machine learning method Machine learning method Makes use of state machines Makes use of state machines Based on probabilistic models Based on probabilistic models Useful in problems having sequential steps Useful in problems having sequential steps Can only observe output from states, not the states themselves Can only observe output from states, not the states themselves Example: speech recognition Example: speech recognition Observe: acoustic signals Observe: acoustic signals Hidden States: phonemes Hidden States: phonemes (distinctive sounds of a language) State machine:

Observable Markov Model Example Weather Weather Once each day weather is observed Once each day weather is observed State 1: rain State 1: rain State 2: cloudy State 2: cloudy State 3: sunny State 3: sunny What is the probability the weather for the next 7 days will be: What is the probability the weather for the next 7 days will be: sun, sun, rain, rain, sun, cloudy, sun sun, sun, rain, rain, sun, cloudy, sun Each state corresponds to a physical observable event Each state corresponds to a physical observable event State transition matrixRainyCloudySunnyRainy0.40.30.3 Cloudy0.20.60.2 Sunny0.10.10.8

Observable Markov Model

Hidden Markov Model Example Coin toss: Coin toss: Heads, tails sequence with 2 coins Heads, tails sequence with 2 coins You are in a room, with a wall You are in a room, with a wall Person behind wall flips coin, tells result Person behind wall flips coin, tells result Coin selection and toss is hidden Coin selection and toss is hidden Cannot observe events, only output (heads, tails) from events Cannot observe events, only output (heads, tails) from events Problem is then to build a model to explain observed sequence of heads and tails Problem is then to build a model to explain observed sequence of heads and tails

HMM Components A set of states (x’s) A set of states (x’s) A set of possible output symbols (y’s) A set of possible output symbols (y’s) A state transition matrix (a’s) A state transition matrix (a’s) probability of making transition from one state to the next probability of making transition from one state to the next Output emission matrix (b’s) Output emission matrix (b’s) probability of a emitting/observing a symbol at a particular state probability of a emitting/observing a symbol at a particular state Initial probability vector Initial probability vector probability of starting at a particular state probability of starting at a particular state Not shown, sometimes assumed to be 1 Not shown, sometimes assumed to be 1

HMM Components

Common HMM Types Ergodic (fully connected): Ergodic (fully connected): Every state of model can be reached in a single step from every other state of the model Every state of model can be reached in a single step from every other state of the model Bakis (left-right): Bakis (left-right): As time increases, states proceed from left to right As time increases, states proceed from left to right

HMM Core Problems Three problems must be solved for HMMs to be useful in real-world applications Three problems must be solved for HMMs to be useful in real-world applications 1) Evaluation 2) Decoding 3) Learning

HMM Evaluation Problem Purpose: score how well a given model matches a given observation sequence Purpose: score how well a given model matches a given observation sequence Example (Speech recognition): Example (Speech recognition): Assume HMMs (models) have been built for words ‘home’ and ‘work’. Assume HMMs (models) have been built for words ‘home’ and ‘work’. Given a speech signal, evaluation can determine the probability each model represents the utterance Given a speech signal, evaluation can determine the probability each model represents the utterance

HMM Decoding Problem Given a model and a set of observations, what are the hidden states most likely to have generated the observations? Given a model and a set of observations, what are the hidden states most likely to have generated the observations? Useful to learn about internal model structure, determine state statistics, and so forth Useful to learn about internal model structure, determine state statistics, and so forth

HMM Learning Problem Goal is to learn HMM parameters (training) Goal is to learn HMM parameters (training) State transition probabilities State transition probabilities Observation probabilities at each state Observation probabilities at each state Training is crucial: Training is crucial: it allows optimal adaptation of model parameters to observed training data using real-world phenomena it allows optimal adaptation of model parameters to observed training data using real-world phenomena No known method for obtaining optimal parameters from data – only approximations No known method for obtaining optimal parameters from data – only approximations Can be a bottleneck in HMM usage Can be a bottleneck in HMM usage

HMM Concept Summary Build models representing the hidden states of a process or structure using only observations Build models representing the hidden states of a process or structure using only observations Use the models to evaluate probability that a model represents a particular observation sequence Use the models to evaluate probability that a model represents a particular observation sequence Use the evaluation information in an application to: recognize speech, parse addresses, and many other applications Use the evaluation information in an application to: recognize speech, parse addresses, and many other applications

Part II: Application AutoBib System Provide a uniform view of several computer science bibliographic web data sources Provide a uniform view of several computer science bibliographic web data sources An automated web information extraction system that requires little human input An automated web information extraction system that requires little human input Web pages designed differently from site-to-site Web pages designed differently from site-to-site IE requires training samples IE requires training samples HMMs used to parse unstructured bibliographic records into a structured format: NLP HMMs used to parse unstructured bibliographic records into a structured format: NLP

Web Information Extraction Converting Raw Records

Approach 1) Provide seed database of structured records 2) Extract raw records from relevant Web pages 3) Match structured records to raw records To build training samples To build training samples 4) Train HMM-based parser 5) Parse unmatched raw recs into structured recs 6) Merge new structured records into database

AutoBib Architecture

Step 1 - Seeding Provide seed database of structured records Provide seed database of structured records Take small collection of BibTeX format records and insert into database Take small collection of BibTeX format records and insert into database Cleaning step normalizes record fields Cleaning step normalizes record fields Examples: Examples: “Proc.”  “Proceedings” “Proc.”  “Proceedings” “Jan”  “January” “Jan”  “January” Manual step, executed once only Manual step, executed once only

Step 2 – Extract Raw Records Extract raw records from relevant Web pages Extract raw records from relevant Web pages User specifies User specifies Web pages to extract from Web pages to extract from How to follow ‘next page’ links for multiple pages How to follow ‘next page’ links for multiple pages Raw records are extracted Raw records are extracted Uses record-boundary discovery techniques Uses record-boundary discovery techniques Subtree of Interest = largest subtree of HTML tags Subtree of Interest = largest subtree of HTML tags Record separators = frequent HTML tags Record separators = frequent HTML tags

Tokenized Records (Replace all HTML tags with ^)

Step 3 - Matching Match raw records R to structured records S Match raw records R to structured records S Apply 4 tests (heuristic-based) Apply 4 tests (heuristic-based) 1) Match at least author in R to an author in S 2) S.year must appear in R 3) If S.pages exists, R must contain it 4) S.title is ‘approximately contained’ in R Levenshtein edit distance – approximate string match

Step 4 – Parser Training Train HMM-based parser Train HMM-based parser For each pair of R and S that match, annotate tokens in raw record with field names For each pair of R and S that match, annotate tokens in raw record with field names Annotated raw records are fed into HMM parser in order to learn: Annotated raw records are fed into HMM parser in order to learn: State transition probabilities State transition probabilities Symbol probabilities at each state Symbol probabilities at each state

Parser Training, continued Key consideration is HMM structure for navigating record fields (fields, delimiters) Key consideration is HMM structure for navigating record fields (fields, delimiters) Special states Special states start, end start, end Normal states Normal states author, title, year, etc. author, title, year, etc. Best structure found: Best structure found: Have multiple delimiter and tag states, Have multiple delimiter and tag states, one for each normal state one for each normal state Example: author-delimiter, author-tag Example: author-delimiter, author-tag

Sample HMM (Method 3) Source: http://www.cs.duke.edu/~geng/autobib/web/hmm.jpg

Step 5 - Conversion Parse unmatched raw recs into structured recs using HMM parser Parse unmatched raw recs into structured recs using HMM parser Matched raw records can be directly converted without parsing because they were annotated in matching step Matched raw records can be directly converted without parsing because they were annotated in matching step

Step 6 - Merging Merge new structured records into database Merge new structured records into database Initial seed database has now grown Initial seed database has now grown New records will be used for improved matching on the next run New records will be used for improved matching on the next run

Evaluation Success rate: Success rate: # of tokens labeled by HMM ------------------------------------- # of tokens labeled by person DBLP: 98.9% DBLP: 98.9% Computer Science Bibliography Computer Science Bibliography CSWD: 93.4% CSWD: 93.4% CompuScience WWW-Database CompuScience WWW-Database

HMM Advantages / Disadvantages Advantages Advantages Effective Effective Can handle variations in record structure Can handle variations in record structure Optional fields Optional fields Varying field ordering Varying field ordering Disadvantages Disadvantages Requires training using annotated data Requires training using annotated data Not completely automatic Not completely automatic May require manual markup May require manual markup Size of training data may be an issue Size of training data may be an issue

Other methods Wrappers Wrappers Specification of areas of interest on Web page Specification of areas of interest on Web page Hand-crafted Hand-crafted Wrapper induction Wrapper induction Requires manual training Requires manual training Not always accommodating to changing structure Not always accommodating to changing structure Syntax-based; no semantic labeling Syntax-based; no semantic labeling

Application to Other Domains E-Commerce E-Commerce Comparison shopping sites Comparison shopping sites Extract product/pricing information from many sites Extract product/pricing information from many sites Convert information into structured format and store Convert information into structured format and store Provide interface to look up product information and then display pricing information gathered from many sites Provide interface to look up product information and then display pricing information gathered from many sites Saves users time Saves users time Rather than navigating to and searching many sites, users can consult a single site Rather than navigating to and searching many sites, users can consult a single site

References Concept: Concept: Rabiner, L. R. (1989). A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE, 77(2), 257-285. Rabiner, L. R. (1989). A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE, 77(2), 257-285. Application: Application: Geng, J. and Yang, J. (2004). Automatic Extraction of Bibliographic Information on the Web. Proceedings of the 8th International Database Engineering and Applications Symposium (IDEAS’04), 193-204. Geng, J. and Yang, J. (2004). Automatic Extraction of Bibliographic Information on the Web. Proceedings of the 8th International Database Engineering and Applications Symposium (IDEAS’04), 193-204.

Hidden Markov Models Applied to Information Extraction Part I: Concept Part I: Concept HMM Tutorial HMM Tutorial Part II: Sample Application Part II: Sample.

Similar presentations

Presentation on theme: "Hidden Markov Models Applied to Information Extraction Part I: Concept Part I: Concept HMM Tutorial HMM Tutorial Part II: Sample Application Part II: Sample."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Hidden Markov Models Applied to Information Extraction Part I: Concept Part I: Concept HMM Tutorial HMM Tutorial Part II: Sample Application Part II: Sample.

Similar presentations

Presentation on theme: "Hidden Markov Models Applied to Information Extraction Part I: Concept Part I: Concept HMM Tutorial HMM Tutorial Part II: Sample Application Part II: Sample."— Presentation transcript:

Similar presentations

About project

Feedback