Machine Learning for Information Extraction: An Overview Kamal Nigam Google Pittsburgh With input, slides and suggestions from William Cohen, Andrew McCallum.

Slides:



Advertisements
Similar presentations
Information Extraction David Kauchak cs160 Fall 2009 some content adapted from:
Advertisements

Conditional Random Fields and beyond …
John Lafferty, Andrew McCallum, Fernando Pereira
Conditional Random Fields - A probabilistic graphical model Stefan Mutter Machine Learning Group Conditional Random Fields - A probabilistic graphical.
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira.
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Statistical NLP: Lecture 11
Hidden Markov Models in NLP
Part II. Statistical NLP Advanced Artificial Intelligence (Hidden) Markov Models Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
Information Extraction PengBo Dec 2, Topics of today IE: Information Extraction Techniques Wrapper Induction Sliding Windows From FST to HMM.
Logistics Course reviews Project report deadline: March 16 Poster session guidelines: – 2.5 minutes per poster (3 hrs / 55 minus overhead) – presentations.
Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.
CSE 574 – Artificial Intelligence II Statistical Relational Learning Instructor: Pedro Domingos.
Information Extraction Rayid Ghani IR Seminar - 11/28/00.
Conditional Random Fields
. Hidden Markov Models with slides from Lise Getoor, Sebastian Thrun, William Cohen, and Yair Weiss.
6/29/2015 2:01 AM1 CSE 573 Finite State Machines for Information Extraction Topics –Administrivia –Background –DIRT –Finite State Machine Overview –HMMs.
Hidden Markov Models (HMMs) for Information Extraction
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
15-505: Lecture 11 Generative Models for Text Classification and Information Extraction Kamal Nigam Some slides from William Cohen, Andrew McCallum.
Information Extraction 2 CSE 454 Based on Slides by William W. Cohen Carnegie Mellon University Andrew McCallum University of Massachusetts Amherst From.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Learning Hidden Markov Model Structure for Information Extraction Kristie Seymour, Andrew McCullum, & Ronald Rosenfeld.
Online Stacked Graphical Learning Zhenzhen Kou +, Vitor R. Carvalho *, and William W. Cohen + Machine Learning Department + / Language Technologies Institute.
Handwritten Character Recognition using Hidden Markov Models Quantifying the marginal benefit of exploiting correlations between adjacent characters and.
1 Sequence Labeling Raymond J. Mooney University of Texas at Austin.
Information Extraction CSE 574 Dan Weld. What is “Information Extraction” Filling slots in a database from sub-segments of text. As a task: October 14,
INFORMATION EXTRACTION David Kauchak cs457 Fall 2011 some content adapted from:
Review: Hidden Markov Models Efficient dynamic programming algorithms exist for –Finding Pr(S) –The highest probability path P that maximizes Pr(S,P) (Viterbi)
Information Extraction Lecture 5 – Named Entity Recognition III CIS, LMU München Winter Semester Dr. Alexander Fraser, CIS.
CSE 5539: Web Information Extraction
Graphical models for part of speech tagging
CSC2535: Computation in Neural Networks Lecture 11: Conditional Random Fields Geoffrey Hinton.
1 Sequence Learning Sudeshna Sarkar 14 Aug Alternative graphical models for part of speech tagging.
Information Extraction: Distilling Structured Data from Unstructured Text. -Andrew McCallum Presented by Lalit Bist.
INFORMATION EXTRACTION David Kauchak cs159 Spring 2011 some content adapted from:
Hidden Markov Models for Information Extraction CSE 454.
EMNLP’01 19/11/2001 ML: Classical methods from AI –Decision-Tree induction –Exemplar-based Learning –Rule Induction –TBEDL ML: Classical methods from AI.
Text Feature Extraction. Text Classification Text classification has many applications –Spam detection –Automated tagging of streams of news articles,
Presenter: Shanshan Lu 03/04/2010
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF)
Copyright © 2012, SAS Institute Inc. All rights reserved. ANALYTICS IN BIG DATA ERA ANALYTICS TECHNOLOGY AND ARCHITECTURE TO MANAGE VELOCITY AND VARIETY,
Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-
IE with Dictionaries Cohen & Sarawagi. Announcements Current statistics: –days with unscheduled student talks: 2 –students with unscheduled student talks:
Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.
Conditional Markov Models: MaxEnt Tagging and MEMMs William W. Cohen CALD.
A Generalization of Forward-backward Algorithm Ai Azuma Yuji Matsumoto Nara Institute of Science and Technology.
CPSC 422, Lecture 19Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 23, 2015 Slide Sources Raymond J. Mooney University of.
School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.
John Lafferty Andrew McCallum Fernando Pereira
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
Conditional Markov Models: MaxEnt Tagging and MEMMs
Information Extraction Entity Extraction: Statistical Methods Sunita Sarawagi.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Wrapper Learning: Cohen et al 2002; Kushmeric 2000; Kushmeric & Frietag 2000 William Cohen 1/26/03.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
MACHINE LEARNING FOR NATURAL LANGUAGE PROCESSING.
Conditional Random Fields and Its Applications Presenter: Shih-Hsiang Lin 06/25/2007.
Tasneem Ghnaimat. Language Model An abstract representation of a (natural) language. An approximation to real language Assume we have a set of sentences,
Information Extraction Lecture 5 – Named Entity Recognition III
School of Computer Science & Engineering
Conditional Random Fields
Prof. Dr. Alexander Fraser, CIS
CSE 574 Finite State Machines for Information Extraction
IE by Candidate Classification: Califf & Mooney
NER with Models Allowing Long-Range Dependencies
Presentation transcript:

Machine Learning for Information Extraction: An Overview Kamal Nigam Google Pittsburgh With input, slides and suggestions from William Cohen, Andrew McCallum and Ion Muslea

Example: A Problem Genomics job Mt. Baker, the school district Baker Hostetler, the company Baker, a job opening

Example: A Solution

Job Openings: Category = Food Services Keyword = Baker Location = Continental U.S.

Extracting Job Openings from the Web Title: Ice Cream Guru Description: If you dream of cold creamy… Contact: Category: Travel/Hospitality Function: Food Services

Potential Enabler of Faceted Search

Lots of Structured Information in Text

IE from Research Papers

What is Information Extraction? Recovering structured data from formatted text

What is Information Extraction? Recovering structured data from formatted text –Identifying fields (e.g. named entity recognition)

What is Information Extraction? Recovering structured data from formatted text –Identifying fields (e.g. named entity recognition) –Understanding relations between fields (e.g. record association)

What is Information Extraction? Recovering structured data from formatted text –Identifying fields (e.g. named entity recognition) –Understanding relations between fields (e.g. record association) –Normalization and deduplication

What is Information Extraction? Recovering structured data from formatted text –Identifying fields (e.g. named entity recognition) –Understanding relations between fields (e.g. record association) –Normalization and deduplication Today, focus mostly on field identification & a little on record association

IE Posed as a Machine Learning Task Training data: documents marked up with ground truth In contrast to text classification, local features crucial. Features of: –Contents –Text just before item –Text just after item –Begin/end boundaries 00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrun prefixcontentssuffix … …

Good Features for Information Extraction Example word features: –identity of word –is in all caps –ends in “-ski” –is part of a noun phrase –is in a list of city names –is under node X in WordNet or Cyc –is in bold font –is in hyperlink anchor –features of past & future –last person name was female –next two words are “and Associates” begins-with-number begins-with-ordinal begins-with-punctuation begins-with-question-word begins-with-subject blank contains-alphanum contains-bracketed- number contains-http contains-non-space contains-number contains-pipe contains-question-mark contains-question-word ends-with-question-mark first-alpha-is-capitalized indented indented-1-to-4 indented-5-to-10 more-than-one-third-space only-punctuation prev-is-blank prev-begins-with-ordinal shorter-than-30 Creativity and Domain Knowledge Required!

Is Capitalized Is Mixed Caps Is All Caps Initial Cap Contains Digit All lowercase Is Initial Punctuation Period Comma Apostrophe Dash Preceded by HTML tag Character n-gram classifier says string is a person name (80% accurate) In stopword list (the, of, their, etc) In honorific list (Mr, Mrs, Dr, Sen, etc) In person suffix list (Jr, Sr, PhD, etc) In name particle list (de, la, van, der, etc) In Census lastname list; segmented by P(name) In Census firstname list; segmented by P(name) In locations lists (states, cities, countries) In company name list (“J. C. Penny”) In list of company suffixes (Inc, & Associates, Foundation) Word Features –lists of job titles, –Lists of prefixes –Lists of suffixes –350 informative phrases HTML/Formatting Features –{begin, end, in} x {,,, } x {lengths 1, 2, 3, 4, or longer} –{begin, end} of line Creativity and Domain Knowledge Required! Good Features for Information Extraction

IE History Pre-Web Mostly news articles –De Jong’s FRUMP [1982] Hand-built system to fill Schank-style “scripts” from news wire –Message Understanding Conference (MUC) DARPA [’87-’95], TIPSTER [’92-’96] Most early work dominated by hand-built models –E.g. SRI’s FASTUS, hand-built FSMs. –But by 1990’s, some machine learning: Lehnert, Cardie, Grishman and then HMMs: Elkan [Leek ’97], BBN [Bikel et al ’98] Web AAAI ’94 Spring Symposium on “Software Agents” –Much discussion of ML applied to Web. Maes, Mitchell, Etzioni. Tom Mitchell’s WebKB, ‘96 –Build KB’s from the Web. Wrapper Induction –Initially hand-build, then ML: [Soderland ’96], [Kushmeric ’97],…

Landscape of ML Techniques for IE: Any of these models can be used to capture words, formatting or both. Classify Candidates Abraham Lincoln was born in Kentucky. Classifier which class? Sliding Window Abraham Lincoln was born in Kentucky. Classifier which class? Try alternate window sizes: Boundary Models Abraham Lincoln was born in Kentucky. Classifier which class? BEGINENDBEGINEND BEGIN Finite State Machines Abraham Lincoln was born in Kentucky. Most likely state sequence? Wrapper Induction Abraham Lincoln was born in Kentucky. Learn and apply pattern for a website PersonName

Sliding Windows & Boundary Detection

Information Extraction by Sliding Windows GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement

Information Extraction by Sliding Windows GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement

Information Extraction by Sliding Window GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement

Information Extraction by Sliding Window GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement

Information Extraction by Sliding Window GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement

Information Extraction with Sliding Windows [Freitag 97, 98; Soderland 97; Califf 98] 00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrun w t-m w t-1 w t w t+n w t+n+1 w t+n+m prefixcontentssuffix … … Standard supervised learning setting –Positive instances: Candidates with real label –Negative instances: All other candidates –Features based on candidate, prefix and suffix Special-purpose rule learning systems work well courseNumber(X) :- tokenLength(X,=,2), every(X, inTitle, false), some(X, A,, inTitle, true), some(X, B, <>. tripleton, true)

Rule-learning approaches to sliding- window classification: Summary Representations for classifiers allow restriction of the relationships between tokens, etc Representations are carefully chosen subsets of even more powerful representations based on logic programming (ILP and Prolog) Use of these “heavyweight” representations is complicated, but seems to pay off in results

IE by Boundary Detection GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement

IE by Boundary Detection GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement

IE by Boundary Detection GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement

IE by Boundary Detection GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement

IE by Boundary Detection GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement

BWI: Learning to detect boundaries Another formulation: learn three probabilistic classifiers: –START(i) = Prob( position i starts a field) –END(j) = Prob( position j ends a field) –LEN(k) = Prob( an extracted field has length k) Then score a possible extraction (i,j) by START(i) * END(j) * LEN(j-i) LEN(k) is estimated from a histogram [Freitag & Kushmerick, AAAI 2000]

BWI: Learning to detect boundaries BWI uses boosting to find “detectors” for START and END Each weak detector has a BEFORE and AFTER pattern (on tokens before/after position i). Each “pattern” is a sequence of tokens and/or wildcards like: anyAlphabeticToken, anyToken, anyUpperCaseLetter, anyNumber, … Weak learner for “patterns” uses greedy search (+ lookahead) to repeatedly extend a pair of empty BEFORE,AFTER patterns

BWI: Learning to detect boundaries FieldF1 Person Name:30% Location:61% Start Time:98%

Problems with Sliding Windows and Boundary Finders Decisions in neighboring parts of the input are made independently from each other. –Naïve Bayes Sliding Window may predict a “seminar end time” before the “seminar start time”. –It is possible for two overlapping windows to both be above threshold. –In a Boundary-Finding system, left boundaries are laid down independently from right boundaries, and their pairing happens as a separate step.

Finite State Machines

Hidden Markov Models S t-1 S t O t S t+1 O t +1 O t Finite state model Graphical model Parameters: for all states S={s 1,s 2,…} Start state probabilities: P(s t ) Transition probabilities: P(s t |s t-1 ) Observation (emission) probabilities: P(o t |s t ) Training: Maximize probability of training observations (w/ prior) HMMs are the standard sequence modeling tool in genomics, music, speech, NLP, …... transitions observations o 1 o 2 o 3 o 4 o 5 o 6 o 7 o 8 Generates: State sequence Observation sequence Usually a multinomial over atomic, fixed alphabet

IE with Hidden Markov Models Yesterday Lawrence Saul spoke this example sentence. Person name: Lawrence Saul Given a sequence of observations: and a trained HMM: Find the most likely state sequence: (Viterbi) Any words said to be generated by the designated “person name” state extract as a person name:

Generative Extraction with HMMs Parameters: {P(s t |s t-1 ), P(o t |s t ), for all states s t, words o t } Parameters define generative model: [McCallum, Nigam, Seymore & Rennie ‘00]

HMM Example: “Nymble” Other examples of HMMs in IE: [Leek ’97; Freitag & McCallum ’99; Seymore et al. 99] Task: Named Entity Extraction Train on 450k words of news wire text. Case Language F1. Mixed English93% UpperEnglish91% MixedSpanish90% [Bikel, et al 97] Person Org Other (Five other name classes) start-of- sentence end-of- sentence Transition probabilities Observation probabilities P(s t | s t-1, o t-1 ) P(o t | s t, s t-1 ) Back-off to: P(s t | s t-1 ) P(s t ) P(o t | s t, o t-1 ) P(o t | s t ) P(o t ) or Results:

Regrets from Atomic View of Tokens Would like richer representation of text: multiple overlapping features, whole chunks of text. line, sentence, or paragraph features: –length –is centered in page –percent of non-alphabetics –white-space aligns with next line –containing sentence has two verbs –grammatically contains a question –contains links to “authoritative” pages –emissions that are uncountable –features at multiple levels of granularity Example word features: –identity of word –is in all caps –ends in “-ski” –is part of a noun phrase –is in a list of city names –is under node X in WordNet or Cyc –is in bold font –is in hyperlink anchor –features of past & future –last person name was female –next two words are “and Associates”

Problems with Richer Representation and a Generative Model These arbitrary features are not independent: –Overlapping and long-distance dependences –Multiple levels of granularity (words, characters) –Multiple modalities (words, formatting, layout) –Observations from past and future HMMs are generative models of the text: Generative models do not easily handle these non- independent features. Two choices: –Model the dependencies. Each state would have its own Bayes Net. But we are already starved for training data! –Ignore the dependencies. This causes “over-counting” of evidence (ala naïve Bayes). Big problem when combining evidence, as in Viterbi!

Conditional Sequence Models We would prefer a conditional model: P(s|o) instead of P(s,o): –Can examine features, but not responsible for generating them. –Don’t have to explicitly model their dependencies. –Don’t “waste modeling effort” trying to generate what we are given at test time anyway. If successful, this answers the challenge of integrating the ability to handle many arbitrary features with the full power of finite state automata.

Conditional Markov Models S t-1 S t O t S t+1 O t +1 O t Generative (traditional HMM)... transitions observations S t-1 S t O t S t+1 O t +1 O t Conditional... transitions observations Standard belief propagation: forward-backward procedure. Viterbi and Baum-Welch follow naturally. Maximum Entropy Markov Models [McCallum, Freitag & Pereira, 2000] MaxEnt POS Tagger [Ratnaparkhi, 1996] SNoW-based Markov Model [Punyakanok & Roth, 2000]

Exponential Form for “Next State” Function Capture dependency on s t-1 with |S| independent functions, P s t-1 (s t |o t ). Each state contains a “next-state classifier” that, given the next observation, produces a probability of the next state, P s t-1 (s t |o t ). s t-1 stst Recipe: - Labeled data is assigned to transitions. - Train each state’s exponential model by maximum entropy weightfeature

Consider this MEMM, and enough training data to perfectly model it: Pr(0123|rob) = Pr(1|0,r)/Z1 * Pr(2|1,o)/Z2 * Pr(3|2,b)/Z3 = 0.5 * 1 * 1 Pr(0453|rib) = Pr(4|0,r)/Z1’ * Pr(5|4,i)/Z2’ * Pr(3|5,b)/Z3’ = 0.5 * 1 *1 Pr(0123|rib)=1 Pr(0453|rob)=1 Label Bias Problem

HMM MEMM CRF S t-1 StSt OtOt S t+1 O t+1 O t-1 S t-1 StSt OtOt S t+1 O t+1 O t-1 S t-1 StSt OtOt S t+1 O t+1 O t-1... (A special case of MEMMs and CRFs.) Conditional Random Fields (CRFs) [Lafferty, McCallum, Pereira ‘2001] From HMMs to MEMMs to CRFs

Conditional Random Fields (CRFs) StSt S t+1 S t+2 O = O t, O t+1, O t+2, O t+3, O t+4 S t+3 S t+4 Markov on s, conditional dependency on o. Hammersley-Clifford-Besag theorem stipulates that the CRF has this form—an exponential function of the cliques in the graph. Assuming that the dependency structure of the states is tree-shaped (linear chain is a trivial tree), inference can be done by dynamic programming in time O(|o| |S| 2 )—just like HMMs.

Training CRFs Methods: iterative scaling (quite slow) conjugate gradient (much faster) conjugate gradient with preconditioning (super fast) limited-memory quasi-Newton methods (also super fast) Complexity comparable to standard Baum-Welch [Sha & Pereira 2002] & [Malouf 2002]

Sample IE Applications of CRFs Noun phrase segmentation [Sha & Pereira, 03] Named entity recognition [McCallum & Li 03] Protein names in bio abstracts [Settles 05] Addresses in web pages [Culotta et al. 05] Semantic roles in text [Roth & Yih 05] RNA structural alignment [Sato & Satakibara 05]

Examples of Recent CRF Research Semi-Markov CRFs [Sarawagi & Cohen 05] –Awkwardness of token level decisions for segments –Segment sequence model alleviates this –Two-level model with sequences of segments, which are sequences of tokens Stochastic Meta-Descent [Vishwanathan 06] –Stochastic gradient optimization for training –Take gradient step with small batches of examples –Order of magnitude faster than L-BFGS –Same resulting accuracies for extraction

Further Reading about CRFs Charles Sutton and Andrew McCallum. An Introduction to Conditional Random Fields for Relational Learning. In Introduction to Statistical Relational Learning. Edited by Lise Getoor and Ben Taskar. MIT Press An Introduction to Conditional Random Fields for Relational Learning