Information Extraction from the World Wide Web CSE 454 Based on Slides by William W. Cohen Carnegie Mellon University Andrew McCallum University of Massachusetts.

Information Extraction from the World Wide Web CSE 454 Based on Slides by William W. Cohen Carnegie Mellon University Andrew McCallum University of Massachusetts Amherst From KDD 2003

todo Add explanation of top down rule learning Add explanation of wrappers for BWI Better example of HMM structure for cora DB Fix slide 79 Review learning with complete info

Group Meetings Thursday ?? Tuesday

Apps on the Net INet Advertising Info Extraction Cryptography on the Net Facebook Human Comp Class Overview Network Layer Crawling IR - Ranking Indexing Query processing Content Analysis Sliding Windows HMMs Self-Supervised

Information Extraction

Example: The Problem Martin Baker, a person Genomics job Employers job posting form Slides from Cohen & McCallum

Example: A Solution Slides from Cohen & McCallum

Extracting Job Openings from the Web foodscience.com-Job2 JobTitle: Ice Cream Guru Employer: foodscience.com JobCategory: Travel/Hospitality JobFunction: Food Services JobLocation: Upper Midwest Contact Phone: 800-488-2611 DateExtracted: January 8, 2001 Source: www.foodscience.com/jobs_midwest.html OtherCompanyJobs: foodscience.com-Job1 Slides from Cohen & McCallum

Job Openings: Category = Food Services Keyword = Baker Location = Continental U.S. Slides from Cohen & McCallum

What is “Information Extraction” Filling slots in a database from sub-segments of text. As a task: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… NAME TITLE ORGANIZATION Slides from Cohen & McCallum

What is “Information Extraction” Filling slots in a database from sub-segments of text. As a task: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… NAME TITLE ORGANIZATION Bill Gates CEO Microsoft Bill Veghte VP Microsoft Richard Stallman founder Free Soft.. IE Slides from Cohen & McCallum

What is “Information Extraction” Information Extraction = segmentation + classification + clustering + association As a family of techniques: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation Slides from Cohen & McCallum

What is “Information Extraction” Information Extraction = segmentation + classification + association + clustering As a family of techniques: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation Slides from Cohen & McCallum

What is “Information Extraction” Information Extraction = segmentation + classification + association + clustering As a family of techniques: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation NAME TITLE ORGANIZATION Bill Gates CEOMicrosoft Bill Veghte VP Microsoft RichardStallman founder Free Soft.. * * * * Slides from Cohen & McCallum

IE in Context Create ontology Load DB Spider Query, Search Data mine IE Document collection Database Filter by relevance Label training data Train extraction models Slides from Cohen & McCallum

IE History Pre-Web Mostly news articles –De Jong’s FRUMP [1982] Hand-built system to fill Schank-style “scripts” from news wire –Message Understanding Conference (MUC) DARPA [’87-’95], TIPSTER [’92- ’96] Most early work dominated by hand-built models –E.g. SRI’s FASTUS, hand-built FSMs. –But by 1990’s, some machine learning: Soderland/Lehnert, Cardie, Grishman –Then HMMs: Elkan [Leek ’97], BBN [Bikel et al ’98] Web AAAI ’94 Spring Symposium on “Software Agents” –Much discussion of ML applied to Web. Maes, Mitchell, Etzioni. Tom Mitchell’s WebKB, ‘96 –Build KB’s from the Web. Wrapper Induction –First by hand, then ML: [Doorenbos ‘96], [Soderland ’96], [Kushmerick ’97],… Slides from Cohen & McCallum

Landscape of IE Tasks (1/4): Pattern Feature Domain Astro Teller is the CEO and co-founder of BodyMedia. Astro holds a Ph.D. in Artificial Intelligence from Carnegie Mellon University, where he was inducted as a national Hertz fellow. His M.S. in symbolic and heuristic computation and B.S. in computer science are from Stanford University. His work in science, literature and business has appeared in international media from the New York Times to CNN to NPR. Slides from Cohen & McCallum Text paragraphs without formatting Grammatical sentences and some formatting & links Non-grammatical snippets, rich formatting & links Tables

Landscape of IE Tasks (2/4): Pattern Scope Web site specificGenre specificWide, non-specific Amazon Book Pages ResumesUniversity Names FormattingLayoutLanguage Slides from Cohen & McCallum

Landscape of IE Tasks (3/4): Pattern Complexity Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern University of Arkansas P.O. Box 140 Hope, AR 71802 …was among the six houses sold by Hope Feldman that year. Ambiguous patterns, needing context and many sources of evidence The CALD main office can be reached at 412-268-1299 The big Wyoming sky… U.S. states U.S. phone numbers U.S. postal addresses Person names Headquarters: 1128 Main Street, 4th Floor Cincinnati, Ohio 45210 Pawel Opalinski, Software Engineer at WhizBang Labs. E.g. word patterns: Slides from Cohen & McCallum

Landscape of IE Tasks (4/4): Pattern Combinations Single entity Person: Jack Welch Binary relationship Relation: Person-Title Person: Jack Welch Title: CEO N-ary record “ Named entity” extraction Jack Welch will retire as CEO of General Electric tomorrow. The top role at the Connecticut company will be filled by Jeffrey Immelt. Relation: Company-Location Company: General Electric Location: Connecticut Relation: Succession Company: General Electric Title: CEO Out: Jack Welsh In: Jeffrey Immelt Person: Jeffrey Immelt Location: Connecticut Slides from Cohen & McCallum

Landscape of IE Models Lexicons Alabama Alaska … Wisconsin Wyoming Abraham Lincoln was born in Kentucky. member? …and beyond Sliding Window Abraham Lincoln was born in Kentucky. Classifier which class? Try alternate window sizes: Boundary Models Abraham Lincoln was born in Kentucky. Classifier which class? BEGINENDBEGINEND BEGIN Context Free Grammars Abraham Lincoln was born in Kentucky. NNPVPNPVNNP NP PP VP S Most likely parse? Finite State Machines Abraham Lincoln was born in Kentucky. Most likely state sequence? Slides from Cohen & McCallum Classify Pre-segmented Candidates Abraham Lincoln was born in Kentucky. Classifier which class? Any of these models can be used to capture words, formatting or both.

Evaluation of Single Entity Extraction Michael Kearns and Sebastian Seung will start Monday’s tutorial, followed by Richard M. Karpe and Martin Cooke. TRUTH: PRED: Precision = = # correctly predicted segments 2 # predicted segments 6 Michael Kearns and Sebastian Seung will start Monday’s tutorial, followed by Richard M. Karpe and Martin Cooke. Recall = = # correctly predicted segments 2 # true segments 4 F1 = Harmonic mean of Prec. + Recall = ((1/P) + (1/R))/2 1 Slides from Cohen & McCallum

State of the Art Performance Named entity recognition –Person, Location, Organization, … –F1 in high 80’s or low- to mid-90’s Binary relation extraction –Contained-in (Location1, Location2) Member-of (Person1, Organization1) –F1 in 60’s or 70’s or 80’s Wrapper induction –Extremely accurate performance obtainable –Human effort (~30min) required on each site Slides from Cohen & McCallum

Current Research Open Information Extraction Large Scale Extraction –Eg 5000 relations –Distant Supervision Rapid Supervision –Minimal labeled examples –Minimal tuning

Landscape: Focus of this Tutorial Pattern complexity Pattern feature domain Pattern scope Pattern combinations Models closed setregularcomplexambiguous wordswords + formattingformatting site-specificgenre-specificgeneral entitybinaryn-ary lexiconregexwindowboundaryFSMCFG Slides from Cohen & McCallum

Quick Review

28 Bayes Theorem 1702-1761

29 Bayesian Categorization Let set of categories be {c 1, c 2,…c n } Let E be description of an instance. Determine category of E by determining for each c i P(E) can be determined since categories are complete and disjoint.

30 Naïve Bayesian Motivation Problem: Too many possible instances (exponential in m) to estimate all P(E | c i ) If we assume features of an instance are independent given the category (c i ) (conditionally independent). Therefore, we then only need to know P(e j | c i ) for each feature and category.

Probabilistic Graphical Models y x1x1 x2x2 x3x3 Random variables (Boolean) Hidden state Observable Spam? Nigeria?Widow?CSE 454? Causal dependency (probabilistic) P(x i | y=spam) P(x i | y≠spam) Nodes = Random Variables Directed Edges = Causal Connections –Associated with a CPT (conditional probability table)

Recap: Naïve Bayes Assumption: features independent given label Generative Classifier –Model joint distribution p(x,y) –Inference –Learning: counting –Can we use for IE directly? The article appeared in the Seattle Times. city? length capitalization suffix Need to consider sequence! Labels of neighboring words dependent!

Sliding Windows Slides from Cohen & McCallum

Extraction by Sliding Window GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement Slides from Cohen & McCallum

A “Naïve Bayes” Sliding Window Model [Freitag 1997] 00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrun w t-m w t-1 w t w t+n w t+n+1 w t+n+m prefixcontentssuffix Other examples of sliding window: [Baluja et al 2000] (decision tree over individual words & their context) If P(“Wean Hall Rm 5409” = LOCATION) is above some threshold, extract it. … … Estimate Pr(LOCATION | window) using Bayes rule Try all “reasonable” windows (vary length, position) Assume independence for length, prefix, suffix, content words Estimate from data quantities like: Pr(“Place” in prefix|LOCATION) Slides from Cohen & McCallum

“Naïve Bayes” Sliding Window Results GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. Domain: CMU UseNet Seminar Announcements FieldF1 Person Name:30% Location:61% Start Time:98% Slides from Cohen & McCallum

Realistic sliding-window-classifier IE What windows to consider? –all windows containing as many tokens as the shortest example, but no more tokens than the longest example How to represent a classifier? It might: –Restrict the length of window; –Restrict the vocabulary or formatting used before/after/inside window; –Restrict the relative order of tokens, etc. Learning Method –SRV: Top-Down Rule Learning [Frietag AAAI ‘98] –Rapier: Bottom-Up [Califf & Mooney, AAAI ‘99] Slides from Cohen & McCallum

Rapier: results – precision/recall Slides from Cohen & McCallum

Rapier – results vs. SRV Slides from Cohen & McCallum

Rule-learning approaches to sliding- window classification: Summary SRV, Rapier, and WHISK [Soderland KDD ‘97] –Representations for classifiers allow restriction of the relationships between tokens, etc –Representations are carefully chosen subsets of even more powerful representations based on logic programming (ILP and Prolog) –Use of these “heavyweight” representations is complicated, but seems to pay off in results Can simpler representations for classifiers work? Slides from Cohen & McCallum

BWI: Learning to detect boundaries Another formulation: learn 3 probabilistic classifiers: –START(i) = Prob( position i starts a field) –END(j) = Prob( position j ends a field) –LEN(k) = Prob( an extracted field has length k) Score a possible extraction (i,j) by START(i) * END(j) * LEN(j-i) LEN(k) is estimated from a histogram [Freitag & Kushmerick, AAAI 2000] Slides from Cohen & McCallum

BWI: Learning to detect boundaries BWI uses boosting to find “detectors” for START and END Each weak detector has a BEFORE and AFTER pattern (on tokens before/after position i). Each “pattern” is a sequence of –tokens and/or –wildcards like: anyAlphabeticToken, anyNumber, … Weak learner for “patterns” uses greedy search (+ lookahead) to repeatedly extend a pair of empty BEFORE,AFTER patterns Slides from Cohen & McCallum

BWI: Learning to detect boundaries Naïve Bayes FieldF1 Speaker:30% Location:61% Start Time:98% Slides from Cohen & McCallum

Problems with Sliding Windows and Boundary Finders Decisions in neighboring parts of the input are made independently from each other. –Naïve Bayes Sliding Window may predict a “seminar end time” before the “seminar start time”. –It is possible for two overlapping windows to both be above threshold. –In a Boundary-Finding system, left boundaries are laid down independently from right boundaries, and their pairing happens as a separate step. Slides from Cohen & McCallum

Landscape of IE Techniques (1/1): Models Lexicons Alabama Alaska … Wisconsin Wyoming Abraham Lincoln was born in Kentucky. member? Classify Pre-segmented Candidates Abraham Lincoln was born in Kentucky. Classifier which class? Sliding Window Abraham Lincoln was born in Kentucky. Classifier which class? Try alternate window sizes: Boundary Models Abraham Lincoln was born in Kentucky. Classifier which class? BEGINENDBEGINEND BEGIN Context Free Grammars Abraham Lincoln was born in Kentucky. NNPVPNPVNNP NP PP VP S Most likely parse? Finite State Machines Abraham Lincoln was born in Kentucky. Most likely state sequence? Slides from Cohen & McCallum Each model can capture words, formatting, or both

Finite State Machines Slides from Cohen & McCallum

Hidden Markov Models (HMMs) S t-1 S t O t S t+1 O t +1 O t - 1... Finite state model Graphical model Parameters: for all states S={s 1,s 2,…} Start state probabilities: P(s t ) Transition probabilities: P(s t |s t-1 ) Observation (emission) probabilities: P(o t |s t ) Training: Maximize probability of training observations (w/ prior) standard sequence model in genomics, speech, NLP, …... transitions observations o 1 o 2 o 3 o 4 o 5 o 6 o 7 o 8 Generates: State sequence Observation sequence Usually a multinomial over atomic, fixed alphabet Slides from Cohen & McCallum

Example: The Dishonest Casino A casino has two dice: Fair die P(1) = P(2) = P(3) = P(5) = P(6) = 1/6 Loaded die P(1) = P(2) = P(3) = P(5) = 1/10 P(6) = 1/2 Casino player switches back-&-forth between fair and loaded die once every 20 turns Game: 1.You bet $1 2.You roll (always with a fair die) 3.Casino player rolls (maybe with fair die, maybe with loaded die) 4.Highest number wins $2 Slides from Serafim Batzoglou

Question # 1 – Evaluation GIVEN A sequence of rolls by the casino player 12455264621461461361366616646616366163661636 1… QUESTION How likely is this sequence, given our model of how the casino works? This is the EVALUATION problem in HMMs Slides from Serafim Batzoglou

Question # 2 – Decoding GIVEN A sequence of rolls by the casino player 1245526462146146136136661664661636616366163 … QUESTION What portion of the sequence was generated with the fair die, and what portion with the loaded die? This is the DECODING question in HMMs Slides from Serafim Batzoglou

Question # 3 – Learning GIVEN A sequence of rolls by the casino player 124552646214614613613666166466163661636616361651… QUESTION How “loaded” is the loaded die? How “fair” is the fair die? How often does the casino player change from fair to loaded, and back? This is the LEARNING question in HMMs Slides from Serafim Batzoglou

The dishonest casino model FAIRLOADED 0.05 0.95 P(1|F) = 1/6 P(2|F) = 1/6 P(3|F) = 1/6 P(4|F) = 1/6 P(5|F) = 1/6 P(6|F) = 1/6 P(1|L) = 1/10 P(2|L) = 1/10 P(3|L) = 1/10 P(4|L) = 1/10 P(5|L) = 1/10 P(6|L) = 1/2 Slides from Serafim Batzoglou

What’s this have to do with Info Extraction? FAIRLOADED 0.05 0.95 P(1|F) = 1/6 P(2|F) = 1/6 P(3|F) = 1/6 P(4|F) = 1/6 P(5|F) = 1/6 P(6|F) = 1/6 P(1|L) = 1/10 P(2|L) = 1/10 P(3|L) = 1/10 P(4|L) = 1/10 P(5|L) = 1/10 P(6|L) = 1/2

What’s this have to do with Info Extraction? TEXTNAME 0.05 0.95 P(the | T) = 0.003 P(from | T) = 0.002 ….. P(Dan | N) = 0.005 P(Sue | N) = 0.003 …

IE with Hidden Markov Models Yesterday Pedro Domingos spoke this example sentence. Person name: Pedro Domingos Given a sequence of observations: and a trained HMM: Find the most likely state sequence: (Viterbi) Any words said to be generated by the designated “person name” state extract as a person name: person name location name background Slide by Cohen & McCallum

IE with Hidden Markov Models For sparse extraction tasks : Separate HMM for each type of target Each HMM should –Model entire document –Consist of target and non-target states –Not necessarily fully connected 59 Slide by Okan Basegmez

Or … Combined HMM Example – Research Paper Headers 60 Slide by Okan Basegmez

HMM Example: “Nymble” Other examples of shrinkage for HMMs in IE: [Freitag and McCallum ‘99] Task: Named Entity Extraction Train on ~500k words of news wire text. Case Language F1. Mixed English93% UpperEnglish91% MixedSpanish90% [Bikel, et al 1998], [BBN “IdentiFinder”] Person Org Other (Five other name classes) start-of- sentence end-of- sentence Transition probabilities Observation probabilities Back-off to: or Results: Slide by Cohen & McCallum

HMM Example: “Nymble” Task: Named Entity Extraction Train on ~500k words of news wire text. Case Language F1. Mixed English93% Upper English91% Mixed Spanish90% [Bikel, et al 1998], [BBN “IdentiFinder”] Person Org Other (Five other name classes) start-of- sentence end-of- sentence Results: Slide adapted from Cohen & McCallum

Finite State Model Person Org Other (Five other name classes) start-of- sentence end-of- sentence y1y1 x1x1 x2x2 x3x3 y2y2 y3y3 y4y4 y5y5 y6y6 … x4x4 x5x5 x6x6 … vs. Path

Question #1 – Evaluation GIVEN A sequence of observations x 1 x 2 x 3 x 4 ……x N A trained HMM θ=(,, ) QUESTION How likely is this sequence, given our HMM ? P(x,θ) Why do we care? Need it for learning to choose among competing models!

Question #2 - Decoding GIVEN A sequence of observations x 1 x 2 x 3 x 4 ……x N A trained HMM θ=(,, ) QUESTION How dow we choose the corresponding parse (state sequence) y 1 y 2 y 3 y 4 ……y N, which “best” explains x 1 x 2 x 3 x 4 ……x N ? There are several reasonable optimality criteria: single optimal sequence, average statistics for individual states, …

A parse of a sequence Given a sequence x = x 1 ……x N, A parse of o is a sequence of states y = y 1, ……, y N 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2 Slide by Serafim Batzoglou person other location

Question #3 - Learning GIVEN A sequence of observations x 1 x 2 x 3 x 4 ……x N QUESTION How do we learn the model parameters θ =(,, ) which maximize P(x, θ ) ?

Three Questions Evaluation –Forward algorithm –(Could also go other direction) Decoding –Viterbi algorithm Learning –Baum-Welch Algorithm (aka “forward-backward”) A kind of EM (expectation maximization) –Perceptron Updates

A Solution to #1: Evaluation Given observations x=x 1 … x N and HMM θ, what is p(x) ? Enumerate every possible state sequence y=y 1 … y N Probability of x and given particular y Probability of particular y Summing over all possible state sequences we get N T state sequences! 2T multiplications per sequence For small HMMs T=10, N=10 there are 10 billion sequences!

Solution to #1: Evaluation Use Dynamic Programming: Define forward variable probability that at time t - the state is S i - the partial observation sequence x=x 1 … x t has been emitted

Forward Variable  t (i) Prob - that the state at time t vas value S i and - the partial obs sequence x=x 1 … x t has been seen 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 person other location … … … xtxt

Forward Variable  t (i) prob - that the state at t vas value S i and - the partial obs sequence x=x 1 … x t has been seen 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 person other location … … … xtxt SiSi

Solution to #1: Evaluation Use Dynamic Programming Cache and reuse inner sums Define forward variables probability that at time t -the state is y t = S i -the partial observation sequence x=x 1 … x t has been omitted

The Forward Algorithm

INITIALIZATION INDUCTION TERMINATION Time: O(K 2 N) Space: O(KN) K = |S| #states N length of sequence

The Backward Algorithm

INITIALIZATION INDUCTION TERMINATION Time: O(K 2 N) Space: O(KN)

Three Questions Evaluation –Forward algorithm –(Could also go other direction) Decoding –Viterbi algorithm Learning –Baum-Welch Algorithm (aka “forward-backward”) –A kind of EM (expectation maximization)

Need new slide! Looks like this but deltas

Solution to #2 - Decoding Given x=x 1 … x N and HMM θ, what is “best” parse y 1 … y N ? Several optimal solutions 1. States which are individually most likely 2. Single best state sequence We want to find sequence y 1 … y N, such that P(x,y) is maximized y * = argmax y P( x, y ) Again, we can use dynamic programming! 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … o1o1 o2o2 o3o3 oKoK 2 1 K 2

The Viterbi Algorithm DEFINE INITIALIZATION INDUCTION TERMINATION Backtracking to get state sequence y*

Slides from Serafim Batzoglou The Viterbi Algorithm Time: O(K 2 T) Space: O(KT) x 1 x 2 ……x j-1 x j ……………………………..x T State 1 2 K i δ j (i) Max Remember: δ k (i) = probability of most likely state seq ending with state S k Linear in length of sequence

The Viterbi Algorithm 83 Pedro Domingos

Three Questions Evaluation –Forward algorithm –(Could also go other direction) Decoding –Viterbi algorithm Learning –Baum-Welch Algorithm (aka “forward-backward”) –A kind of EM (expectation maximization)

Solution to #3 - Learning Given x 1 …x N, how do we learn θ =(,, ) to maximize P(x)? Unfortunately, there is no known way to analytically find a global maximum θ * such that θ * = arg max P(o | θ) But it is possible to find a local maximum; given an initial model θ, we can always find a model θ ’ such that P(o | θ ’ ) ≥ P(o | θ)

86 Chicken & Egg Problem If we knew the actual sequence of states –It would be easy to learn transition and emission probabilities –But we can’t observe states, so we don’t! If we knew transition & emission probabilities –Then it’d be easy to estimate the sequence of states (Viterbi) –But we don’t know them! Slide by Daniel S. Weld

87 Simplest Version Mixture of two distributions Know: form of distribution & variance, % =5 Just need mean of each distribution.01.03.05.07.09 Slide by Daniel S. Weld

88 Input Looks Like.01.03.05.07.09 Slide by Daniel S. Weld

89 We Want to Predict.01.03.05.07.09 ? Slide by Daniel S. Weld

90 Chicken & Egg.01.03.05.07.09 Note that coloring instances would be easy if we knew Gausians…. Slide by Daniel S. Weld

91 Chicken & Egg.01.03.05.07.09 And finding the Gausians would be easy If we knew the coloring Slide by Daniel S. Weld

92 Expectation Maximization (EM) Pretend we do know the parameters –Initialize randomly: set  1 =?;  2 =?.01.03.05.07.09 Slide by Daniel S. Weld

93 Expectation Maximization (EM) Pretend we do know the parameters –Initialize randomly [E step] Compute probability of instance having each possible value of the hidden variable.01.03.05.07.09 Slide by Daniel S. Weld

94 Expectation Maximization (EM) Pretend we do know the parameters –Initialize randomly [E step] Compute probability of instance having each possible value of the hidden variable.01.03.05.07.09 Slide by Daniel S. Weld

95 Expectation Maximization (EM) Pretend we do know the parameters –Initialize randomly [E step] Compute probability of instance having each possible value of the hidden variable.01.03.05.07.09 [M step] Treating each instance as fractionally having both values compute the new parameter values Slide by Daniel S. Weld

96 ML Mean of Single Gaussian U ml = argmin u  i (x i – u) 2.01.03.05.07.09 Slide by Daniel S. Weld

97 Expectation Maximization (EM) [E step] Compute probability of instance having each possible value of the hidden variable.01.03.05.07.09 [M step] Treating each instance as fractionally having both values compute the new parameter values Slide by Daniel S. Weld

98 Expectation Maximization (EM) [E step] Compute probability of instance having each possible value of the hidden variable.01.03.05.07.09 Slide by Daniel S. Weld

101 EM for HMMs [E step] Compute probability of instance having each possible value of the hidden variable –Compute the forward and backward probabilities for given model parameters and our observations [M step] Treating each instance as fractionally having both values compute the new parameter values - Re-estimate the model parameters - Simple Counting

Summary - Learning Use hill-climbing –Called the forward-backward (or Baum/Welch) algorithm Idea –Use an initial parameter instantiation –Loop Compute the forward and backward probabilities for given model parameters and our observations Re-estimate the parameters –Until estimates don’t change much

IE Resources Data –RISE, http://www.isi.edu/~muslea/RISE/index.html –Linguistic Data Consortium (LDC) Penn Treebank, Named Entities, Relations, etc. –http://www.biostat.wisc.edu/~craven/ie –http://www.cs.umass.edu/~mccallum/data Code –TextPro, http://www.ai.sri.com/~appelt/TextPro –MALLET, http://www.cs.umass.edu/~mccallum/mallet –SecondString, http://secondstring.sourceforge.net/ Both –http://www.cis.upenn.edu/~adwait/penntools.html Slides from Cohen & McCallum

References [Bikel et al 1997] Bikel, D.; Miller, S.; Schwartz, R.; and Weischedel, R. Nymble: a high-performance learning name-finder. In Proceedings of ANLP’97, p194-201. [Califf & Mooney 1999], Califf, M.E.; Mooney, R.: Relational Learning of Pattern-Match Rules for Information Extraction, in Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI-99). [Cohen, Hurst, Jensen, 2002] Cohen, W.; Hurst, M.; Jensen, L.: A flexible learning system for wrapping tables and lists in HTML documents. Proceedings of The Eleventh International World Wide Web Conference (WWW-2002) [Cohen, Kautz, McAllester 2000] Cohen, W; Kautz, H.; McAllester, D.: Hardening soft information sources. Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining (KDD-2000). [Cohen, 1998] Cohen, W.: Integration of Heterogeneous Databases Without Common Domains Using Queries Based on Textual Similarity, in Proceedings of ACM SIGMOD-98. [Cohen, 2000a] Cohen, W.: Data Integration using Similarity Joins and a Word-based Information Representation Language, ACM Transactions on Information Systems, 18(3). [Cohen, 2000b] Cohen, W. Automatically Extracting Features for Concept Learning from the Web, Machine Learning: Proceedings of the Seventeeth International Conference (ML-2000). [Collins & Singer 1999] Collins, M.; and Singer, Y. Unsupervised models for named entity classification. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, 1999. [De Jong 1982] De Jong, G. An Overview of the FRUMP System. In: Lehnert, W. & Ringle, M. H. (eds), Strategies for Natural Language Processing. Larence Erlbaum, 1982, 149-176. [Freitag 98] Freitag, D: Information extraction from HTML: application of a general machine learning approach, Proceedings of the Fifteenth National Conference on Artificial Intelligence (AAAI-98). [Freitag, 1999], Freitag, D. Machine Learning for Information Extraction in Informal Domains. Ph.D. dissertation, Carnegie Mellon University. [Freitag 2000], Freitag, D: Machine Learning for Information Extraction in Informal Domains, Machine Learning 39(2/3): 99-101 (2000). Freitag & Kushmerick, 1999] Freitag, D; Kushmerick, D.: Boosted Wrapper Induction. Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI-99) [Freitag & McCallum 1999] Freitag, D. and McCallum, A. Information extraction using HMMs and shrinakge. In Proceedings AAAI-99 Workshop on Machine Learning for Information Extraction. AAAI Technical Report WS-99-11. [Kushmerick, 2000] Kushmerick, N: Wrapper Induction: efficiency and expressiveness, Artificial Intelligence, 118(pp 15-68). [Lafferty, McCallum & Pereira 2001] Lafferty, J.; McCallum, A.; and Pereira, F., Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data, In Proceedings of ICML-2001. [Leek 1997] Leek, T. R. Information extraction using hidden Markov models. Master’s thesis. UC San Diego. [McCallum, Freitag & Pereira 2000] McCallum, A.; Freitag, D.; and Pereira. F., Maximum entropy Markov models for information extraction and segmentation, In Proceedings of ICML-2000 [Miller et al 2000] Miller, S.; Fox, H.; Ramshaw, L.; Weischedel, R. A Novel Use of Statistical Parsing to Extract Information from Text. Proceedings of the 1st Annual Meeting of the North American Chapter of the ACL (NAACL), p. 226 - 233. Slides from Cohen & McCallum

Information Extraction from the World Wide Web CSE 454 Based on Slides by William W. Cohen Carnegie Mellon University Andrew McCallum University of Massachusetts.

Similar presentations

Presentation on theme: "Information Extraction from the World Wide Web CSE 454 Based on Slides by William W. Cohen Carnegie Mellon University Andrew McCallum University of Massachusetts."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Information Extraction from the World Wide Web CSE 454 Based on Slides by William W. Cohen Carnegie Mellon University Andrew McCallum University of Massachusetts.

Similar presentations

Presentation on theme: "Information Extraction from the World Wide Web CSE 454 Based on Slides by William W. Cohen Carnegie Mellon University Andrew McCallum University of Massachusetts."— Presentation transcript:

Similar presentations

About project

Feedback