Computer Science Department University of Massachusetts Amherst

Computer Science Department University of Massachusetts Amherst
11/17/2018 Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with Charles Sutton, Aron Culotta, Ben Wellner, Khashayar Rohanimanesh, Wei Li

Goal: Mine actionable knowledge from unstructured text.
11/17/2018 Goal: Mine actionable knowledge from unstructured text. I want to help people make good decisions by leveraging the knowledge on the Web and other bodies of text. Sometimes document retrieval is enough for this, but sometimes you need to find patterns in structured data that span many pages. Let me give you some examples of what I mean.

Extracting Job Openings from the Web
11/17/2018 Extracting Job Openings from the Web foodscience.com-Job2 JobTitle: Ice Cream Guru Employer: foodscience.com JobCategory: Travel/Hospitality JobFunction: Food Services JobLocation: Upper Midwest Contact Phone: DateExtracted: January 8, 2001 Source: OtherCompanyJobs: foodscience.com-Job1 WB’s first application area is extracting job openings from the web. Start at company home page, automatically follow links to find the job openings section, segment job openings from each other, and extract job title, location, description, and all of these fields. (Notice that capitalization, position on page, bold face are just as much indicators of this being a job title as the words themselves. Notice also that part of the IE task is figuring out which links to follow. And often data for a relation came from multiple different pages.)

A Portal for Job Openings
11/17/2018 A Portal for Job Openings

Job Openings: Category = High Tech Keyword = Java Location = U.S.
11/17/2018 Category = High Tech Job Openings: Keyword = Java Location = U.S. Not only did this support extremely useful and targeted search and data navigation, with highly summarized display

Data Mining the Extracted Job Information
11/17/2018 Data Mining the Extracted Job Information It also supported something that a web page search engine could never do: data mine the extracted information. Once a month, we data-mined our database of jobs and the changes in them to produce a Job Opportunity Index report … used by several state and local governments to help set policy.

IE from Chinese Documents regarding Weather
11/17/2018 IE from Chinese Documents regarding Weather Department of Terrestrial System, Chinese Academy of Sciences 200k+ documents several millennia old - Qing Dynasty Archives - memos - newspaper articles - diaries The Chinese Academy of Sciences has more than 200k documents dating back several millennia: memos, news articles, diaries. Among other.. wealth of information about weather (floods, droughts, temperature extremes). They are trying to mine this to help them understand today's global weather trends. So important they are currently doing it by hand, but need to automate it to get through the volume.

IE from Research Papers
11/17/2018 [McCallum et al ‘99] Some of you might know about Cora, a contemporary of CiteSeer. This is a system created by Kamal Nigam, Kristie Seymore, Jason Rennie and myself that mines the web for Computer Science research papers. You can view search results in terms of summaries from extracted data. Here you see the automatically extracted titles, authors, institutions and abstracts.

11/17/2018 IE from Research Papers CiteSeer, which we all know and love, is made possible by the automatic extraction of title and author information from research paper headers and references in order to build the reference graph that we use to find related work.

Mining Research Papers
11/17/2018 Mining Research Papers [Rosen-Zvi, Griffiths, Steyvers, Smyth, 2004] …and look at some mined aggregate statistics about the citation graph, Or as presented here two days ago, look at the most prominent authors in certain sub-fields [Giles et al]

What is “Information Extraction”
11/17/2018 What is “Information Extraction” As a family of techniques: Information Extraction = segmentation + classification + clustering + association October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Bill Veghte VP Richard Stallman founder Free Software Foundation Technically-speaking, what is involved in doing IE? As a family of techniques, is consists of Segmentation to find the beginning and ending boundaries of the text snippets that will go in the DB. Classification…

11/17/2018 What is “Information Extraction” As a family of techniques: Information Extraction = segmentation + classification + association + clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Bill Veghte VP Richard Stallman founder Free Software Foundation

11/17/2018 What is “Information Extraction” As a family of techniques: Information Extraction = segmentation + classification + association + clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… * Microsoft Corporation CEO Bill Gates Microsoft Gates Bill Veghte VP Richard Stallman founder Free Software Foundation Free Soft.. Microsoft Microsoft TITLE ORGANIZATION * founder * CEO VP * Stallman NAME Veghte Bill Gates Richard Bill

Larger Context IE Data Mining Spider Filter Segment Classify Associate
11/17/2018 Larger Context Spider Filter Data Mining IE Segment Classify Associate Cluster Discover patterns - entity types - links / relations - events Database Document collection Actionable knowledge Prediction Outlier detection Decision support

a Outline Examples of IE and Data Mining.
11/17/2018 Outline a Examples of IE and Data Mining. Brief review of Conditional Random Fields Joint inference: Motivation and examples Joint Labeling of Cascaded Sequences (Belief Propagation) Joint Labeling of Distant Entities (BP by Tree Reparameterization) Joint Co-reference Resolution (Graph Partitioning) Joint Segmentation and Co-ref (Iterated Conditional Samples) Efficiently training large, multi-component models: Piecewise Training

11/17/2018 Hidden Markov Models HMMs are the standard sequence modeling tool in genomics, music, speech, NLP, … Finite state model Graphical model S S S t - 1 t t+1 transitions ... ... observations ... Generates: State sequence Observation O O O t - 1 t t +1 o1 o2 o3 o4 o5 o6 o7 o8 HMMs: standard tool…, FSM w/ Pr on trans & emmi Operate by starting... rolling a die... emitting symbol, generating simultaneously a state seq and obs seq. It's Markov and independence assumptions as depicted in this Graphical Model. Random var depicted by this node takes on values of states in FSM. Emissions are traditionally atomic entities, like words. We say this is generative model because it has parameters for probability of producing observed emissions (given state) Parameters: for all states S={s1,s2,…} Start state probabilities: P(st ) Transition probabilities: P(st|st-1 ) Observation (emission) probabilities: P(ot|st ) Training: Maximize probability of training observations (w/ prior) Usually a multinomial over atomic, fixed alphabet

IE with Hidden Markov Models
11/17/2018 IE with Hidden Markov Models Given a sequence of observations: Yesterday Rich Caruana spoke this example sentence. and a trained HMM: person name location name background Find the most likely state sequence: (Viterbi) How do these apply to IE? Given HMM (some of whose states as associated with certain database fields) We find the most... Then any words Viterbi says were generated from red states, we extract as... HMMs are an old standard, but... Yesterday Rich Caruana spoke this example sentence. Any words said to be generated by the designated “person name” state extract as a person name: Person name: Rich Caruana

We want More than an Atomic View of Words
11/17/2018 We want More than an Atomic View of Words Would like richer representation of text: many arbitrary, overlapping features of the words. identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor last person name was female next two words are “and Associates” S S S t - 1 t t+1 … is “Wisniewski” … part of noun phrase ends in “-ski” O O O 1. Prefer richer… one that allows for multiple, arbitrary... In many cases, for example names and other large vocabulary fields, the identity of word alone is not very predictive because you probably have never seen it before. For example…”Wisneiwski” …not city In many other cases, such as table seg, the relevant features are not really the words at all, but attributes of the line as a whole… emissions=whole lines of text at a time t - 1 t t +1

From HMMs to Conditional Random Fields
11/17/2018 From HMMs to Conditional Random Fields [Lafferty, McCallum, Pereira 2001] St-1 St St+1 Joint ... Ot-1 Ot Ot+1 ... Conditional St-1 St St+1 ... For context, we'll begin again with our old friend the HMM; Joint. Here is graphical model. To simplify drawing, temporarily collapsing many feature variables into one fat one. Want conditional: divide by the marginal prob of the observation seq Don't want to estimate P(o|s); no reason need a probability there; make is simply a potential; Z. Use a sparse representation for potential function: log-linear model; here are features; param lambda corresponds to P(feature|state), but does not need to be a probability. Undirected graphical model we train to maximize conditional prob. This is model. This is a super-special case, b/c as we discussed before, features of past & future... Here is a crash course in conditional-prob FSMs. Don’t let all the equations scare you: this slide is easier than it looks. Here is the old, traditional HMM: Prob of state and obs sequence is product of each state trans, and observation emission in the sequence. MEMMs: replaces with conditional, where we no longer generate the obs, but condition on them. Corresponds to reversing direction of arrows... BELPROB=new FB Q: What functional form? For various reasons, an excellent choice is an exponential function: sum of features, each with a learned weight, exp()'ed, and normalized. This functional form (w/ normalizer where it is here) corresponds to an graphical model with undirected edges, which is also known as a Markov Random Field in statistical physics. Ot-1 Ot Ot+1 ... where (A super-special case of Conditional Random Fields.) Set parameters by maximum likelihood, using optimization method on L.

Conditional Random Fields
11/17/2018 Conditional Random Fields [Lafferty, McCallum, Pereira 2001] 1. FSM special-case: linear chain among unknowns, parameters tied across time steps. St St+1 St+2 St+3 St+4 O = Ot, Ot+1, Ot+2, Ot+3, Ot+4 "Non-super" special case Graphical model on previous slide was a special case. In general, observations don’t have to be chopped up, and feature functions can ask arbitrary questions about any range of the observation sequence. Exact inference is very efficient with dynamic programming, as long as graph among unknowns is linear or tree. Not going to get into method for learning lambda weights from training data, but simply a matter of maximum likelihood: This is what trying to maximize, differentiate, and solve with Conjugate Gradient. Again, DP makes it efficient. 2. In general: CRFs = "Conditionally-trained Markov Network" arbitrary structure among unknowns 3. Relational Markov Networks [Taskar, Abbeel, Koller 2002]: Parameters tied across hits from SQL-like queries ("clique templates")

Linear-chain CRFs vs. HMMs
11/17/2018 Linear-chain CRFs vs. HMMs Comparable computational efficiency for inference Features may be arbitrary functions of any or all observations Parameters need not fully specify generation of observations; can require less training data Easy to incorporate domain knowledge …all features used in HMMs can be used here, just discriminatively-trained HMM. …need not fully specify: I’ll show you a strong example of this later. …domain knowledge == arbitrary features you make up

Table Extraction from Government Reports
11/17/2018 Table Extraction from Government Reports Cash receipts from marketings of milk during 1995 at $19.9 billion dollars, was slightly below Producer returns averaged $12.93 per hundredweight, $0.19 per hundredweight below Marketings totaled 154 billion pounds, 1 percent above Marketings include whole milk sold to plants and dealers as well as milk sold directly to consumers. An estimated 1.56 billion pounds of milk were used on farms where produced, 8 percent less than Calves were fed 78 percent of this milk with the remainder consumed in producer households. Milk Cows and Production of Milk and Milkfat: United States, : : Production of Milk and Milkfat 2/ : Number : Year : of : Per Milk Cow : Percentage : Total :Milk Cows 1/: : of Fat in All : : : Milk : Milkfat : Milk Produced : Milk : Milkfat : 1,000 Head Pounds Percent Million Pounds : : 9, , ,582 5,514.4 : 9, , ,664 5,623.7 : 9, , ,644 5,694.3 1/ Average number during year, excluding heifers not yet fresh. 2/ Excludes milk sucked by calves. You can find reams and reams of government reports at fedstats.gov. Including this piece of a document about cow milk production Complicated structure: rich headers, data, footnotes saying things like, "Excludes..."

Table Extraction from Government Reports
11/17/2018 Table Extraction from Government Reports [Pinto, McCallum, Wei, Croft, 2003 SIGIR] 100+ documents from Labels: CRF Non-Table Table Title Table Header Table Data Row Table Section Data Row Table Footnote ... (12 in all) Cash receipts from marketings of milk during 1995 at $19.9 billion dollars, was slightly below Producer returns averaged $12.93 per hundredweight, $0.19 per hundredweight below Marketings totaled 154 billion pounds, 1 percent above Marketings include whole milk sold to plants and dealers as well as milk sold directly to consumers. An estimated 1.56 billion pounds of milk were used on farms where produced, 8 percent less than Calves were fed 78 percent of this milk with the remainder consumed in producer households. Milk Cows and Production of Milk and Milkfat: United States, : : Production of Milk and Milkfat 2/ : Number : Year : of : Per Milk Cow : Percentage : Total :Milk Cows 1/: : of Fat in All : : : Milk : Milkfat : Milk Produced : Milk : Milkfat : 1,000 Head Pounds Percent Million Pounds : : 9, , ,582 5,514.4 : 9, , ,664 5,623.7 : 9, , ,644 5,694.3 1/ Average number during year, excluding heifers not yet fresh. 2/ Excludes milk sucked by calves. Features: Our job is to find the tables, separating from rest of text, and then label lines that have different function in the table, such as title, header, data row... Do this with CRF, one random variable per line of text. Large array of features assessing contents, formatting ...as well as, importantly, conjunctions exponential models are log-LINEAR models, to form complex decision boundaries, need higher dimensionality. Just like: can't represent XOR with sum and threshold, but can with sum, conjunction and threshold. Percentage of digit chars Percentage of alpha chars Indented Contains 5+ consecutive spaces Whitespace in this line aligns with prev. ... Conjunctions of all previous features, time offset: {0,0}, {-1,0}, {0,1}, {1,2}.

Table Extraction Experimental Results
11/17/2018 Table Extraction Experimental Results [Pinto, McCallum, Wei, Croft, 2003 SIGIR] Line labels, percent correct Table segments, F1 HMM 65 % 64 % Stateless MaxEnt 85 % - Here is an HMM (generative model) Here is conditional model, but without the benefit of context from a FSM Conditional RF w/out conjunctions (pretty important!) The main model as introduced. Story is similar when... not individual lines, but whole segments begin correct. CRF 95 % 92 %

11/17/2018 [McCallum et al ‘99] Some of you might know about Cora, a contemporary of CiteSeer. This is a system created by Kamal Nigam, Kristie Seymore, Jason Rennie and myself that mines the web for Computer Science research papers. You can view search results in terms of summaries from extracted data. Here you see the automatically extracted titles, authors, institutions and abstracts.

11/17/2018 IE from Research Papers Field-level F1 Hidden Markov Models (HMMs) [Seymore, McCallum, Rosenfeld, 1999] Support Vector Machines (SVMs) 89.7 [Han, Giles, et al, 2003] Conditional Random Fields (CRFs) 93.9 [Peng, McCallum, 2004]  error 40% HMMs have FSM sequence modeling, but generative SVMs are discriminative, but don’t do full inference on sequences.

Named Entity Recognition
11/17/2018 Named Entity Recognition CRICKET - MILLNS SIGNS FOR BOLAND CAPE TOWN South African provincial side Boland said on Thursday they had signed Leicestershire fast bowler David Millns on a one year contract. Millns, who toured Australia with England A in 1992, replaces former England all-rounder Phillip DeFreitas as Boland's overseas professional. Labels: Examples: PER Yayuk Basuki Innocent Butare ORG 3M KDP Cleveland LOC Cleveland Nirmal Hriday The Oval MISC Java Basque 1,000 Lakes Rally Nirmal is a hospice house run by Mother Teresa.

Named Entity Extraction Results
11/17/2018 Named Entity Extraction Results [McCallum & Li, 2003, CoNLL] Method F1 HMMs BBN's Identifinder 73% CRFs w/out Feature Induction 83% CRFs with Feature Induction 90% based on LikelihoodGain

a a Outline Examples of IE and Data Mining.
11/17/2018 Outline a Examples of IE and Data Mining. Brief review of Conditional Random Fields Joint inference: Motivation and examples Joint Labeling of Cascaded Sequences (Belief Propagation) Joint Labeling of Distant Entities (BP by Tree Reparameterization) Joint Co-reference Resolution (Graph Partitioning) Joint Segmentation and Co-ref (Iterated Conditional Samples) Efficiently training large, multi-component models: Piecewise Training a

Larger Context IE Data Mining Spider Filter Segment Classify Associate
11/17/2018 Larger Context Spider Filter Data Mining IE Segment Classify Associate Cluster Discover patterns - entity types - links / relations - events Database Document collection Actionable knowledge Prediction Outlier detection Decision support

Problem: Combined in serial juxtaposition,
11/17/2018 Problem: Combined in serial juxtaposition, IE and KD are unaware of each others’ weaknesses and opportunities. KD begins from a populated DB, unaware of where the data came from, or its inherent uncertainties. IE is unaware of emerging patterns and regularities in the DB. The accuracy of both suffers, and significant mining of complex text sources is beyond reach. IE research mildly stuck… this is proposed solution. Although information extraction and data mining appear together in many applications, their interface in most current systems would better be described as serial juxtaposition than as tight integration. Information extraction populates slots in a database by identifying relevant subsequences of text, but is usually not aware of the emerging patterns and regularities in the database. Data mining methods begin from a populated database, and are often unaware of where the data came from, or its inherent uncertainties. The result is that the accuracy of both suffers, and significant mining of complex text sources is beyond reach.

Solution: IE Data Mining Uncertainty Info Emerging Patterns Spider
11/17/2018 Solution: Uncertainty Info Spider Filter Data Mining IE Segment Classify Associate Cluster Discover patterns - entity types - links / relations - events Database Document collection Actionable knowledge Emerging Patterns Prediction Outlier detection Decision support

Solution: IE Unified Model Complex Inference and Learning Data Mining
11/17/2018 Solution: Unified Model Spider Filter Data Mining IE Segment Classify Associate Cluster Discover patterns - entity types - links / relations - events Probabilistic Model Conditional Random Fields [Lafferty, McCallum, Pereira] Conditional PRMs [Koller…], [Jensen…], [Geetor…], [Domingos…] Discriminatively-trained undirected graphical models Document collection Actionable knowledge Prediction Outlier detection Decision support Complex Inference and Learning Just what we researchers like to sink our teeth into!

Larger-scale Joint Inference for IE
11/17/2018 Larger-scale Joint Inference for IE What model structures will capture salient dependencies? Will joint inference improve accuracy? How do to inference in these large graphical models? How to efficiently train these models, which are built from multiple large components?

1. Jointly labeling cascaded sequences Factorial CRFs
11/17/2018 1. Jointly labeling cascaded sequences Factorial CRFs [Sutton, Khashayar, McCallum, ICML 2004] Named-entity tag Noun-phrase boundaries Part-of-speech English words In some situations we want to output not a single label sequence but several, for example… Rather than performing each of these labeling tasks serially in a cascade, Factorial CRFs perform all three tasks jointly, Providing matching accuracy with only 50% of the training data

11/17/2018 1. Jointly labeling cascaded sequences Factorial CRFs [Sutton, Khashayar, McCallum, ICML 2004] Named-entity tag Noun-phrase boundaries Part-of-speech English words In some situations we want to output not a single label sequence but several, for example… Rather than performing each of these labeling tasks serially in a cascade, Factorial CRFs perform all three tasks jointly, Providing matching accuracy with only 50% of the training data But errors cascade--must be perfect at every stage to do well.

11/17/2018 1. Jointly labeling cascaded sequences Factorial CRFs [Sutton, Khashayar, McCallum, ICML 2004] Named-entity tag Noun-phrase boundaries Part-of-speech English words In some situations we want to output not a single label sequence but several, for example… Rather than performing each of these labeling tasks serially in a cascade, Factorial CRFs perform all three tasks jointly, Providing matching accuracy with only 50% of the training data Joint prediction of part-of-speech and noun-phrase in newswire, matching accuracy with only 50% of the training data. Inference: Tree reparameterization BP [Wainwright et al, 2002]

2. Jointly labeling distant mentions Skip-chain CRFs
11/17/2018 2. Jointly labeling distant mentions Skip-chain CRFs [Sutton, McCallum, SRL 2004] … Senator Joe Green said today … Green ran for … Dependency among similar, distant mentions ignored. Skip-chain CRFs allow the model to capture dependencies among the labels of distant words, Such as this occurrence of “Green” in which there is a lot of local evidence that it is a person’s name, and this one in which there isn’t.

2. Jointly labeling distant mentions Skip-chain CRFs
11/17/2018 2. Jointly labeling distant mentions Skip-chain CRFs [Sutton, McCallum, SRL 2004] … Senator Joe Green said today … Green ran for … 14% reduction in error on most repeated field in seminar announcements. Skip-chain CRFs allow the model to capture dependencies among the labels of distant words, Such as this occurrence of “Green” in which there is a lot of local evidence that it is a person’s name, and this one in which there isn’t. Inference: Tree reparameterization BP [Wainwright et al, 2002]

3. Joint co-reference among all pairs Affinity Matrix CRF
11/17/2018 3. Joint co-reference among all pairs Affinity Matrix CRF “Entity resolution” “Object correspondence” . . . Mr Powell . . . 45 . . . Powell . . . Y/N Y/N -99 Y/N ~25% reduction in error on co-reference of proper nouns in newswire. 11 Traditionally in NLP co-reference has been performed by making independent coreference decisions on each pair of entity mentions. An Affinity Matrix CRF jointly makes all coreference decisions together, accounting for multiple constraints. . . . she . . . Inference: Correlational clustering graph partitioning [McCallum, Wellner, IJCAI WS 2003, NIPS 2004] [Bansal, Blum, Chawla, 2002]

Coreference Resolution
11/17/2018 Coreference Resolution AKA "record linkage", "database record deduplication", "entity resolution", "object correspondence", "identity uncertainty" Input Output News article, with named-entity "mentions" tagged Number of entities, N = 3 #1 Secretary of State Colin Powell he Mr. Powell Powell #2 Condoleezza Rice she Rice #3 President Bush Bush Today Secretary of State Colin Powell met with he Condoleezza Rice Mr Powell she Powell . . . President Bush Rice Bush One instantiation of this problem is known as co-reference in NLP. Common problem in many fields: DB, bibliometrics, computer vision, generally... Input is...

Inside the Traditional Solution
11/17/2018 Inside the Traditional Solution Pair-wise Affinity Metric Mention (3) Mention (4) Y/N? . . . Mr Powell . . . . . . Powell . . . N Two words in common 29 Y One word in common 13 Y "Normalized" mentions are string identical 39 Y Capitalized word in common 17 Y > 50% character tri-gram overlap 19 N < 25% character tri-gram overlap -34 Y In same sentence 9 Y Within two sentences 8 N Further than 3 sentences apart -1 Y "Hobbs Distance" < N Number of entities in between two mentions = 0 12 N Number of entities in between two mentions > 4 -3 Y Font matches 1 Y Default OVERALL SCORE = > threshold=0 Traditional solution is to use a learned pair-wise affinity metric... (Tree distance in parse tree, called "Hobbs distance".)

11/17/2018 The Problem Pair-wise merging decisions are being made independently from each other . . . Mr Powell . . . affinity = 98 Y . . . Powell . . . N affinity = -104 They should be made in relational dependence with each other. Y affinity = 11 The problem is that pairwise noisy... you can get inconsistent-triangles and then poor accuracy. The way the algorithm works is to sort the pair-wise affinities, and performing greedy agglomerative clustering with Union-Find. Stops merging when it gets to the threshold. Never even sees the inconsistent No. Problem=...being made independently They should... . . . she . . . Affinity measures are noisy and imperfect.

A Generative Model Solution
11/17/2018 [Russell 2001], [Pasula et al 2002], [Milch et al 2003], [Marthi et al 2003] (Applied to citation matching, and object correspondence in vision) N id context words id surname distance fonts gender age . Directed model, generative of observations Stuart Russell, Hanna Pasula, Brian Milch and Baskara Marthi… they capture relations. He uses a generative BN, in which generate N, entities, mentions. However two problems: 1) ...so could take this same random var structure, and turn into a conditional Markov random field, as we did with HMMs but... Alternating parameter estimation and model structure changes. Markov-chain Monte Carlo, Metropolis Hastings... .

A Markov Random Field for Co-reference
11/17/2018 A Markov Random Field for Co-reference (MRF) [McCallum & Wellner, 2003, ICML] . . . Mr Powell . . . 45 Make pair-wise merging decisions in dependent relation to each other by - calculating a joint prob. - including all edge weights - adding dependence on consistent triangles. . . . Powell . . . Y/N Y/N -30 Y/N Shape of the picture looks the same, but optimization criterion and algorithm are quite different. As with most MRF approaches, begin by writing down the random variables, and connecting them into cliques that bring together the variables whose compatibility you want to check. As before x's are the given observations, the shaded nodes (here the mentions), and the y's are the unknowns, the clear nodes (here the co-reference decisions) Let's check compatibility between pairs of mentions like before, but this time, calculate the probability of all Y/N variables at once, as a joint probability (indicated with the y-vector). We follow Hammersley-Clifford theorem again, write down an exponential function of the cliques of the graph, and we get this. Notice that the probability depends on all pairs. 11 . . . she . . .

11/17/2018 A Markov Random Field for Co-reference (MRF) [McCallum & Wellner, 2003] . . . Mr Powell . . . 45 Make pair-wise merging decisions in dependent relation to each other by - calculating a joint prob. - including all edge weights - adding dependence on consistent triangles. . . . Powell . . . Y/N Y/N -30 Y/N Shape of the picture looks the same, but optimization criterion and algorithm are quite different. As with most MRF approaches, begin by writing down the random variables, and connecting them into cliques that bring together the variables whose compatibility you want to check. As before x's are the given observations, the shaded nodes (here the mentions), and the y's are the unknowns, the clear nodes (here the co-reference decisions) Let's check compatibility between pairs of mentions like before, but this time, calculate the probability of all Y/N variables at once, as a joint probability (indicated with the y-vector). We follow Hammersley-Clifford theorem again, write down an exponential function of the cliques of the graph, and we get this. Notice that the probability depends on all pairs. 11 . . . she . . .

11/17/2018 A Markov Random Field for Co-reference (MRF) [McCallum & Wellner, 2003] . . . Mr Powell . . . -(45) . . . Powell . . . N N -(-30) Y Let's consider some examples. We define f() to have value 0 when obs-test (like cap-word in common) is false, otherwise 1 if Y, -1 if N. We "like this" with score -4. +(11) . . . she . . . -4

11/17/2018 A Markov Random Field for Co-reference (MRF) [McCallum & Wellner, 2003] . . . Mr Powell . . . +(45) . . . Powell . . . Y N -(-30) Y "like this with score -inf +(11) -infinity . . . she . . .

11/17/2018 A Markov Random Field for Co-reference (MRF) [McCallum & Wellner, 2003] . . . Mr Powell . . . +(45) . . . Powell . . . Y N -(-30) N Like this with score 64. When viewed in isolation we would have prefered to include "she" with "Powell", but that would have also required "she" = "Mr Powell". -(11) . . . she . . . 64

Inference in these MRFs = Graph Partitioning
11/17/2018 Inference in these MRFs = Graph Partitioning [Boykov, Vekler, Zabih, 1999], [Kolmogorov & Zabih, 2002], [Yu, Cross, Shi, 2002] . . . Mr Powell . . . 45 . . . Powell . . . -106 -30 -134 11 . . . Condoleezza Rice . . . . . . she . . . 10

11/17/2018 Inference in these MRFs = Graph Partitioning [Boykov, Vekler, Zabih, 1999], [Kolmogorov & Zabih, 2002], [Yu, Cross, Shi, 2002] . . . Mr Powell . . . 45 . . . Powell . . . -106 -30 -134 11 . . . Condoleezza Rice . . . . . . she . . . 10 = -22

11/17/2018 Inference in these MRFs = Graph Partitioning [Boykov, Vekler, Zabih, 1999], [Kolmogorov & Zabih, 2002], [Yu, Cross, Shi, 2002] . . . Mr Powell . . . 45 . . . Powell . . . -106 -30 -134 11 . . . Condoleezza Rice . . . . . . she . . . 10 = 314

Co-reference Experimental Results
11/17/2018 Co-reference Experimental Results [McCallum & Wellner, 2003] Proper noun co-reference DARPA ACE broadcast news transcripts, 117 stories Partition F1 Pair F1 Single-link threshold 16 % 18 % Best prev match [Morton] 83 % 89 % MRFs % 92 % error=30% error=28% DARPA MUC-6 newswire article corpus, 30 stories Single-link threshold 11% 7 % Best prev match [Morton] 70 % 76 % MRFs % 80 % error=13% error=17%

Joint co-reference among all pairs Affinity Matrix CRF
11/17/2018 Joint co-reference among all pairs Affinity Matrix CRF . . . Mr Powell . . . 45 . . . Powell . . . Y/N Y/N -99 Y/N ~25% reduction in error on co-reference of proper nouns in newswire. 11 Traditionally in NLP co-reference has been performed by making independent coreference decisions on each pair of entity mentions. An Affinity Matrix CRF jointly makes all coreference decisions together, accounting for multiple constraints. . . . she . . . Inference: Correlational clustering graph partitioning [McCallum, Wellner, IJCAI WS 2003, NIPS 2004] [Bansal, Blum, Chawla, 2002]

4. Joint segmentation and co-reference
11/17/2018 4. Joint segmentation and co-reference Extraction from and matching of research paper citations. o s World Knowledge Laurel, B. Interface Agents: Metaphors with Character, in The Art of Human-Computer Interface Design, B. Laurel (ed), Addison-Wesley, 1990. c Co-reference decisions y y p Brenda Laurel. Interface Agents: Metaphors with Character, in Laurel, The Art of Human-Computer Interface Design, , 1990. Database field values c y c Citation attributes s s Segmentation o o Recently we have been working with models that have one component for segmenting a citation string into fields, and another for coreference---citation matching And performing both segmentation and coreference jointly. 35% reduction in co-reference error by using segmentation uncertainty. 6-14% reduction in segmentation error by using co-reference. Inference: Variant of Iterated Conditional Modes [Wellner, McCallum, Peng, Hay, UAI 2004] see also [Marthi, Milch, Russell, 2003] [Besag, 1986]

Joint IE and Coreference from Research Paper Citations
11/17/2018 4. Joint segmentation and co-reference Joint IE and Coreference from Research Paper Citations Textual citation mentions (noisy, with duplicates) Paper database, with fields, clean, duplicates collapsed AUTHORS TITLE VENUE Cowell, Dawid… Probab… Springer Montemerlo, Thrun…FastSLAM… AAAI… Kjaerulff Approxi… Technic… Here is plan... Start with a quick review of technical foundations from my previous talk.

Citation Segmentation and Coreference
11/17/2018 Laurel, B. Interface Agents: Metaphors with Character , in The Art of Human-Computer Interface Design , T. Smith (ed) , Addison-Wesley , Brenda Laurel . Interface Agents: Metaphors with Character , in Smith , The Art of Human-Computr Interface Design , ,

11/17/2018 Laurel, B. Interface Agents: Metaphors with Character , in The Art of Human-Computer Interface Design , T. Smith (ed) , Addison-Wesley , Brenda Laurel . Interface Agents: Metaphors with Character , in Smith , The Art of Human-Computr Interface Design , , Segment citation fields

11/17/2018 Laurel, B. Interface Agents: Metaphors with Character , in The Art of Human-Computer Interface Design , T. Smith (ed) , Addison-Wesley , Y ? N Brenda Laurel . Interface Agents: Metaphors with Character , in Smith , The Art of Human-Computr Interface Design , , Segment citation fields Resolve coreferent citations

11/17/2018 Laurel, B. Interface Agents: Metaphors with Character , in The Art of Human-Computer Interface Design , T. Smith (ed) , Addison-Wesley , Y ? N Brenda Laurel . Interface Agents: Metaphors with Character , in Smith , The Art of Human-Computr Interface Design , , AUTHOR = Brenda Laurel TITLE = Interface Agents: Metaphors with Character PAGES = BOOKTITLE = The Art of Human-Computer Interface Design EDITOR = T. Smith PUBLISHER = Addison-Wesley YEAR = 1990 Segment citation fields Resolve coreferent citations Form canonical database record Resolving conflicts

11/17/2018 Laurel, B. Interface Agents: Metaphors with Character , in The Art of Human-Computer Interface Design , T. Smith (ed) , Addison-Wesley , Y ? N Brenda Laurel . Interface Agents: Metaphors with Character , in Smith , The Art of Human-Computr Interface Design , , AUTHOR = Brenda Laurel TITLE = Interface Agents: Metaphors with Character PAGES = BOOKTITLE = The Art of Human-Computer Interface Design EDITOR = T. Smith PUBLISHER = Addison-Wesley YEAR = 1990 Segment citation fields Resolve coreferent citations Form canonical database record Perform jointly.

IE + Coreference Model CRF Segmentation Observed citation
11/17/2018 AUT AUT YR TITL TITL CRF Segmentation s Observed citation x J Besag 1986 On the…

Citation mention attributes
IE + Coreference Model 11/17/2018 AUTHOR = “J Besag” YEAR = “1986” TITLE = “On the…” Citation mention attributes c CRF Segmentation s Observed citation x J Besag 1986 On the…

Structure for each citation mention
IE + Coreference Model 11/17/2018 Smyth , P Data mining… Structure for each citation mention c s x Smyth Data Mining… J Besag 1986 On the…

Binary coreference variables for each pair of mentions
IE + Coreference Model 11/17/2018 Smyth , P Data mining… Binary coreference variables for each pair of mentions c s x Smyth Data Mining… J Besag 1986 On the…

Binary coreference variables for each pair of mentions
IE + Coreference Model 11/17/2018 Smyth , P Data mining… Binary coreference variables for each pair of mentions y n n c s x Smyth Data Mining… J Besag 1986 On the…

Research paper entity attribute nodes
IE + Coreference Model 11/17/2018 Smyth , P Data mining… AUTHOR = “P Smyth” YEAR = “2001” TITLE = “Data Mining…” ... Research paper entity attribute nodes y n Structures only appear if matching Y or N. … In in talk on Case Factor Diagrams. n c s x Smyth Data Mining… J Besag 1986 On the…

Research paper entity attribute node
IE + Coreference Model 11/17/2018 Smyth , P Data mining… Research paper entity attribute node y y Structures only appear if matching Y or N. … In in talk on Case Factor Diagrams. y c s x Smyth Data Mining… J Besag 1986 On the…

IE + Coreference Model 11/17/2018 Smyth , P Data mining… y n Structures only appear if matching Y or N. … In in talk on Case Factor Diagrams. n c s x Smyth Data Mining… J Besag 1986 On the…

Such a highly connected graph makes exact inference intractable, so…
11/17/2018 Such a highly connected graph makes exact inference intractable, so… In other large-scale, real-world domains where they are using graphical models: Structured approximate inference Sampling

Approximate Inference 1
11/17/2018 m1(v2) m2(v3) Loopy Belief Propagation v1 v2 v3 m2(v1) m3(v2) messages passed between nodes v4 v5 v6

11/17/2018 m1(v2) m2(v3) Loopy Belief Propagation Generalized Belief v1 v2 v3 m2(v1) m3(v2) messages passed between nodes v4 v5 v6 v6 v5 v3 v2 v1 v4 v9 v8 v7 messages passed between regions ..in some case convergence can be improved by taking advantage of natural structures in the graph… Here, a message is a conditional probability table passed among nodes. But when message size grows exponentially with size of overlap between regions!

11/17/2018 Iterated Conditional Modes (ICM) [Besag 1986] v1 v2 v3 = held constant v6i+1 = argmax P(v6i | v \ v6i) v4 v5 v6 v6i

11/17/2018 Iterated Conditional Modes (ICM) [Besag 1986] v1 v2 v3 = held constant v5j+1 = argmax P(v5j | v \ v5j) v4 v5 v6 v5j

11/17/2018 Iterated Conditional Modes (ICM) [Besag 1986] v1 v2 v3 = held constant v4k+1 = argmax P(v4k | v \ v4k) v4 v5 v6 v4k Structured inference scales well here, but greedy, and easily falls into local minima.

11/17/2018 Iterated Conditional Modes (ICM) [Besag 1986] Iterated Conditional Sampling (ICS) (our name) Instead of selecting only argmax, sample of argmaxes of P(v4k | v \ v4k) e.g. an N-best list (the top N values) v1 v2 v3 = held constant v4k+1 = argmax P(v4k | v \ v4k) v4 v5 v6 v4k We have called ICS; we’d be happy to hear about pre-existing names in the literature. v1 v2 v3 Can use “Generalized Version” of this; doing exact inference on a region of several nodes at once. Here, a “message” grows only linearly with overlap region size and N! v4 v5 v6

Features of this Inference Method
11/17/2018 Structured or “factored” representation (ala GBP) Uses samples to approximate density Closed-loop message-passing on loopy graph (ala BP) Related Work Beam search “Forward”-only inference Particle filtering, e.g. [Doucet 1998] Usually on tree-shaped graph, or “feedforward” only. MC Sampling…Embedded HMMs [Neal, 2003] Sample from high-dim continuous state space; do forward-backward Sample Propagation [Paskin, 2003] Messages = samples, on a junction tree Fields to Trees [Hamze & de Freitas, UAI 2003] Rao-Blackwellized MCMC, partitioning G into non-overlapping trees Factored Particles for DBNs [Ng, Peshkin, Pfeffer, 2002] Combination of Particle Filtering and Boyan-Koller for DBNs --- Blocking Gibbs Sampling??

IE + Coreference Model Exact inference on these linear-chain regions
11/17/2018 Smyth , P Data mining… Exact inference on these linear-chain regions From each chain pass an N-best List into coreference Smyth Data Mining… J Besag 1986 On the…

IE + Coreference Model Approximate inference by graph partitioning…
11/17/2018 Smyth , P Data mining… Approximate inference by graph partitioning… Make scale to 1M citations with Canopies [McCallum, Nigam, Ungar 2000] …integrating out uncertainty in samples of extraction Smyth Data Mining… J Besag 1986 On the…

Exact (exhaustive) inference over entity attributes
IE + Coreference Model 11/17/2018 Smyth , P Data mining… Exact (exhaustive) inference over entity attributes y n Structures only appear if matching Y or N. … In in talk on Case Factor Diagrams. n Smyth Data Mining… J Besag 1986 On the…

IE + Coreference Model 11/17/2018 Smyth , P Data mining… Revisit exact inference on IE linear chain, now conditioned on entity attributes y n Structures only appear if matching Y or N. … In in talk on Case Factor Diagrams. n Smyth Data Mining… J Besag 1986 On the…

Separately for different regions
Parameter Estimation 11/17/2018 Separately for different regions IE Linear-chain Exact MAP Coref graph edge weights MAP on individual edges Entity attribute potentials MAP, pseudo-likelihood y n Structures only appear if matching Y or N. … In in talk on Case Factor Diagrams. n In all cases: Climb MAP gradient with quasi-Newton method

Experimenal Results Set of citations from CiteSeer
11/17/2018 Experimenal Results Set of citations from CiteSeer 1500 citation mentions to 900 paper entities Hand-labeled for coreference and field-extraction Divided into 4 subsets, each on a different topic RL, Face detection, Reasoning, Constraint Satisfaction Within each subset many citations share authors, publication venues, publishers, etc. 70% of the citation mentions are singletons

Coreference cluster recall
11/17/2018 Coreference Results Coreference cluster recall N Reinforce Face Reason Constraint 1 (Baseline) 0.946 0.96 0.94 3 0.95 0.98 7 0.97 9 0.982 Optimal 0.99 Average error reduction is 35%. “Optimal” makes best use of N-best list by using true labels. Indicates that even more improvement can be obtained

Information Extraction Results
11/17/2018 Information Extraction Results Segmentation F1 Reinforce Face Reason Constraint Baseline .943 .908 .929 .934 w/ Coref .949 .914 .935 Err. Reduc. .101 .062 .090 .142 P-value .0442 .0014 .0001 Error reduction ranges from 6-14%. Small, but significant at 95% confidence level (p-value < 0.05) Biggest limiting factor in both sets of results: data set is small, and does not have large coreferent sets.

Separately for different regions
Parameter Estimation 11/17/2018 Separately for different regions IE Linear-chain Exact MAP Coref graph edge weights MAP on individual edges Entity attribute potentials MAP, pseudo-likelihood y n Structures only appear if matching Y or N. … In in talk on Case Factor Diagrams. n In all cases: Climb MAP gradient with quasi-Newton method

a a a Outline Examples of IE and Data Mining.
11/17/2018 Outline a Examples of IE and Data Mining. Brief review of Conditional Random Fields Joint inference: Motivation and examples Joint Labeling of Cascaded Sequences (Belief Propagation) Joint Labeling of Distant Entities (BP by Tree Reparameterization) Joint Co-reference Resolution (Graph Partitioning) Joint Segmentation and Co-ref (Iterated Conditional Samples) Efficiently training large, multi-component models: Piecewise Training a a

Piecewise Training 11/17/2018

Piecewise Training with NOTA
11/17/2018 Piecewise Training with NOTA

Experimental Results Test F1 Training time CRF 89.87 9 hours
11/17/2018 Experimental Results Named entity tagging (CoNLL-2003) Training set = 15k newswire sentences 9 labels Test F1 Training time CRF hours MEMM hour CRF-PT hours 90.50 stat. sig. improvement at p = 0.001

Experimental Results 2 Test F1 Training time CRF 88.1 14 hours
11/17/2018 Experimental Results 2 Part-of-speech tagging (Penn Treebank, small subset) Training set = 1154 newswire sentences 45 labels Test F1 Training time CRF hours MEMM hours CRF-PT hours 88.8 stat. sig. improvement at p = 0.001

“Parameter Independence Diagrams”
11/17/2018 “Parameter Independence Diagrams” Graphical models = formalism for representing independence assumptions among variables. Here we represent independence assumptions among parameters (in factor graph)

“Parameter Independence Diagrams”
11/17/2018 “Parameter Independence Diagrams”

Train some in pieces, some globally
11/17/2018 Train some in pieces, some globally

Piecewise Training Research Questions
11/17/2018 Piecewise Training Research Questions How to select the boundaries of “pieces”? What choices of limited interaction are best? How to sample sparse subsets of NOTA instances? Application to simpler models (classifiers) Application to more complex models (parsing)

Main Application Project:
11/17/2018 Main Application Project:

11/17/2018

11/17/2018 Main Application Project: Cites Research Paper

11/17/2018 Main Application Project: Expertise Cites Grant Research Paper Person Venue University Groups

Summary Conditional Random Fields
11/17/2018 Summary Conditional Random Fields Conditional probability models of structured data Data mining complex unstructured text suggests the need for joint inference IE -> DM. Early examples Factorial finite state models Jointly labeling distant entities Coreference analysis Segmentation uncertainty aiding coreference Piecewise Training Faster + higher accuracy by making independence assumptions

11/17/2018 End of Talk

4. Joint segmentation and co-reference
11/17/2018 4. Joint segmentation and co-reference [Wellner, McCallum, Peng, Hay, UAI 2004] Extraction from and matching of research paper citations. o s World Knowledge Laurel, B. Interface Agents: Metaphors with Character, in The Art of Human-Computer Interface Design, B. Laurel (ed), Addison-Wesley, 1990. c Co-reference decisions y y p Brenda Laurel. Interface Agents: Metaphors with Character, in Laurel, The Art of Human-Computer Interface Design, , 1990. Database field values c y c Citation attributes s s Segmentation o o Recently we have been working with models that have one component for segmenting a citation string into fields, and another for coreference---citation matching And performing both segmentation and coreference jointly. 35% reduction in co-reference error by using segmentation uncertainty. 6-14% reduction in segmentation error by using co-reference. Inference: Variant of Iterated Conditional Modes [Besag, 1986]

Label Bias Problem in Conditional Sequence Models
11/17/2018 Label Bias Problem in Conditional Sequence Models Example (after Bottou ‘91): Bias toward states with few “siblings”. Per-state normalization in MEMMs does not allow “probability mass” to transfer from one branch to the other. r o b “rob” start r i b “rib” Problem which we call “label bias”, and has also been pointed out by Leon Bottou and others. Best illustrated by a simple example. Think of this FSM as a speech recognition system that distinguishes just two words: “rib” and “rob”. First let’s look at how HMMs correctly solve this problem. Say we observe the sounds “rob”. We assess the probability flowing through each path. Begin at start state; prob mass splits… Now let’s look at how MEMMs solve this problem… Problem is per-state normalization.

Proposed Solutions Determinization: Use fully-connected models:
11/17/2018 Proposed Solutions Determinization: not always possible state-space explosion Use fully-connected models: lacks prior structural knowledge. Our solution: Conditional random fields (CRFs): Probabilistic conditional models generalizing MEMMs. Allow some transitions to vote more strongly than others in computing state sequence probability. Whole sequence rather than per-state normalization. o b “rob” r start i b “rib”

Computer Science Department University of Massachusetts Amherst

Similar presentations

Presentation on theme: "Computer Science Department University of Massachusetts Amherst"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Computer Science Department University of Massachusetts Amherst

Similar presentations

Presentation on theme: "Computer Science Department University of Massachusetts Amherst"— Presentation transcript:

Similar presentations

About project

Feedback