Information Extraction, Data Mining and Joint Inference

Information Extraction, Data Mining and Joint Inference
11/12/2018 Information Extraction, Data Mining and Joint Inference Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with Charles Sutton, Aron Culotta, Xuerui Wang, Ben Wellner, David Mimno, Gideon Mann.

Goal: Mine actionable knowledge from unstructured text.
11/12/2018 Goal: Mine actionable knowledge from unstructured text. I want to help people make good decisions by leveraging the knowledge on the Web and other bodies of text. Sometimes document retrieval is enough for this, but sometimes you need to find patterns in structured data that span many pages. Let me give you some examples of what I mean.

Extracting Job Openings from the Web
11/12/2018 Extracting Job Openings from the Web foodscience.com-Job2 JobTitle: Ice Cream Guru Employer: foodscience.com JobCategory: Travel/Hospitality JobFunction: Food Services JobLocation: Upper Midwest Contact Phone: DateExtracted: January 8, 2001 Source: OtherCompanyJobs: foodscience.com-Job1 WB’s first application area is extracting job openings from the web. Start at company home page, automatically follow links to find the job openings section, segment job openings from each other, and extract job title, location, description, and all of these fields. (Notice that capitalization, position on page, bold face are just as much indicators of this being a job title as the words themselves. Notice also that part of the IE task is figuring out which links to follow. And often data for a relation came from multiple different pages.)

A Portal for Job Openings
11/12/2018 A Portal for Job Openings

Job Openings: Category = High Tech Keyword = Java Location = U.S.
11/12/2018 Category = High Tech Job Openings: Keyword = Java Location = U.S. Not only did this support extremely useful and targeted search and data navigation, with highly summarized display

Data Mining the Extracted Job Information
11/12/2018 Data Mining the Extracted Job Information It also supported something that a web page search engine could never do: data mine the extracted information. Once a month, we data-mined our database of jobs and the changes in them to produce a Job Opportunity Index report … used by several state and local governments to help set policy.

IE from Research Papers
11/12/2018 [McCallum et al ‘99] Some of you might know about Cora, a contemporary of CiteSeer. This is a system created by Kamal Nigam, Kristie Seymore, Jason Rennie and myself that mines the web for Computer Science research papers. You can view search results in terms of summaries from extracted data. Here you see the automatically extracted titles, authors, institutions and abstracts.

IE from Research Papers
11/12/2018 IE from Research Papers CiteSeer, which we all know and love, is made possible by the automatic extraction of title and author information from research paper headers and references in order to build the reference graph that we use to find related work.

Mining Research Papers
11/12/2018 Mining Research Papers [Rosen-Zvi, Griffiths, Steyvers, Smyth, 2004] …and look at some mined aggregate statistics about the citation graph, Or as presented here two days ago, look at the most prominent authors in certain sub-fields [Giles et al]

IE from Chinese Documents regarding Weather
11/12/2018 IE from Chinese Documents regarding Weather Department of Terrestrial System, Chinese Academy of Sciences 200k+ documents several millennia old - Qing Dynasty Archives - memos - newspaper articles - diaries The Chinese Academy of Sciences has more than 200k documents dating back several millennia: memos, news articles, diaries. Among other.. wealth of information about weather (floods, droughts, temperature extremes). They are trying to mine this to help them understand today's global weather trends. So important they are currently doing it by hand, but need to automate it to get through the volume.

What is “Information Extraction”
11/12/2018 What is “Information Extraction” As a family of techniques: Information Extraction = segmentation + classification + clustering + association October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Bill Veghte VP Richard Stallman founder Free Software Foundation Technically-speaking, what is involved in doing IE? As a family of techniques, is consists of Segmentation to find the beginning and ending boundaries of the text snippets that will go in the DB. Classification…

11/12/2018 What is “Information Extraction” As a family of techniques: Information Extraction = segmentation + classification + association + clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Bill Veghte VP Richard Stallman founder Free Software Foundation

11/12/2018 What is “Information Extraction” As a family of techniques: Information Extraction = segmentation + classification + association + clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… * Microsoft Corporation CEO Bill Gates Microsoft Gates Bill Veghte VP Richard Stallman founder Free Software Foundation Free Soft.. Microsoft Microsoft TITLE ORGANIZATION * founder * CEO VP * Stallman NAME Veghte Bill Gates Richard Bill

From Text to Actionable Knowledge
11/12/2018 From Text to Actionable Knowledge Spider Filter Data Mining IE Segment Classify Associate Cluster Discover patterns - entity types - links / relations - events Database Document collection Actionable knowledge Prediction Outlier detection Decision support

Solution: IE Data Mining Uncertainty Info Emerging Patterns Spider
11/12/2018 Solution: Uncertainty Info Spider Filter Data Mining IE Segment Classify Associate Cluster Discover patterns - entity types - links / relations - events Database Document collection Actionable knowledge Emerging Patterns Prediction Outlier detection Decision support

Solution: IE Unified Model Complex Inference and Learning Data Mining
11/12/2018 Solution: Unified Model Spider Filter Data Mining IE Segment Classify Associate Cluster Discover patterns - entity types - links / relations - events Probabilistic Model Conditional Random Fields [Lafferty, McCallum, Pereira] Conditional PRMs [Koller…], [Jensen…], [Geetor…], [Domingos…] Discriminatively-trained undirected graphical models Document collection Actionable knowledge Prediction Outlier detection Decision support Complex Inference and Learning Just what we researchers like to sink our teeth into!

11/12/2018 Scientific Questions What model structures will capture salient dependencies? Will joint inference actually improve accuracy? How to do inference in these large graphical models? How to do parameter estimation efficiently in these models, which are built from multiple large components? How to do structure discovery in these models?

a a Outline Examples of IE and Data Mining. Motivate Joint Inference
11/12/2018 Outline a Examples of IE and Data Mining. Motivate Joint Inference Brief introduction to Conditional Random Fields Joint inference: Examples Joint Labeling of Cascaded Sequences (Loopy Belief Propagation) Joint Co-reference Resolution (Graph Partitioning) Joint Co-reference with Weighted 1st-order Logic (MCMC) Joint Relation Extraction and Data Mining (Bootstrapping) Ultimate application area: Rexa, a Web portal for researchers a

(Linear Chain) Conditional Random Fields
11/12/2018 [Lafferty, McCallum, Pereira 2001] Undirected graphical model, trained to maximize conditional probability of output (sequence) given input (sequence) Finite state model Graphical model OTHER PERSON OTHER ORG TITLE … output seq y y y y y t - 1 t t+1 t+2 t+3 FSM states . . . observations x x x x x t - 1 t t +1 t +2 t +3 said Jones a Microsoft VP … input seq A CRF is simply an undirected graphical model trained to maximize a conditional probability. First explorations with these models centered on finite state models, represented as linear-chain graphical models, with joint probability distribution over state sequence Y calculated as a normalized product over potentials on cliques of the graph. As is often traditional in NLP and other application areas, these potentials are defined to be log-linear combination of weights on features of the clique values. The chief excitement from an application point of view is the ability to use rich and arbitrary features of the input without complicating inference. Yielded many good results, for example… Explosion of interest in many different conferences. where Asian word segmentation [COLING’04], [ACL’04] IE from Research papers [HTL’04] Object classification in images [CVPR ‘04] Wide-spread interest, positive experimental results in many applications. Noun phrase, Named entity [HLT’03], [CoNLL’03] Protein structure prediction [ICML’04] IE from Bioinformatics text [Bioinformatics ‘04],…

a a a Outline Examples of IE and Data Mining. Motivate Joint Inference
11/12/2018 Outline a Examples of IE and Data Mining. Motivate Joint Inference Brief introduction to Conditional Random Fields Joint inference: Examples Joint Labeling of Cascaded Sequences (Loopy Belief Propagation) Joint Co-reference Resolution (Graph Partitioning) Joint Co-reference with Weighted 1st-order Logic (MCMC) Joint Relation Extraction and Data Mining (Bootstrapping) Ultimate application area: Rexa, a Web portal for researchers a a

Jointly labeling cascaded sequences Factorial CRFs
11/12/2018 Jointly labeling cascaded sequences Factorial CRFs [Sutton, Khashayar, McCallum, ICML 2004] Named-entity tag Noun-phrase boundaries Part-of-speech English words In some situations we want to output not a single label sequence but several, for example… Rather than performing each of these labeling tasks serially in a cascade, Factorial CRFs perform all three tasks jointly, Providing matching accuracy with only 50% of the training data

11/12/2018 Jointly labeling cascaded sequences Factorial CRFs [Sutton, Khashayar, McCallum, ICML 2004] Named-entity tag Noun-phrase boundaries Part-of-speech English words In some situations we want to output not a single label sequence but several, for example… Rather than performing each of these labeling tasks serially in a cascade, Factorial CRFs perform all three tasks jointly, Providing matching accuracy with only 50% of the training data But errors cascade--must be perfect at every stage to do well.

11/12/2018 Jointly labeling cascaded sequences Factorial CRFs [Sutton, Khashayar, McCallum, ICML 2004] Named-entity tag Noun-phrase boundaries Part-of-speech English words In some situations we want to output not a single label sequence but several, for example… Rather than performing each of these labeling tasks serially in a cascade, Factorial CRFs perform all three tasks jointly, Providing matching accuracy with only 50% of the training data Joint prediction of part-of-speech and noun-phrase in newswire, matching accuracy with only 50% of the training data. Inference: Loopy Belief Propagation

a a Outline Examples of IE and Data Mining. Motivate Joint Inference
11/12/2018 Outline a Examples of IE and Data Mining. Motivate Joint Inference Brief introduction to Conditional Random Fields Joint inference: Examples Joint Labeling of Cascaded Sequences (Loopy Belief Propagation) Joint Co-reference Resolution (Graph Partitioning) Joint Co-reference with Weighted 1st-order Logic (MCMC) Joint Relation Extraction and Data Mining (Bootstrapping) Ultimate application area: Rexa, a Web portal for researchers a

Joint co-reference among all pairs Affinity Matrix CRF
11/12/2018 Joint co-reference among all pairs Affinity Matrix CRF “Entity resolution” “Object correspondence” . . . Mr Powell . . . 45 . . . Powell . . . Y/N Y/N -99 Y/N ~25% reduction in error on co-reference of proper nouns in newswire. 11 Traditionally in NLP co-reference has been performed by making independent coreference decisions on each pair of entity mentions. An Affinity Matrix CRF jointly makes all coreference decisions together, accounting for multiple constraints. . . . she . . . Inference: Correlational clustering graph partitioning [McCallum, Wellner, IJCAI WS 2003, NIPS 2004] [Bansal, Blum, Chawla, 2002]

Joint Co-reference for Multiple Entity Types
11/12/2018 [Culotta & McCallum 2005] People Stuart Russell Y/N Stuart Russell Y/N Y/N S. Russel

11/12/2018 [Culotta & McCallum 2005] People Organizations Stuart Russell University of California at Berkeley Y/N Y/N Stuart Russell Y/N Y/N Berkeley Y/N Y/N S. Russel Berkeley

11/12/2018 [Culotta & McCallum 2005] People Organizations Stuart Russell University of California at Berkeley Y/N Y/N Stuart Russell Y/N Y/N Berkeley Y/N Y/N Reduces error by 22% S. Russel Berkeley

a a a a a Outline Examples of IE and Data Mining.
11/12/2018 Outline a Examples of IE and Data Mining. Motivate Joint Inference Brief introduction to Conditional Random Fields Joint inference: Examples Joint Labeling of Cascaded Sequences (Loopy Belief Propagation) Joint Co-reference Resolution (Graph Partitioning) Joint Co-reference with Weighted 1st-order Logic (MCMC) Joint Relation Extraction and Data Mining (Bootstrapping) Ultimate application area: Rexa, a Web portal for researchers a a a a

Sometimes pairwise comparisons are not enough.
11/12/2018 Sometimes pairwise comparisons are not enough. Entities have multiple attributes (name, , institution, location); need to measure “compatibility” among them. Having 2 “given names” is common, but not 4. Need to measure size of the clusters of mentions.  a pair of lastname strings that differ > 5? We need measures on hypothesized “entities” We need First-order logic

Pairwise Co-reference Features
11/12/2018 Pairwise Co-reference Features Howard Dean SamePerson(Howard Dean, Howard Martin)? SamePerson(Dean Martin, Howard Martin)? SamePerson(Dean Martin, Howard Dean)? Pairwise Features StringMatch(x1,x2) EditDistance(x1,x2) Today I’ll be focusing on one example of higher order representations: identity uncertainty (or coreference resolution). Dean Martin Howard Martin

Cluster-wise (higher-order) Representations
11/12/2018 Cluster-wise (higher-order) Representations Howard Dean First-Order Features x1,x2 StringMatch(x1,x2) x1,x2 ¬StringMatch(x1,x2) x1,x2 EditDistance>.5(x1,x2) ThreeDistinctStrings(x1,x2, x3 ) N = 3 SamePerson(Howard Dean, Howard Martin, Dean Martin)? Dean Martin Howard Martin

Cluster-wise (higher-order) Representations
11/12/2018 Cluster-wise (higher-order) Representations SamePerson(x1,x2 ,x3,x4) SamePerson(x1,x2 ,x3,x4 ,x5) SamePerson(x1,x2 ,x3,x4 ,x5 ,x6) … Combinatorial Explosion! SamePerson(x1,x2 ,x3) … SamePerson(x1,x2) … … Dean Martin Howard Dean Howard Martin Dino Howie Martin

This space complexity is common in first-order probabilistic models
11/12/2018 This space complexity is common in first-order probabilistic models

11/12/2018 Markov Logic: (Weighted 1st-order Logic) Using 1st-order Logic as a Template to Construct a CRF [Richardson & Domingos 2005] ground Markov network Faster! grounding Markov network requires space O(nr) n = number constants r = highest clause arity

11/12/2018 How can we perform inference and learning in models that cannot be grounded? To scale these FO models to real-world tasks, we need learning and inference methods that operate when the model cannot be completely grounded.

Inference in First-Order Models SAT Solvers
11/12/2018 Inference in First-Order Models SAT Solvers Weighted SAT solvers [Kautz et al 1997] Requires complete grounding of network LazySAT [Singla & Domingos 2006] Saves memory by only storing clauses that may become unsatisfied Still requires exponential time to visit all ground clauses at initialization. Originally, MAP inference in Markov Logic Nets used weighted sat solvers (like GSAT), but these require a complete grounding of the network. Recently, Parag and Pedro have propose LazySAT, which saves memory during inference by only storing clauses that may become unsatisfied. However, LazySAT still requires exponential time to visit all ground clauses at initialization.

Inference in First-Order Models Sampling
11/12/2018 Inference in First-Order Models Sampling Gibbs Sampling Difficult to move between high probability configurations by changing single variables Although, consider MC-SAT [Poon & Domingos ‘06] An alternative: Metropolis-Hastings sampling Can be extended to partial configurations Only instantiate relevant variables Successfully used in BLOG models [Milch et al 2005] 2 parts: proposal distribution, acceptance distribution. [Culotta & McCallum 2006] As they point out, it can be superior to Gibbs sampling when you can’t move between high probability configurations by changing single variables.

Proposal Distribution
11/12/2018 Proposal Distribution Dean Martin Howie Martin Howard Martin Dino y y’ Dean Martin Dino Howard Martin Howie Martin Cut some of these

11/12/2018 Proposal Distribution Dean Martin Howie Martin Howard Martin Dino y y’ Dean Martin Howie Martin Howard Martin Howie Martin Cut some of these

11/12/2018 Proposal Distribution Dean Martin Howie Martin Howard Martin Howie Martin y y’ Dean Martin Howie Martin Howard Martin Dino Cut some of these

Inference with Metropolis-Hastings
11/12/2018 Inference with Metropolis-Hastings y : configuration p(y’)/p(y) : likelihood ratio Ratio of P(Y|X) ZX cancels q(y’|y) : proposal distribution probability of proposing move y y’ Normalization cancel. P(y|x), what is q? Blog inter cluster edges?

Experiments Paper citation coreference Author coreference
11/12/2018 Experiments Paper citation coreference Author coreference First-order features All Titles Match Exists Year MisMatch Average String Edit Distance > X Number of mentions

Results on Citation Data
11/12/2018 Results on Citation Data Citeseer paper coreference results (pair F1) First-Order Pairwise constraint 82.3 76.7 reinforce 93.4 78.7 face 88.9 83.2 reason 81.0 84.9 Author coreference results (pair F1) Why does worse? Avgs? First-Order Pairwise miller_d 41.9 61.7 li_w 43.2 36.2 smith_b 65.4 25.4

a a a a a a Outline Examples of IE and Data Mining.
11/12/2018 Outline a Examples of IE and Data Mining. Motivate Joint Inference Brief introduction to Conditional Random Fields Joint inference: Examples Joint Labeling of Cascaded Sequences (Loopy Belief Propagation) Joint Co-reference Resolution (Graph Partitioning) Joint Co-reference with Weighted 1st-order Logic (MCMC) Joint Relation Extraction and Data Mining (Bootstrapping) Ultimate application area: Rexa, a Web portal for researchers a a a a a

Data 270 Wikipedia articles 1000 paragraphs 4700 relations
11/12/2018 Data 270 Wikipedia articles 1000 paragraphs 4700 relations 52 relation types JobTitle, BirthDay, Friend, Sister, Husband, Employer, Cousin, Competition, Education, … Targeted for density of relations Bush/Kennedy/Manning/Coppola families and friends

11/12/2018 More generally, we’d like to extract relations while considering the larger context in which they occur.

…his father George H. W. Bush… …his cousin John Prescott Ellis…
11/12/2018 George W. Bush …his father George H. W. Bush… …his cousin John Prescott Ellis… George H. W. Bush …his sister Nancy Ellis Bush… Nancy Ellis Bush …her son John Prescott Ellis… Cousin = Father’s Sister’s Son Given the sentence “…”, we’d like to extract the entire family tree it implies. We can imagine a pattern detection method that can discover that GWB is the son of GHWB, and maybe even that GHWB is the son of PB, but without any further background knowledge, no standard pattern matching method will detect that GWB is the grandson of PB. Rather than hand code the requisite definition of the Grandfather relation, we propose an extraction method that learns such rules from training data, and uses them to improve extraction accuracy. Nancy Ellis Bush sibling X Y George HW Bush George W Bush son John Prescott Ellis son cousin

…celebrated with Stuart Forbes…
11/12/2018 John Kerry …celebrated with Stuart Forbes… likely a cousin Stuart Forbes James Forbes John Kerry Rosemary Forbes Son Name Sibling Given the sentence “…”, we’d like to extract the entire family tree it implies. We can imagine a pattern detection method that can discover that GWB is the son of GHWB, and maybe even that GHWB is the son of PB, but without any further background knowledge, no standard pattern matching method will detect that GWB is the grandson of PB. Rather than hand code the requisite definition of the Grandfather relation, we propose an extraction method that learns such rules from training data, and uses them to improve extraction accuracy. cousin James Forbes sibling Rosemary Forbes John Kerry son Stuart Forbes

Iterative DB Construction
11/12/2018 Joseph P. Kennedy, Sr … son John F. Kennedy with Rose Fitzgerald Son Fill DB with “first-pass” CRF Wife At test time, the database may not be relevant to the current document. How can we use these relational features in this case? We run a “first-pass” CRF to populate the database with related entities. We then run a “second pass” CRF using relational features from the (possibly noisy) DB. Note that the rules generalize to new entities. Name Son Joseph P. Kennedy John F. Kennedy Rose Fitzgerald Use relational features with “second-pass” CRF Ronald Reagan George W. Bush (0.3)

Results Member Of: 0.4211 -> 0.6093 F1 .5489 .5995 .6100 .6008
11/12/2018 Results ME CRF RCRF RCRF .9 RCRF .5 RCRF Truth RCRF Truth.5 F1 .5489 .5995 .6100 .6008 .6136 .6791 .6363 Prec .6475 .7019 .6799 .7177 .7095 .7553 .7343 Recall .4763 .5232 .5531 .5166 .5406 .6169 .5614 Member Of: > ME = maximum entropy CRF = conditional random field RCRF = CRF + mined features

Examples of Discovered Relational Features
11/12/2018 Examples of Discovered Relational Features Mother: FatherWife Cousin: MotherHusbandNephew Friend: EducationStudent Education: FatherEducation Boss: BossSon MemberOf: GrandfatherMemberOf Competition: PoliticalPartyMemberCompetition

a a a a Outline Examples of IE and Data Mining.
11/12/2018 Outline a Examples of IE and Data Mining. Motivate Joint Inference Brief introduction to Conditional Random Fields Joint inference: Examples Joint Labeling of Cascaded Sequences (Loopy Belief Propagation) Joint Co-reference Resolution (Graph Partitioning) Joint Co-reference with Weighted 1st-order Logic (MCMC) Joint Relation Extraction and Data Mining (Bootstrapping) Ultimate application area: Rexa, a Web portal for researchers a a a

Mining our Research Literature
11/12/2018 Mining our Research Literature Better understand structure of our own research area. Structure helps us learn a new field. Aid collaboration Map how ideas travel through social networks of researchers. Aids for hiring and finding reviewers!

11/12/2018 Previous Systems

11/12/2018

11/12/2018 Previous Systems Cites Research Paper

More Entities and Relations
11/12/2018 More Entities and Relations Expertise Cites Grant Research Paper Person Venue University Groups

11/12/2018

11/12/2018 Topical Transfer [Mann, Mimno, McCallum, JCDL 2006] Citation counts from one topic to another. Map “producers and consumers”

11/12/2018 Impact Diversity Topic Diversity: Entropy of the distribution of citing topics

Summary 11/12/2018 Joint inference needed for avoiding cascading errors in information extraction and data mining. Challenge: making inference & learning scale to massive graphical models. Markov-chain Monte Carlo Rexa: New research paper search engine, mining the interactions in our community.

Information Extraction, Data Mining and Joint Inference

Similar presentations

Presentation on theme: "Information Extraction, Data Mining and Joint Inference"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Information Extraction, Data Mining and Joint Inference

Similar presentations

Presentation on theme: "Information Extraction, Data Mining and Joint Inference"— Presentation transcript:

Similar presentations

About project

Feedback