Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst.

Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with Charles Sutton, Aron Culotta, Ben Wellner, Khashayar Rohanimanesh, Wei Li, Andres Corrada, Xuerui Wang

Goal: Mine actionable knowledge from unstructured text.

Extracting Job Openings from the Web foodscience.com-Job2 JobTitle: Ice Cream Guru Employer: foodscience.com JobCategory: Travel/Hospitality JobFunction: Food Services JobLocation: Upper Midwest Contact Phone: 800-488-2611 DateExtracted: January 8, 2001 Source: www.foodscience.com/jobs_midwest.html OtherCompanyJobs: foodscience.com-Job1

A Portal for Job Openings

Job Openings: Category = High Tech Keyword = Java Location = U.S.

Data Mining the Extracted Job Information

IE from Chinese Documents regarding Weather Department of Terrestrial System, Chinese Academy of Sciences 200k+ documents several millennia old - Qing Dynasty Archives - memos - newspaper articles - diaries

IE from Research Papers [McCallum et al ‘99]

IE from Research Papers

Mining Research Papers [Giles et al] [Rosen-Zvi, Griffiths, Steyvers, Smyth, 2004]

What is “Information Extraction” Information Extraction = segmentation + classification + clustering + association As a family of techniques: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation

What is “Information Extraction” Information Extraction = segmentation + classification + association + clustering As a family of techniques: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation

What is “Information Extraction” Information Extraction = segmentation + classification + association + clustering As a family of techniques: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation NAME TITLE ORGANIZATION Bill Gates CEO Microsoft Bill Veghte VP Microsoft Richard Stallman founder Free Soft.. * * * *

Larger Context Segment Classify Associate Cluster Filter Prediction Outlier detection Decision support IE Document collection Database Discover patterns - entity types - links / relations - events Data Mining Spider Actionable knowledge

Outline Examples of IE and Data Mining. Brief review of Conditional Random Fields Joint inference: Motivation and examples –Joint Labeling of Cascaded Sequences (Belief Propagation) –Joint Labeling of Distant Entities (BP by Tree Reparameterization) –Joint Co-reference Resolution (Graph Partitioning) –Joint Segmentation and Co-ref (Iterated Conditional Samples) Piecewise Training for large multi-component models Two example projects –Email, contact management, and Social Network Analysis –Research Paper search and analysis 

Hidden Markov Models S t-1 S t O t S t+1 O t +1 O t - 1... Finite state model Graphical model Parameters: for all states S={s 1,s 2,…} Start state probabilities: P(s t ) Transition probabilities: P(s t |s t-1 ) Observation (emission) probabilities: P(o t |s t ) Training: Maximize probability of training observations (w/ prior) HMMs are the standard sequence modeling tool in genomics, music, speech, NLP, …... transitions observations o 1 o 2 o 3 o 4 o 5 o 6 o 7 o 8 Generates: State sequence Observation sequence Usually a multinomial over atomic, fixed alphabet

IE with Hidden Markov Models Yesterday Rich Caruana spoke this example sentence. Person name: Rich Caruana Given a sequence of observations: and a trained HMM: Find the most likely state sequence: (Viterbi) Any words said to be generated by the designated “person name” state extract as a person name: person name location name background

We want More than an Atomic View of Words Would like richer representation of text: many arbitrary, overlapping features of the words. S t-1 S t O t S t+1 O t +1 O t - 1 identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor last person name was female next two words are “and Associates” … … part of noun phrase is “Wisniewski” ends in “-ski”

Problems with Richer Representation and a Joint Model These arbitrary features are not independent. –Multiple levels of granularity (chars, words, phrases) –Multiple dependent modalities (words, formatting, layout) –Past & future Two choices: Model the dependencies. Each state would have its own Bayes Net. But we are already starved for training data! Ignore the dependencies. This causes “over-counting” of evidence (ala naïve Bayes). Big problem when combining evidence, as in Viterbi! S t-1 S t O t S t+1 O t +1 O t - 1 S t-1 S t O t S t+1 O t +1 O t - 1

Conditional Sequence Models We prefer a model that is trained to maximize a conditional probability rather than joint probability: P(s|o) instead of P(s,o): –Can examine features, but not responsible for generating them. –Don’t have to explicitly model their dependencies. –Don’t “waste modeling effort” trying to generate what we are given at test time anyway.

Joint Conditional S t-1 StSt OtOt S t+1 O t+1 O t-1 S t-1 StSt OtOt S t+1 O t+1 O t-1... (A super-special case of Conditional Random Fields.) [Lafferty, McCallum, Pereira 2001] where From HMMs to Conditional Random Fields Set parameters by maximum likelihood, using optimization method on  L.

Conditional Random Fields StSt S t+1 S t+2 O = O t, O t+1, O t+2, O t+3, O t+4 S t+3 S t+4 1. FSM special-case: linear chain among unknowns, parameters tied across time steps. [Lafferty, McCallum, Pereira 2001] 2. In general: CRFs = "Conditionally-trained Markov Network" arbitrary structure among unknowns 3. Relational Markov Networks [Taskar, Abbeel, Koller 2002]: Parameters tied across hits from SQL-like queries ("clique templates")

(Linear Chain) Conditional Random Fields y t-1 y t x t y t+1 x t +1 x t - 1 Finite state modelGraphical model Undirected graphical model, trained to maximize conditional probability of outputs given inputs... FSM states observations y t+2 x t +2 y t+3 x t +3 said Veght a Microsoft VP … where OTHER PERSON OTHER ORG TITLE … output seq input seq Asian word segmentation [COLING’04], [ACL’04] IE from Research papers [HTL’04] Object classification in images [CVPR ‘04] Fast-growing, wide-spread interest, many positive experimental results. Noun phrase, Named entity [HLT’03], [CoNLL’03] Protein structure prediction [ICML’04] IE from Bioinformatics text [Bioinformatics ‘04],… [Lafferty, McCallum, Pereira 2001]

Training CRFs Feature count using correct labels Feature count using predicted labels Smoothing penalty --

Linear-chain CRFs vs. HMMs Comparable computational efficiency for inference Features may be arbitrary functions of any or all observations –Parameters need not fully specify generation of observations; can require less training data –Easy to incorporate domain knowledge

Table Extraction from Government Reports Cash receipts from marketings of milk during 1995 at $19.9 billion dollars, was slightly below 1994. Producer returns averaged $12.93 per hundredweight, $0.19 per hundredweight below 1994. Marketings totaled 154 billion pounds, 1 percent above 1994. Marketings include whole milk sold to plants and dealers as well as milk sold directly to consumers. An estimated 1.56 billion pounds of milk were used on farms where produced, 8 percent less than 1994. Calves were fed 78 percent of this milk with the remainder consumed in producer households. Milk Cows and Production of Milk and Milkfat: United States, 1993-95 -------------------------------------------------------------------------------- : : Production of Milk and Milkfat 2/ : Number :------------------------------------------------------- Year : of : Per Milk Cow : Percentage : Total :Milk Cows 1/:-------------------: of Fat in All :------------------ : : Milk : Milkfat : Milk Produced : Milk : Milkfat -------------------------------------------------------------------------------- : 1,000 Head --- Pounds --- Percent Million Pounds : 1993 : 9,589 15,704 575 3.66 150,582 5,514.4 1994 : 9,500 16,175 592 3.66 153,664 5,623.7 1995 : 9,461 16,451 602 3.66 155,644 5,694.3 -------------------------------------------------------------------------------- 1/ Average number during year, excluding heifers not yet fresh. 2/ Excludes milk sucked by calves.

Table Extraction from Government Reports Cash receipts from marketings of milk during 1995 at $19.9 billion dollars, was slightly below 1994. Producer returns averaged $12.93 per hundredweight, $0.19 per hundredweight below 1994. Marketings totaled 154 billion pounds, 1 percent above 1994. Marketings include whole milk sold to plants and dealers as well as milk sold directly to consumers. An estimated 1.56 billion pounds of milk were used on farms where produced, 8 percent less than 1994. Calves were fed 78 percent of this milk with the remainder consumed in producer households. Milk Cows and Production of Milk and Milkfat: United States, 1993-95 -------------------------------------------------------------------------------- : : Production of Milk and Milkfat 2/ : Number :------------------------------------------------------- Year : of : Per Milk Cow : Percentage : Total :Milk Cows 1/:-------------------: of Fat in All :------------------ : : Milk : Milkfat : Milk Produced : Milk : Milkfat -------------------------------------------------------------------------------- : 1,000 Head --- Pounds --- Percent Million Pounds : 1993 : 9,589 15,704 575 3.66 150,582 5,514.4 1994 : 9,500 16,175 592 3.66 153,664 5,623.7 1995 : 9,461 16,451 602 3.66 155,644 5,694.3 -------------------------------------------------------------------------------- 1/ Average number during year, excluding heifers not yet fresh. 2/ Excludes milk sucked by calves. CRF Labels: Non-Table Table Title Table Header Table Data Row Table Section Data Row Table Footnote... (12 in all) [Pinto, McCallum, Wei, Croft, 2003 SIGIR] Features: Percentage of digit chars Percentage of alpha chars Indented Contains 5+ consecutive spaces Whitespace in this line aligns with prev.... Conjunctions of all previous features, time offset: {0,0}, {-1,0}, {0,1}, {1,2}. 100+ documents from www.fedstats.gov

Table Extraction Experimental Results Line labels, percent correct Table segments, F1 95 %92 % 65 %64 % 85 %- HMM Stateless MaxEnt CRF [Pinto, McCallum, Wei, Croft, 2003 SIGIR]

IE from Research Papers [McCallum et al ‘99]

IE from Research Papers Field-level F1 Hidden Markov Models (HMMs)75.6 [Seymore, McCallum, Rosenfeld, 1999] Support Vector Machines (SVMs)89.7 [Han, Giles, et al, 2003] Conditional Random Fields (CRFs)93.9 [Peng, McCallum, 2004]  error 40%

Named Entity Recognition CRICKET - MILLNS SIGNS FOR BOLAND CAPE TOWN 1996-08-22 South African provincial side Boland said on Thursday they had signed Leicestershire fast bowler David Millns on a one year contract. Millns, who toured Australia with England A in 1992, replaces former England all-rounder Phillip DeFreitas as Boland's overseas professional. Labels: Examples: PERYayuk Basuki Innocent Butare ORG3M KDP Cleveland LOCCleveland Nirmal Hriday The Oval MISCJava Basque 1,000 Lakes Rally

Automatically Induced Features IndexFeature 0inside-noun-phrase (o t-1 ) 5stopword (o t ) 20capitalized (o t+1 ) 75word=the (o t ) 100in-person-lexicon (o t-1 ) 200word=in (o t+2 ) 500word=Republic (o t+1 ) 711word=RBI (o t ) & header=BASEBALL 1027header=CRICKET (o t ) & in-English-county-lexicon (o t ) 1298company-suffix-word (firstmention t+2 ) 4040location (o t ) & POS=NNP (o t ) & capitalized (o t ) & stopword (o t-1 ) 4945moderately-rare-first-name (o t-1 ) & very-common-last-name (o t ) 4474word=the (o t-2 ) & word=of (o t ) [McCallum & Li, 2003, CoNLL]

Named Entity Extraction Results MethodF1 HMMs BBN's Identifinder73% CRFs w/out Feature Induction83% CRFs with Feature Induction90% based on LikelihoodGain [McCallum & Li, 2003, CoNLL]

Outline Examples of IE and Data Mining. Brief review of Conditional Random Fields Joint inference: Motivation and examples –Joint Labeling of Cascaded Sequences (Belief Propagation) –Joint Labeling of Distant Entities (BP by Tree Reparameterization) –Joint Co-reference Resolution (Graph Partitioning) –Joint Segmentation and Co-ref (Iterated Conditional Samples) Piecewise Training for large multi-component models Two example projects –Email, contact management, and Social Network Analysis –Research Paper search and analysis  

Larger Context Segment Classify Associate Cluster Filter Prediction Outlier detection Decision support IE Document collection Database Discover patterns - entity types - links / relations - events Data Mining Spider Actionable knowledge

Problem: Combined in serial juxtaposition, IE and DM are unaware of each others’ weaknesses and opportunities. 1)DM begins from a populated DB, unaware of where the data came from, or its inherent uncertainties. 2)IE is unaware of emerging patterns and regularities in the DB. The accuracy of both suffers, and significant mining of complex text sources is beyond reach.

Segment Classify Associate Cluster Filter Prediction Outlier detection Decision support IE Document collection Database Discover patterns - entity types - links / relations - events Data Mining Spider Actionable knowledge Uncertainty Info Emerging Patterns Solution:

Segment Classify Associate Cluster Filter Prediction Outlier detection Decision support IE Document collection Probabilistic Model Discover patterns - entity types - links / relations - events Data Mining Spider Actionable knowledge Solution: Conditional Random Fields [Lafferty, McCallum, Pereira] Conditional PRMs [Koller…], [Jensen…], [Geetor…], [Domingos…] Discriminatively-trained undirected graphical models Complex Inference and Learning Just what we researchers like to sink our teeth into! Unified Model

Larger-scale Joint Inference for IE What model structures will capture salient dependencies? Will joint inference improve accuracy? How do to inference in these large graphical models? How to efficiently train these models, which are built from multiple large components?

1. Jointly labeling cascaded sequences Factorial CRFs Part-of-speech Noun-phrase boundaries Named-entity tag English words [Sutton, Khashayar, McCallum, ICML 2004]

1. Jointly labeling cascaded sequences Factorial CRFs Part-of-speech Noun-phrase boundaries Named-entity tag English words [Sutton, Khashayar, McCallum, ICML 2004] But errors cascade--must be perfect at every stage to do well.

1. Jointly labeling cascaded sequences Factorial CRFs Part-of-speech Noun-phrase boundaries Named-entity tag English words [Sutton, Khashayar, McCallum, ICML 2004] Joint prediction of part-of-speech and noun-phrase in newswire, matching accuracy with only 50% of the training data. Inference: Tree reparameterization BP [Wainwright et al, 2002]

2. Jointly labeling distant mentions Skip-chain CRFs Senator Joe Green said today …. Green ran for … … [Sutton, McCallum, SRL 2004] Dependency among similar, distant mentions ignored.

2. Jointly labeling distant mentions Skip-chain CRFs Senator Joe Green said today …. Green ran for … … [Sutton, McCallum, SRL 2004] 14% reduction in error on most repeated field in email seminar announcements. Inference: Tree reparameterization BP [Wainwright et al, 2002]

3. Joint co-reference among all pairs Affinity Matrix CRF... Mr Powell...... Powell...... she... 45  99 Y/N 11 [McCallum, Wellner, IJCAI WS 2003, NIPS 2004] ~25% reduction in error on co-reference of proper nouns in newswire. Inference: Correlational clustering graph partitioning [Bansal, Blum, Chawla, 2002] “Entity resolution” “Object correspondence”

Coreference Resolution Input AKA "record linkage", "database record deduplication", "entity resolution", "object correspondence", "identity uncertainty" Output News article, with named-entity "mentions" tagged Number of entities, N = 3 #1 Secretary of State Colin Powell he Mr. Powell Powell #2 Condoleezza Rice she Rice #3 President Bush Bush Today Secretary of State Colin Powell met with................................................. he................... Condoleezza Rice......... Mr Powell..........she..................... Powell............... President Bush.................................. Rice................ Bush.........................................................................

Inside the Traditional Solution Mention (3) Mention (4)... Mr Powell...... Powell... NTwo words in common29 YOne word in common13 Y"Normalized" mentions are string identical39 YCapitalized word in common17 Y> 50% character tri-gram overlap19 N< 25% character tri-gram overlap-34 YIn same sentence9 YWithin two sentences8 NFurther than 3 sentences apart-1 Y"Hobbs Distance" < 311 NNumber of entities in between two mentions = 012 NNumber of entities in between two mentions > 4-3 YFont matches1 YDefault-19 OVERALL SCORE = 98 > threshold=0 Pair-wise Affinity Metric Y/N?

The Problem... Mr Powell...... Powell...... she... affinity = 98 affinity = 11 affinity =  104 Pair-wise merging decisions are being made independently from each other Y Y N Affinity measures are noisy and imperfect. They should be made in relational dependence with each other.

A Generative Model Solution [Russell 2001], [Pasula et al 2002], [Milch et al 2003], [Marthi et al 2003] id wordscontext distancefonts id surname agegender N............ (Applied to citation matching, and object correspondence in vision)

A Markov Random Field for Co-reference... Mr Powell...... Powell...... she... 45  30 Y/N [McCallum & Wellner, 2003, ICML] (MRF) Make pair-wise merging decisions in dependent relation to each other by - calculating a joint prob. - including all edge weights - adding dependence on consistent triangles. 11

A Markov Random Field for Co-reference... Mr Powell...... Powell...... she... 45  30 Y/N [McCallum & Wellner, 2003] (MRF) Make pair-wise merging decisions in dependent relation to each other by - calculating a joint prob. - including all edge weights - adding dependence on consistent triangles. 11

A Markov Random Field for Co-reference... Mr Powell...... Powell...... she... Y N N [McCallum & Wellner, 2003] (MRF) 44  45)  30)  (11)

A Markov Random Field for Co-reference... Mr Powell...... Powell...... she... Y Y N [McCallum & Wellner, 2003] (MRF)  infinity  45)  30)  (11)

A Markov Random Field for Co-reference... Mr Powell...... Powell...... she... N Y N [McCallum & Wellner, 2003] (MRF) 64  45)  30)  (11)

Inference in these MRFs = Graph Partitioning [Boykov, Vekler, Zabih, 1999], [Kolmogorov & Zabih, 2002], [Yu, Cross, Shi, 2002]... Mr Powell...... Powell...... she... 45 11  30... Condoleezza Rice...  134 10  106

Inference in these MRFs = Graph Partitioning [Boykov, Vekler, Zabih, 1999], [Kolmogorov & Zabih, 2002], [Yu, Cross, Shi, 2002]... Mr Powell...... Powell...... she...... Condoleezza Rice... =  22 45 11  30  134 10  106

Inference in these MRFs = Graph Partitioning [Boykov, Vekler, Zabih, 1999], [Kolmogorov & Zabih, 2002], [Yu, Cross, Shi, 2002]... Mr Powell...... Powell...... she...... Condoleezza Rice... = 314 45 11  30  134 10  106

Co-reference Experimental Results Proper noun co-reference DARPA ACE broadcast news transcripts, 117 stories Partition F1Pair F1 Single-link threshold16 %18 % Best prev match [Morton]83 %89 % MRFs88 %92 %  error=30%  error=28% DARPA MUC-6 newswire article corpus, 30 stories Partition F1Pair F1 Single-link threshold11%7 % Best prev match [Morton]70 %76 % MRFs74 %80 %  error=13%  error=17% [McCallum & Wellner, 2003]

Y/N Joint Co-reference of Multiple Fields X. Li Predicting the Stock Market International Conference on Knowledge Discovery and Data Mining [Culotta & McCallum 2005] Xuerui Li Predict the Stock Market KDD CitationsVenues

Joint Co-reference Experimental Results CiteSeer Dataset 1500 citations, 900 unique papers, 350 unique venues Paper Venue indepjointindepjoint constraint88.991.079.494.1 reinforce92.292.256.560.1 face88.293.780.982.8 reason97.497.075.679.5 Micro Average91.793.473.179.1  error=20%  error=22% [Culotta & McCallum 2005]

Joint co-reference among all pairs Affinity Matrix CRF... Mr Powell...... Powell...... she... 45  99 Y/N 11 [McCallum, Wellner, IJCAI WS 2003, NIPS 2004] ~25% reduction in error on co-reference of proper nouns in newswire. Inference: Correlational clustering graph partitioning [Bansal, Blum, Chawla, 2002]

p Database field values c 4. Joint segmentation and co-reference o s o s c c s o Citation attributes y y y Segmentation [Wellner, McCallum, Peng, Hay, UAI 2004] Inference: Variant of Iterated Conditional Modes Co-reference decisions Laurel, B. Interface Agents: Metaphors with Character, in The Art of Human-Computer Interface Design, B. Laurel (ed), Addison- Wesley, 1990. Brenda Laurel. Interface Agents: Metaphors with Character, in Laurel, The Art of Human-Computer Interface Design, 355-366, 1990. [Besag, 1986] World Knowledge 35% reduction in co-reference error by using segmentation uncertainty. 6-14% reduction in segmentation error by using co-reference. Extraction from and matching of research paper citations. see also [Marthi, Milch, Russell, 2003]

Joint IE and Coreference from Research Paper Citations Textual citation mentions (noisy, with duplicates) Paper database, with fields, clean, duplicates collapsed AUTHORS TITLE VENUE Cowell, Dawid… Probab…Springer Montemerlo, Thrun…FastSLAM… AAAI… Kjaerulff Approxi… Technic… 4. Joint segmentation and co-reference

Laurel, B. Interface Agents: Metaphors with Character, in The Art of Human-Computer Interface Design, T. Smith (ed), Addison-Wesley, 1990. Brenda Laurel. Interface Agents: Metaphors with Character, in Smith, The Art of Human-Computr Interface Design, 355-366, 1990. Citation Segmentation and Coreference

Laurel, B. Interface Agents: Metaphors with Character, in The Art of Human-Computer Interface Design, T. Smith (ed), Addison-Wesley, 1990. Brenda Laurel. Interface Agents: Metaphors with Character, in Smith, The Art of Human-Computr Interface Design, 355-366, 1990. 1) Segment citation fields Citation Segmentation and Coreference

Laurel, B. Interface Agents: Metaphors with Character, in The Art of Human-Computer Interface Design, T. Smith (ed), Addison-Wesley, 1990. Brenda Laurel. Interface Agents: Metaphors with Character, in Smith, The Art of Human-Computr Interface Design, 355-366, 1990. 1) Segment citation fields 2) Resolve coreferent citations Citation Segmentation and Coreference Y?NY?N

Laurel, B. Interface Agents: Metaphors with Character, in The Art of Human-Computer Interface Design, T. Smith (ed), Addison-Wesley, 1990. Brenda Laurel. Interface Agents: Metaphors with Character, in Smith, The Art of Human-Computr Interface Design, 355-366, 1990. Citation Segmentation and Coreference Y?NY?N Segmentation QualityCitation Co-reference (F1) No Segmentation78% CRF Segmentation91% True Segmentation93% 1) Segment citation fields 2) Resolve coreferent citations

Laurel, B. Interface Agents: Metaphors with Character, in The Art of Human-Computer Interface Design, T. Smith (ed), Addison-Wesley, 1990. Brenda Laurel. Interface Agents: Metaphors with Character, in Smith, The Art of Human-Computr Interface Design, 355-366, 1990. 1) Segment citation fields 2) Resolve coreferent citations 3) Form canonical database record Citation Segmentation and Coreference AUTHOR =Brenda Laurel TITLE =Interface Agents: Metaphors with Character PAGES =355-366 BOOKTITLE =The Art of Human-Computer Interface Design EDITOR =T. Smith PUBLISHER =Addison-Wesley YEAR =1990 Y?NY?N Resolving conflicts

Laurel, B. Interface Agents: Metaphors with Character, in The Art of Human-Computer Interface Design, T. Smith (ed), Addison-Wesley, 1990. Brenda Laurel. Interface Agents: Metaphors with Character, in Smith, The Art of Human-Computr Interface Design, 355-366, 1990. 1) Segment citation fields 2) Resolve coreferent citations 3) Form canonical database record Citation Segmentation and Coreference AUTHOR =Brenda Laurel TITLE =Interface Agents: Metaphors with Character PAGES =355-366 BOOKTITLE =The Art of Human-Computer Interface Design EDITOR =T. Smith PUBLISHER =Addison-Wesley YEAR =1990 Y?NY?N Perform jointly.

x s Observed citation CRF Segmentation IE + Coreference Model J Besag 1986 On the… AUT AUT YR TITL TITL

x s Observed citation CRF Segmentation IE + Coreference Model Citation mention attributes J Besag 1986 On the… AUTHOR = “J Besag” YEAR = “1986” TITLE = “On the…” c

x s IE + Coreference Model c J Besag 1986 On the… Smyth. 2001 Data Mining… Smyth, P Data mining… Structure for each citation mention

x s IE + Coreference Model c Binary coreference variables for each pair of mentions J Besag 1986 On the… Smyth. 2001 Data Mining… Smyth, P Data mining…

x s IE + Coreference Model c y n n J Besag 1986 On the… Smyth. 2001 Data Mining… Smyth, P Data mining… Binary coreference variables for each pair of mentions

y n n x s IE + Coreference Model c J Besag 1986 On the… Smyth. 2001 Data Mining… Smyth, P Data mining… Research paper entity attribute nodes AUTHOR = “P Smyth” YEAR = “2001” TITLE = “Data Mining…”...

y y y x s IE + Coreference Model c J Besag 1986 On the… Smyth. 2001 Data Mining… Smyth, P Data mining… Research paper entity attribute node

y n n x s IE + Coreference Model c J Besag 1986 On the… Smyth. 2001 Data Mining… Smyth, P Data mining…

Such a highly connected graph makes exact inference intractable…

Loopy Belief Propagation v6v6 v5v5 v3v3 v2v2 v1v1 v4v4 m1(v2)m1(v2) m2(v3)m2(v3) m3(v2)m3(v2)m2(v1)m2(v1) messages passed between nodes Approximate Inference 1

Loopy Belief Propagation Generalized Belief Propagation v6v6 v5v5 v3v3 v2v2 v1v1 v4v4 m1(v2)m1(v2) m2(v3)m2(v3) m3(v2)m3(v2)m2(v1)m2(v1) v6v6 v5v5 v3v3 v2v2 v1v1 v4v4 v9v9 v8v8 v7v7 messages passed between nodes messages passed between regions Here, a message is a conditional probability table passed among nodes. But when message size grows exponentially with size of overlap between regions! Approximate Inference 1

Iterated Conditional Modes (ICM) [Besag 1986] v6v6 v5v5 v3v3 v2v2 v1v1 v4v4 v 6 i+1 = argmax P(v 6 i | v \ v 6 i ) v6iv6i = held constant Approximate Inference 2

Iterated Conditional Modes (ICM) [Besag 1986] v6v6 v5v5 v3v3 v2v2 v1v1 v4v4 v 5 j+1 = argmax P(v 5 j | v \ v 5 j ) v5jv5j = held constant Approximate Inference 2

Iterated Conditional Modes (ICM) [Besag 1986] v6v6 v5v5 v3v3 v2v2 v1v1 v4v4 v 4 k+1 = argmax P(v 4 k | v \ v 4 k ) v4kv4k = held constant Approximate Inference 2 but greedy, and easily falls into local minima. Structured inference scales well here,

Iterated Conditional Modes (ICM) [Besag 1986] Iterated Conditional Sampling (ICS) (our name) Instead of selecting only argmax, sample of argmaxes of P(v 4 k | v \ v 4 k ) e.g. an N-best list (the top N values) v6v6 v5v5 v3v3 v2v2 v1v1 v4v4 v 4 k+1 = argmax P(v 4 k | v \ v 4 k ) v4kv4k = held constant v6v6 v5v5 v3v3 v2v2 v1v1 v4v4 Approximate Inference 2 Can use “Generalized Version” of this; doing exact inference on a region of several nodes at once. Here, a “message” grows only linearly with overlap region size and N!

Features of this Inference Method 1)Structured or “factored” representation (ala GBP) 2)Uses samples to approximate density 3)Closed-loop message-passing on loopy graph (ala BP) Beam search –“Forward”-only inference Particle filtering, e.g. [Doucet 1998] –Usually on tree-shaped graph, or “feedforward” only. MC Sampling…Embedded HMMs [Neal, 2003] –Sample from high-dim continuous state space; do forward-backward Sample Propagation [Paskin, 2003] –Messages = samples, on a junction tree Fields to Trees [Hamze & de Freitas, UAI 2003] –Rao-Blackwellized MCMC, partitioning G into non-overlapping trees Factored Particles for DBNs [Ng, Peshkin, Pfeffer, 2002] –Combination of Particle Filtering and Boyan-Koller for DBNs Related Work

IE + Coreference Model Exact inference on these linear-chain regions J Besag 1986 On the… Smyth. 2001 Data Mining… Smyth, P Data mining… From each chain pass an N-best List into coreference

IE + Coreference Model J Besag 1986 On the… Smyth. 2001 Data Mining… Smyth, P Data mining… Approximate inference by graph partitioning… …integrating out uncertainty in samples of extraction Make scale to 1M citations with Canopies [McCallum, Nigam, Ungar 2000]

NameTitle… Laurel, BInterface Agents: Metaphors with Character The … Laurel, B.Interface Agents: Metaphors with Character … Laurel, B. Interface Agents Metaphors with Character … When calculating similarity with another citation, have more opportunity to find correct, matching fields. NameTitleBook TitleYear Laurel, B. InterfaceAgents: Metaphors with Character The Art of Human Computer Interface Design 1990 Laurel, B.Interface Agents: Metaphors with Character The Art of Human Computer Interface Design 1990 Laurel, B. InterfaceAgents: Metaphors with Character The Art of Human Computer Interface Design 1990 Inference: Sample = N-best List from CRF Segmentation y ? n

y n n IE + Coreference Model J Besag 1986 On the… Smyth. 2001 Data Mining… Smyth, P Data mining… Exact (exhaustive) inference over entity attributes

y n n IE + Coreference Model J Besag 1986 On the… Smyth. 2001 Data Mining… Smyth, P Data mining… Revisit exact inference on IE linear chain, now conditioned on entity attributes

y n n Parameter Estimation Coref graph edge weights MAP on individual edges Separately for different regions IE Linear-chain Exact MAP Entity attribute potentials MAP, pseudo-likelihood In all cases: Climb MAP gradient with quasi-Newton method

Experimenal Results Set of citations from CiteSeer –1500 citation mentions –to 900 paper entities Hand-labeled for coreference and field-extraction Divided into 4 subsets, each on a different topic –RL, Face detection, Reasoning, Constraint Satisfaction –Within each subset many citations share authors, publication venues, publishers, etc. 70% of the citation mentions are singletons

p Database field values c 4. Joint segmentation and co-reference o s o s c c s o Citation attributes y y y Segmentation [Wellner, McCallum, Peng, Hay, UAI 2004] Inference: Variant of Iterated Conditional Modes Co-reference decisions Laurel, B. Interface Agents: Metaphors with Character, in The Art of Human-Computer Interface Design, B. Laurel (ed), Addison- Wesley, 1990. Brenda Laurel. Interface Agents: Metaphors with Character, in Laurel, The Art of Human-Computer Interface Design, 355-366, 1990. [Besag, 1986] World Knowledge 35% reduction in co-reference error by using segmentation uncertainty. 6-14% reduction in segmentation error by using co-reference. Extraction from and matching of research paper citations.

NReinforceFaceReasonConstraint 1 (Baseline)0.9460.960.940.96 30.950.980.96 70.950.980.950.97 90.9820.970.960.97 Optimal0.99 Coreference cluster recall Average error reduction is 35%. “Optimal” makes best use of N-best list by using true labels. Indicates that even more improvement can be obtained Coreference Results

ReinforceFaceReasonConstraint Baseline.943.908.929.934 w/ Coref.949.914.935.943 Err. Reduc..101.062.090.142 P-value.0442.0014.0001 Segmentation F1 Error reduction ranges from 6-14%. Small, but significant at 95% confidence level (p-value < 0.05) Information Extraction Results Biggest limiting factor in both sets of results: data set is small, and does not have large coreferent sets.

y n n Parameter Estimation Coref graph edge weights MAP on individual edges Separately for different regions IE Linear-chain Exact MAP Entity attribute potentials MAP, pseudo-likelihood In all cases: Climb MAP gradient with quasi-Newton method

Outline Examples of IE and Data Mining. Brief review of Conditional Random Fields Joint inference: Motivation and examples –Joint Labeling of Cascaded Sequences (Belief Propagation) –Joint Labeling of Distant Entities (BP by Tree Reparameterization) –Joint Co-reference Resolution (Graph Partitioning) –Joint Segmentation and Co-ref (Iterated Conditional Samples) Piecewise Training for large multi-component models Two example projects –Email, contact management, and Social Network Analysis –Research Paper search and analysis   

Piecewise Training

Piecewise Training with NOTA

Experimental Results Named entity tagging (CoNLL-2003) Training set = 15k newswire sentences 9 labels Test F1Training time CRF89.879 hours MEMM88.901 hour CRF-PT5.3 hours90.50 stat. sig. improvement at p = 0.001

Experimental Results 2 Part-of-speech tagging (Penn Treebank, small subset) Training set = 1154 newswire sentences 45 labels Test F1Training time CRF88.114 hours MEMM88.12 hours CRF-PT2.5 hours88.8 stat. sig. improvement at p = 0.001

“Parameter Independence Diagrams” Graphical models = formalism for representing independence assumptions among variables. Here we represent independence assumptions among parameters (in factor graph)

“Parameter Independence Diagrams”

Train some in pieces, some globally

Piecewise Training Research Questions How to select the boundaries of “pieces”? What choices of limited interaction are best? How to sample sparse subsets of NOTA instances? Application to simpler models (classifiers) Application to more complex models (parsing)

Application to Classifiers

Piecewise Training in Factorial CRFs for Transfer Learning Emailed seminar ann’mt entities Email English words [Sutton, McCallum, 2005] Too little labeled training data. 60k words training. GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.

Piecewise Training in Factorial CRFs for Transfer Learning Newswire named entities Newswire English words [Sutton, McCallum, 2005] Train on “related” task with more data. 200k words training. CRICKET - MILLNS SIGNS FOR BOLAND CAPE TOWN 1996-08-22 South African provincial side Boland said on Thursday they had signed Leicestershire fast bowler David Millns on a one year contract. Millns, who toured Australia with England A in 1992, replaces former England all- rounder Phillip DeFreitas as Boland's overseas professional.

Piecewise Training in Factorial CRFs for Transfer Learning Newswire named entities Email English words [Sutton, McCallum, 2005] At test time, label email with newswire NEs...

Piecewise Training in Factorial CRFs for Transfer Learning Newswire named entities Emailed seminar ann’mt entities Email English words [Sutton, McCallum, 2005] …then use these labels as features for final task

Piecewise Training in Factorial CRFs for Transfer Learning Newswire named entities Seminar Announcement entities English words [Sutton, McCallum, 2005] Piecewise training of a joint model.

CRF Transfer Experimental Results Seminar Announcements Dataset [Freitag 1998] CRFstimeetimelocationspeaker overall No transfer99.197.381.073.7 87.8 Cascaded transfer99.296.084.374.2 88.4 Joint transfer99.196.085.376.3 89.2 New “best published” accuracy on common dataset [Sutton, McCallum, 2005]

Outline Examples of IE and Data Mining. Brief review of Conditional Random Fields Joint inference: Motivation and examples –Joint Labeling of Cascaded Sequences (Belief Propagation) –Joint Labeling of Distant Entities (BP by Tree Reparameterization) –Joint Co-reference Resolution (Graph Partitioning) –Joint Segmentation and Co-ref (Iterated Conditional Samples) Piecewise Training for large multi-component models Two example projects –Email, contact management, and Social Network Analysis –Research Paper search and analysis    

Workplace effectiveness ~ Ability to leverage network of acquaintances But filling Contacts DB by hand is tedious, and incomplete. Email Inbox Contacts DB WWW Automatically Managing and Understanding Connections of People in our Email World

System Overview Contact Info and Person Name Extraction Person Name Extraction Name Coreference Homepage Retrieval Social Network Analysis Keyword Extraction CRF WWW names Email

An Example To: “Andrew McCallum” mccallum@cs.umass.edu Subject... First Name: Andrew Middle Name: Kachites Last Name: McCallum JobTitle:Associate Professor Company:University of Massachusetts Street Address: 140 Governor’s Dr. City:Amherst State:MA Zip:01003 Company Phone: (413) 545-1323 Links:Fernando Pereira, Sam Roweis,… Key Words: Information extraction, social network,… Search for new people

Summary of Results Token Acc Field Prec Field Recall Field F1 CRF94.5085.7376.3380.76 PersonKeywords William CohenLogic programming Text categorization Data integration Rule learning Daphne KollerBayesian networks Relational models Probabilistic models Hidden variables Deborah McGuiness Semantic web Description logics Knowledge representation Ontologies Tom MitchellMachine learning Cognitive states Learning apprentice Artificial intelligence Contact info and name extraction performance (25 fields) Example keywords extracted 1.Expert Finding: When solving some task, find friends-of-friends with relevant expertise. Avoid “stove-piping” in large org’s by automatically suggesting collaborators. Given a task, automatically suggest the right team for the job. (Hiring aid!) 2.Social Network Analysis: Understand the social structure of your organization. Suggest structural changes for improved efficiency.

Clustering words into topics with Latent Dirichlet Allocation [Blei, Ng, Jordan 2003]

STORY STORIES TELL CHARACTER CHARACTERS AUTHOR READ TOLD SETTING TALES PLOT TELLING SHORT FICTION ACTION TRUE EVENTS TELLS TALE NOVEL MIND WORLD DREAM DREAMS THOUGHT IMAGINATION MOMENT THOUGHTS OWN REAL LIFE IMAGINE SENSE CONSCIOUSNESS STRANGE FEELING WHOLE BEING MIGHT HOPE WATER FISH SEA SWIM SWIMMING POOL LIKE SHELL SHARK TANK SHELLS SHARKS DIVING DOLPHINS SWAM LONG SEAL DIVE DOLPHIN UNDERWATER DISEASE BACTERIA DISEASES GERMS FEVER CAUSE CAUSED SPREAD VIRUSES INFECTION VIRUS MICROORGANISMS PERSON INFECTIOUS COMMON CAUSING SMALLPOX BODY INFECTIONS CERTAIN Example topics induced from a large collection of text FIELD MAGNETIC MAGNET WIRE NEEDLE CURRENT COIL POLES IRON COMPASS LINES CORE ELECTRIC DIRECTION FORCE MAGNETS BE MAGNETISM POLE INDUCED SCIENCE STUDY SCIENTISTS SCIENTIFIC KNOWLEDGE WORK RESEARCH CHEMISTRY TECHNOLOGY MANY MATHEMATICS BIOLOGY FIELD PHYSICS LABORATORY STUDIES WORLD SCIENTIST STUDYING SCIENCES BALL GAME TEAM FOOTBALL BASEBALL PLAYERS PLAY FIELD PLAYER BASKETBALL COACH PLAYED PLAYING HIT TENNIS TEAMS GAMES SPORTS BAT TERRY JOB WORK JOBS CAREER EXPERIENCE EMPLOYMENT OPPORTUNITIES WORKING TRAINING SKILLS CAREERS POSITIONS FIND POSITION FIELD OCCUPATIONS REQUIRE OPPORTUNITY EARN ABLE [Tennenbaum et al]

STORY STORIES TELL CHARACTER CHARACTERS AUTHOR READ TOLD SETTING TALES PLOT TELLING SHORT FICTION ACTION TRUE EVENTS TELLS TALE NOVEL MIND WORLD DREAM DREAMS THOUGHT IMAGINATION MOMENT THOUGHTS OWN REAL LIFE IMAGINE SENSE CONSCIOUSNESS STRANGE FEELING WHOLE BEING MIGHT HOPE WATER FISH SEA SWIM SWIMMING POOL LIKE SHELL SHARK TANK SHELLS SHARKS DIVING DOLPHINS SWAM LONG SEAL DIVE DOLPHIN UNDERWATER DISEASE BACTERIA DISEASES GERMS FEVER CAUSE CAUSED SPREAD VIRUSES INFECTION VIRUS MICROORGANISMS PERSON INFECTIOUS COMMON CAUSING SMALLPOX BODY INFECTIONS CERTAIN FIELD MAGNETIC MAGNET WIRE NEEDLE CURRENT COIL POLES IRON COMPASS LINES CORE ELECTRIC DIRECTION FORCE MAGNETS BE MAGNETISM POLE INDUCED SCIENCE STUDY SCIENTISTS SCIENTIFIC KNOWLEDGE WORK RESEARCH CHEMISTRY TECHNOLOGY MANY MATHEMATICS BIOLOGY FIELD PHYSICS LABORATORY STUDIES WORLD SCIENTIST STUDYING SCIENCES BALL GAME TEAM FOOTBALL BASEBALL PLAYERS PLAY FIELD PLAYER BASKETBALL COACH PLAYED PLAYING HIT TENNIS TEAMS GAMES SPORTS BAT TERRY JOB WORK JOBS CAREER EXPERIENCE EMPLOYMENT OPPORTUNITIES WORKING TRAINING SKILLS CAREERS POSITIONS FIND POSITION FIELD OCCUPATIONS REQUIRE OPPORTUNITY EARN ABLE Example topics induced from a large collection of text [Tennenbaum et al]

From LDA to Author-Recipient-Topic (ART)

Inference and Estimation Gibbs Sampling: - Easy to implement - Reasonably fast r

Enron Email Corpus 250k email messages 23k people Date: Wed, 11 Apr 2001 06:56:00 -0700 (PDT) From: debra.perlingiere@enron.com To: steve.hooser@enron.com Subject: Enron/TransAltaContract dated Jan 1, 2001 Please see below. Katalin Kiss of TransAlta has requested an electronic copy of our final draft? Are you OK with this? If so, the only version I have is the original draft without revisions. DP Debra Perlingiere Enron North America Corp. Legal Department 1400 Smith Street, EB 3885 Houston, Texas 77002 dperlin@enron.com

Topics, and prominent sender/receivers discovered by ART

Beck = “Chief Operations Officer” Dasovich = “Government Relations Executive” Shapiro = “Vice Presidence of Regulatory Affairs” Steffes = “Vice President of Government Affairs”

Comparing Role Discovery connection strength (A,B) = distribution over authored topics Traditional SNA distribution over recipients distribution over authored topics Author-TopicART

Comparing Role Discovery Tracy Geaconne  Dan McCarty Traditional SNAAuthor-TopicART Similar roles Different roles Geaconne = “Secretary” McCarty = “Vice President”

Traditional SNAAuthor-TopicART Different roles Very similarNot very similar Geaconne = “Secretary” Hayslett = “Vice President & CTO” Comparing Role Discovery Tracy Geaconne  Rod Hayslett

Traditional SNAAuthor-TopicART Different roles Very differentVery similar Blair = “Gas pipeline logistics” Watson = “Pipeline facilities planning” Comparing Role Discovery Lynn Blair  Kimberly Watson

Traditional SNAAuthor-TopicART Block structured Not Comparing Group Discovery Enron TransWestern Division

McCallum Email Corpus 2004 January - October 2004 23k email messages 825 people From: kate@cs.umass.edu Subject: NIPS and.... Date: June 14, 2004 2:27:41 PM EDT To: mccallum@cs.umass.edu There is pertinent stuff on the first yellow folder that is completed either travel or other things, so please sign that first folder anyway. Then, here is the reminder of the things I'm still waiting for: NIPS registration receipt. CALO registration receipt. Thanks, Kate

McCallum Email Blockstructure

Four most prominent topics in discussions with ____?

Two most prominent topics in discussions with ____?

Topic 37

Topic 40

Pairs with highest rank difference between ART & SNA 5 other professors 3 other ML researchers

Role-Author-Recipient-Topic Models

Outline Examples of IE and Data Mining. Brief review of Conditional Random Fields Joint inference: Motivation and examples –Joint Labeling of Cascaded Sequences (Belief Propagation) –Joint Labeling of Distant Entities (BP by Tree Reparameterization) –Joint Co-reference Resolution (Graph Partitioning) –Joint Segmentation and Co-ref (Iterated Conditional Samples) Piecewise Training for large multi-component models Two example projects –Email, contact management, and Social Network Analysis –Research Paper search and analysis     

Previous Systems

Research Paper Cites Previous Systems

Research Paper Cites Person UniversityVenue Grant Groups Expertise More Entities and Relations

Summary Conditional Random Fields –Conditional probability models of structured data Data mining complex unstructured text suggests the need for joint inference IE + DM. Early examples –Factorial finite state models –Jointly labeling distant entities –Coreference analysis –Segmentation uncertainty aiding coreference Piecewise Training –Faster + higher accuracy Current projects –Email, contact management, expert-finding, SNA –Mining the scientific literature

End of Talk

Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst.

Similar presentations

Presentation on theme: "Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst.

Similar presentations

Presentation on theme: "Toward Unified Graphical Models of Information Extraction and Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst."— Presentation transcript:

Similar presentations

About project

Feedback