Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis Andrew McCallum Information Extraction and Synthesis.

Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory Computer Science Department University of Massachusetts Amherst Joint work with David Jensen Knowledge Discovery and Dissemination (KDD) Conference September 2004 Intelligence Technology Innovation Center ITIC

Goal: Improve the state-of-the-art in our ability to mine actionable knowledge from unstructured text.

Traditional Pipeline Segment Classify Associate Cluster Filter Prediction Outlier detection Decision support IE Document collection Database Discover patterns - entity types - links / relations - events Knowledge Discovery Spider Actionable knowledge

Extracting Job Openings from the Web foodscience.com-Job2 Employer: foodscience.com JobTitle: Ice Cream Guru JobCategory: Travel/Hospitality JobFunction: Food Services JobLocation: Upper Midwest Contact Phone: 800-488-2611 DateExtracted: January 8, 2001 Source: www.foodscience.com/jobs_midwest.html OtherCompanyJobs: foodscience.com-Job1

Data Mining the Extracted Job Information

IE from Research Papers

Mining Research Papers [Giles et al] [Rosen-Zvi, Griffiths, Steyvers, Smyth, 2004]

IE from Chinese Documents regarding Weather Department of Terrestrial System, Chinese Academy of Sciences 200k+ documents several millennia old - Qing Dynasty Archives - memos - newspaper articles - diaries

Traditional Pipeline Segment Classify Associate Cluster Filter Prediction Outlier detection Decision support IE Document collection Database Discover patterns - entity types - links / relations - events Knowledge Discovery Spider Actionable knowledge

Problem: Combined in serial juxtaposition, IE and KD are unaware of each others’ weaknesses and opportunities. 1)KD begins from a populated DB, unaware of where the data came from, or its inherent uncertainties. 2)IE is unaware of emerging patterns and regularities in the DB. The accuracy of both suffers, and significant mining of complex text sources is beyond reach.

Segment Classify Associate Cluster Filter Prediction Outlier detection Decision support IE Document collection Database Discover patterns - entity types - links / relations - events Data Mining Spider Actionable knowledge Uncertainty Info Emerging Patterns Solution:

Segment Classify Associate Cluster Filter Prediction Outlier detection Decision support IE Document collection Probabilistic Model Discover patterns - entity types - links / relations - events Data Mining Spider Actionable knowledge Research & Approach: Conditional Random Fields [Lafferty, McCallum, Pereira] Conditional PRMs [Koller…], [Jensen…], [Geetor…], [Domingos…], … Conditionally-trained undirected graphical models Complex Inference and Learning Just what we researchers like to sink our teeth into! Unified Model

Extracting answers, and also uncertainty/confidence. –Formally justified as marginalization in graphical models –Applications to new word discovery in Chinese word segmentation, and correction propagation in interactive IE Joint inference, with efficient methods –Multiple, cascaded label sequences (Factorial CRFs) –Multiple distant, but related mentions (Skip-chain CRFs) –Multiple co-reference decisions (Affinity Matrix CRF) –Integrating extraction with co-reference (Graphs & chains) Put it into a large-scale, working system –Social network analysis from Email and the Web –A new portal: research, people, connections. Accomplishments, Discoveries & Results:

Types of Uncertainty in Knowledge Discovery from Text Confidence that extractor correctly obtained statements the author intended. Confidence that what was written is truthful –Author could have had misconceptions. –…or have been purposefully trying to mislead. Confidence that the emerging, discovered pattern is a reliable fact or generalization.

1. Labeling Sequence Data Linear-chain CRFs y t-1 y t x t y t+1 x t +1 x t - 1 Finite state modelGraphical model Undirected graphical model, trained to maximize conditional probability of outputs given inputs... FSM states observations y t+2 x t +2 y t+3 x t +3 said Arden Bement NSF Director … where OTHER PERSON PERSON ORG TITLE … output seq input seq Asian word segmentation [COLING’04], [ACL’04] IE from Research papers [HTL’04] Object classification in images [CVPR ‘04] Segmenting tables in textual gov’t reports, 85% reduction in error over HMMs. Noun phrase, Named entity [HLT’03], [CoNLL’03] Protein structure prediction [ICML’04] IE from Bioinformatics text [Bioinformatics ‘04],… [Lafferty, McCallum, Pereira 2001]

Confidence Estimation in Linear-chain CRFs [Culotta, McCallum 2004] y t-1 y t x t y t+1 x t +1 x t - 1... Lattice of FSM states observations y t+2 x t +2 y t+3 x t +3 said Arden Bement NSF Director … output sequence input sequence OTHER TITLE ORG PERSON Finite State Lattice

Confidence Estimation in Linear-chain CRFs [Culotta, McCallum 2004] y t-1 y t x t y t+1 x t +1 x t - 1... Lattice of FSM states observations y t+2 x t +2 y t+3 x t +3 said Arden Bement NSF Director … output sequence input sequence OTHER TITLE ORG PERSON Constrained Forward-Backward

Forward-Backward Confidence Estimation improves accuracy/coverage optimal our forward-backward confidence traditional token-wise confidence no use of confidence

Confidence Estimation Applied New word discovery in Chinese word segmentation –Improves segmentation accuracy by ~25% Highlighting fields for Interactive Information Extraction –After fixing least confident field, constrained Viterbi automatically reduces error by another 23%. [Peng, Fangfang, McCallum COLING 2004] [Kristiansen, Culotta, Viola, McCallum AAAI 2004] Honorable Mention Award

1. Jointly labeling cascaded sequences Factorial CRFs Part-of-speech Noun-phrase boundaries Named-entity tag English words [Sutton, Khashayar, McCallum, ICML 2004]

1. Jointly labeling cascaded sequences Factorial CRFs Part-of-speech Noun-phrase boundaries Named-entity tag English words [Sutton, Khashayar, McCallum, ICML 2004] But errors cascade--must be perfect at every stage to do well.

1. Jointly labeling cascaded sequences Factorial CRFs Part-of-speech Noun-phrase boundaries Named-entity tag English words [Sutton, Khashayar, McCallum, ICML 2004] Joint prediction of part-of-speech and noun-phrase in newswire, matching accuracy with only 50% of the training data. Inference: Tree reparameterization BP [Wainwright et al, 2002]

2. Jointly labeling distant mentions Skip-chain CRFs Senator Joe Green said today …. Green ran for … … [Sutton, McCallum, SRL 2004] Dependency among similar, distant mentions ignored.

2. Jointly labeling distant mentions Skip-chain CRFs Senator Joe Green said today …. Green ran for … … [Sutton, McCallum, SRL 2004] 14% reduction in error on most repeated field in email seminar announcements. Inference: Tree reparameterization BP [Wainwright et al, 2002]

3. Joint co-reference among all pairs Affinity Matrix CRF... Mr Powell...... Powell...... she... 45  99 Y/N 11 [McCallum, Wellner, IJCAI WS 2003, NIPS 2004] 25% reduction in error on co-reference of proper nouns in newswire. Inference: Correlational clustering graph partitioning [Bansal, Blum, Chawla, 2002] “Entity resolution” “Object correspondence”

Joint IE and Coreference from Research Paper Citations Textual citation mentions (noisy, with duplicates) Paper database, with fields, clean, duplicates collapsed AUTHORS TITLE VENUE Cowell, Dawid… Probab…Springer Montemerlo, Thrun…FastSLAM… AAAI… Kjaerulff Approxi… Technic… 4. Joint segmentation and co-reference

Laurel, B. Interface Agents: Metaphors with Character, in The Art of Human-Computer Interface Design, T. Smith (ed), Addison-Wesley, 1990. Brenda Laurel. Interface Agents: Metaphors with Character, in Smith, The Art of Human-Computr Interface Design, 355-366, 1990. Citation Segmentation and Coreference

Laurel, B. Interface Agents: Metaphors with Character, in The Art of Human-Computer Interface Design, T. Smith (ed), Addison-Wesley, 1990. Brenda Laurel. Interface Agents: Metaphors with Character, in Smith, The Art of Human-Computr Interface Design, 355-366, 1990. 1) Segment citation fields Citation Segmentation and Coreference

Laurel, B. Interface Agents: Metaphors with Character, in The Art of Human-Computer Interface Design, T. Smith (ed), Addison-Wesley, 1990. Brenda Laurel. Interface Agents: Metaphors with Character, in Smith, The Art of Human-Computr Interface Design, 355-366, 1990. 1) Segment citation fields 2) Resolve coreferent citations Citation Segmentation and Coreference Y?NY?N

Laurel, B. Interface Agents: Metaphors with Character, in The Art of Human-Computer Interface Design, T. Smith (ed), Addison-Wesley, 1990. Brenda Laurel. Interface Agents: Metaphors with Character, in Smith, The Art of Human-Computr Interface Design, 355-366, 1990. Citation Segmentation and Coreference Y?NY?N Segmentation QualityCitation Co-reference (F1) No Segmentation78% CRF Segmentation91% True Segmentation93% 1) Segment citation fields 2) Resolve coreferent citations

Laurel, B. Interface Agents: Metaphors with Character, in The Art of Human-Computer Interface Design, T. Smith (ed), Addison-Wesley, 1990. Brenda Laurel. Interface Agents: Metaphors with Character, in Smith, The Art of Human-Computr Interface Design, 355-366, 1990. 1) Segment citation fields 2) Resolve coreferent citations 3) Form canonical database record Citation Segmentation and Coreference AUTHOR =Brenda Laurel TITLE =Interface Agents: Metaphors with Character PAGES =355-366 BOOKTITLE =The Art of Human-Computer Interface Design EDITOR =T. Smith PUBLISHER =Addison-Wesley YEAR =1990 Y?NY?N Resolving conflicts

Laurel, B. Interface Agents: Metaphors with Character, in The Art of Human-Computer Interface Design, T. Smith (ed), Addison-Wesley, 1990. Brenda Laurel. Interface Agents: Metaphors with Character, in Smith, The Art of Human-Computr Interface Design, 355-366, 1990. 1) Segment citation fields 2) Resolve coreferent citations 3) Form canonical database record Citation Segmentation and Coreference AUTHOR =Brenda Laurel TITLE =Interface Agents: Metaphors with Character PAGES =355-366 BOOKTITLE =The Art of Human-Computer Interface Design EDITOR =T. Smith PUBLISHER =Addison-Wesley YEAR =1990 Y?NY?N Perform jointly.

x s Observed citation CRF Segmentation IE + Coreference Model J Besag 1986 On the… AUT AUT YR TITL TITL

x s Observed citation CRF Segmentation IE + Coreference Model Citation mention attributes J Besag 1986 On the… AUTHOR = “J Besag” YEAR = “1986” TITLE = “On the…” c

x s IE + Coreference Model c J Besag 1986 On the… Smyth. 2001 Data Mining… Smyth, P Data mining… Structure for each citation mention

x s IE + Coreference Model c Binary coreference variables for each pair of mentions J Besag 1986 On the… Smyth. 2001 Data Mining… Smyth, P Data mining…

x s IE + Coreference Model c y n n J Besag 1986 On the… Smyth. 2001 Data Mining… Smyth, P Data mining… Binary coreference variables for each pair of mentions

y n n x s IE + Coreference Model c J Besag 1986 On the… Smyth. 2001 Data Mining… Smyth, P Data mining… Research paper entity attribute nodes AUTHOR = “P Smyth” YEAR = “2001” TITLE = “Data Mining…”...

y y y x s IE + Coreference Model c J Besag 1986 On the… Smyth. 2001 Data Mining… Smyth, P Data mining… Research paper entity attribute node

y n n x s IE + Coreference Model c J Besag 1986 On the… Smyth. 2001 Data Mining… Smyth, P Data mining…

Such a highly connected graph makes exact inference intractable, so…

Loopy Belief Propagation v6v6 v5v5 v3v3 v2v2 v1v1 v4v4 m1(v2)m1(v2) m2(v3)m2(v3) m3(v2)m3(v2)m2(v1)m2(v1) messages passed between nodes Approximate Inference 1

Loopy Belief Propagation Generalized Belief Propagation v6v6 v5v5 v3v3 v2v2 v1v1 v4v4 m1(v2)m1(v2) m2(v3)m2(v3) m3(v2)m3(v2)m2(v1)m2(v1) v6v6 v5v5 v3v3 v2v2 v1v1 v4v4 v9v9 v8v8 v7v7 messages passed between nodes messages passed between regions Here, a message is a conditional probability table passed among nodes. But when message size grows exponentially with size of overlap between regions! Approximate Inference 1

Iterated Conditional Modes (ICM) [Besag 1986] v6v6 v5v5 v3v3 v2v2 v1v1 v4v4 v 6 i+1 = argmax P(v 6 i | v \ v 6 i ) v6iv6i = held constant Approximate Inference 2

Iterated Conditional Modes (ICM) [Besag 1986] v6v6 v5v5 v3v3 v2v2 v1v1 v4v4 v 5 j+1 = argmax P(v 5 j | v \ v 5 j ) v5jv5j = held constant Approximate Inference 2

Iterated Conditional Modes (ICM) [Besag 1986] v6v6 v5v5 v3v3 v2v2 v1v1 v4v4 v 4 k+1 = argmax P(v 4 k | v \ v 4 k ) v4kv4k = held constant Approximate Inference 2 but greedy, and easily falls into local minima. Structured inference scales well here,

Iterated Conditional Modes (ICM) [Besag 1986] Iterated Conditional Sampling (ICS) (our name) Instead of selecting only argmax, sample of argmaxes of P(v 4 k | v \ v 4 k ) e.g. an N-best list (the top N values) v6v6 v5v5 v3v3 v2v2 v1v1 v4v4 v 4 k+1 = argmax P(v 4 k | v \ v 4 k ) v4kv4k = held constant v6v6 v5v5 v3v3 v2v2 v1v1 v4v4 Approximate Inference 2 Can use “Generalized Version” of this; doing exact inference on a region of several nodes at once. Here, a “message” grows only linearly with overlap region size and N!

Features of this Inference Method 1)Structured or “factored” representation (ala GBP) 2)Uses samples to approximate density 3)Closed-loop message-passing on loopy graph (ala BP) Beam search –“Forward”-only inference Particle filtering, e.g. [Doucet 1998] –Usually on tree-shaped graph, or “feedforward” only. MC Sampling…Embedded HMMs [Neal, 2003] –Sample from high-dim continuous state space; do forward-backward Sample Propagation [Paskin, 2003] –Messages = samples, on a junction tree Fields to Trees [Hamze & de Freitas, UAI earlier today] –Rao-Blackwellized MCMC, partitioning G into non-overlapping trees Factored Particles for DBNs [Ng, Peshkin, Pfeffer, 2002] –Combination of Particle Filtering and Boyan-Koller for DBNs Related Work

IE + Coreference Model Exact inference on these linear-chain regions J Besag 1986 On the… Smyth. 2001 Data Mining… Smyth, P Data mining… From each chain pass an N-best List into coreference

IE + Coreference Model J Besag 1986 On the… Smyth. 2001 Data Mining… Smyth, P Data mining… Approximate inference by graph partitioning… …integrating out uncertainty in samples of extraction Make scale to 1M citations with Canopies [McCallum, Nigam, Ungar 2000]

NameTitle… Laurel, BInterface Agents: Metaphors with Character The … Laurel, B.Interface Agents: Metaphors with Character … Laurel, B. Interface Agents Metaphors with Character … When calculating similarity with another citation, have more opportunity to find correct, matching fields. NameTitleBook TitleYear Laurel, B. InterfaceAgents: Metaphors with Character The Art of Human Computer Interface Design 1990 Laurel, B.Interface Agents: Metaphors with Character The Art of Human Computer Interface Design 1990 Laurel, B. InterfaceAgents: Metaphors with Character The Art of Human Computer Interface Design 1990 Inference: Sample = N-best List from CRF Segmentation y ? n

y n n IE + Coreference Model J Besag 1986 On the… Smyth. 2001 Data Mining… Smyth, P Data mining… Exact (exhaustive) inference over entity attributes

y n n IE + Coreference Model J Besag 1986 On the… Smyth. 2001 Data Mining… Smyth, P Data mining… Revisit exact inference on IE linear chain, now conditioned on entity attributes

y n n Parameter Estimation Coref graph edge weights MAP on individual edges Separately for different regions IE Linear-chain Exact MAP Entity attribute potentials MAP, pseudo-likelihood In all cases: Climb MAP gradient with quasi-Newton method

p Database field values c 4. Joint segmentation and co-reference o s o s c c s o Citation attributes y y y Segmentation [Wellner, McCallum, Peng, Hay, UAI 2004] Inference: Variant of Iterated Conditional Modes Co-reference decisions Laurel, B. Interface Agents: Metaphors with Character, in The Art of Human-Computer Interface Design, B. Laurel (ed), Addison- Wesley, 1990. Brenda Laurel. Interface Agents: Metaphors with Character, in Laurel, The Art of Human-Computer Interface Design, 355-366, 1990. [Besag, 1986] World Knowledge 35% reduction in co-reference error by using segmentation uncertainty. 6-14% reduction in segmentation error by using co-reference. Extraction from and matching of research paper citations.

Experimenal Results Set of citations from CiteSeer –1500 citation mentions –to 900 paper entities Hand-labeled for coreference and field-extraction Divided into 4 subsets, each on a different topic –RL, Face detection, Reasoning, Constraint Satisfaction –Within each subset many citations share authors, publication venues, publishers, etc. 70% of the citation mentions are singletons

NReinforceFaceReasonConstraint 1 (Baseline)0.9460.960.940.96 30.950.980.96 70.950.980.950.97 90.9820.970.960.97 Optimal0.99 Coreference cluster recall Average error reduction is 35%. “Optimal” makes best use of N-best list by using true labels. Indicates that even more improvement can be obtained Coreference Results

ReinforceFaceReasonConstraint Baseline.943.908.929.934 w/ Coref.949.914.935.943 Err. Reduc..101.062.090.142 P-value.0442.0014.0001 Segmentation F1 Error reduction ranges from 6-14%. Small, but significant at 95% confidence level (p-value < 0.05) Information Extraction Results Biggest limiting factor in both sets of results: data set is small, and does not have large coreferent sets.

Workplace effectiveness ~ Ability to leverage network of acquaintances “The power of your little black book” But filling Contacts DB by hand is tedious, and incomplete. Email Inbox Contacts DB WWW Automatically One Application Project:

System Overview Contact Info and Person Name Extraction Person Name Extraction Name Coreference Homepage Retrieval Social Network Analysis Keyword Extraction CRF WWW names Email

An Example To: “Andrew McCallum” mccallum@cs.umass.edu Subject... First Name: Andrew Middle Name: Kachites Last Name: McCallum JobTitle:Associate Professor Company:University of Massachusetts Street Address: 140 Governor’s Dr. City:Amherst State:MA Zip:01003 Company Phone: (413) 545-1323 Links:Fernando Pereira, Sam Roweis,… Key Words: Information extraction, social network,… Search for new people

Summary of Results Token Acc Field Prec Field Recall Field F1 CRF94.5085.7376.3380.76 PersonKeywords William CohenLogic programming Text categorization Data integration Rule learning Daphne KollerBayesian networks Relational models Probabilistic models Hidden variables Deborah McGuiness Semantic web Description logics Knowledge representation Ontologies Tom MitchellMachine learning Cognitive states Learning apprentice Artificial intelligence Contact info and name extraction performance (25 fields) Example keywords extracted 1.Expert Finding: When solving some task, find friends-of-friends with relevant expertise. Avoid “stove-piping” in large org’s by automatically suggesting collaborators. Given a task, automatically suggest the right team for the job. (Hiring aid!) 2.Social Network Analysis: Understand the social structure of your organization. Suggest structural changes for improved efficiency.

Main Application Project:

Research Paper Cites

Main Application Project: Research Paper Cites Person UniversityVenue Grant Groups Expertise

Status: Spider running. Over 1.5M PDFs in hand. Best-in-world published results in IE from research paper headers and references. First version of multi-entity co-reference running. First version of Web servlet interface up. Well-engineered: Java, servlets, SQL, Lucene, SOAP, etc. Public launch this Fall. Main Application Project:

~80k lines of Java Document classification, information extraction, clustering, coreference, POS tagging, shallow parsing, relational classification, … New package: Graphical models and modern inference methods. –Variational, Tree-reparameterization, Stochastic sampling, contrastive divergence,… New documentation and interfaces. Unlike other toolkits (e.g. Weka) MALLET scales to millions of features, 100k’s training examples, as needed for NLP. MALLET: Machine Learning for Language Toolkit Released as Open Source Software. http://mallet.cs.umass.edu Software Infrastructure In use at UMass, MIT, CMU, Stanford, Berkeley, UPenn, UT Austin, Purdue…

Conditional Models of Identity Uncertainty with Application to Noun Coreference. Andrew McCallum and Ben Wellner. Neural Information Processing Systems (NIPS), 2004.Conditional Models of Identity Uncertainty with Application to Noun Coreference An Integrated, Conditional Model of Information Extraction and Coreference with Application to Citation Matching. Ben Wellner, Andrew McCallum, Fuchun Peng, Michael Hay. Conference on Uncertainty in Artificial Intelligence (UAI), 2004.An Integrated, Conditional Model of Information Extraction and Coreference with Application to Citation Matching Collective Segmentation and Labeling of Distant Entities in Information Extraction. Charles Sutton and Andrew McCallum. ICML workshop on Statistical Relational Learning, 2004.Collective Segmentation and Labeling of Distant Entities in Information Extraction Extracting Social Networks and Contact Information from Email and the Web. Aron Culotta, Ron Bekkerman and Andrew McCallum. Conference on Email and Spam (CEAS) 2004.Extracting Social Networks and Contact Information from Email and the Web Dynamic Conditional Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence Data. Charles Sutton, Khashayar Rohanimanesh and Andrew McCallum. ICML 2004.Dynamic Conditional Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence Data Interactive Information Extraction with Constrained Conditional Random Fields. Trausti Kristjannson, Aron Culotta, Paul Viola and Andrew McCallum. AAAI 2004. (Winner of Honorable Mention Award.)Interactive Information Extraction with Constrained Conditional Random Fields Accurate Information Extraction from Research Papers using Conditional Random Fields. Fuchun Peng and Andrew McCallum. HLT-NAACL, 2004.Accurate Information Extraction from Research Papers using Conditional Random Fields Chinese Segmentation and New Word Detection using Conditional Random Fields. Fuchun Peng, Fangfang Feng, and Andrew McCallum. International Conference on Computational Linguistics (COLING 2004), 2004.Chinese Segmentation and New Word Detection using Conditional Random Fields Confidence Estimation for Information Extraction. Aron Culotta and Andrew McCallum. (HLT- NAACL), 2004,Confidence Estimation for Information Extraction Publications and Contact Info http://www.cs.umass.edu/~mccallum

End of Talk

Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis Andrew McCallum Information Extraction and Synthesis.

Similar presentations

Presentation on theme: "Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis Andrew McCallum Information Extraction and Synthesis."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis Andrew McCallum Information Extraction and Synthesis.

Similar presentations

Presentation on theme: "Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis Andrew McCallum Information Extraction and Synthesis."— Presentation transcript:

Similar presentations

About project

Feedback