Reference Reconciliation in Complex Information Spaces Xin (Luna) Dong, Alon Halevy, Jayant Sigmod 2005 University of Washington.

Slides:



Advertisements
Similar presentations
Discriminative Training of Markov Logic Networks
Advertisements

Xin Luna Dong (Google Inc.) Divesh Srivastava (AT&T 5/2013.
gSpan: Graph-based substructure pattern mining
Correlation Search in Graph Databases Yiping Ke James Cheng Wilfred Ng Presented By Phani Yarlagadda.
Progressive Approach to Relational Entity Resolution Yasser Altowim, Dmitri Kalashnikov, Sharad Mehrotra Progressive Approach to Relational Entity Resolution.
Amit Shvarchenberg and Rafi Sayag. Based on a paper by: Robin Dhamankar, Yoonkyong Lee, AnHai Doan Department of Computer Science University of Illinois,
In Search of Influential Event Organizers in Online Social Networks
Semex: A Platform for Personal Information Management and Integration Xin (Luna) Dong University of Washington June 24, 2005.
NetSci07 May 24, 2007 Entity Resolution in Network Data Lise Getoor University of Maryland, College Park.
Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.
Neighborhood Formation and Anomaly Detection in Bipartite Graphs Jimeng Sun Huiming Qu Deepayan Chakrabarti Christos Faloutsos Speaker: Jimeng Sun.
Localized Techniques for Power Minimization and Information Gathering in Sensor Networks EE249 Final Presentation David Tong Nguyen Abhijit Davare Mentor:
Large-Scale Deduplication with Constraints using Dedupalog Arvind Arasu et al.
Learning to Match Ontologies on the Semantic Web AnHai Doan Jayant Madhavan Robin Dhamankar Pedro Domingos Alon Halevy.
Model Management and the Future Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems April 20, 2005 Semex figures extracted.
Scalable Network Distance Browsing in Spatial Database Samet, H., Sankaranarayanan, J., and Alborzi H. Proceedings of the 2008 ACM SIGMOD international.
CBLOCK: An Automatic Blocking Mechanism for Large-Scale Deduplication Tasks Ashwin Machanavajjhala Duke University with Anish Das Sarma, Ankur Jain, Philip.
A Platform for Personal Information Management and Integration Xin (Luna) Dong and Alon Halevy University of Washington.
Combining Keyword Search and Forms for Ad Hoc Querying of Databases Eric Chu, Akanksha Baid, Xiaoyong Chai, AnHai Doan, Jeffrey Naughton University of.
Semantic Analytics on Social Networks: Experiences in Addressing the Problem of Conflict of Interest Detection Boanerges Aleman-Meza, Meenakshi Nagarajan,
Active Sampling for Entity Matching Aditya Parameswaran Stanford University Jointly with: Kedar Bellare, Suresh Iyengar, Vibhor Rastogi Yahoo! Research.
Inductive learning Simplest form: learn a function from examples
Cristian Urs and Ben Riveira. Introduction The article we chose focuses on improving the performance of Genetic Algorithms by: Use of predictive models.
Managing a Space of Heterogeneous Data Xin (Luna) Dong University of Washington March, 2007.
Progressive Approach to Relational Entity Resolution Yasser Altowim, Dmitri Kalashnikov, Sharad Mehrotra Progressive Approach to Relational Entity Resolution.
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
2015/10/111 DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop.
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
DATA-DRIVEN UNDERSTANDING AND REFINEMENT OF SCHEMA MAPPINGS Data Integration and Service Computing ITCS 6010.
Querying Structured Text in an XML Database By Xuemei Luo.
Validated Model Transformation Tihamér Levendovszky Budapest University of Technology and Economics Department of Automation and Applied Informatics Applied.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
WEB SEARCH PERSONALIZATION WITH ONTOLOGICAL USER PROFILES Data Mining Lab XUAN MAN.
Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.
Progressive Approach to Relational Entity Resolution Yasser Altowim, Dmitri Kalashnikov, Sharad Mehrotra Progressive Approach to Relational Entity Resolution.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Inference Complexity As Learning Bias Daniel Lowd Dept. of Computer and Information Science University of Oregon Joint work with Pedro Domingos.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Guided Learning for Role Discovery (GLRD) Presented by Rui Liu Gilpin, Sean, Tina Eliassi-Rad, and Ian Davidson. "Guided learning for role discovery (glrd):
Meenakshi Nagarajan PhD. Student KNO.E.SIS Wright State University Data Integration.
Multiple Instance Learning for Sparse Positive Bags Razvan C. Bunescu Machine Learning Group Department of Computer Sciences University of Texas at Austin.
Threshold Setting and Performance Monitoring for Novel Text Mining Wenyin Tang and Flora S. Tsai School of Electrical and Electronic Engineering Nanyang.
2015/12/251 Hierarchical Document Clustering Using Frequent Itemsets Benjamin C.M. Fung, Ke Wangy and Martin Ester Proceeding of International Conference.
CONTEXTUAL SEARCH AND NAME DISAMBIGUATION IN USING GRAPHS EINAT MINKOV, WILLIAM W. COHEN, ANDREW Y. NG SIGIR’06 Date: 2008/7/17 Advisor: Dr. Koh,
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
The Canopies Algorithm from “Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching” Andrew McCallum, Kamal Nigam, Lyle.
Department of Computer Science The University of Texas at Austin USA Joint Entity and Relation Extraction using Card-Pyramid Parsing Rohit J. Kate Raymond.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
Discovering Complex Matchings across Web Query Interfaces: A Correlation Mining Approach Bin He Joint work with: Kevin Chen-Chuan Chang, Jiawei Han Univ.
Progress Report ekker. Problem Definition In cases such as object recognition, we can not include all possible objects for training. So transfer learning.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Link mining ( based on slides.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
Meta-Path-Based Ranking with Pseudo Relevance Feedback on Heterogeneous Graph for Citation Recommendation By: Xiaozhong Liu, Yingying Yu, Chun Guo, Yizhou.
Warren Shen, Xin Li, AnHai Doan Database & AI Groups University of Illinois, Urbana Constraint-Based Entity Matching.
Privacy Issues in Graph Data Publishing Summer intern: Qing Zhang (from NC State University) Mentors: Graham Cormode and Divesh Srivastava.
Semi-Supervised Clustering
Associative Query Answering via Query Feature Similarity
A Unifying View on Instance Selection
A Platform for Personal Information Management and Integration
Record Linkage with Uniqueness Constraints and Erroneous Values
[jws13] Evaluation of instance matching tools: The experience of OAEI
Browsing Associations with Semex
Jiawei Han Department of Computer Science
Decision Trees for Mining Data Streams
Enriching Taxonomies With Functional Domain Knowledge
A Framework for Testing Query Transformation Rules
Using Bayesian Network in the Construction of a Bi-level Multi-classifier. A Case Study Using Intensive Care Unit Patients Data B. Sierra, N. Serrano,
Presentation transcript:

Reference Reconciliation in Complex Information Spaces Xin (Luna) Dong, Alon Halevy, Jayant Sigmod 2005 University of Washington

Semex : Personal Information Management System MentionedIn(315) AuthorOfArticles(52) RecipientOf s(8547) SenderOf s(7595) Homepage(1)

Semex : Personal Information Management System Contacts(1145) Co-authors(24)

Semex : Personal Information Management System Authors FromFile CitedBy Cites(33) PublishedIn Article: Reference Reconciliation in Complex Information Spaces

Semex : Personal Information Management System Xin (Luna) Dong xin dong ¶­ðà xinluna dong luna dongxin x. dong Lab-#dong xin dong xin luna Names s

Semex Without Deduplication Search results for luna luna dong SenderOf s(3043) RecipientOf s(2445) MentionedIn(94) 23 persons

Semex Without Deduplication Search results for luna Xin (Luna) Dong AuthorOfArticles(49) MentionedIn(20) 23 persons

Semex Without Deduplication A Platform for Personal Information Management and Integration

Semex Without Deduplication 9 Persons: dong xin xin dong

Semex NEEDS Deduplication (Reference Reconciliation)

Reference Reconciliation in Complex Information Spaces Xin (Luna) Dong, Alon Halevy, Jayant Sigmod 2005 University of Washington

Complex Information Space Example – An Abstract View of Personal Information Article: a 1 =(“Distributed Query Processing”,“ ”, {p 1,p 2,p 3 }, c 1 ) a 2 =(“Distributed query processing”,“ ”, {p 4,p 5,p 6 }, c 2 ) Venue: c 1 =(“ACM Conference on Management of Data”, “1978”, “Austin, Texas”) c 2 =(“ACM SIGMOD”, “1978”, null) Person: p 1 =(“Robert S. Epstein”, null) p 2 =(“Michael Stonebraker”, null) p 3 =(“Eugene Wong”, null) p 4 =(“Epstein, R.S.”, null) p 5 =(“Stonebraker, M.”, null) p 6 =(“Wong, E.”, null)

Complex Information Space Example – An Abstract View of Personal Information Article: a 1 =(“Distributed Query Processing”,“ ”, {p 1,p 2,p 3 }, c 1 ) a 2 =(“Distributed query processing”,“ ”, {p 4,p 5,p 6 }, c 2 ) Venue: c 1 =(“ACM Conference on Management of Data”, “1978”, “Austin, Texas”) c 2 =(“ACM SIGMOD”, “1978”, null) Person: p 1 =(“Robert S. Epstein”, null) p 2 =(“Michael Stonebraker”, null) p 3 =(“Eugene Wong”, null) p 4 =(“Epstein, R.S.”, null) p 5 =(“Stonebraker, M.”, null) p 6 =(“Wong, E.”, null) p 7 =(“Eugene Wong”, p 8 =(null, p 9 =(“mike”, Class Reference Atomic Attribute Association Attribute

Other Complex Information Spaces Citation portals, e.g., Citeseer, Cora Online product catalogs in E-commerce

Real-World Objects Article: a 1 =(“Distributed Query Processing”,“ ”, {p 1,p 2,p 3 }, c 1 ) a 2 =(“Distributed query processing”,“ ”, {p 4,p 5,p 6 }, c 2 ) Venue: c 1 =(“ACM Conference on Management of Data”, “1978”, “Austin, Texas”) c 2 =(“ACM SIGMOD”, “1978”, null) Person: p 1 =(“Robert S. Epstein”, null) p 2 =(“Michael Stonebraker”, null) p 3 =(“Eugene Wong”, null) p 4 =(“Epstein, R.S.”, null) p 5 =(“Stonebraker, M.”, null) p 6 =(“Wong, E.”, null) p 7 =(“Eugene Wong”, p 8 =(null, p 9 =(“mike”,

Reference Reconciliation Input: A set of references R Output: A partitioning over R, such that  Each partition refers to a single real-world object – high precision  Different partitions refer to different objects – high recall

Related Work A very active area of research in Databases, Data Mining and AI Most current approaches assume matching tuples from a single database table  Traditional approaches (Surveyed in [Cohen, et al. 2003]) Step I. Compare attributes Step II. Combine attribute similarities to decide tuple match/non- match Step III. Compute transitive closures to get partitions  New approaches explore relationship between reconciliation decisions using probability models [Russell et al, 2002] [Domingos et al, 2004] Harder for complex information spaces

Challenges in Complex Information Spaces Article: a 1 =(“Distributed Query Processing”,“ ”, {p 1,p 2,p 3 }, c 1 ) a 2 =(“Distributed query processing”,“ ”, {p 4,p 5,p 6 }, c 2 ) Venue: c 1 =(“ACM Conference on Management of Data”, “1978”, “Austin, Texas”) c 2 =(“ACM SIGMOD”, “1978”, null) Person: p 1 =(“Robert S. Epstein”, null) p 2 =(“Michael Stonebraker”, null) p 3 =(“Eugene Wong”, null) p 4 =(“Epstein, R.S.”, null) p 5 =(“Stonebraker, M.”, null) p 6 =(“Wong, E.”, null) p 7 =(“Eugene Wong”, p 8 =(null, p 9 =(“mike”, 1. Multiple Classes 3. Multi-value Attributes 2. Limited Information ??

Intuition Complex information spaces can be considered as networks of instances and associations between the instances Key: exploit the network, specifically, the clues hidden in the associations

Outline  Introduction and problem definition  Reconciliation algorithm  Experimental results  Conclusions

Framework: Dependency Graph p 2 =(“Michael Stonebraker”, null, {p 1, p 3 }) p 3 =(“Eugene Wong”, null, {p 1, p 2 }) p 7 =(“Eugene Wong”, {p 8 }) p 8 =(null, {p 7 }) p 9 =(“mike”, null) (p 2, p 8 ) (p 3,p 7 )(“Michael Stonebraker”, Reference SimilarityAttribute Similarity Compare contacts Cross-attr similarity (p 1,p 7 ) (“Michael Stonebraker”, p 7 ) (p 1, (p 3,

Framework: Dependency Graph p 2 =(“Michael Stonebraker”, null, {p 1, p 3 }) p 3 =(“Eugene Wong”, null, {p 1, p 2 }) p 7 =(“Eugene Wong”, {p 8 }) p 8 =(null, {p 7 }) p 9 =(“mike”, null) (p 2, p 8 ) (p 3,p 7 )(“Michael Stonebraker”, Reference SimilarityAttribute Similarity Compare contacts Cross-attr similarity

Framework: Dependency Graph p 2 =(“Michael Stonebraker”, null, {p 1, p 3 }) p 3 =(“Eugene Wong”, null, {p 1, p 2 }) p 7 =(“Eugene Wong”, {p 8 }) p 8 =(null, {p 7 }) p 9 =(“mike”, null) (p 8, p 9 ) (p 2, p 8 ) (“Michael Stonebraker”, “mike”) (p 2, p 9 ) (p 3,p 7 )(“Michael Stonebraker”, Reference SimilarityAttribute Similarity (“Eugene Wong”, “Eugene Wong”)

Exploit the Dependency Graph (“Distributed…”, “Distributed …”) (“ ”, “ ”) (a 1, a 2 ) (“Michael Stonebraker”, “Stonebraker, M.”) (p 2, p 5 ) (“Eugene Wong”, “Wong, E.”) (p 3, p 6 ) (c 1, c 2 ) (“ACM …”, “ACM SIGMOD”)(“1978”, “1978”) Reference similarityAttribute similarity (“Robert S. Epstein”, “Epstein, R.S.”) (p 1, p 4 )

Dependency Graph Example II (“Distributed…”, “Distributed …”) (“ ”, “ ”) (a 1, a 2 ) (“Michael Stonebraker”, “Stonebraker, M.”) (p 2, p 5 ) (“Eugene Wong”, “Wong, E.”) (p 3, p 6 ) (c 1, c 2 ) (“ACM …”, “ACM SIGMOD”)(“1978”, “1978”) Reference similarityAttribute similarity (“Robert S. Epstein”, “Epstein, R.S.”) (p 1, p 4 ) Compare authored papers

Strategy I. Consider Richer Evidence Cross-attribute similarity – Name&  p 5 =(“Stonebraker, M.”, null)  p 8 =(null, Context Information I – Contact list  p 5 =(“Stonebraker, M.”, null, {p 4, p 6 })  p 8 =(null, {p 7 })  p 6 =p 7 Context Information II – Authored articles  p 2 =(“Michael Stonebraker”, null)  p 5 =(“Stonebraker, M.”, null)  p 2 and p 5 authored the same article

Considering Only Attribute-wise Similarities Cannot Merge Persons Well 1409 Person references: Real-world persons (gold-standard):

Considering Richer Evidence Improves the Recall Person references: 24076Real-world persons:1750

Exploit the Dependency Graph (“Distributed…”, “Distributed …”) (“ ”, “ ”) (a 1, a 2 ) (“Michael Stonebraker”, “Stonebraker, M.”) (p 2, p 5 ) (“Eugene Wong”, “Wong, E.”) (p 3, p 6 ) (c 1, c 2 ) (“ACM …”, “ACM SIGMOD”)(“1978”, “1978”) Reference similarityAttribute similarity (“Robert S. Epstein”, “Epstein, R.S.”) (p 1, p 4 )

Exploit the Dependency Graph (“Distributed…”, “Distributed …”) (“ ”, “ ”) (a 1, a 2 ) (“Michael Stonebraker”, “Stonebraker, M.”) (p 2, p 5 ) (“Eugene Wong”, “Wong, E.”) (p 3, p 6 ) (c 1, c 2 ) (“ACM …”, “ACM SIGMOD”)(“1978”, “1978”) (“Robert S. Epstein”, “Epstein, R.S.”) (p 1, p 4 ) ReconciledSimilar

Exploit the Dependency Graph (“Distributed…”, “Distributed …”) (“ ”, “ ”) (a 1, a 2 ) (“Michael Stonebraker”, “Stonebraker, M.”) (p 2, p 5 ) (“Eugene Wong”, “Wong, E.”) (p 3, p 6 ) (c 1, c 2 ) (“ACM …”, “ACM SIGMOD”)(“1978”, “1978”) (“Robert S. Epstein”, “Epstein, R.S.”) (p 1, p 4 ) ReconciledSimilar

Exploit the Dependency Graph (“Distributed…”, “Distributed …”) (“ ”, “ ”) (a 1, a 2 ) (“Michael Stonebraker”, “Stonebraker, M.”) (p 2, p 5 ) (“Eugene Wong”, “Wong, E.”) (p 3, p 6 ) (c 1, c 2 ) (“ACM …”, “ACM SIGMOD”)(“1978”, “1978”) (“Robert S. Epstein”, “Epstein, R.S.”) (p 1, p 4 ) ReconciledSimilar

Exploit the Dependency Graph (“Distributed…”, “Distributed …”) (“ ”, “ ”) (a 1, a 2 ) (“Michael Stonebraker”, “Stonebraker, M.”) (p 2, p 5 ) (“Eugene Wong”, “Wong, E.”) (p 3, p 6 ) (c 1, c 2 ) (“ACM …”, “ACM SIGMOD”)(“1978”, “1978”) (“Robert S. Epstein”, “Epstein, R.S.”) (p 1, p 4 ) ReconciledSimilar

Exploit the Dependency Graph (“Distributed…”, “Distributed …”) (“ ”, “ ”) (a 1, a 2 ) (“Michael Stonebraker”, “Stonebraker, M.”) (p 2, p 5 ) (“Eugene Wong”, “Wong, E.”) (p 3, p 6 ) (c 1, c 2 ) (“ACM …”, “ACM SIGMOD”)(“1978”, “1978”) (“Robert S. Epstein”, “Epstein, R.S.”) (p 1, p 4 ) ReconciledSimilar

Strategy II. Propagate Information between Reconciliation Decisions After changing the similarity score of one node, re-compute similarity scores of its neighbors This process converges if  Similarity score is monotone in the similarity values of neighbors  Compute neighbor similarities only if similarity increase is not too small

Propagating Information between Reconciliation Decisions Further Improves Recall Person references: 24076Real-world persons:1750

Strategy III. Enrich References in Reconciliation Enrich knowledge of a real-world object for later reconciliation Naïve: Construct graph  Compute similarity  Transitive Closure  Problems Dependency-graph construction is expensive Reference enrichment takes effect until the next pass Solution  Instant enrichment by adding neighbors in the dependency graph

Enrich References by Adding Neighbors p 2 =(“Michael Stonebraker”, null, {p 1, p 3 }) p 3 =(“Eugene Wong”, null, {p 1, p 2 }) p 7 =(“Eugene Wong”, {p 8 }) p 8 =(null, {p 7 }) p 9 =(“mike”, null) (p 8, p 9 ) (p 2, p 8 ) (“Michael Stonebraker”, “mike”) (p 2, p 9 ) (p 3,p 7 )(“Michael Stonebraker”, ReconciledSimilar

Enrich References by Adding Neighbors p 2 =(“Michael Stonebraker”, null, {p 1, p 3 }) p 3 =(“Eugene Wong”, null, {p 1, p 2 }) p 7 =(“Eugene Wong”, {p 8 }) p 8 =(null, {p 7 }) p 9 =(“mike”, null) (p 8, p 9 ) (p 2, p 8 ) (“Michael Stonebraker”, “mike”) (p 2, p 9 ) (p 3,p 7 )(“Michael Stonebraker”, ReconciledSimilar

Enrich References by Adding Neighbors p 2 =(“Michael Stonebraker”, null, {p 1, p 3 }) p 3 =(“Eugene Wong”, null, {p 1, p 2 }) p 7 =(“Eugene Wong”, {p 8 }) p 8 =(null, {p 7 }) p 9 =(“mike”, null) (p 8, p 9 ) (p 2, p 8 ) (“Michael Stonebraker”, “mike”) (p 2, p 9 ) (p 3,p 7 )(“Michael Stonebraker”, ReconciledSimilar

Enrich References by Adding Neighbors p 2 =(“Michael Stonebraker”, null, {p 1, p 3 }) p 3 =(“Eugene Wong”, null, {p 1, p 2 }) p 7 =(“Eugene Wong”, {p 8 }) p 8 =(null, {p 7 }) p 9 =(“mike”, null) (p 8, p 9 ) (p 2, p 8 ) (“Michael Stonebraker”, “mike”)(p 3,p 7 )(“Michael Stonebraker”, ReconciledSimilar

Enrich References by Adding Neighbors p 2 =(“Michael Stonebraker”, null, {p 1, p 3 }) p 3 =(“Eugene Wong”, null, {p 1, p 2 }) p 7 =(“Eugene Wong”, {p 8 }) p 8 =(null, {p 7 }) p 9 =(“mike”, null) (p 8, p 9 ) (p 2, p 8 ) (“Michael Stonebraker”, “mike”)(p 3,p 7 )(“Michael Stonebraker”, ReconciledSimilar

References Enrichment Improves Recall More than Information Propagation Person references: 24076Real-world persons:1750

Applying Both Information Propagation and Reference Enrichment Get the Highest Recall Person references: 24076Real-world persons:

Outline  Introduction and problem definition  Reconciliation algorithm  Experimental results  Conclusions

Experiment Settings Datasets  Four personal datasets  Cora dataset for citations Use the same parameters and thresholds for all datasets Measure  Precision and recall, F-measure Precision: The percentage of correctly reconciled reference pairs over all reconciled reference pairs Recall: The percentage of correctly reconciled reference pairs over pairs of references that refer to the same real-world object  Diversity and Dispersion Diversity: For every result partition, how many real-world objects are included; ideally should be 1 (related to precision) Dispersion: For every real-world object, how many result partitions include them; ideally should be 1 (related to recall)

Recall Results on One Personal Dataset Person references: 24076Real-world persons:

Results Considering All Occurrences of Person Instances Dataset #per/#ref Attr-wise MatchingDependency Graph Prec/RecallF#ParPrec/RecallF#Par A (1750/24076) B (1989/36359) C (1570/15160) D (1518/17199) Avg 0.999/ / / / / / / / / Both precision and recall increase compared with attr-wise matching.

Results Considering Only Distinct Person References Dataset #per/#dist-ref Attr-wise MatchingDependency Graph Prec/RecallF#ParPrec/RecallF#Par A (1750/3114) B (1989/3211) C (1570/2430) D (1518/2188) Avg 0.995/ / / / / / / / / / Precision and recall increase largely compared with attr-wise matching.

Diversity and Dispersion Are Very Close to 1 Dataset #per/#ref Attr-wise MatchingDependency Graph Diversity/Dispersion A (1750/24076) B (1989/36359) C (1570/15160) D (1518/17199) Avg 1.18/ / / / / / / / / /1.008

Our Algorithm Equals or Outperforms Attr-wise Matching in All Classes Class Attr-wise MatchingDependency Graph PrecisionRecallPrecisionRecall Person Article Venue

Results on Cora Dataset is Competitive with Other Reported Results Results reported in other record linkage papers:  Precision/Recall = 0.990/0.925 [Cohen et al., 2002]  Precision/Recall = 0.842/0.909 [Parag and Domingo, 2004]  F-measure = [Bilenko and Mooney, 2003] Class Attr-wise MatchingDependency Graph Prec/RecallF-msrePrec/RecallF-msre Article Person Venue 0.985/ / / / / /

Conclusions Contributions : Dependency-graph-based reconciliation algorithm  Exploit rich evidence  Propagate information between reconciliation decisions  Enrich references during reconciliation Extended Work  Propagate negative information through dependency Graph

Reference Reconciliation in Complex Information Spaces Xin (Luna) Dong, Alon Halevy, Jayant Sigmod

Strategy IV. Enforce Constraints Problem: Solution: Propagate negative information—Constraints  Non-merge node: the two elements are guaranteed to be different and should never be merged P1P1 P2P2 P3P3

Enforce Constraints by Propagating Negative Information p 2 =(“Michael Stonebraker”, null, {p 1, p 3 }) p 3 =(“Eugene Wong”, null, {p 1, p 2 }) p 7 =(“Eugene Wong”, {p 8 }) p 8 =(null, {p 7 }) p 9 =(“matt”, null) (p 2, p 8 ) (“Michael Stonebraker”, “matt”) (p 2, p 9 ) (p 3,p 7 )(“Michael Stonebraker”, (p 8, p 9 ) Reference SimilarityAttribute Similarity

Enforce Constraints by Propagating Negative Information p 2 =(“Michael Stonebraker”, null, {p 1, p 3 }) p 3 =(“Eugene Wong”, null, {p 1, p 2 }) p 7 =(“Eugene Wong”, {p 8 }) p 8 =(null, {p 7 }) p 9 =(“matt”, null) (p 2, p 8 ) (“Michael Stonebraker”, “matt”) (p 2, p 9 ) (p 3,p 7 )(“Michael Stonebraker”, ReconciledSimilarNon-merge (p 8, p 9 ) Constraint

Enforce Constraints by Propagating Negative Information p 2 =(“Michael Stonebraker”, null, {p 1, p 3 }) p 3 =(“Eugene Wong”, null, {p 1, p 2 }) p 7 =(“Eugene Wong”, {p 8 }) p 8 =(null, {p 7 }) p 9 =(“matt”, null) (p 2, p 8 ) (“Michael Stonebraker”, “matt”) (p 2, p 9 ) (p 3,p 7 )(“Michael Stonebraker”, ReconciledSimilarNon-merge (p 8, p 9 ) Constraint

Enforce Constraints by Propagating Negative Information p 2 =(“Michael Stonebraker”, null, {p 1, p 3 }) p 3 =(“Eugene Wong”, null, {p 1, p 2 }) p 7 =(“Eugene Wong”, {p 8 }) p 8 =(null, {p 7 }) p 9 =(“matt”, null) (p 2, p 8 ) (“Michael Stonebraker”, “matt”) (p 2, p 9 ) (p 3,p 7 )(“Michael Stonebraker”, ReconciledSimilarNon-merge (p 8, p 9 ) Constraint

Enforce Constraints by Propagating Negative Information p 2 =(“Michael Stonebraker”, null, {p 1, p 3 }) p 3 =(“Eugene Wong”, null, {p 1, p 2 }) p 7 =(“Eugene Wong”, {p 8 }) p 8 =(null, {p 7 }) p 9 =(“matt”, null) (p 2, p 8 ) (“Michael Stonebraker”, “matt”) (p 2, p 9 ) (p 3,p 7 )(“Michael Stonebraker”, ReconciledSimilarNon-merge (p 8, p 9 ) Constraint

Enforcing Constraints Improves Precision MethodPrecision #(Entities reconciled with others incorrectly) Constraint No Constraint

Similarity Computation Similarity function for node N – s(N)  Input: sim scores of N’s neighbors  Output: sim score of N, ranged from 0 to 1 Similarity function can be defined by applying domain knowledge, learning from training data, resorting to global knowledge, etc. S = S rv + S sb + S wb  S rv : from real-valued neighbors. Decision-tree shape.  S sb : from strong-boolean-valued neighbors  S wb : from weak-boolean-valued neighbors

Framework: Dependency Graph Definition  For every pair of references A and B: A node representing their similarity  For every attribute of A and attribute of B A node representing attribute similarity An edge between attr-sim node and ref-sim node, representing the dependency between the similarities  Each node is associated with a similarity score between 0 and 1 Construction: include only nodes whose two elements have potential to be similar