Reference Reconciliation in Complex Information Spaces Xin (Luna) Dong, Alon Halevy, Jayant Sigmod 2005 University of Washington
Semex : Personal Information Management System MentionedIn(315) AuthorOfArticles(52) RecipientOf s(8547) SenderOf s(7595) Homepage(1)
Semex : Personal Information Management System Contacts(1145) Co-authors(24)
Semex : Personal Information Management System Authors FromFile CitedBy Cites(33) PublishedIn Article: Reference Reconciliation in Complex Information Spaces
Semex : Personal Information Management System Xin (Luna) Dong xin dong ¶ðà xinluna dong luna dongxin x. dong Lab-#dong xin dong xin luna Names s
Semex Without Deduplication Search results for luna luna dong SenderOf s(3043) RecipientOf s(2445) MentionedIn(94) 23 persons
Semex Without Deduplication Search results for luna Xin (Luna) Dong AuthorOfArticles(49) MentionedIn(20) 23 persons
Semex Without Deduplication A Platform for Personal Information Management and Integration
Semex Without Deduplication 9 Persons: dong xin xin dong
Semex NEEDS Deduplication (Reference Reconciliation)
Reference Reconciliation in Complex Information Spaces Xin (Luna) Dong, Alon Halevy, Jayant Sigmod 2005 University of Washington
Complex Information Space Example – An Abstract View of Personal Information Article: a 1 =(“Distributed Query Processing”,“ ”, {p 1,p 2,p 3 }, c 1 ) a 2 =(“Distributed query processing”,“ ”, {p 4,p 5,p 6 }, c 2 ) Venue: c 1 =(“ACM Conference on Management of Data”, “1978”, “Austin, Texas”) c 2 =(“ACM SIGMOD”, “1978”, null) Person: p 1 =(“Robert S. Epstein”, null) p 2 =(“Michael Stonebraker”, null) p 3 =(“Eugene Wong”, null) p 4 =(“Epstein, R.S.”, null) p 5 =(“Stonebraker, M.”, null) p 6 =(“Wong, E.”, null)
Complex Information Space Example – An Abstract View of Personal Information Article: a 1 =(“Distributed Query Processing”,“ ”, {p 1,p 2,p 3 }, c 1 ) a 2 =(“Distributed query processing”,“ ”, {p 4,p 5,p 6 }, c 2 ) Venue: c 1 =(“ACM Conference on Management of Data”, “1978”, “Austin, Texas”) c 2 =(“ACM SIGMOD”, “1978”, null) Person: p 1 =(“Robert S. Epstein”, null) p 2 =(“Michael Stonebraker”, null) p 3 =(“Eugene Wong”, null) p 4 =(“Epstein, R.S.”, null) p 5 =(“Stonebraker, M.”, null) p 6 =(“Wong, E.”, null) p 7 =(“Eugene Wong”, p 8 =(null, p 9 =(“mike”, Class Reference Atomic Attribute Association Attribute
Other Complex Information Spaces Citation portals, e.g., Citeseer, Cora Online product catalogs in E-commerce
Real-World Objects Article: a 1 =(“Distributed Query Processing”,“ ”, {p 1,p 2,p 3 }, c 1 ) a 2 =(“Distributed query processing”,“ ”, {p 4,p 5,p 6 }, c 2 ) Venue: c 1 =(“ACM Conference on Management of Data”, “1978”, “Austin, Texas”) c 2 =(“ACM SIGMOD”, “1978”, null) Person: p 1 =(“Robert S. Epstein”, null) p 2 =(“Michael Stonebraker”, null) p 3 =(“Eugene Wong”, null) p 4 =(“Epstein, R.S.”, null) p 5 =(“Stonebraker, M.”, null) p 6 =(“Wong, E.”, null) p 7 =(“Eugene Wong”, p 8 =(null, p 9 =(“mike”,
Reference Reconciliation Input: A set of references R Output: A partitioning over R, such that Each partition refers to a single real-world object – high precision Different partitions refer to different objects – high recall
Related Work A very active area of research in Databases, Data Mining and AI Most current approaches assume matching tuples from a single database table Traditional approaches (Surveyed in [Cohen, et al. 2003]) Step I. Compare attributes Step II. Combine attribute similarities to decide tuple match/non- match Step III. Compute transitive closures to get partitions New approaches explore relationship between reconciliation decisions using probability models [Russell et al, 2002] [Domingos et al, 2004] Harder for complex information spaces
Challenges in Complex Information Spaces Article: a 1 =(“Distributed Query Processing”,“ ”, {p 1,p 2,p 3 }, c 1 ) a 2 =(“Distributed query processing”,“ ”, {p 4,p 5,p 6 }, c 2 ) Venue: c 1 =(“ACM Conference on Management of Data”, “1978”, “Austin, Texas”) c 2 =(“ACM SIGMOD”, “1978”, null) Person: p 1 =(“Robert S. Epstein”, null) p 2 =(“Michael Stonebraker”, null) p 3 =(“Eugene Wong”, null) p 4 =(“Epstein, R.S.”, null) p 5 =(“Stonebraker, M.”, null) p 6 =(“Wong, E.”, null) p 7 =(“Eugene Wong”, p 8 =(null, p 9 =(“mike”, 1. Multiple Classes 3. Multi-value Attributes 2. Limited Information ??
Intuition Complex information spaces can be considered as networks of instances and associations between the instances Key: exploit the network, specifically, the clues hidden in the associations
Outline Introduction and problem definition Reconciliation algorithm Experimental results Conclusions
Framework: Dependency Graph p 2 =(“Michael Stonebraker”, null, {p 1, p 3 }) p 3 =(“Eugene Wong”, null, {p 1, p 2 }) p 7 =(“Eugene Wong”, {p 8 }) p 8 =(null, {p 7 }) p 9 =(“mike”, null) (p 2, p 8 ) (p 3,p 7 )(“Michael Stonebraker”, Reference SimilarityAttribute Similarity Compare contacts Cross-attr similarity (p 1,p 7 ) (“Michael Stonebraker”, p 7 ) (p 1, (p 3,
Framework: Dependency Graph p 2 =(“Michael Stonebraker”, null, {p 1, p 3 }) p 3 =(“Eugene Wong”, null, {p 1, p 2 }) p 7 =(“Eugene Wong”, {p 8 }) p 8 =(null, {p 7 }) p 9 =(“mike”, null) (p 2, p 8 ) (p 3,p 7 )(“Michael Stonebraker”, Reference SimilarityAttribute Similarity Compare contacts Cross-attr similarity
Framework: Dependency Graph p 2 =(“Michael Stonebraker”, null, {p 1, p 3 }) p 3 =(“Eugene Wong”, null, {p 1, p 2 }) p 7 =(“Eugene Wong”, {p 8 }) p 8 =(null, {p 7 }) p 9 =(“mike”, null) (p 8, p 9 ) (p 2, p 8 ) (“Michael Stonebraker”, “mike”) (p 2, p 9 ) (p 3,p 7 )(“Michael Stonebraker”, Reference SimilarityAttribute Similarity (“Eugene Wong”, “Eugene Wong”)
Exploit the Dependency Graph (“Distributed…”, “Distributed …”) (“ ”, “ ”) (a 1, a 2 ) (“Michael Stonebraker”, “Stonebraker, M.”) (p 2, p 5 ) (“Eugene Wong”, “Wong, E.”) (p 3, p 6 ) (c 1, c 2 ) (“ACM …”, “ACM SIGMOD”)(“1978”, “1978”) Reference similarityAttribute similarity (“Robert S. Epstein”, “Epstein, R.S.”) (p 1, p 4 )
Dependency Graph Example II (“Distributed…”, “Distributed …”) (“ ”, “ ”) (a 1, a 2 ) (“Michael Stonebraker”, “Stonebraker, M.”) (p 2, p 5 ) (“Eugene Wong”, “Wong, E.”) (p 3, p 6 ) (c 1, c 2 ) (“ACM …”, “ACM SIGMOD”)(“1978”, “1978”) Reference similarityAttribute similarity (“Robert S. Epstein”, “Epstein, R.S.”) (p 1, p 4 ) Compare authored papers
Strategy I. Consider Richer Evidence Cross-attribute similarity – Name& p 5 =(“Stonebraker, M.”, null) p 8 =(null, Context Information I – Contact list p 5 =(“Stonebraker, M.”, null, {p 4, p 6 }) p 8 =(null, {p 7 }) p 6 =p 7 Context Information II – Authored articles p 2 =(“Michael Stonebraker”, null) p 5 =(“Stonebraker, M.”, null) p 2 and p 5 authored the same article
Considering Only Attribute-wise Similarities Cannot Merge Persons Well 1409 Person references: Real-world persons (gold-standard):
Considering Richer Evidence Improves the Recall Person references: 24076Real-world persons:1750
Exploit the Dependency Graph (“Distributed…”, “Distributed …”) (“ ”, “ ”) (a 1, a 2 ) (“Michael Stonebraker”, “Stonebraker, M.”) (p 2, p 5 ) (“Eugene Wong”, “Wong, E.”) (p 3, p 6 ) (c 1, c 2 ) (“ACM …”, “ACM SIGMOD”)(“1978”, “1978”) Reference similarityAttribute similarity (“Robert S. Epstein”, “Epstein, R.S.”) (p 1, p 4 )
Exploit the Dependency Graph (“Distributed…”, “Distributed …”) (“ ”, “ ”) (a 1, a 2 ) (“Michael Stonebraker”, “Stonebraker, M.”) (p 2, p 5 ) (“Eugene Wong”, “Wong, E.”) (p 3, p 6 ) (c 1, c 2 ) (“ACM …”, “ACM SIGMOD”)(“1978”, “1978”) (“Robert S. Epstein”, “Epstein, R.S.”) (p 1, p 4 ) ReconciledSimilar
Exploit the Dependency Graph (“Distributed…”, “Distributed …”) (“ ”, “ ”) (a 1, a 2 ) (“Michael Stonebraker”, “Stonebraker, M.”) (p 2, p 5 ) (“Eugene Wong”, “Wong, E.”) (p 3, p 6 ) (c 1, c 2 ) (“ACM …”, “ACM SIGMOD”)(“1978”, “1978”) (“Robert S. Epstein”, “Epstein, R.S.”) (p 1, p 4 ) ReconciledSimilar
Exploit the Dependency Graph (“Distributed…”, “Distributed …”) (“ ”, “ ”) (a 1, a 2 ) (“Michael Stonebraker”, “Stonebraker, M.”) (p 2, p 5 ) (“Eugene Wong”, “Wong, E.”) (p 3, p 6 ) (c 1, c 2 ) (“ACM …”, “ACM SIGMOD”)(“1978”, “1978”) (“Robert S. Epstein”, “Epstein, R.S.”) (p 1, p 4 ) ReconciledSimilar
Exploit the Dependency Graph (“Distributed…”, “Distributed …”) (“ ”, “ ”) (a 1, a 2 ) (“Michael Stonebraker”, “Stonebraker, M.”) (p 2, p 5 ) (“Eugene Wong”, “Wong, E.”) (p 3, p 6 ) (c 1, c 2 ) (“ACM …”, “ACM SIGMOD”)(“1978”, “1978”) (“Robert S. Epstein”, “Epstein, R.S.”) (p 1, p 4 ) ReconciledSimilar
Exploit the Dependency Graph (“Distributed…”, “Distributed …”) (“ ”, “ ”) (a 1, a 2 ) (“Michael Stonebraker”, “Stonebraker, M.”) (p 2, p 5 ) (“Eugene Wong”, “Wong, E.”) (p 3, p 6 ) (c 1, c 2 ) (“ACM …”, “ACM SIGMOD”)(“1978”, “1978”) (“Robert S. Epstein”, “Epstein, R.S.”) (p 1, p 4 ) ReconciledSimilar
Strategy II. Propagate Information between Reconciliation Decisions After changing the similarity score of one node, re-compute similarity scores of its neighbors This process converges if Similarity score is monotone in the similarity values of neighbors Compute neighbor similarities only if similarity increase is not too small
Propagating Information between Reconciliation Decisions Further Improves Recall Person references: 24076Real-world persons:1750
Strategy III. Enrich References in Reconciliation Enrich knowledge of a real-world object for later reconciliation Naïve: Construct graph Compute similarity Transitive Closure Problems Dependency-graph construction is expensive Reference enrichment takes effect until the next pass Solution Instant enrichment by adding neighbors in the dependency graph
Enrich References by Adding Neighbors p 2 =(“Michael Stonebraker”, null, {p 1, p 3 }) p 3 =(“Eugene Wong”, null, {p 1, p 2 }) p 7 =(“Eugene Wong”, {p 8 }) p 8 =(null, {p 7 }) p 9 =(“mike”, null) (p 8, p 9 ) (p 2, p 8 ) (“Michael Stonebraker”, “mike”) (p 2, p 9 ) (p 3,p 7 )(“Michael Stonebraker”, ReconciledSimilar
Enrich References by Adding Neighbors p 2 =(“Michael Stonebraker”, null, {p 1, p 3 }) p 3 =(“Eugene Wong”, null, {p 1, p 2 }) p 7 =(“Eugene Wong”, {p 8 }) p 8 =(null, {p 7 }) p 9 =(“mike”, null) (p 8, p 9 ) (p 2, p 8 ) (“Michael Stonebraker”, “mike”) (p 2, p 9 ) (p 3,p 7 )(“Michael Stonebraker”, ReconciledSimilar
Enrich References by Adding Neighbors p 2 =(“Michael Stonebraker”, null, {p 1, p 3 }) p 3 =(“Eugene Wong”, null, {p 1, p 2 }) p 7 =(“Eugene Wong”, {p 8 }) p 8 =(null, {p 7 }) p 9 =(“mike”, null) (p 8, p 9 ) (p 2, p 8 ) (“Michael Stonebraker”, “mike”) (p 2, p 9 ) (p 3,p 7 )(“Michael Stonebraker”, ReconciledSimilar
Enrich References by Adding Neighbors p 2 =(“Michael Stonebraker”, null, {p 1, p 3 }) p 3 =(“Eugene Wong”, null, {p 1, p 2 }) p 7 =(“Eugene Wong”, {p 8 }) p 8 =(null, {p 7 }) p 9 =(“mike”, null) (p 8, p 9 ) (p 2, p 8 ) (“Michael Stonebraker”, “mike”)(p 3,p 7 )(“Michael Stonebraker”, ReconciledSimilar
Enrich References by Adding Neighbors p 2 =(“Michael Stonebraker”, null, {p 1, p 3 }) p 3 =(“Eugene Wong”, null, {p 1, p 2 }) p 7 =(“Eugene Wong”, {p 8 }) p 8 =(null, {p 7 }) p 9 =(“mike”, null) (p 8, p 9 ) (p 2, p 8 ) (“Michael Stonebraker”, “mike”)(p 3,p 7 )(“Michael Stonebraker”, ReconciledSimilar
References Enrichment Improves Recall More than Information Propagation Person references: 24076Real-world persons:1750
Applying Both Information Propagation and Reference Enrichment Get the Highest Recall Person references: 24076Real-world persons:
Outline Introduction and problem definition Reconciliation algorithm Experimental results Conclusions
Experiment Settings Datasets Four personal datasets Cora dataset for citations Use the same parameters and thresholds for all datasets Measure Precision and recall, F-measure Precision: The percentage of correctly reconciled reference pairs over all reconciled reference pairs Recall: The percentage of correctly reconciled reference pairs over pairs of references that refer to the same real-world object Diversity and Dispersion Diversity: For every result partition, how many real-world objects are included; ideally should be 1 (related to precision) Dispersion: For every real-world object, how many result partitions include them; ideally should be 1 (related to recall)
Recall Results on One Personal Dataset Person references: 24076Real-world persons:
Results Considering All Occurrences of Person Instances Dataset #per/#ref Attr-wise MatchingDependency Graph Prec/RecallF#ParPrec/RecallF#Par A (1750/24076) B (1989/36359) C (1570/15160) D (1518/17199) Avg 0.999/ / / / / / / / / Both precision and recall increase compared with attr-wise matching.
Results Considering Only Distinct Person References Dataset #per/#dist-ref Attr-wise MatchingDependency Graph Prec/RecallF#ParPrec/RecallF#Par A (1750/3114) B (1989/3211) C (1570/2430) D (1518/2188) Avg 0.995/ / / / / / / / / / Precision and recall increase largely compared with attr-wise matching.
Diversity and Dispersion Are Very Close to 1 Dataset #per/#ref Attr-wise MatchingDependency Graph Diversity/Dispersion A (1750/24076) B (1989/36359) C (1570/15160) D (1518/17199) Avg 1.18/ / / / / / / / / /1.008
Our Algorithm Equals or Outperforms Attr-wise Matching in All Classes Class Attr-wise MatchingDependency Graph PrecisionRecallPrecisionRecall Person Article Venue
Results on Cora Dataset is Competitive with Other Reported Results Results reported in other record linkage papers: Precision/Recall = 0.990/0.925 [Cohen et al., 2002] Precision/Recall = 0.842/0.909 [Parag and Domingo, 2004] F-measure = [Bilenko and Mooney, 2003] Class Attr-wise MatchingDependency Graph Prec/RecallF-msrePrec/RecallF-msre Article Person Venue 0.985/ / / / / /
Conclusions Contributions : Dependency-graph-based reconciliation algorithm Exploit rich evidence Propagate information between reconciliation decisions Enrich references during reconciliation Extended Work Propagate negative information through dependency Graph
Reference Reconciliation in Complex Information Spaces Xin (Luna) Dong, Alon Halevy, Jayant Sigmod
Strategy IV. Enforce Constraints Problem: Solution: Propagate negative information—Constraints Non-merge node: the two elements are guaranteed to be different and should never be merged P1P1 P2P2 P3P3
Enforce Constraints by Propagating Negative Information p 2 =(“Michael Stonebraker”, null, {p 1, p 3 }) p 3 =(“Eugene Wong”, null, {p 1, p 2 }) p 7 =(“Eugene Wong”, {p 8 }) p 8 =(null, {p 7 }) p 9 =(“matt”, null) (p 2, p 8 ) (“Michael Stonebraker”, “matt”) (p 2, p 9 ) (p 3,p 7 )(“Michael Stonebraker”, (p 8, p 9 ) Reference SimilarityAttribute Similarity
Enforce Constraints by Propagating Negative Information p 2 =(“Michael Stonebraker”, null, {p 1, p 3 }) p 3 =(“Eugene Wong”, null, {p 1, p 2 }) p 7 =(“Eugene Wong”, {p 8 }) p 8 =(null, {p 7 }) p 9 =(“matt”, null) (p 2, p 8 ) (“Michael Stonebraker”, “matt”) (p 2, p 9 ) (p 3,p 7 )(“Michael Stonebraker”, ReconciledSimilarNon-merge (p 8, p 9 ) Constraint
Enforce Constraints by Propagating Negative Information p 2 =(“Michael Stonebraker”, null, {p 1, p 3 }) p 3 =(“Eugene Wong”, null, {p 1, p 2 }) p 7 =(“Eugene Wong”, {p 8 }) p 8 =(null, {p 7 }) p 9 =(“matt”, null) (p 2, p 8 ) (“Michael Stonebraker”, “matt”) (p 2, p 9 ) (p 3,p 7 )(“Michael Stonebraker”, ReconciledSimilarNon-merge (p 8, p 9 ) Constraint
Enforce Constraints by Propagating Negative Information p 2 =(“Michael Stonebraker”, null, {p 1, p 3 }) p 3 =(“Eugene Wong”, null, {p 1, p 2 }) p 7 =(“Eugene Wong”, {p 8 }) p 8 =(null, {p 7 }) p 9 =(“matt”, null) (p 2, p 8 ) (“Michael Stonebraker”, “matt”) (p 2, p 9 ) (p 3,p 7 )(“Michael Stonebraker”, ReconciledSimilarNon-merge (p 8, p 9 ) Constraint
Enforce Constraints by Propagating Negative Information p 2 =(“Michael Stonebraker”, null, {p 1, p 3 }) p 3 =(“Eugene Wong”, null, {p 1, p 2 }) p 7 =(“Eugene Wong”, {p 8 }) p 8 =(null, {p 7 }) p 9 =(“matt”, null) (p 2, p 8 ) (“Michael Stonebraker”, “matt”) (p 2, p 9 ) (p 3,p 7 )(“Michael Stonebraker”, ReconciledSimilarNon-merge (p 8, p 9 ) Constraint
Enforcing Constraints Improves Precision MethodPrecision #(Entities reconciled with others incorrectly) Constraint No Constraint
Similarity Computation Similarity function for node N – s(N) Input: sim scores of N’s neighbors Output: sim score of N, ranged from 0 to 1 Similarity function can be defined by applying domain knowledge, learning from training data, resorting to global knowledge, etc. S = S rv + S sb + S wb S rv : from real-valued neighbors. Decision-tree shape. S sb : from strong-boolean-valued neighbors S wb : from weak-boolean-valued neighbors
Framework: Dependency Graph Definition For every pair of references A and B: A node representing their similarity For every attribute of A and attribute of B A node representing attribute similarity An edge between attr-sim node and ref-sim node, representing the dependency between the similarities Each node is associated with a similarity score between 0 and 1 Construction: include only nodes whose two elements have potential to be similar