Exploiting Relationships for Domain-Independent Data Cleaning Dmitri V. Kalashnikov Sharad Mehrotra Stella Chen Computer Science Department University of California, Irvine (RESCUE) Copyright(c) by Dmitri V. Kalashnikov, 2005 SIAM Data Mining Conference, 2005
2 Talk Overview Examples –motivating data cleaning (DC) –motivating analysis of relationships for DC Reference disambiguation –one of the DC problems –this work addresses Proposed approach –RelDC (Relationship-based Data Cleaning) –employs analysis of relationships for DC –the main contribution Experiments
3 Why do we need “Data Cleaning”? An actual excerpt from a person’s CV –sanitized for privacy –quite common in CVs, etc –this particular person –argues he is good –because his work is well-cited –but, there is a problem with using CiteSeer ranking –in general, it is not valid (in CVs) –let’s see why... “... In June 2004, I was listed as the 1000 th most cited author in computer science (of 100,000 authors) by CiteSeer, available at
4 Suspicious entries –Let us go to the DBLP website –which stores bibliographic entries of many CS authors –Let us check who are –“A. Gupta” –“L. Zhang” What is the problem in the example? CiteSeer: the top-k most cited authorsDBLP
5 Comparing raw and cleaned CiteSeer RankAuthorLocation# citations 1 (100.00%)douglas 2 (100.00%)rakesh 3 (100.00%)hector 4 (100.00%)sally 5 (100.00%)jennifer 6 (100.00%)david 6 (100.00%)thomas 7 (100.00%)rajeev 8 (100.00%)willy 9 (100.00%)van 10 (100.00%)rajeev 11 (100.00%)john 12 (100.00%)joseph 13 (100.00%)andrew 14 (100.00%)peter 15 (100.00%)serge CiteSeer top-k Cleaned CiteSeer top-k
6 What is the lesson? –data should be cleaned first –e.g., determine the (unique) real authors of publications –solving such challenges is not always “easy” –that explains a large body of work on data cleaning –note –CiteSeer is aware of the problem with its ranking –there are more issues with CiteSeer –many not related to data cleaning “Garbage in, garbage out” principle: Making decisions based on bad data, can lead to wrong results.
7 High-level view of the problem
8 Traditional Domain-Independent DC Methods
9 What is “Reference Disambiguation”? A1, ‘Dave White’, ‘Intel’ A2, ‘Don White’, ‘CMU’ A3, ‘Susan Grey’, ‘MIT’ A4, ‘John Black’, ‘MIT’ A5, ‘Joe Brown’, unknown A6, ‘Liz Pink’, unknown P1, ‘Databases... ’, ‘John Black’, ‘Don White’ P2, ‘Multimedia... ’, ‘Sue Grey’, ‘D. White’ P3, ‘Title3...’, ‘Dave White’ P4, ‘Title5...’, ‘Don White’, ‘Joe Brown’ P5, ‘Title6...’, ‘Joe Brown’, ‘Liz Pink’ P6, ‘Title7... ’, ‘Liz Pink’, ‘D. White’ Author table (clean)Publication table (to be cleaned) ? Analysis (‘D. White’ in P2, our approach): 1. ‘Don White’ has a paper with ‘John 2. ‘Dave White’ is not connected to MIT in any way 3. ‘Sue Grey’ is coauthor of P2 too, MIT Thus: ‘D. White’ in P2 is probably Don (since we know he collaborates with MIT ppl.) Analysis (‘D. White’ in P6, our approach): 1. ‘Don White’ has a paper (P4) with Joe Brown; Joe has a paper (P5) with Liz Pink; Liz Pink is a coauthor of P6. 2. ‘Dave White’ does not have papers with Joe or Liz Thus: ‘D. White’ in P6 is probably Don (since co-author networks often form clusters)
10 Attributed Relational Graph (ARG) View dataset as a graph –nodes for entities –papers, authors, organizations –e.g., P2, Susan, MIT –edges for relationships –“writes”, “affiliated with” –e.g. Susan → P2 (“writes”) “Choice” nodes –for uncertain relationships –mutual exclusion –“1” and “2” in the figure Analysis can be viewed as –application of the “Context AP” –to this graph –defined next... Q: How come domain-independent?
11 In designing the RelDC approach - our goal was to use CAP as an axiom - then solve problem formally, without heuristics if –reference r, made in the context of entity x, refers to an entity y j –but, the description, provided by r, matches multiple entities: y 1,…, y j,…, y N, then – x and y j are likely to be more strongly connected to each other via chains of relationships –than x and y k ( k = 1, 2, …, N; k j ). Context Attraction Principle (CAP) “J. Smith”publication P1 John E. Smith SSN = 123 Joe A. Smith P1John E. Smith Jane Smith
12 Analyzing paths: linking entities and contexts D. White is a reference –in the context of P2, P6 –can link P2, P6 to Don –cannot link P2, P6 to Dave –more complex paths in general Analysis (‘D. White’ in P2): path P2→Don 1. ‘Don White’ has a paper with ‘John 2. ‘Dave White’ is not connected to MIT in any way 3. ‘Sue Grey’ is coauthor of P1 too, MIT Thus: ‘D. White’ is probably Don White Analysis (‘D. White’ in P6): path P6→Don 1. ‘Don White’ has a paper (P4) with Joe Brown; Joe has a paper (P5) with Liz Pink; Liz Pink is a coauthor of P6. 2. ‘Dave White’ does not have papers with Joe or Liz Thus: ‘D. White’ is probably Don White
13 Questions to answer 1. Does the CAP principle hold over real datasets? That is, if we disambiguate references based on it, will the references be correctly disambiguated? 2. Can we design a generic solution to exploiting relationships for disambiguation?
14 Problem formalization NotationMeaning X={x 1, x 2,..., x N }the set of all entities in in the database x i.r k the k-th reference of entity x i a referencea description of an object, multiple attributes d[x i.r k ]the “answer” for x i.r k -- the real entity x i.r k refers to (unknown, the goal is to find it) CS[x i.r k ]the “choice set” for x i.r k -- the set of all entities matching the description provided by x i.r k y 1, y 2,..., y N the “options” for x i.r k -- elements in CS[x i.r k ] v[xi]v[xi]the node in the graph for entity x i the name of k-th author of paper x i, e.g. ‘J. Smith’ the true k-th author of paper x i ‘John A. Smith’, ‘Jane B. Smith’,...
15 Handling References: Linking (references correspond to relationships) if |CS[x i.r k ]| = 1 then –we know the answer d[x i.r k ] –link x i and d[x i.r k ] directly, w = 1 else –the answer is uncertain for x i.r k –create a “choice” node, link it –“option-weights”, w w N = 1 –option-weights are variables Entity-Relationship Graph RelDC views dataset as a graph –undirected –nodes for entities –don’t have weights –edges for relationships –have weights –real number in [0,1] –the confidence the relationship exists “J. Smith” P1 “Jane Smith” “John Smith”
16 Definition: To resolve a reference x i.r k means –to pick one y j from CS[x i.r k ] as d[x i.r k ]. Graph interpretation –among w 1, w 2,..., w N, assign w j = 1 to one w j –means y j is chosen as the answer d[x i.r k ] Definition: Reference x i.r k is resolved correctly, if the chosen y j = d[x i.r k ]. Definition: Reference x i.r k is unresolved or uncertain, if not yet resolved... Goal: Resolve all uncertain references as correctly as possible. Objective of Reference Disambiguation
17 Alterative goal –for each reference x i.r k –assign option-weights w 1,...,w N –but it [0,1], not binary as before – w j reflects the degree of confidence that y j = d[x i.r k ] – w 1 + w w N = 1 Mapping the alternative goal to the original – use an interpretation procedure – pick y i with the max w i as the answer for x i.r k – a final step RelDC deals with the alternative goal! – the bulk of the discussion on computing those option-weights Alternative Goal
18 Formalizing the CAP CAP –is based on “connection strength” –c(u,v) for entities u and v –measures how strongly u and v are connected to each other via relationships –e.g. c(u,v) > c(u,z) in the figure –will formalize c(u,v) later if c(x i, y j ) ≥ c(x i, y k ) then w j ≥ w k (most of the time) Context Attraction Principle (CAP) We use proportionality: c(x i, y j ) ∙ w k = c(x i, y k ) ∙ w j
19 RelDC approach Input: the ARG for the dataset 1.Computing connection strengths −for each unresolved reference x i.r k −determine equations for all (i.e., N ) c(x i, y j )’s − c(x i, y j ) = g ij (w) − a function of other option-weights 2.Determining equations for option-weights −use CAP to relate all w j ’s and connection strengths −since c(x i, y j ) = g ij (w), hence w ij = f ij (w) 3.Computing option-weights −solve the system of equations from Step 2. 4.Resolving references −use the interpretation procedure to resolve weights
20 Computing connection strength (Step 1) Computation of c(u,v) consists of two phases –Phase 1: Discover connections –all L-short simple paths between u and v –bottleneck –optimizations, not in SDM05 –Phase 2: Measure the strength –in the discovered connections –many c(u,v) models exist –we use random walks in graphs model
21 Measuring connection strength Note: –c(u,v) returns an equations –because paths can go via various option-edges –c uv = c(u,v) = g uv ( w )
22 Equations for option-weights (Step 2) CAP (proportionality): System (over-constrained): Add slack:
23 Solving the system (Steps 3 and 4) Step 3: Solve the system of equations 1.use a math solver, or 2.iterative method (approx. solution ), or 3.bounding-interval-based method (tech. report). Step 4: Interpret option-weights –to determine the answer for each reference –pick y j with the largest weight as the answer
24 Experimental Setup Parameters –When looking for L-short simple paths, L = 7 –L is the path-length limit RealPub dataset: –CiteSeer + HPSearch –publications (255K) –authors (176K) –organizations (13K) –departments (25K) –ground truth is not known –accuracy... SynPub datasets: –many ds of two types –emulation of RealPub –publications (5K) –authors (1K) –organizations (25K) –departments (125K) –ground truth is known RealMov: –movies (12K) –people (22K) –actors –directors –producers –studious (1K) –producing –distributing
25 Sample Publication Data CiteSeer: publication records HPSearch: author records
26 Efficiency and Long paths Non-exponential cost Longer paths do help
27 Accuracy on SynPub
28 Sample Movies Data
29 Accuracy on RealMov References to DirectorsReferences to Studios
30 Summary DC and “Garbage in, Garbage out” principle Our main contributions –showing that analyzing relationship can help DC –an approach, that achieves that RelDC –developed in Aug 2003 –domain-independent data cleaning framework –not about cleaning CiteSeer –uses relationships for data cleaning Ongoing work –“learning” the importance of relationships from data
31 Contact Information RelDC project (RESCUE) Dmitri V. Kalashnikov (contact author) Sharad Mehrotra Zhaoqi Chen
32 Summary DC and “Garbage in, Garbage out” principle Analyzing relationship can help data cleaning RelDC –developed in Aug 2003 –domain-independent data cleaning framework –not about cleaning CiteSeer –uses relationships for data cleaning –employs CAP as an axiom –converts the problem to an optimization problem –can disambiguate different types of references at once –in theory, not tested yet