Exploiting Relationships for Object Consolidation Zhaoqi Chen Dmitri V. Kalashnikov Sharad Mehrotra Computer Science Department University of California, Irvine (RESCUE) ACM IQIS 2005 Copyright(c) by Dmitri V. Kalashnikov, 2005Work supported by NSF Grants IIS and IIS
2 Talk Overview Examples –motivating data cleaning (DC) –motivating analysis of relationships for DC Object consolidation –one of the DC problems –this work addresses Proposed approach –RelDC framework –Relationship analysis and graph partitioning Experiments
3 Why do we need “Data Cleaning”? q Hi, my name is Jane Smith. I’d like to apply for a faculty position at your university Wow! Unbelievable! You must be a really hard worker! I am sure we will accept a candidate like that! Jane Smith – Fresh Ph.D.Tom - Recruiter OK, let me check something quickly … ??? Publications: 1.…… 2.…… 3.…… Publications: 1.…… 2.…… 3.…… CiteSeer Rank
4 Suspicious entries –Lets go to DBLP website –which stores bibliographic entries of many CS authors –Lets check two people –“A. Gupta” –“L. Zhang” What is the problem? CiteSeer: the top-k most cited authorsDBLP
5 Comparing raw and cleaned CiteSeer RankAuthorLocation# citations 1 (100.00%)douglas 2 (100.00%)rakesh 3 (100.00%)hector 4 (100.00%)sally 5 (100.00%)jennifer 6 (100.00%)david 6 (100.00%)thomas 7 (100.00%)rajeev 8 (100.00%)willy 9 (100.00%)van 10 (100.00%)rajeev 11 (100.00%)john 12 (100.00%)joseph 13 (100.00%)andrew 14 (100.00%)peter 15 (100.00%)serge CiteSeer top-k Cleaned CiteSeer top-k
6 What is the lesson? –data should be cleaned first –e.g., determine the (unique) real authors of publications –solving such challenges is not always “easy” –that explains a large body of work on data cleaning –note –CiteSeer is aware of the problem with its ranking –there are more issues with CiteSeer –many not related to data cleaning “Garbage in, garbage out” principle: Making decisions based on bad data, can lead to wrong results.
7 RelDC Framework
8 Object Consolidation Notation – O={o 1,...,o |O| } set of entities –unknown in general – X={x 1,...,x |X| } set of repres. – d[x i ] the entity x i refers to –unknown in general – C[x i ] all repres. that refer to d[x i ] –“group set” –unknown in general –the goal is to find it for each x i – S[x i ] all repres. that can be x i –“consolidation set” –determined by FBS –we assume C[x i ] S[x i ]
9 Attributed Relational Graph (ARG) ARG in RelDC Nodes –per cluster of representations –per representation (for “tough” cases) Edges –regular –similarity
10 Context Attraction Principle (CAP) Take a guess: Who is “J. Smith” –Jane? –John?
11 Questions to Answer 1.Does the CAP principle hold over real datasets? That is, if we consolidate objects based on it, will the quality of consolidation improves? 2.Can we design a generic solution to exploiting relationships for disambiguation?
12 Consolidation Algorithm 1.Construct ARG and identify all VCS’s –use FBS in constructing the ARG 2.Choose a VCS and compute c(u,v)’s –for each pair of repr. connected via a similarity edge 3.Partition VSC –use a graph partitioning algorithm –partitioning is based on c(u,v)’s –after partitioning, adjust ARG accordingly –go to Step 2, if more VCS exists
13 Connection Strength Computation of c(u,v) Phase 1: Discover connections –all L-short simple paths between u and v –bottleneck –optimizations, not in IQIS’05 Phase 2: Measure the strength –in the discovered connections –many c(u,v) models exist –we use model similar to diffusion kernels
14 Existing c(u,v) Models Models for c(u,v) –many exists –diffusion kernels, random walks, etc –none is fully adequate –cannot learn similarity from data Diffusion kernels – (x,y)= 1 (x,y) “base similarity” –via direct links (of size 1) – k (x,y) “indirect similarity” –via links of size k –B: where B xy = B 1 xy = 1 (x,y) –base similarity matrix –B k : indirect similarity matrix –K: total similarity matrix, or “kernel”
15 Our c(u,v) Model Our c(u,v) model –regular edges have types T 1,...,T n –types T 1,...,T n have weights w 1,...,w n – (x,y) = w i –get the type of a given edge –assign this weigh as base similarity –paths with similarity edges –might not exist, use heuristics Our model & Diff. kernels –virtually identical, but... –we do not compute the whole matrix K –we compute one c(u,v) at a time –we limit path lengths by L – (x,y) is unknown in general –the analyst assigns them –learn from data (ongoing work)
16 Consolidation via Partitioning Observations –each VCS contains representations of at least 1 object –if a repr. is in VCS, then the rest of repr. of the same object are in it too Partitioning –two cases –k, the number of entities in VSC, is known –k is unknown –when k is known, use any partit. algo –maximize inside-con, minimize outside-con. –we use [Shi,Malik’2000] –normalized cut –when k is unknown –split into two: just to see the cut –compare cut against threshold –decide “to split” or “not to split” actually
17 Measuring Quality of Outcome Existing measures –dispersion [DMKD’04] –for an entity, into how many clusters its repr. are clustered, ideal is 1 –diversity –for a cluster, how many distinct entities it covers, ideal is 1 –easy, clear semantics –but have problems, see figure Entropy –for an entity, if out of m represent. m 1 to C 1 ;...; m n to C n then –if a cluster consists of represent.: m 1 of E 1 ;...; m n of E n then (same...) –ideal entropy is zero
18 Experimental Setup Parameters –L-short simple paths, L = 7 –L is the path-length limit Note –The algorithm is applied to “tough cases”, after FBS already has successfully consolidated many entries! RealMov –movies (12K) –people (22K) –actors –directors –producers –studious (1K) –producing –distributing Uncertainty –d1,d2,...,dn are director entities –pick a fraction d1,d2,...,d10 –group, e.g. in groups of two –{d1,d2},...,{d9,d10} –make all representations of d1,d2 indiscernible by FBS,... Baseline 1 –one cluster per VCS, regardless –dumb?... but ideal disp & H(E) Baseline 2 –knows grouping statistics –guesses #ent in VCS –random assigns repr. to clusters
19 Sample Movies Data
20 The Effect of L on Quality Cluster Entropy & DiversityEntity Entropy & Dispersion
21 Effect of Threshold and Scalability
22 Summary RelDC –developed in Aug 2003 (reference disambiguation) –domain-independent data cleaning framework –uses relationships for data cleaning –reference disambiguation [SDM’05] –object consolidation [IQIS’05] Ongoing work –“learning” the importance of relationships from data
23 Contact Information RelDC project (RESCUE) Zhaoqi Chen Dmitri V. Kalashnikov Sharad Mehrotra