Download presentation
Presentation is loading. Please wait.
Published byDella Primrose Cunningham Modified over 9 years ago
1
Exploiting Relationships for Object Consolidation Zhaoqi Chen Dmitri V. Kalashnikov Sharad Mehrotra Computer Science Department University of California, Irvine http://www.ics.uci.edu/~dvk/RelDC http://www.itr-rescue.org (RESCUE) ACM IQIS 2005 Work supported by NSF Grants IIS-0331707 and IIS-0083489
2
2 Talk Overview Motivation Object consolidation problem Proposed approach –RelDC: Relationship based data cleaning –Relationship analysis and graph partitioning Experiments
3
3 Why do we need “Data Cleaning”? q Hi, my name is Jane Smith. I’d like to apply for a faculty position at your university Wow! Unbelievable! Are you sure you will join us even if we do not offer you tenure right away? Jane Smith – Fresh Ph.D.Tom - Recruiter OK, let me check something quickly … ??? Publications: 1.…… 2.…… 3.…… Publications: 1.…… 2.…… 3.…… CiteSeer Rank
4
4 Names often do not uniquely identify people What is the problem? CiteSeer: the top-k most cited authorsDBLP
5
5 Comparing raw and cleaned CiteSeer RankAuthorLocation# citations 1 (100.00%)douglas schmidtcs@wustl5608 2 (100.00%)rakesh agrawalalmaden@ibm4209 3 (100.00%)hector garciamolina@4167 4 (100.00%)sally floyd@aciri3902 5 (100.00%)jennifer widom@stanford3835 6 (100.00%)david cullercs@berkeley3619 6 (100.00%)thomas henzingereecs@berkeley3752 7 (100.00%)rajeev motwani@stanford3570 8 (100.00%)willy zwaenepoelcs@rice3624 9 (100.00%)van jacobsonlbl@gov3468 10 (100.00%)rajeev alurcis@upenn3577 11 (100.00%)john ousterhout@pacbell3290 12 (100.00%)joseph halperncs@cornell3364 13 (100.00%)andrew kahng@ucsd3288 14 (100.00%)peter stadlertbi@univie3187 15 (100.00%)serge abiteboul@inria3060 CiteSeer top-k Cleaned CiteSeer top-k
6
6 Object Consolidation Problem Cluster representations that correspond to the same “real” world object/entity Two instances: real world objects are known/unknown r1r2r3r4r5r6r7rN o1o2o3o4o5o6o7oM Representations of objects in the database Real objects in the database
7
7 RelDC Approach Exploit relationships among objects to disambiguate when traditional approach on clustering based on similarity does not work f1 f2 f3 ? ? ? f4 Y f1 f2 f3 f4 ? X Traditional Methods + X Y A B C D EF Relationship Analysis ARG RelDC Framework features and context Relationship-based Data Cleaning
8
8 Attributed Relational Graph (ARG) View the database as an ARG Nodes –per cluster of representations (if already resolved by feature-based approach) –per representation (for “tough” cases) Edges –Regular – correspond to relationships between entities –Similarity – created using feature-based methods on representations
9
9 Context Attraction Principle (CAP) Who is “J. Smith” –Jane? –John?
10
10 Questions to Answer 1.Does the CAP principle hold over real datasets? That is, if we consolidate objects based on it, will the quality of consolidation improves? 2.Can we design a generic strategy that exploits CAP for consolidation?
11
11 Consolidation Algorithm 1.Construct ARG and identify all virtual clusters (VCSs) –use FBS in constructing the ARG 2.Choose a VCS and compute connection strength between nodes –for each pair of repr. connected via a similarity edge 3.Partition the VCS –use a graph partitioning algorithm –partitioning is based on connection strength –after partitioning, adjust ARG accordingly –go to Step 2, if more potential clusters exists
12
12 Connection Strength c(u,v) Models for c(u,v) –many possibilities –diffusion kernels, random walks, etc –none is fully adequate –cannot learn similarity from data Diffusion kernels – (x,y)= 1 (x,y) “base similarity” –via direct links (of size 1) – k (x,y) “indirect similarity” –via links of size k –B: where B xy = B 1 xy = 1 (x,y) –base similarity matrix –B k : indirect similarity matrix –K: total similarity matrix, or “kernel”
13
13 Connection Strength c(u,v) (cont.) Instantiating parameters –Determining (x,y) –regular edges have types T 1,...,T n –types T 1,...,T n have weights w 1,...,w n – (x,y) = w i –get the type of a given edge –assign this weigh as base similarity –Handling similarity edges – (x,y) assigned value proportional to similarity (heuristic) – Approach to learn (x,y) from data (ongoing work) Implementation –we do not compute the whole matrix K –we compute one c(u,v) at a time – limit path lengths by L
14
14 Consolidation via Partitioning Observations –each VCS contains representations of at least 1 object –if a repr. is in VCS, then the rest of repr. of the same object are in it too Partitioning –two cases –k, the number of entities in VSC, is known –k is unknown –when k is known, use any partit. algo –maximize inside-con, minimize outside-con. –we use [Shi,Malik’2000] –normalized cut –when k is unknown –split into two: just to see the cut –compare cut against threshold –decide “to split” or “not to split” –Iterate
15
15 Measuring Quality of Outcome –dispersion –for an entity, into how many clusters its repr. are clustered, ideal is 1 –diversity –for a cluster, how many distinct entities it covers, ideal is 1 –Entity uncertainty –for an entity, if out of m represent. m 1 to C 1 ;...; m n to C n then –Cluster Uncertainty –if a cluster consists of represent.: m 1 of E 1 ;...; m n of E n then (same...) –ideal entropy is zero
16
16 Experimental Setup Parameters –L-short simple paths, L = 7 –L is the path-length limit Note –The algorithm is applied to “tough cases”, after FBS already has successfully consolidated many entries! RealMov –movies (12K) –people (22K) –actors –directors –producers –studious (1K) –producing –distributing Uncertainty –d1,d2,...,dn are director entities –pick a fraction d1,d2,...,dm –Group entries in size k, –e.g. in groups of two {d1,d2},...,{d9,d10} –make all representations of a group indiscernible by FBS,... Baseline 1 –one cluster per VCS, regardless –Equivalent to using only FBS –ideal dispersion & H(E)! Baseline 2 Baseline 2 –knows grouping statistics –gueses #ent in VCS –random assigns repr. to clusters
17
17 Sample Movies Data
18
18 The Effect of L on Quality Cluster Entropy & DiversityEntity Entropy & Dispersion
19
19 Effect of Threshold and Scalability
20
20 Summary RelDC –domain-independent data cleaning framework –uses relationships for data cleaning –reference disambiguation [SDM’05] –object consolidation [IQIS’05] Ongoing work –“learning” the importance of relationships from data –Exploiting relationships among entities for other data cleaning problems
21
21 Contact Information RelDC project www.ics.uci.edu/~dvk/RelDC www.itr-rescue.org (RESCUE) Zhaoqi Chen chenz@ics.uci.edu Dmitri V. Kalashnikov www.ics.uci.edu/~dvk dvk@ics.uci.edu Sharad Mehrotra www.ics.uci.edu/~sharad sharad@ics.uci.edu
22
22 extra slides…
23
23 What is the lesson? –data should be cleaned first –e.g., determine the (unique) real authors of publications –solving such challenges is not always “easy” –that explains a large body of work on data cleaning –note –CiteSeer is aware of the problem with its ranking –there are more issues with CiteSeer –many not related to data cleaning “Garbage in, garbage out” principle: Making decisions based on bad data, can lead to wrong results.
24
24 Object Consolidation Notation – O={o 1,...,o |O| } set of entities –unknown in general – X={x 1,...,x |X| } set of repres. – d[x i ] the entity x i refers to –unknown in general – C[x i ] all repres. that refer to d[x i ] –“group set” –unknown in general –the goal is to find it for each x i – S[x i ] all repres. that can be x i –“consolidation set” –determined by FBS –we assume C[x i ] S[x i ]
25
25 Object Consolidation Problem Let O={o 1,...,o |O| } be the set of entities –unknown in general Let X={x 1,...,x |X| } be the set of representations Map xi to its corresponding entity oj in O d[x i ] the entity x i refers to –unknown in general – C[x i ] all repres. that refer to d[x i ] –“group set” –unknown in general –the goal is to find it for each x i – S[x i ] all repres. that can be x i –“consolidation set” –determined by FBS –we assume C[x i ] S[x i ]
26
26 RelDC Framework
27
27 Connection Strength Computation of c(u,v) Phase 1: Discover connections –all L-short simple paths between u and v –bottleneck –optimizations, not in IQIS’05 Phase 2: Measure the strength –in the discovered connections –many c(u,v) models exist –we use model similar to diffusion kernels
28
28 Our c(u,v) Model Our c(u,v) model –regular edges have types T 1,...,T n –types T 1,...,T n have weights w 1,...,w n – (x,y) = w i –get the type of a given edge –assign this weigh as base similarity –paths with similarity edges –might not exist, use heuristics Our model & Diff. kernels –virtually identical, but... –we do not compute the whole matrix K –we compute one c(u,v) at a time –we limit path lengths by L – (x,y) is unknown in general –the analyst assigns them –learn from data (ongoing work)
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.