Download presentation
Presentation is loading. Please wait.
1
Adaptive Graphical Approach to Entity Resolution Dmitri V. Kalashnikov Stella Chen, Dmitri V. Kalashnikov, Sharad Mehrotra Computer Science Department University of California, Irvine Additional information is available at http://www.ics.uci.edu/~dvkhttp://www.ics.uci.edu/~dvk Copyright © by Dmitri V. Kalashnikov, 2007 ACM IEEE Joint Conference on Digital Libraries 2007
2
2 Structure of the Talk Motivation Generic Disambiguation Framework –High-level Entity Resolution Approach –Part of the Framework Experiments
3
3 Entity Resolution & Data Cleaning Analysis on bad data leads to wrong conclusions! Uncertainty Errors Missing data
4
4 Why do we need “Entity Resolution”? q Hi, I’m Jane Smith. I’d like to apply for a faculty position. Wow! I am sure we will accept a strong candidate like that! Jane Smith – Fresh Ph.D.Tom - Recruiter OK, let me check something quickly … ??? Publications: 1.…… 2.…… 3.…… Publications: 1.…… 2.…… 3.…… CiteSeer Rank
5
5 Suspicious entries –Lets go to DBLP website –which stores bibliographic entries of many CS authors –Lets check two people –“A. Gupta” –“L. Zhang” What is the problem? CiteSeer: the top-k most cited authorsDBLP
6
6 Comparing raw and cleaned CiteSeer RankAuthorLocation 1 (100.00%)douglas schmidtcs@wustl 2 (100.00%)rakesh agrawalalmaden@ibm 3 (100.00%)hector garciamolina@ 4 (100.00%)sally floyd@aciri 5 (100.00%)jennifer widom@stanford 6 (100.00%)david cullercs@berkeley 6 (100.00%)thomas henzingereecs@berkeley 7 (100.00%)rajeev motwani@stanford 8 (100.00%)willy zwaenepoelcs@rice 9 (100.00%)van jacobsonlbl@gov 10 (100.00%)rajeev alurcis@upenn 11 (100.00%)john ousterhout@pacbell 12 (100.00%)joseph halperncs@cornell 13 (100.00%)andrew kahng@ucsd 14 (100.00%)peter stadlertbi@univie 15 (100.00%)serge abiteboul@inria Raw CiteSeer’s Top-K Most Cited Authors Cleaned CiteSeer’s Top-K Most Cited Authors
7
7 What is the lesson? –Data should be cleaned first –E.g., determine the (unique) real authors of publications –Solving such challenges is not always “easy” –This explains a large body of work on Entity Resolution “Garbage in, garbage out” principle: Making decisions based on bad data, can lead to wrong results.
8
8 Typical Data Processing Flow
9
9 Two most common types of Entity Resolution Fuzzy lookup –match references to objects –list of all objects is given –[SDM’05], [TODS’06] Fuzzy grouping –group references that co-refer –[IQIS’05], [JCDL’07]
10
10 Structure of the Talk Motivation Generic Framework –High-level Approach –Part of the Framework Experiments
11
11 Traditional Approach to Entity Resolution s (X,Y) = f (X,Y) Similarity = Similarity of Features
12
12 Key Observation: More Info is Available =
13
13 Solution: Main Idea s (X,Y) = c (X,Y) + γ f (X,Y) Similarity = Similarity of Features + “Connection Strength” New Paradigm
14
14 Illustrative Example “Indirect connections” –Suppose your co-worker’s name is “John White” –Suppose you see on the Web, on my homepage –My name: “Dmitri …” –Somebody named: “John White” –Who is the “John White”? –From data you might establish a connection: –“Dmitri” might be connected to more “John White”’s…
15
15 Key Features of the Framework Our goal is/was to create a framework, such that: –solid theoretic foundation –lookup –domain-independent framework –self-tuning –scales to large datasets –robust under uncertainty –high disambiguation quality
16
16 Structure of the Talk Motivation Generic Framework –High-level Approach –Part of the Framework Experiments
17
17 Approach Graph Creation –Entity-Relationship Graph Consolidation Algorithm –Bottom-up clustering Adaptiveness to data –That is, self-tuning –Supervised learning External Data –To improve the quality further –A theoretic possibility –Not tested yet
18
18 ER Graph Creation
19
19 Virtual Connected Subgraph (VCS) VCS –Similarity edges form VCSs –Subgraphs in the ER graph 1. “Virtual” –Contains only similarity edges 2. “Connected” –A path between any 2 nodes 3. Completeness –Adding more nodes/edges would violate (1) and (2) Logically, the Goal is –Partition each VCS properly
20
20 Consolidation Algorithm: Merging
21
21 Self-tuning via Supervised Learning
22
22 Self-tuning (2)
23
23 External Knowledge to Improve Quality
24
24 Structure of the Talk Motivation Generic Framework –High-level Approach –Part of the Framework Experiments
25
25 Quality “Context” is proposed in [Bhattacharya et al., DMKD’04] The two algos are proposed in [Dong et al., SIGMOD’05]
26
26 Scalability & Efficiency
27
27 Impact of Random Relationships
28
28 Contact Information Info about our disambiguation project –http://www.ics.uci.edu/~dvkhttp://www.ics.uci.edu/~dvk Overall design –Dmitri V. Kalashnikov –dvk [at] domain Implementation details in JCDL’07 –Zhaoqi (Stella) Chen –chenz [at] domain –domain = ics.uci.edu
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.