Adaptive Graphical Approach to Entity Resolution Dmitri V. Kalashnikov Stella Chen, Dmitri V. Kalashnikov, Sharad Mehrotra Computer Science Department.

Adaptive Graphical Approach to Entity Resolution Dmitri V. Kalashnikov Stella Chen, Dmitri V. Kalashnikov, Sharad Mehrotra Computer Science Department University of California, Irvine Additional information is available at http://www.ics.uci.edu/~dvkhttp://www.ics.uci.edu/~dvk Copyright © by Dmitri V. Kalashnikov, 2007 ACM IEEE Joint Conference on Digital Libraries 2007

2 Structure of the Talk  Motivation Generic Disambiguation Framework –High-level Entity Resolution Approach –Part of the Framework Experiments

3 Entity Resolution & Data Cleaning Analysis on bad data leads to wrong conclusions! Uncertainty Errors Missing data

4 Why do we need “Entity Resolution”? q Hi, I’m Jane Smith. I’d like to apply for a faculty position. Wow! I am sure we will accept a strong candidate like that! Jane Smith – Fresh Ph.D.Tom - Recruiter OK, let me check something quickly … ??? Publications: 1.…… 2.…… 3.…… Publications: 1.…… 2.…… 3.…… CiteSeer Rank

5 Suspicious entries –Lets go to DBLP website –which stores bibliographic entries of many CS authors –Lets check two people –“A. Gupta” –“L. Zhang” What is the problem? CiteSeer: the top-k most cited authorsDBLP

6 Comparing raw and cleaned CiteSeer RankAuthorLocation 1 (100.00%)douglas schmidtcs@wustl 2 (100.00%)rakesh agrawalalmaden@ibm 3 (100.00%)hector garciamolina@ 4 (100.00%)sally floyd@aciri 5 (100.00%)jennifer widom@stanford 6 (100.00%)david cullercs@berkeley 6 (100.00%)thomas henzingereecs@berkeley 7 (100.00%)rajeev motwani@stanford 8 (100.00%)willy zwaenepoelcs@rice 9 (100.00%)van jacobsonlbl@gov 10 (100.00%)rajeev alurcis@upenn 11 (100.00%)john ousterhout@pacbell 12 (100.00%)joseph halperncs@cornell 13 (100.00%)andrew kahng@ucsd 14 (100.00%)peter stadlertbi@univie 15 (100.00%)serge abiteboul@inria Raw CiteSeer’s Top-K Most Cited Authors Cleaned CiteSeer’s Top-K Most Cited Authors

7 What is the lesson? –Data should be cleaned first –E.g., determine the (unique) real authors of publications –Solving such challenges is not always “easy” –This explains a large body of work on Entity Resolution “Garbage in, garbage out” principle: Making decisions based on bad data, can lead to wrong results.

8 Typical Data Processing Flow

9 Two most common types of Entity Resolution Fuzzy lookup –match references to objects –list of all objects is given –[SDM’05], [TODS’06] Fuzzy grouping –group references that co-refer –[IQIS’05], [JCDL’07]

10 Structure of the Talk Motivation  Generic Framework –High-level Approach –Part of the Framework Experiments

11 Traditional Approach to Entity Resolution s (X,Y) = f (X,Y) Similarity = Similarity of Features

12 Key Observation: More Info is Available =

13 Solution: Main Idea s (X,Y) = c (X,Y) + γ f (X,Y) Similarity = Similarity of Features + “Connection Strength” New Paradigm

14 Illustrative Example “Indirect connections” –Suppose your co-worker’s name is “John White” –Suppose you see on the Web, on my homepage –My name: “Dmitri …” –Somebody named: “John White” –Who is the “John White”? –From data you might establish a connection: –“Dmitri” might be connected to more “John White”’s…

15 Key Features of the Framework Our goal is/was to create a framework, such that: –solid theoretic foundation –lookup –domain-independent framework –self-tuning –scales to large datasets –robust under uncertainty –high disambiguation quality

16 Structure of the Talk Motivation Generic Framework –High-level  Approach –Part of the Framework Experiments

17 Approach Graph Creation –Entity-Relationship Graph Consolidation Algorithm –Bottom-up clustering Adaptiveness to data –That is, self-tuning –Supervised learning External Data –To improve the quality further –A theoretic possibility –Not tested yet

18 ER Graph Creation

19 Virtual Connected Subgraph (VCS) VCS –Similarity edges form VCSs –Subgraphs in the ER graph 1. “Virtual” –Contains only similarity edges 2. “Connected” –A path between any 2 nodes 3. Completeness –Adding more nodes/edges would violate (1) and (2) Logically, the Goal is –Partition each VCS properly

20 Consolidation Algorithm: Merging

21 Self-tuning via Supervised Learning

22 Self-tuning (2)

23 External Knowledge to Improve Quality

24 Structure of the Talk Motivation Generic Framework –High-level Approach –Part of the Framework  Experiments

25 Quality “Context” is proposed in [Bhattacharya et al., DMKD’04] The two algos are proposed in [Dong et al., SIGMOD’05]

26 Scalability & Efficiency

27 Impact of Random Relationships

28 Contact Information Info about our disambiguation project –http://www.ics.uci.edu/~dvkhttp://www.ics.uci.edu/~dvk Overall design –Dmitri V. Kalashnikov –dvk [at] domain Implementation details in JCDL’07 –Zhaoqi (Stella) Chen –chenz [at] domain –domain = ics.uci.edu

Adaptive Graphical Approach to Entity Resolution Dmitri V. Kalashnikov Stella Chen, Dmitri V. Kalashnikov, Sharad Mehrotra Computer Science Department.

Similar presentations

Presentation on theme: "Adaptive Graphical Approach to Entity Resolution Dmitri V. Kalashnikov Stella Chen, Dmitri V. Kalashnikov, Sharad Mehrotra Computer Science Department."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Adaptive Graphical Approach to Entity Resolution Dmitri V. Kalashnikov Stella Chen, Dmitri V. Kalashnikov, Sharad Mehrotra Computer Science Department.

Similar presentations

Presentation on theme: "Adaptive Graphical Approach to Entity Resolution Dmitri V. Kalashnikov Stella Chen, Dmitri V. Kalashnikov, Sharad Mehrotra Computer Science Department."— Presentation transcript:

Similar presentations

About project

Feedback