Download presentation
Presentation is loading. Please wait.
Published byEustace Lewis Modified over 9 years ago
1
Progressive Approach to Relational Entity Resolution Yasser Altowim, Dmitri Kalashnikov, Sharad Mehrotra Progressive Approach to Relational Entity Resolution Yasser Altowim, Dmitri Kalashnikov, Sharad Mehrotra
2
Progressive ER
3
IdNamePapers u1u1 Very Large Data Bases {p1}{p1} u2u2 ICDE Conference {p2}{p2} u3u3 VLDB {p3}{p3} u4u4 IEEE Data Eng. Bull {p4}{p4} IdTitleAuthorsVenue p1p1 Transaction Support in Read Optimized … { a 1, a 2 } u1u1 p2p2 Read Optimized File System Designs: … {a1}{a1} u2u2 p3p3 Transaction Support in Read Optimized … { a 3, a 4 } u3u3 p4p4 Berkeley DB: A Retrospective.. {a3}{a3} u4u4 Author Venue IdNamePapers a1a1 Marge Seltzer { p 1, p 2 } a2a2 Michael Stonebraker {p1}{p1} a3a3 Margo I. Seltzer { p 3, p 4 } a4a4 M. Stonebraker {p3}{p3} Paper Relational Dataset
4
duplicate Resolve Graph Representation u1, u3u1, u3 u1, u3u1, u3 p1, p3p1, p3 p1, p3p1, p3 duplicate
5
Problem Definition Given a relational dataset D, and a cost budget BG, Our goal is to develop a progressive approach that produces a high-quality result using BG units of cost. Given a relational dataset D, and a cost budget BG, Our goal is to develop a progressive approach that produces a high-quality result using BG units of cost.
6
ER Graph R 1 S 1 R 2 T 2 T 1 S2S2
7
ER Graph R 1 S 1 R 2 T 2 T 1 S2S2 v1v1 v2v2 v3v3 v4v4 v8v8 v7v7 v6v6 v5v5 v9v9 v 10 v 11 v 12
8
R 2 T 2 S2S2 Partially Constructed Graph R 1 S 1 T 1 v1v1 v2v2 v3v3 v7v7 v6v6 v5v5 v4v4 v8v8 v9v9 v 10 v 11 v 12
9
Resolution Windows Window 1 Window 2 Window n … 1.Plan Generation. 2.Plan Execution ( ). Resolution Plan ( ) Set of blocks ( ) to be instantiated. Set of nodes ( ) to be resolved. BG Lazy Resolution Strategy
10
Plan Cost and Benefit
11
Node Benefit … … … … … … Indirect Benefit Direct Benefit v1v1 v2v2 v3v3 v4v4 v5v5 v6v6 Stat e
12
2. Generate a plan such that: h. is maximized. 2. Generate a plan such that: h. is maximized. 1. Benefit-vs-Cost Analysis: Each node and block has an updated cost and benefit. 1. Benefit-vs-Cost Analysis: Each node and block has an updated cost and benefit. Plan Generation Phase NP-hard Oregon-Trail Knapsack NP-hard Oregon-Trail Knapsack
13
Instantiated Unresolved Nodes Step#1 Step#2 Uninstantiated Blocks R1R1 R1R1 R2R2 R2R2 R4R4 R4R4 R5R5 R5R5 R6R6 R6R6 R8R8 R8R8 R9R9 R9R9 Plan Generation Algorithm v1v1 v2v2 v4v4 v6v6 v7v7 v 10 v 13 v 15 v 16 v 21 v1v1 v2v2 v6v6 v 10 v 16
14
Step#3 If > else return and R1R1 R1R1 R8R8 R8R8 R6R6 R6R6 R2R2 R2R2 … Plan Generation Algorithm v1v1 v2v2 v6v6 v 10 v 16 v1v1 v2v2 v 10 v 30 v 32 v 34 v 36 v 38 v 40 v 42 v 45 v 47 v 48
15
Experimental Evaluation 1.Papers (P) 2.Authors (A) 3.Venues (U) = ( Title, Abstract, Keywords, Authors, Venue ). = ( Name, Email, Affiliation, Address, Paper ). = ( Name, Year, Pages, Papers ). Number of Entities Blocking Functions Similarity Functions Resolve Function P 30,00023 Naïve Bayes A 83,15214 Naïve Bayes U 30,00013 Naïve Bayes CiteSeerX Dataset
16
Algorithms: 1.DepGraph. X. Dong et al. Reference reconciliation in complex information spaces. SIGMOD. 2.Static. S. E. Whang et al. Joint entity resolution. ICDE. 3.Full: No lazy resolution strategy. 4.Random: Lazy resolution strategy but with random order. Experimental Evaluation R R1R1 R1R1 R4R4 R4R4 R5R5 R5R5 … T6T6 T6T6 T1T1 T1T1 T3T3 T3T3 … S2S2 S2S2 S6S6 S6S6 S5S5 S5S5 … T S
17
Time vs. Recall
18
Our ApproachRandomFull Execution Time (sec) 300.33396.55542.43 Plan Generation 4.76%3.81%2.58% Plan Execution 95.11%96.17%97.40 Lazy Resolution with Workflow Our ApproachRandomFull Execution Time (sec) 300.33396.55542.43 Plan Generation 4.76%3.81%2.58% Reading Blocks 4.70%3.75%2.90% Graph Creation 8.40%6.25%4.72% Node Resolution 82.01%86.17%89.78% Reading Blocks. Creating Nodes. Resolving Nodes. Reading Blocks. Creating Nodes. Resolving Nodes.
19
Conclusion Progressive Approach to Relational ER. Cost and benefit model for generating a resolution plan. Lazy resolution strategy to resolve nodes with the least amount of cost. Experiments on publication and synthetic datasets to demonstrate the efficiency of our approach.
20
Questions
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.