Download presentation
1
Record Linkage in a Distributed Environment
Huang Yipeng Wing group meeting, 11 March 2011
2
Record Linkage Determining if pairs of personal records refer to the same entity E.g. Distinguishing between data belonging to… <Yipeng, author of this presentation> and <Yipeng, son of PM Lee> Introduction
3
The Distributed Environment
Amanda Beverley Katherine Why? Dealing with large data Limitation of blocking Advantages Parallel computation Data source flexibility Complementary to blocking methods O(nC2) Amanda Introduction
4
The Distributed Environment
MapReduce Distributed environment for large data sets Hadoop Open source implementation Convenient model for scaling Record Linkage Protects users from system level concerns Introduction
5
Research Problem Disconnect between generic parallel framework and specific Record Linkage problem The goal Tailor Hadoop for Record Linkage tasks Introduction
6
Outline Introduction Related Work Methodology Evaluation Conclusion
7
Related Work Record Linkage Literature
Blocking techniques Parallel Record Linkage Literature P-Febrl (P Christen 2003), P-Swoosh (H Kawai 2006), Parallel Linkage (H Kim 2007) Hadoop Literature Evaluation Metrics Pairwise comparisons (T Elsayed 2008) Related Work
8
Outline Introduction Related Work Methodology Evaluation Conclusion
9
MapReduce Workflow Partitioner Methodology
10
Implementation Map Purpose: Reads lines of input and outputs
Parallelism Data manipulation Blocking Reads lines of input and outputs <key, value> pairs. Reduce Purpose: Parallelism Record Linkage ops Records with the same <key> in same Reduce(). Linkage results Methodology
11
Hash Partitioner Default implementation Hash(Key) mod N
Good for uniformed data but not for skewed distributions Name Distribution Comparisons joshua 5000 emiily 48 1128 jack 35 595 thomas 33 528 lachlan 32 496 benjamin 31 465 comparisons 210 comparisons Methodology
12
Record Linkage Partitioner
Preprocessing Partitioning Balances # comparisons assigned to each node in online fashion to attain a more consistent running time across nodes Merging TODO: External sorting ??? Methodology
13
Record Linkage Partitioner
Goal: Have all nodes finish the reduce phase at the same time Attain a better runtime but retaining the same level of accuracy Methodology
14
Domain principles Counting pairwise comparisons gives a more accurate picture of the true computational workload The distribution of names tends to follow a power law distribution in many countries (D Zanette 2001), (S Miyazima 2000) United States & Berlin D. H. Zanette and S. C. Manrubia, Phys. Anthropo 295, 1 (2001) Taiwan W. J. Reed and B. D. Hughes, Physica A 319, 579 (2003) Japan S. Miyazima, Y. Lee, T. Nagamine, and H. Miyajima, Physica A 278, 282 (2000). England & Wales (First Name), Douglas A. Galbi, Long-Term Trends in Personal Given Name Frequencies in the UK, FCC, 2002 Korea, China Exponential Zipf Methodology
15
Record Linkage Workflow
Round 1 Range partition based on comparison workload Round 2 Merge lost comparisons from Round 1 Round 3 Remove cross duplicates Methodology
16
Round 1 Input Distribution Map Phase Methodology
1. Calc avg comparison workload over N nodes Distribution 2. Check if a record will exceed the avg. If Yes, Divide by min number of nodes needed to drop below. Map Phase 3. Assign records to nodes and update the avg comparison workload to reflect lost comparisons , if any. 4. Recurse until comparison load can be evenly distributed among nodes Methodology
17
Round 2 List X A B R1 B A B R1 R2 Methodology 17
18
Round 2 A B Job 1 Only acts on lost comparisons
Because input is indistinct, a 3rd round of deduplication may be needed. A B C Job 1 Job 2 Job 3 Methodology 18
19
Outline Introduction Related Work Methodology Evaluation Conclusion
20
Performance Metrics Performance evaluation in absolute runtime, speedup & scaleup on a shared cluster. “It’s what users care about” Representative of real operations Evaluation
21
Input Records 10 million records, 0.9 million original, 0.1 million duplicate, up to 9 duplicates per record, 1 modification per field, 1 modification per record, duplicates follow Poisson distribution. <rec org, talyor, swift, 5, canterbury crescent, , cooks hill, 4122, , , 38, , , 9> Methodology
22
Data sets Synthetic data produced with Febrl data generator
Artificially skewed distribution Name Distribution Comparisons joshua 50 1225 emiily 48 1128 jack 35 595 thomas 33 528 lachlan 32 496 benjamin 31 465 Methodology
23
Utilization Evaluation
24
Utilization Evaluation
25
Utilization A B C Evaluation
26
Utilization A B Evaluation
27
Round 2 A B C J1 J2 J3 J4 J5 J6 ? Node Utilization % 27
28
Results so far…. Default Workflow RL Workflow 2 nodes,
5000 records, 2433 duplicates 71.5 secs 75 secs 7000 records, 4814 duplicates >10 mins 196.8 secs Evaluation
29
Results so far…. RL Workflow runtime
Similar to Hash-based runtime on small datasets Better as the size of the dataset grows Evaluation
30
Conclusion Parallelism a right step in the right direction for record linkage Complementary to existing approaches Hadoop can be tailored for Record Linkage tasks “Record Linkage” Partitioner / Workflow is just one an example of possible improvements Conclusion
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.