Download presentation
Presentation is loading. Please wait.
Published byBerenice Cross Modified over 9 years ago
1
Literature Review 1
2
Record linkage Runtime reduction techniques ◦ Blocking ◦ Canopies ◦ Sorted Neighborhood Shift to parallel computing Research directions 2
3
Determining if pairs of records refer to the same entity ◦ E.g. Distinguishing between data belonging to… Yipeng, the NUS student and Yipeng, the son of PM Lee 3
4
DB 1DB 2DB 3 Amanda Beverley CatherineKatherine DanielDavidAmanda ElaineAmanda Dedup Two Lists Dedup Single List O(M*N)O(N 2 ) 4
5
Pairwise comparison increasing expensive Blocking techniques ◦ Reduce the search space Amanda Daniel David 5
6
6
7
Record 1 Record 2 Record 3 Record 4 Record 5 Record 6 Record 7 Record 8 Record 9 Record 10 Comparison Window: 2w−1 7
8
Pairwise comparison increasing expensive Blocking techniques ◦ Reduce the search space ◦ Limitations Single node computation Localized data source Conflicting in function Amanda Daniel David 8
9
Multi node computation Data source flexibility Complementary to blocking methods Frontrunners: ◦ P-Febrl (P Christen 2003), ◦ P-Swoosh (H Kawai 2006), ◦ Parallel Linkage (H Kim 2007) 9
10
Peter Christen ◦ Parallelized Febrl with MPI ◦ Linear Speedup but did not Scaleup well Hideki Kawai ◦ Designed P-swoosh in a simulated environment ◦ Match based parallelism ◦ 2x speedup with use of domain knowledge 10
11
Hung-sik Kim, Dongwon Lee ◦ Explored parallel record linkage for different input cases in MATLAB ◦ Consistent Speedup ◦ Not validated with very large datasets 11
12
Handles system level concerns… ◦ E.g. Data distribution, fault tolerance, dynamic load balancing, portability and scalability Convenient model for scaling record linkage ◦ Beter scaleup on pairwise comparisions (T Elsayed 2008) ◦ Runtime increased linearly with dataset (R Vernica 2010) 12
13
Tailoring Hadoop for record linkage problems ◦ E.g. Bin packing blocks of different sizes Experimenting with different problem types ◦ E.g. Bipartite data centers Adapting existing parallel clustering algorithms onto the MapReduce model 13
14
Parallelism a right step in the right direction ◦ Complementary to existing approaches ◦ Consistent with the object orientation But… ◦ Parallel design and implementation is difficult ◦ MapReduce is a viable solution 14
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.