Presentation is loading. Please wait.

Presentation is loading. Please wait.

Literature Review 1.  Record linkage  Runtime reduction techniques ◦ Blocking ◦ Canopies ◦ Sorted Neighborhood  Shift to parallel computing  Research.

Similar presentations


Presentation on theme: "Literature Review 1.  Record linkage  Runtime reduction techniques ◦ Blocking ◦ Canopies ◦ Sorted Neighborhood  Shift to parallel computing  Research."— Presentation transcript:

1 Literature Review 1

2  Record linkage  Runtime reduction techniques ◦ Blocking ◦ Canopies ◦ Sorted Neighborhood  Shift to parallel computing  Research directions 2

3  Determining if pairs of records refer to the same entity ◦ E.g. Distinguishing between data belonging to… Yipeng, the NUS student and Yipeng, the son of PM Lee 3

4 DB 1DB 2DB 3 Amanda Beverley CatherineKatherine DanielDavidAmanda ElaineAmanda Dedup Two Lists Dedup Single List O(M*N)O(N 2 ) 4

5  Pairwise comparison increasing expensive  Blocking techniques ◦ Reduce the search space Amanda Daniel David 5

6 6

7 Record 1 Record 2 Record 3 Record 4 Record 5 Record 6 Record 7 Record 8 Record 9 Record 10 Comparison Window: 2w−1 7

8  Pairwise comparison increasing expensive  Blocking techniques ◦ Reduce the search space ◦ Limitations  Single node computation  Localized data source  Conflicting in function Amanda Daniel David 8

9  Multi node computation  Data source flexibility  Complementary to blocking methods  Frontrunners: ◦ P-Febrl (P Christen 2003), ◦ P-Swoosh (H Kawai 2006), ◦ Parallel Linkage (H Kim 2007) 9

10  Peter Christen ◦ Parallelized Febrl with MPI ◦ Linear Speedup but did not Scaleup well  Hideki Kawai ◦ Designed P-swoosh in a simulated environment ◦ Match based parallelism ◦ 2x speedup with use of domain knowledge 10

11  Hung-sik Kim, Dongwon Lee ◦ Explored parallel record linkage for different input cases in MATLAB ◦ Consistent Speedup ◦ Not validated with very large datasets 11

12  Handles system level concerns… ◦ E.g. Data distribution, fault tolerance, dynamic load balancing, portability and scalability  Convenient model for scaling record linkage ◦ Beter scaleup on pairwise comparisions (T Elsayed 2008) ◦ Runtime increased linearly with dataset (R Vernica 2010) 12

13  Tailoring Hadoop for record linkage problems ◦ E.g. Bin packing blocks of different sizes  Experimenting with different problem types ◦ E.g. Bipartite data centers  Adapting existing parallel clustering algorithms onto the MapReduce model 13

14  Parallelism a right step in the right direction ◦ Complementary to existing approaches ◦ Consistent with the object orientation  But… ◦ Parallel design and implementation is difficult ◦ MapReduce is a viable solution 14


Download ppt "Literature Review 1.  Record linkage  Runtime reduction techniques ◦ Blocking ◦ Canopies ◦ Sorted Neighborhood  Shift to parallel computing  Research."

Similar presentations


Ads by Google