Download presentation
Presentation is loading. Please wait.
Published byRobyn George Modified over 9 years ago
1
Efficient Parallel Set-Similarity Joins Using Hadoop Chen Li Joint work with Michael Carey and Rares Vernica
2
2 Motivation: Data Cleaning StarTitleYearGenre Keanu ReevesThe Matrix1999Sci-Fi Tom HanksToy Story 32010Animation SchwarzeneggerThe Terminator1984Sci-Fi Samuel JacksonThe man2006Crime Find movies starring Tom Hanks
3
3 Movies starring S..warz…ne…ger? StarTitleYearGenre Keanu ReevesThe Matrix1999Sci-Fi Tom HanksToy Story 32010Animation SchwarzeneggerThe Terminator1984Sci-Fi Samuel JacksonThe man2006Crime
4
4 Similarity Search StarTitleYearGenre Keanu ReevesThe Matrix1999Sci-Fi Samuel JacksonIron man2008Sci-Fi SchwarzeneggerThe Terminator1984Sci-Fi Samuel JacksonThe man2006Crime Find movies with a star “similar to” Schwarrzenger.
5
5 Record linkage Star Keanu Reeves Samuel Jackson Schwarzenegger … Table RTable S Star Keanu Reeves Samuel L. Jackson Schwarzenegger …
6
Step 2: Verification 6 Two-step solution Star … Table RTable S Star … Step 1: Similarity Join
7
Similarity join for large data sets Techniques applicable to other domains, e.g.: › Finding similar documents › Finding customers with similar patterns 7 Focus of this talk
8
Formulation: set-similarity join Hadoop-based solutions Experiments More results: see SIGMOD2010 paper 8 Talk Outline
9
9 Set-Similarity Join Finding pairs of records with a similarity on their join attributes > t
10
10 Why this formulation? “Samuel L. Jackson” {Samuel, L., Jackson} “Samuel Jackson” {Samuel, Jackson} Word tokens: Gram tokens: S c h w a r z e n e g g e r
11
Set-similarity functions Jaccard Dice Cosine Hamming … All solvable in this framework 11
12
Formulation of set-similarity join Hadoop-based solutions Experiments 12 Talk Outline
13
Large amounts of data Data or processing does not fit in one machine Assumptions: › Self join: R = S › Two similar sets share at least 1 token 13 Why Hadoop?
14
Map: (a, 23), (b, 23), (c, 23) 14 A naïve solution Too much data to transfer Too many pairs to verify . Reduce:(a,23),(a,29),(a,50), … Verify each pair
15
Solving frequency skew: prefix filtering prefix r1 r2 15 Prefixes of similar sets should share tokens Sort tokens by frequency (ascending) Prefix of a set: least frequent tokens Sorted by frequency Chaudhuri, Ganti, Kaushik: A Primitive Operator for Similarity Joins in Data Cleaning. ICDE 2006: 5
16
Prefix filtering: example 16 Each set has 5 tokens “Similar”: they share at least 4 tokens Prefix length: 2 Record 1 Record 2
17
Stage 1: Order tokens by frequency Stage 2: Finding “similar” id pairs Stage 3: id pairs record paris 17 Hadoop Solution: Overview
18
18 Stage 1: Sort tokens by frequency Compute token frequencies Sort them MapReduce phase 1 MapReduce phase 2
19
19 Stage 2: Find “similar” id pairs Partition using prefixesVerify similarity
20
20 Stage 3: id pairs record pairs (phase 1) Bring records for each id in each pair
21
21 Stage 3: id pairs record pairs (phase 2) Join two half filled records
22
Formulation of set-similarity join Hadoop-based solutions Experiments 22 Talk Outline
23
Hardware › 10-node IBM x3650 cluster › Intel Xeon processor E5520 2.26GHz with four cores › Four 300GB hard disks › 12GB RAM Software › Ubuntu 9.06, 64-bit, server edition OS › Java 1.6, 64-bit, server › Hadoop 0.20.1 Datasets: publications (DBLP and CITESEERX) 23 Experimental Setting
24
24 Running time Stage 2 Stage 1 Stage 3
25
25 Speedup
26
26 Speedup Breakdown Stage 2 has good speedup
27
27 Scaleup Good scaleup
28
Other methods for the 3 stages Case: R <> S Dealing with limited memory 28 Additional results
29
Set-similarity joins in Hadoop: Three-stage approach using Hadoop Experimental study 29 Summary
30
Thank you Chen Li @ UC Irvine Source code available at: http://asterix.ics.uci.edu/fuzzyjoin-mapreduce/ Acknowledgements: NSF, Google, IBM.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.