Efficient Parallel Set-Similarity Joins Using Hadoop Chen Li Joint work with Michael Carey and Rares Vernica.

Slides:



Advertisements
Similar presentations
1 Efficient Merging and Filtering Algorithms for Approximate String Searches Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming.
Advertisements

Jiaheng Lu, University of California, Irvine
String Similarity Measures and Joins with Synonyms
Experiences with Hadoop and MapReduce
LIBRA: Lightweight Data Skew Mitigation in MapReduce
Jiannan Wang (Tsinghua, China) Guoliang Li (Tsinghua, China) Jianhua Feng (Tsinghua, China)
Parallel Computing MapReduce Examples Parallel Efficiency Assignment
An Efficient Multi-Dimensional Index for Cloud Data Management Xiangyu Zhang Jing Ai Zhongyuan Wang Jiaheng Lu Xiaofeng Meng School of Information Renmin.
Top-k Set Similarity Joins Chuan Xiao, Wei Wang, Xuemin Lin and Haichuan Shang University of New South Wales and NICTA.
A Scalable, Predictable Join Operator for Highly Concurrent Data Warehouses George Candea (EPFL & Aster Data) Neoklis Polyzotis (UC Santa Cruz) Radek Vingralek.
The Flamingo Software Package on Approximate String Queries Chen Li UC Irvine and Bimaple
ACL, June Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard University of Maryland,
Liang Jin * UC Irvine Nick Koudas University of Toronto Chen Li * UC Irvine Anthony K.H. Tung National University of Singapore VLDB’2005 * Liang Jin and.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.
Liang Jin and Chen Li VLDB’2005 Supported by NSF CAREER Award IIS Selectivity Estimation for Fuzzy String Predicates in Large Data Sets.
1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring.
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
Cost-Based Variable-Length-Gram Selection for String Collections to Support Approximate Queries Efficiently Xiaochun Yang, Bin Wang Chen Li Northeastern.
Hinrich Schütze and Christina Lioma Lecture 4: Index Construction
MapReduce Simplified Data Processing On large Clusters Jeffery Dean and Sanjay Ghemawat.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Building Efficient Time Series Similarity Search Operator Mijung Kim Summer Internship 2013 at HP Labs.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, January, 2014 Jaehwan Lee.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.
Efficient Exact Similarity Searches using Multiple Token Orderings Jongik Kim 1 and Hongrae Lee 2 1 Chonbuk National University, South Korea 2 Google Inc.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.
VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams Chen Li Bin Wang and Xiaochun Yang Northeastern University,
Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Efficient Parallel Set-Similarity Joins Using MapReduce Rares Vernica, Michael J. Carey, Chen Li Speaker : Razvan Belet.
Click to edit Present’s Name Xiaoyang Zhang 1, Jianbin Qin 1, Wei Wang 1, Yifang Sun 1, Jiaheng Lu 2 HmSearch: An Efficient Hamming Distance Query Processing.
Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1.
Experiments An Efficient Trie-based Method for Approximate Entity Extraction with Edit-Distance Constraints Entity Extraction A Document An Efficient Filter.
Top-k Set Similarity Joins Chuan Xiao, Wei Wang, Xuemin Lin and Haichuan Shang Univ. of New South Wales, Austrailia ICDE ’09 9 Feb 2011 Taewhi Lee Based.
RecBench: Benchmarks for Evaluating Performance of Recommender System Architectures Justin Levandoski Michael D. Ekstrand Michael J. Ludwig Ahmed Eldawy.
Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,
Facilitating Document Annotation using Content and Querying Value.
ISchool, Cloud Computing Class Talk, Oct 6 th Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective Tamer Elsayed,
Dynamic Slot Allocation Technique for MapReduce Clusters School of Computer Engineering Nanyang Technological University 25 th Sept 2013 Shanjiang Tang,
Liang Jin * UC Irvine Nick Koudas University of Toronto Chen Li * UC Irvine Anthony K.H. Tung National University of Singapore * Liang Jin and Chen Li:
Page 1 A Platform for Scalable One-pass Analytics using MapReduce Boduo Li, E. Mazur, Y. Diao, A. McGregor, P. Shenoy SIGMOD 2011 IDS Fall Seminar 2011.
MapReduce & Hadoop IT332 Distributed Systems. Outline  MapReduce  Hadoop  Cloudera Hadoop  Tutorial 2.
Chen Li Department of Computer Science Joint work with Liang Jin, Nick Koudas, Anthony Tung, and Rares Vernica Answering Approximate Queries Efficiently.
Efficient Merging and Filtering Algorithms for Approximate String Searches Chen Li, Jiaheng Lu and Yiming Lu Univ. of California, Irvine, USA ICDE ’08.
By: Joel Dominic and Carroll Wongchote 4/18/2012.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
Presenter: Yue Zhu, Linghan Zhang A Novel Approach to Improving the Efficiency of Storing and Accessing Small Files on Hadoop: a Case Study by PowerPoint.
Big Data is a Big Deal!.
COMP9313: Big Data Management Lecturer: Xin Cao Course web site:
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
By Chris immanuel, Heym Kumar, Sai janani, Susmitha
Distributed Network Traffic Feature Extraction for a Real-time IDS
Privacy Preserving Subgraph Matching on Large Graphs in Cloud
Introduction to HDFS: Hadoop Distributed File System
Hadoop Clusters Tess Fulkerson.
Jiannan Wang (Tsinghua, China) Guoliang Li (Tsinghua, China)
TT-Join: Efficient Set Containment Join
MapReduce Simplied Data Processing on Large Clusters
Cse 344 May 2nd – Map/reduce.
Cse 344 May 4th – Map/Reduce.
Guoliang Li (Tsinghua, China) Dong Deng (Tsinghua, China)
Experiences with Hadoop and MapReduce
Relaxing Join and Selection Queries
An Efficient Partition Based Method for Exact Set Similarity Joins
Dong Deng+, Yu Jiang+, Guoliang Li+, Jian Li+, Cong Yu^
Analysis of Structured or Semi-structured Data on a Hadoop Cluster
Accelerating Regular Path Queries using FPGA
Presentation transcript:

Efficient Parallel Set-Similarity Joins Using Hadoop Chen Li Joint work with Michael Carey and Rares Vernica

2 Motivation: Data Cleaning StarTitleYearGenre Keanu ReevesThe Matrix1999Sci-Fi Tom HanksToy Story 32010Animation SchwarzeneggerThe Terminator1984Sci-Fi Samuel JacksonThe man2006Crime Find movies starring Tom Hanks

3 Movies starring S..warz…ne…ger? StarTitleYearGenre Keanu ReevesThe Matrix1999Sci-Fi Tom HanksToy Story 32010Animation SchwarzeneggerThe Terminator1984Sci-Fi Samuel JacksonThe man2006Crime

4 Similarity Search StarTitleYearGenre Keanu ReevesThe Matrix1999Sci-Fi Samuel JacksonIron man2008Sci-Fi SchwarzeneggerThe Terminator1984Sci-Fi Samuel JacksonThe man2006Crime Find movies with a star “similar to” Schwarrzenger.

5 Record linkage Star Keanu Reeves Samuel Jackson Schwarzenegger … Table RTable S Star Keanu Reeves Samuel L. Jackson Schwarzenegger …

Step 2: Verification 6 Two-step solution Star … Table RTable S Star … Step 1: Similarity Join

 Similarity join for large data sets  Techniques applicable to other domains, e.g.: › Finding similar documents › Finding customers with similar patterns 7 Focus of this talk

 Formulation: set-similarity join  Hadoop-based solutions  Experiments More results: see SIGMOD2010 paper 8 Talk Outline

9 Set-Similarity Join Finding pairs of records with a similarity on their join attributes > t

10 Why this formulation? “Samuel L. Jackson”  {Samuel, L., Jackson} “Samuel Jackson”  {Samuel, Jackson}  Word tokens:  Gram tokens: S c h w a r z e n e g g e r

Set-similarity functions Jaccard Dice Cosine Hamming … All solvable in this framework 11

 Formulation of set-similarity join  Hadoop-based solutions  Experiments 12 Talk Outline

 Large amounts of data  Data or processing does not fit in one machine  Assumptions: › Self join: R = S › Two similar sets share at least 1 token 13 Why Hadoop?

 Map:  (a, 23), (b, 23), (c, 23) 14 A naïve solution  Too much data to transfer   Too many pairs to verify .  Reduce:(a,23),(a,29),(a,50), …  Verify each pair

Solving frequency skew: prefix filtering prefix r1 r2 15  Prefixes of similar sets should share tokens  Sort tokens by frequency (ascending)  Prefix of a set: least frequent tokens Sorted by frequency Chaudhuri, Ganti, Kaushik: A Primitive Operator for Similarity Joins in Data Cleaning. ICDE 2006: 5

Prefix filtering: example 16  Each set has 5 tokens  “Similar”: they share at least 4 tokens  Prefix length: 2 Record 1 Record 2

 Stage 1: Order tokens by frequency  Stage 2: Finding “similar” id pairs  Stage 3: id pairs  record paris 17 Hadoop Solution: Overview

18 Stage 1: Sort tokens by frequency Compute token frequencies Sort them MapReduce phase 1 MapReduce phase 2

19 Stage 2: Find “similar” id pairs Partition using prefixesVerify similarity

20 Stage 3: id pairs  record pairs (phase 1) Bring records for each id in each pair

21 Stage 3: id pairs  record pairs (phase 2) Join two half filled records

 Formulation of set-similarity join  Hadoop-based solutions  Experiments 22 Talk Outline

 Hardware › 10-node IBM x3650 cluster › Intel Xeon processor E GHz with four cores › Four 300GB hard disks › 12GB RAM  Software › Ubuntu 9.06, 64-bit, server edition OS › Java 1.6, 64-bit, server › Hadoop  Datasets: publications (DBLP and CITESEERX) 23 Experimental Setting

24 Running time Stage 2 Stage 1 Stage 3

25 Speedup

26 Speedup Breakdown Stage 2 has good speedup

27 Scaleup Good scaleup

 Other methods for the 3 stages  Case: R <> S  Dealing with limited memory 28 Additional results

 Set-similarity joins in Hadoop:  Three-stage approach using Hadoop  Experimental study 29 Summary

Thank you Chen UC Irvine Source code available at: Acknowledgements: NSF, Google, IBM.