An Efficient Partition Based Method for Exact Set Similarity Joins

Slides:

Advertisements

Similar presentations

Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Space-Constrained Gram-Based Indexing for Efficient.

Advertisements

String Similarity Measures and Joins with Synonyms

Indexing DNA Sequences Using q-Grams

Greedy Algorithms.

Correlation Search in Graph Databases Yiping Ke James Cheng Wilfred Ng Presented By Phani Yarlagadda.

Lei Zou 1, Jinghui Mo 1, Lei Chen 2, M. Tamer Özsu 3, Dongyan Zhao 1 1 gStore: Answering SPARQL Queries Via Subgraph Matching 1 Peking University, 2 Hong.

Reference-based Indexing of Sequence Databases Jayendra Venkateswaran, Deepak Lachwani, Tamer Kahveci, Christopher Jermaine University of Florida-Gainesville.

Jiannan Wang (Tsinghua, China) Guoliang Li (Tsinghua, China) Jianhua Feng (Tsinghua, China)

Fast Algorithms For Hierarchical Range Histogram Constructions

Yasuhiro Fujiwara (NTT Cyber Space Labs)

Cost-effective Outbreak Detection in Networks Jure Leskovec, Andreas Krause, Carlos Guestrin, Christos Faloutsos, Jeanne VanBriesen, Natalie Glance.

Top-k Set Similarity Joins Chuan Xiao, Wei Wang, Xuemin Lin and Haichuan Shang University of New South Wales and NICTA.

Dong Deng, Guoliang Li, Jianhua Feng Database Group, Tsinghua University Present by Dong Deng A Pivotal Prefix Based Filtering Algorithm for String Similarity.

Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search Alexander Behm 1, Shengyue Ji 1, Chen Li 1, Jiaheng.

Stabbing the Sky: Efficient Skyline Computation over Sliding Windows COMP9314 Lecture Notes.

Speaker: Sattam Alsubaiee Supporting Location-Based Approximate-Keyword Queries Sattam Alsubaiee, Alexander Behm, and Chen Li University of California,

Efficient Type-Ahead Search on Relational Data: a TASTIER Approach Guoliang Li 1, Shengyue Ji 2, Chen Li 2, Jianhua Feng 1 1 Tsinghua University, Beijing,

Quantile-Based KNN over Multi- Valued Objects Wenjie Zhang Xuemin Lin, Muhammad Aamir Cheema, Ying Zhang, Wei Wang The University of New South Wales, Australia.

Themis Palpanas1 VLDB - Aug 2004 Fair Use Agreement This agreement covers the use of all slides on this CD-Rom, please read carefully. You may freely use.

Reza Sherkat ICDE061 Reza Sherkat and Davood Rafiei Department of Computing Science University of Alberta Canada Efficiently Evaluating Order Preserving.

Efficient Parallel Set-Similarity Joins Using Hadoop Chen Li Joint work with Michael Carey and Rares Vernica.

Speeding Up Batch Alignment of Large Ontologies Using MapReduce Uthayasanker Thayasivam and Prashant Doshi Dept. of Computer Science University of Georgia.

Efficient Exact Similarity Searches using Multiple Token Orderings Jongik Kim 1 and Hongrae Lee 2 1 Chonbuk National University, South Korea 2 Google Inc.

Click to edit Present’s Name Xiaoyang Zhang 1, Jianbin Qin 1, Wei Wang 1, Yifang Sun 1, Jiaheng Lu 2 HmSearch: An Efficient Hamming Distance Query Processing.

Experiments An Efficient Trie-based Method for Approximate Entity Extraction with Edit-Distance Constraints Entity Extraction A Document An Efficient Filter.

Diversified Top-k Graph Pattern Matching 1 Yinghui Wu UC Santa Barbara Wenfei Fan University of Edinburgh Southwest Jiaotong University Xin Wang.

The Fast Optimal Voltage Partitioning Algorithm For Peak Power Density Minimization Jia Wang, Shiyan Hu Department of Electrical and Computer Engineering.

文本挖掘简介邹权博士，助理教授. Outline  Introduction  TF-IDF  Similarity.

Combinatorial Optimization Problems in Computational Biology Ion Mandoiu CSE Department.

Experiments Faerie: Efficient Filtering Algorithms for Approximate Dictionary-based Entity Extraction Entity Extraction A Document An Efficient Filter.

Reporter ： Yu Shing Li 1.  Introduction  Querying and update in the cloud  Multi-dimensional index R-Tree and KD-tree Basic Structure Pruning Irrelevant.

GStore: Answering SPARQL Queries Via Subgraph Matching Lei Zou 1, Jinghui Mo 1, Lei Chen 2, M. Tamer Özsu 3, Dongyan Zhao Peking University, 2 Hong.

Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.

Accelerating Dynamic Time Warping Clustering with a Novel Admissible Pruning Strategy Nurjahan BegumLiudmila Ulanova Jun Wang 1 Eamonn Keogh University.

Chapter 9: Web Services and Databases Title: NiagaraCQ: A Scalable Continuous Query System for Internet Databases Authors: Jianjun Chen, David J. DeWitt,

Generalized Point Based Value Iteration for Interactive POMDPs Prashant Doshi Dept. of Computer Science and AI Institute University of Georgia

Efficient Merging and Filtering Algorithms for Approximate String Searches Chen Li, Jiaheng Lu and Yiming Lu Univ. of California, Irvine, USA ICDE ’08.

Lecture 4: Data Integration and Cleaning CMPT 733, SPRING 2016 JIANNAN WANG.

EFFICIENT ALGORITHMS FOR APPROXIMATE MEMBER EXTRACTION By Swapnil Kharche and Pavan Basheerabad.

Computer Science and Engineering Jianye Yang 1, Ying Zhang 2, Wenjie Zhang 1, Xuemin Lin 1 Influence based Cost Optimization on User Preference 1 The University.

Spatial Approximate String Search. Abstract This work deals with the approximate string search in large spatial databases. Specifically, we investigate.

Date : 2016/08/09 Advisor : Jia-ling Koh Speaker : Yi-Yui Lee

Cohesive Subgraph Computation over Large Graphs

Outline Introduction State-of-the-art solutions

COMP9313: Big Data Management Lecturer: Xin Cao Course web site:

Optimizing Parallel Algorithms for All Pairs Similarity Search

The Greedy Method and Text Compression

RE-Tree: An Efficient Index Structure for Regular Expressions

Privacy Preserving Subgraph Matching on Large Graphs in Cloud

Jiannan Wang (Tsinghua, China) Guoliang Li (Tsinghua, China)

TT-Join: Efficient Set Containment Join

Pass-Join: A Partition based Method for Similarity Joins

Entity Matching : How Similar Is Similar?

Top-k String Similarity Search with Edit-Distance Constraints

On Spatial Joins in MapReduce

Guoliang Li (Tsinghua, China) Dong Deng (Tsinghua, China)

Joining Interval Data in Relational Databases

Finding Subgraphs with Maximum Total Density and Limited Overlap

Sequential Data Cleaning: A Statistical Approach

Consensus Partition Liang Zheng 5.21.

Jongik Kim1, Dong-Hoon Choi2, and Chen Li3

Minwise Hashing and Efficient Search

Approximation Algorithms for the Selection of Robust Tag SNPs

Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research)

Relax and Adapt: Computing Top-k Matches to XPath Queries

Dong Deng+, Yu Jiang+, Guoliang Li+, Jian Li+, Cong Yu^

Dong Deng, Guoliang Li, He Wen, H. V. Jagadish, Jianhua Feng

Presentation transcript:

An Efficient Partition Based Method for Exact Set Similarity Joins Dong Deng, Guoliang Li, He Wen, Jianhua Feng Database Group, Tsinghua University

Real World Data is Dirty Typo Inconsistent Format Argyrios Zymnis Argyris Zymnis

Fuzzy Matching: A use case :-) Find speeches similar to Melania Trump's RNC speech This slide is borrowed from Chen Li: https://chenli.ics.uci.edu/

Fuzz Matching: A use case :-) This slide is borrowed from Chen Li: https://chenli.ics.uci.edu/

Token-based Similarity 𝐽𝑎𝑐𝑐𝑎𝑟𝑑 𝑋,𝑌 = |𝑋∩𝑌| |𝑋∪𝑌| 𝐶𝑜𝑠𝑖𝑛𝑒 𝑋,𝑌 = |𝑋∩𝑌| 𝑋 |𝑌| 𝐷𝑖𝑐𝑒 𝑋,𝑌 = 2|𝑋∩𝑌| 𝑋 +|𝑌|

Similarity Join Problem Definition Input: A collection of sets R A token-based similarity function Sim A similarity threshold δ Output All pairs 𝑋, 𝑌 ∈𝑅×𝑅 𝑠.𝑡. 𝑆𝑖𝑚(𝑋, 𝑌)≥𝛿 *We also addressed the two-relation similarity join problem in the paper

An Example Input: Output R Sim = Jaccard Similarity δ = 0.73 Jaccard(X1, X5) = 0.82 ≥𝛿

there are 1 trillion pairs !! Challenge brute-force: 𝑂 𝑛2 𝑝𝑎𝑖𝑟𝑠‼! For n = 1 million, there are 1 trillion pairs !!

Filter-and-Verification Framework Signature(s) ∩ Signature(r) = ϕ? Verify: Sim(r,s) ≥𝜹? No Yes string r string s threshold 𝜹 Results

Related Works Most of recent related works fit in the Prefix Filter Framework “Based on our findings, we do not expect significant impact from future techniques that sit on top of the prefix filter, but see opportunities in fast candidate generation.” By Mann et. al. @ VLDB16 Experimental Paper

Prefix Filter Framework The list of all elements in order (universe) RID 1 2 3 4 Prefixes Suffixes Prefix Filter: Sim(X, Y) ≥𝛿 only if 𝑃𝑟𝑒𝑓𝑖𝑥(𝑋)∩𝑃𝑟𝑒𝑓𝑖𝑥(𝑌)≠∅

Prefix Filter Framework the pruning power is limited! two dissimilar sets are a candidate if they share 1 element in their prefixes

Partition-based Framework The list of all elements in order (universe) RID 1 2 3 4 Subsets Subsets Subsets Sim(X, Y) ≥𝛿 only if 𝑆𝑢𝑏𝑠𝑒𝑡𝑠(𝑋)∩𝑆𝑢𝑏𝑠𝑒𝑡𝑠(𝑌)≠∅

Partition-based Framework What is the minimum number of partitions that can guarantee completeness?

The number of partitions Intuition: 1: Deduce an overlap lower bound based on the similarity function and the threshold 𝐒𝐢𝐦(𝑿,𝒀)≥𝜹 → 𝑿∩𝒀 ≥𝒎 2: Partition them into 𝑚+1 subsets Then two similar sets must share at least 1 subset

Element Skew Problem Some subsets have limited number of elements The ‘empty’ subsets yield quadratic candidates Solution: Add some flexibility Skip the subsets with less elements Select more signatures from subsets with more elements to guarantee completeness

Signatures: 1-deletion neighborhoods Given a non-empty set Z, its 1-deletion neighborhoods are its subsets with size of |Z| − 1, denote as del(Z) Z del(Z)

using 1-deletion neighborhood 𝑼≠𝑽,𝑽∉𝒅𝒆𝒍 𝑼 , 𝑼∉𝒅𝒆𝒍(𝑽)→ 𝑼 ∆ 𝑽 ≥𝟐 U del(U) , V del(V) , 𝑼 ∆ 𝑽 =𝟐 Skip x subsets & select 1-deletion neighborhoods from another x subsets guaranteed completeness !!

Optimal Allocation Strategy 0: skip the ith subset 1: only use the ith subset as signature 2: use both the ith subset and 1-deletions vi= Constraint: 𝑖=1 𝑚+1 𝑣 𝑖 =𝑚+1 Object: Minimize 𝑖=1 𝑚+1 𝑐 𝑣𝑖 𝑖 𝑐 0 𝑖 =0 𝑐 1 𝑖 :𝑡ℎ𝑒 # 𝑜𝑓 𝑠𝑒𝑡𝑠 𝑠ℎ𝑎𝑟𝑖𝑛𝑔 𝑡ℎ𝑒 𝑖𝑡ℎ 𝑠𝑢𝑏𝑠𝑒𝑡 𝑐 2 𝑖 : the # of sets sharing (subset or 1-deletion) signatures

Time complexity Dynamic Programming Optimal: # of candidates O(s2) time complexity as m = O(s) where s is the set size Each set is partitioned 𝑠−𝛿𝑠+1=𝑂(𝑠) times Allocation time complexity is O(s3) for each set Next we reduce it to O(s log s)

Greedy Method for Allocation Selection Heap-based Method 2-approximation algorithm Time complexity: O(s2)  O(s log s)

Adaptive Grouping 𝜹 s 𝜹 s+1 . . . s-2 s-1 s s

Adaptive Grouping [ 𝑙 𝑚𝑖𝑛 𝛼 𝑘−1 , 𝑙 𝑚𝑖𝑛 𝛼 𝑘 ) where 𝛼∈[ 1 2 ,1] The k-th group includes all the sets with size within [ 𝑙 𝑚𝑖𝑛 𝛼 𝑘−1 , 𝑙 𝑚𝑖𝑛 𝛼 𝑘 ) where 𝛼∈[ 1 2 ,1] The size range becomes more and more ‘broader’ The partition times is bounded by 𝑙𝑜𝑔 𝛼 𝛿+1=𝑂(1) for any set The time complexity is 𝑂(𝑠 𝑙𝑜𝑔 𝑠 𝑙𝑜𝑔 𝛼 𝛿)=𝑂(𝑠 𝑙𝑜𝑔 𝑠)

Experiments Datasets： State-of-the-art methods： ppjoin adaptjoin Setup： C++，GCC 4.8.2 with –O3 24 Intel Xeon X5670 2.93GHz 64 GB memory

Adaptive Grouping Observation: Greedy and Optimal largely reduced the # of candidates and the elapsed time compared to Framework Greedy has almost the same # of candidates compared to Optimal Greedy outperformed Opiaml

Greedy Selection w/o Adaptive Grouping Observation: Greedy outperformed Optimal Greedy and Optimal are not competitive to Framework without the adaptive grouping This is because the adaptive grouping technique bounds the # of partition times from O(s) to O(1) for each set

Comparing with State-of-the-arts The elapsed time The number of candidates

Scalability and R-S Join

Conclusion Partition-based Framework 1-Deletion Neighborhood Optimal Allocation Algorithm 2-approximation Greedy Algorithm Adaptive Grouping Mechanism

Project hompage: http://people.csail.mit.edu/dongdeng/projects/setjoin Thank you Q & A