Weighted Exact Set Similarity Join

Slides:



Advertisements
Similar presentations
1 Efficient Merging and Filtering Algorithms for Approximate String Searches Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming.
Advertisements

Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Space-Constrained Gram-Based Indexing for Efficient.
String Similarity Measures and Joins with Synonyms
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1] Pirooz Chubak May 22, 2008.
1 gStore: Answering SPARQL Queries Via Subgraph Matching Presented by Guan Wang Kent State University October 24, 2011.
Power-Law Based Estimation of Set Similarity Join Size Hongrae Lee, University of British Columbia Raymond T. Ng, University of British Columbia Kyuseok.
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
Efficient Keyword Search for Smallest LCAs in XML Database Yu Xu Department of Computer Science & Engineering University of California, San Diego Yannis.
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Jiannan Wang (Tsinghua, China) Guoliang Li (Tsinghua, China) Jianhua Feng (Tsinghua, China)
Top-k Set Similarity Joins Chuan Xiao, Wei Wang, Xuemin Lin and Haichuan Shang University of New South Wales and NICTA.
Arvind Arasu, Surajit Chaudhuri, and Raghav Kaushik Presented by Bryan Wilhelm.
An Overview of Similarity Query Processing 김종익 전북대학교 컴퓨터공학부.
Supporting Queries with Imprecise Constraints Ullas Nambiar Dept. of Computer Science University of California, Davis Subbarao Kambhampati Dept. of Computer.
TOSS: An Extension of TAX with Ontologies and Similarity Queries Edward Hung, Yu Deng, V.S. Subrahmanian Department of Computer Science University of Maryland,
A Unified Approach for Computing Top-k Pairs in Multidimensional Space Presented By: Muhammad Aamir Cheema 1 Joint work with Xuemin Lin 1, Haixun Wang.
1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring.
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
Efficient Exact Set-Similarity Joins Arvind Arasu Venkatesh Ganti Raghav Kaushik DMX Group, Microsoft Research.
Graph Indexing Techniques Seoul National University IDB Lab. Kisung Kim
Keyword Search in Relational Databases Jaehui Park Intelligent Database Systems Lab. Seoul National University
A Grammar-based Entity Representation Framework for Data Cleaning Authors: Arvind Arasu Raghav Kaushik Presented by Rashmi Havaldar.
Efficient Exact Similarity Searches using Multiple Token Orderings Jongik Kim 1 and Hongrae Lee 2 1 Chonbuk National University, South Korea 2 Google Inc.
Similarity Join Wu Yang Main work MS--A Primitive Operator for Similarity Joins in Data Cleaning ICDE 2006 Google--Scaling Up All Pairs Similarity.
Database Management 9. course. Execution of queries.
Experiments An Efficient Trie-based Method for Approximate Entity Extraction with Edit-Distance Constraints Entity Extraction A Document An Efficient Filter.
Querying Structured Text in an XML Database By Xuemei Luo.
Top-k Set Similarity Joins Chuan Xiao, Wei Wang, Xuemin Lin and Haichuan Shang Univ. of New South Wales, Austrailia ICDE ’09 9 Feb 2011 Taewhi Lee Based.
Approximate XML Joins Huang-Chun Yu Li Xu. Introduction XML is widely used to integrate data from different sources. Perform join operation for XML documents:
文本挖掘简介 邹权 博士,助理教授. Outline  Introduction  TF-IDF  Similarity.
Experiments Faerie: Efficient Filtering Algorithms for Approximate Dictionary-based Entity Extraction Entity Extraction A Document An Efficient Filter.
Presented by: Aneeta Kolhe. Named Entity Recognition finds approximate matches in text. Important task for information extraction and integration, text.
Ranking objects based on relationships Computing Top-K over Aggregation Sigmod 2006 Kaushik Chakrabarti et al.
Byung-Won On (Penn State Univ.) Nick Koudas (Univ. of Toronto) Dongwon Lee (Penn State Univ.) Divesh Srivastava (AT&T Labs – Research) Group Linkage ICDE.
Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28,
Efficient Approximate Search on String Collections Part II Marios HadjieleftheriouChen Li.
Fast Indexes and Algorithms For Set Similarity Selection Queries M. Hadjieleftheriou A.Chandel N. Koudas D. Srivastava.
Query Caching and View Selection for XML Databases Bhushan Mandhani Dan Suciu University of Washington Seattle, USA.
Of 24 lecture 11: ontology – mediation, merging & aligning.
ITree: Exploring Time-Varying Data using Indexable Tree Yi Gu and Chaoli Wang Michigan Technological University Presented at IEEE Pacific Visualization.
Xifeng Yan Philip S. Yu Jiawei Han SIGMOD 2005 Substructure Similarity Search in Graph Databases.
COMP9313: Big Data Management Lecturer: Xin Cao Course web site:
Optimizing Parallel Algorithms for All Pairs Similarity Search
Database Management System
Near Duplicate Detection
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Jiannan Wang (Tsinghua, China) Guoliang Li (Tsinghua, China)
Efficient Similarity Joins for Near Duplicate Detection
Associative Query Answering via Query Feature Similarity
TT-Join: Efficient Set Containment Join
Query Languages.
Entity Matching : How Similar Is Similar?
Chuan Xiao, Wei Wang, Xuemin Lin
Hierarchical clustering approaches for high-throughput data
Instructor: Shengyu Zhang
Text Joins in an RDBMS for Web Data Integration
MatchCatcher: A Debugger for Blocking in Entity Matching
Structure and Content Scoring for XML
G-CORE: A Core for Future Graph Query Languages
An Interactive Approach to Collectively Resolving URI Coreference
Introduction to Information Retrieval
Flow Networks and Bipartite Matching
Structure and Content Scoring for XML
A Framework for Testing Query Transformation Rules
Minwise Hashing and Efficient Search
Information Retrieval and Web Design
Wei Wang University of New South Wales, Australia
PRSim: Sublinear Time SimRank Computation on Large Power-Law Graphs.
CoXML: A Cooperative XML Query Answering System
Presentation transcript:

Weighted Exact Set Similarity Join The Pennsylvania State University Dongwon Lee dongwon@psu.edu

Set Similarity Join Def. Set Similarity Join (SSJoin): Between collections A and B, find X pairs of objects whose similarity > t: If X = “MOST”  Approximate SSJoin If X = “ALL”  Exact SSJoin 0.7 : {Lake, Monona, Wisc, Dane, County} 0.5 0.4 : {University, Mendota, Wisc, Dane,} 0.2 0.9 0.1 A B Wisconsin DB Seminar, 2009

Set Similarity Join Weighted vs. Unweighted Weighting quantifies relative importance of token Eg, “Microsoft” is more important than “Copr.” How to assign meaningful weights to tokens is an important problem itself Not further discussed here Wisconsin DB Seminar, 2009

Set Similarity Join Approximate SSJoin Exact SSJoin Allows some false positives/negatives Eg, LSH as solution Exact SSJoin Does not allow any false positives/negatives Needs to be scalable Weighted + Exact SSJoin Will simply call “WESSJoin” UESSJoin WESSJoin exact UASSJoin WASSJoin approx. unweighted weighted Wisconsin DB Seminar, 2009

Applications of WESSJoin Entity resolution Web document genre classification Find all pairs of documents w. similar contents Query refinement for web search For a query, find another w. similar search result Movie recommendation Identify users who have similar movie tastes w.r.t. the rented movies  Focus on string data represented as SET Eg, document, web page, record Wisconsin DB Seminar, 2009

Research Issues Why not express WESSJoin in SQL? Join predicate as UDF Cartesian product followed by UDF processing  Inefficient evaluation Special handling for WESSJoin needed Scalability Support diverse similarity (or distance) functions Eg, Overlap, Jaccard, Cosine vs. Edit, … Support diverse computation models Eg, Threshold vs. Top-k Wisconsin DB Seminar, 2009

Similarity/Distance Functions Jaccard Coefficient: J(x,y) = Overlap similarity: O(x,y) = Cosine similarity: C(x,y) = Hamming distance H(x,y) = Levenshtein distance L(x,y): min # of edit operations to transform x to y Wisconsin DB Seminar, 2009

Properties of sim() Similarity functions can be re-written to each other equivalently J(x,y) > t  O(x,y) > t/(1+t) (|x|+|y|) O(x,y) > t  H(x,y) < |x|+|y|-2t C(x,y) > t  O(x,y) > Eg, x: {Lake, Mendota, Monona} y: {Wisc, Dane, Mendota, Lake} J(x,y) > 0.5 ?  O(x,y) > 2.3 ? Set representation: k-gram, word, phrase, … Wisconsin DB Seminar, 2009

Naïve Solution All pair-wise comparison between A and B Nested-loop: |A||B| comparisons The sim() evaluation may be costly Eg, Generalized Jaccard Similarity function with O(|x|3) For x in A: For y in B: If sim(x,y) > t, return (x,y); A, B: table x, y: record as set Wisconsin DB Seminar, 2009

Naïve Solution Example B ID Content 1 {Lake, Mendota} 2 {Lake, Monona, Area} 3 {Lake, Mendota, Monona, Dane} ID Content 4 {Lake, Monona, University} 5 {Monona, Research, Area} 6 {Lake, Mendota, Monona, Area} O(x,y) > 2 ? O(x,y) ID=4 ID=5 ID=6 ID=1 1 2 ID=2 3 ID=3 Wisconsin DB Seminar, 2009

Naïve Solution Example B ID Content 1 {Lake, Mendota} 2 {Lake, Monona, Area} 3 {Lake, Mendota, Monona, Dane} ID Content 4 {Lake, Monona, University} 5 {Monona, Research, Area} 6 {Lake, Mendota, Monona, Area} J(x,y) > 0.6 ? J(x,y)) ID=4 ID=5 ID=6 ID=1 0.25 0.5 ID=2 0.4 0.75 ID=3 0.2 0.16 0.6 Wisconsin DB Seminar, 2009

2-Step Framework Step 1: “Blocking” Using Index/heuristics/filtering/etc, reduce # of candidates to compare Step 2: sim() only within candidate sets O(|A||C|) s.t. |C| << |B| For x in A: Using Foo, find a candidate set C in B For y in C: If sim(x,y) > t, return (x,y); Wisconsin DB Seminar, 2009

Variants for “Foo” “Foo”: How to identify candidate set C Fast Accurate: no false positives/negatives Many Variants for “Foo” Inverted Index [Sarawagi et al, SIGMOD 04] Size filtering [Arasu et al, VLDB 06] Prefix Index [Chaudhuri et al, ICDE 06] Prefix + Inverted Index [Bayardo et al, WWW 07] Bound filtering [On et al, ICDE 07] Position Index [Xiao et al, WWW 08] Wisconsin DB Seminar, 2009

Inverted Index [Sarawagi et al, SIGMOD 04] B ID Content 1 {Lake, Mendota} 2 {Lake, Monona, Area} 3 {Lake, Mendota, Monona, Dane} ID Content 4 {Lake, Monona, University} 5 {Monona, Research, Area} 6 {Lake, Mendota, Monona, Area} Inverted Index (IDX) for A Inverted Index (IDX) for B Token in A ID List Area 2 Dane 3 Lake 1, 2, 3 Mendota 1, 3 Monona 2, 3 Token in B ID List Area 5 Lake 4, 6 Mendota 6 Monona 4, 5, 6 Research University 4 Wisconsin DB Seminar, 2009

Inverted Index [Sarawagi et al, SIGMOD 04] B ID Content 1 {Lake, Mendota} 2 {Lake, Monona, Area} 3 {Lake, Mendota, Monona, Dane} ID Content 4 {Lake, Monona, University} 5 {Monona, Research, Area} 6 {Lake, Mendota, Monona, Area} For x in A: Using IDX, find a candidate set C in B For y in C: If sim(x,y) > t, return (x,y); Inverted Index (IDX) for B Token in B ID List Area 5 Lake 4, 6 Mendota 6 Monona 4, 5, 6 Research University 4 ID=1: {Lake, Mendota} ID=2: … ID=3: … Candidate set C: {4,6} + {6} = {4, 6} Wisconsin DB Seminar, 2009

Inverted Index [Sarawagi et al, SIGMOD 04] B ID Content 1 {Lake, Mendota} 2 {Lake, Monona, Area} 3 {Lake, Mendota, Monona, Dane} ID Content 4 {Lake, Monona, University} 5 {Monona, Research, Area} 6 {Lake, Mendota, Monona, Area} For x in A: Using IDX, find a candidate set C in B For y in C: If sim(x,y) > t, return (x,y); Inverted Index (IDX) for B Token in B ID List Area 5 Lake 4, 6 Mendota 6 Monona 4, 5, 6 Research University 4 ID=1: {Lake, Mendota} ID=2: … ID=3: … ID Freq. 4 1 6 2 Candidate set C: O(x,y) > 2 Wisconsin DB Seminar, 2009

Size Filtering [Arasu et al, VLDB 06] Idea: Build index on the size of inputs Jaccard Coefficient J= Upperbound for Jaccard: Bounding |y| w.r.t. |x|: Combining two  x x y y Wisconsin DB Seminar, 2009

Size Filtering [Arasu et al, VLDB 06] Intuition: If t and |x| are given, |y| is bounded Eg, x: {Lake, Mendota} y: {Lake, Mendota, Monona, Area} J(x,y) > 0.8 ? Then, according to: |x|=2, t=0.8  1.6 <= |y| <= 2.5 However, |y| = 4 y cannot satisfy t=0.8  no need to compute J(x,y) at all Wisconsin DB Seminar, 2009

Size Filtering [Arasu et al, VLDB 06] For x in A: Using IDX, find a candidate set C in B For y in C: If sim(x,y) > t, return (x,y); Algorithm For all input strings, build B-tree w.r.t. their sizes Given a set x, using B-tree index, find a candidate y in B s.t. Wisconsin DB Seminar, 2009

Prefix Index [Chaudhuri et al, ICDE 06] Intuition: If two sets are very similar, their prefixes, when ordered, must have some common tokens Eg. x: {Dane, University, Monona, Mendota} y: {Area, Lake, Mendota, Monona, Wisc} O(x,y) > 3 ? x’: {Dane, Mendota, Monona, University} y’: {Area, Lake, Mendota, Monona, Wisc} Prefixes Wisconsin DB Seminar, 2009

Prefix Index [Chaudhuri et al, ICDE 06] Theorem 1: If there is no overlap btw. Prefix(x) and Prefix(y), then sim(x,y) > t, where: If sim()=Overlap, Prefix(x)=|x| - (t-1) If sim()=Jaccard, Prefix(x)=|x|-Ceiling(t*|x|)+1 Algorithm using Theorem 1: Given a set x For each token t_x in the prefix of x Using an index, locate a candidate y that contains t_x in the prefix of y If sim(x,y) > t, return (x,y) Wisconsin DB Seminar, 2009

Prefix + Inverted Index [Bayardo et al, WWW 07] ID Content 1 {Lake, Mendota} 2 {Lake, Monona, Area} 3 {Lake, Mendota, Monona, Dane} ID Content 4 {Lake, Monona, University} 5 {Monona, Research, Area} 6 {Lake, Mendota, Monona, Area} Token ID List DF Order Area 2, 5 2 4 Dane 3 1 Lake 1, 2, 3, 4, 6 5 6 Mendota 1, 3, 6 Monona 2, 3, 4, 5, 6 7 Research University Inverted Index (IDX) for both A and B Create a universal order: Put rare tokens front Order: Dane > Research > University > Area > Mendota > Lake > Monona Wisconsin DB Seminar, 2009

Prefix + Inverted Index [Bayardo et al, WWW 07] Ordered A Ordered B ID Content 1 {Mendota, Lake} 2 {Area, Lake, Monona} 3 {Dane, Mendota, Lake, Monona} ID Content 4 {University, Lake, Monona} 5 {Research, Area, Monona} 6 {Area, Mendota, Lake, Monona} Order: Dane > Research > University > Area > Mendota > Lake > Monona Wisconsin DB Seminar, 2009

Prefix + Inverted Index [Bayardo et al, WWW 07] Ordered A Ordered B ID Content 1 {Mendota, Lake} 2 {Area, Lake, Monona} 3 {Dane, Mendota, Lake, Monona} ID Content 4 {University, Lake, Monona} 5 {Research, Area, Monona} 6 {Area, Mendota, Lake, Monona} O(x,y) > 2 Prefix(x)=|x|-(t-1)=|x|-1 Prefix Inverted Index for B Token in B ID List Area 5 Lake 4, 6 Mendota 6 Research University 4 ID=1: {Mendota, Lake} ID=2: … ID=3: … Candidate set C: {6} Wisconsin DB Seminar, 2009

Prefix + Inverted Index [Bayardo et al, WWW 07] Ordered A Ordered B ID Content 1 {Mendota, Lake} 2 {Area, Lake, Monona} 3 {Dane, Mendota, Lake, Monona} ID Content 4 {University, Lake, Monona} 5 {Research, Area, Monona} 6 {Area, Mendota, Lake, Monona} O(x,y) > 2 Prefix(x)=|x|-(t-1)=|x|-1 Prefix Inverted Index for B Token in B ID List Area 5 Lake 4, 6 Mendota 6 Research University 4 ID=1: … ID=2: {Area, Lake, Monona} ID=3: … Candidate set C: {5} + {4,6} = {4,5,6} Wisconsin DB Seminar, 2009

Prefix + Inverted Index [Bayardo et al, WWW 07] Ordered A Ordered B ID Content 1 {Mendota, Lake} 2 {Area, Lake, Monona} 3 {Dane, Mendota, Lake, Monona} ID Content 4 {University, Lake, Monona} 5 {Research, Area, Monona} 6 {Area, Mendota, Lake, Monona} O(x,y) > 2 Prefix(x)=|x|-(t-1)=|x|-1 Prefix Inverted Index for B Token in B ID List Area 5 Lake 4, 6 Mendota 6 Research University 4 ID=1: … ID=2: … ID=3: {Dane, Mendota, Lake, Monona} Candidate set C: {6} + {4,6} = {4,6} Wisconsin DB Seminar, 2009

Position Index [Xiao et al, WWW 08] Order: Dane > Research > University > Area > Mendota > Lake > Monona Eg, x: {Dane, Research, Area, Mendota, Lake} y: {Research, Area, Mendota, Lake, Monona} O(x,y) > 4 ?  Prefix(x) = Prefix(y) = 5 – (4 -1) = 2 “Research” is common btw prefixes  (x,y) is a candidate pair  need to compute sim(x,y) Wisconsin DB Seminar, 2009

Position Index [Xiao et al, WWW 08] Order: Dane > Research > University > Area > Mendota > Lake > Monona Eg, x: {Dane, Research, Area, Mendota, Lake} y: {Research, Area, Mendota, Lake, Monona} O(x,y) > 4 ?  Prefix(x) = Prefix(y) = 5 – (4 -1) = 2 Estimation of max overlap = overlap in prefixes + min # of unseen tokens = 1 + min(3,4) = 4 > t  No need to compute sim(x,y) ! Wisconsin DB Seminar, 2009

Bound Filtering [On et al, ICDE 07] Generalized Jaccard (GJ) similarity Two sets: x = {a1, …, a|x|}, y = {b1, …, b|y|} Normalized weight of the maximum bipartite matching M in the bipartite graph (N = x U y, E=x X y) Wisconsin DB Seminar, 2009

Bound Filtering [On et al, ICDE 07] x y 0.7 0.7 0.5 0.5 0.4 0.4 0.2 0.9 0.2 0.9 0.1 0.1 x y M: maximum weight bipartite matching Wisconsin DB Seminar, 2009

Bound Filtering [On et al, ICDE 07] Issues GJ captures more semantics btw. two sets via the weighted bipartite matching than Jaccard But more costly to compute: maximum weight bipartite matching Bellman-Ford: O(V2E) Hungarian: O(V3) For x in A: Using Foo, find a candidate set C in B For y in C: If GJ(x,y) > t, return (x,y); Wisconsin DB Seminar, 2009

Bound Filtering [On et al, ICDE 07] Bipartite matching computation is expensive because of the requirement No node in the bipartite graph can have more than one edge incident on it Relax this constraint: For each element ai in x, find an element bj in y with the highest element-level similarity  S1 For each element bj in y, find an element ai in x with the highest element-level similarity  S2 Complexity becomes linear: O(|x|+|y|) Wisconsin DB Seminar, 2009

Bound Filtering [On et al, ICDE 07] x y 0.7 0.7 S1 S1 0.5 0.5 0.4 0.4 0.2 0.9 0.2 0.9 0.1 0.1 x y 0.7 S2 0.5 S2 0.4 0.2 0.9 0.1 x y Wisconsin DB Seminar, 2009

Bound Filtering [On et al, ICDE 07] Properties: Numerator of UB is at least as large as that of GJ Denominator of UB is no larger than that of GJ Similar arguments for LB Theorem 2 LB <= GJ <= UB Wisconsin DB Seminar, 2009

Bound Filtering [On et al, ICDE 07] For x in A: Using Foo, find a candidate set C in B For y in C: If GJ(x,y) > t, return (x,y); Algorithm Compute UB(x,y) If UB(x,y) <= t  GJ(x,y) <= t  (x,y) is not an answer Else Compute LB(x,y) If LB(x,y) > t  GJ(x,y) > t  (x,y) is an answer Else compute GJ(x,y) LB <= GJ <= UB Wisconsin DB Seminar, 2009

Takeaways WESSJoin finds ALL pairs of sets btw two collections whose similarity > t Good abstraction for various problems 2 step framework is promising Step 1: reduce candidates Step 2: similarity computation among candidates Less researched issues Comparison among different WESSJoin methods WESSJoin + top-k/skyline/MapReduce/etc Wisconsin DB Seminar, 2009

Reference [Sarawagi et al, SIGMOD 04] Sunita Sarawagi, Alok Kirpal: Efficient set joins on similarity predicates, SIGMOD 2004. [Arasu et al, VLDB 06] Arvind Arasu, Venkatesh Ganti, and Raghav Kaushik, Efficient exact set-similarity joins, VLDB 2006. [Chaudhuri et al, ICDE 06] Surajit Chaudhuri, Venkatesh Ganti, Raghav Kaushik: A Primitive Operator for Similarity Joins in Data Cleaning. ICDE 2006. [Bayardo et al, WWW 07] R. J. Bayardo, Yiming Ma, Ramakrishnan Srikant. Scaling Up All-Pairs Similarity Search, WWW 2007. [On et al, ICDE 07] Byung-Won On, Nick Koudas, Dongwon Lee, Divesh Srivastava, Group Linkage, ICDE 2007. [Xiao et al, WWW 08] Chuan Xiao, Wei Wang, Xuemin Lin, Jeffrey Xu Yu. Efficient Similarity Joins for Near Duplicate Detection. WWW 2008. Wei Wang. Efficient Exact Similarity Join Algorithms: http://www.cse.unsw.edu.au/~weiw/project/PPJoin-UTS-Oct-2008.pdf Jeffrey D. Ullman. High-Similarity Algorithms: http://infolab.stanford.edu/~ullman/mining/2009/similarity4.pdf Wisconsin DB Seminar, 2009