Top-k Set Similarity Joins Chuan Xiao, Wei Wang, Xuemin Lin and Haichuan Shang Univ. of New South Wales, Austrailia ICDE ’09 9 Feb 2011 Taewhi Lee Based.

Slides:



Advertisements
Similar presentations
Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin.
Advertisements

String Similarity Measures and Joins with Synonyms
Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1] Pirooz Chubak May 22, 2008.
Computer Science and Engineering Inverted Linear Quadtree: Efficient Top K Spatial Keyword Search Chengyuan Zhang 1,Ying Zhang 1,Wenjie Zhang 1, Xuemin.
13/04/20151 SPARK: Top- k Keyword Query in Relational Database Wei Wang University of New South Wales Australia.
Jianxin Li, Chengfei Liu, Rui Zhou Swinburne University of Technology, Australia Wei Wang University of New South Wales, Australia Top-k Keyword Search.
Power-Law Based Estimation of Set Similarity Join Size Hongrae Lee, University of British Columbia Raymond T. Ng, University of British Columbia Kyuseok.
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
Jiannan Wang (Tsinghua, China) Guoliang Li (Tsinghua, China) Jianhua Feng (Tsinghua, China)
Near-Duplicates Detection
Top-k Set Similarity Joins Chuan Xiao, Wei Wang, Xuemin Lin and Haichuan Shang University of New South Wales and NICTA.
SPARK: Top-k Keyword Query in Relational Databases Yi Luo, Xuemin Lin, Wei Wang, Xiaofang Zhou Univ. of New South Wales, Univ. of Queensland SIGMOD 2007.
Dong Deng, Guoliang Li, Jianhua Feng Database Group, Tsinghua University Present by Dong Deng A Pivotal Prefix Based Filtering Algorithm for String Similarity.
An Overview of Similarity Query Processing 김종익 전북대학교 컴퓨터공학부.
Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search Alexander Behm 1, Shengyue Ji 1, Chen Li 1, Jiaheng.
Quantile-Based KNN over Multi- Valued Objects Wenjie Zhang Xuemin Lin, Muhammad Aamir Cheema, Ying Zhang, Wei Wang The University of New South Wales, Australia.
Suggestion of Promising Result Types for XML Keyword Search Joint work with Jianxin Li, Chengfei Liu and Rui Zhou ( Swinburne University of Technology,
Efficient Processing of Top-k Spatial Keyword Queries João B. Rocha-Junior, Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørvåg 1 SSTD 2011.
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.
Reza Sherkat ICDE061 Reza Sherkat and Davood Rafiei Department of Computing Science University of Alberta Canada Efficiently Evaluating Order Preserving.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
A Unified Approach for Computing Top-k Pairs in Multidimensional Space Presented By: Muhammad Aamir Cheema 1 Joint work with Xuemin Lin 1, Haixun Wang.
33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Towards Graph Containment Search and Indexing Chen Chen 1, Xifeng Yan 2, Philip.
Cost-Based Variable-Length-Gram Selection for String Collections to Support Approximate Queries Efficiently Xiaochun Yang, Bin Wang Chen Li Northeastern.
EFFICIENT COMPUTATION OF DIVERSE QUERY RESULTS Presenting: Karina Koifman Course : DB Seminar.
CS 106 Introduction to Computer Science I 10 / 16 / 2006 Instructor: Michael Eckmann.
Efficient Parallel Set-Similarity Joins Using Hadoop Chen Li Joint work with Michael Carey and Rares Vernica.
Efficient Exact Set-Similarity Joins Arvind Arasu Venkatesh Ganti Raghav Kaushik DMX Group, Microsoft Research.
Graph Indexing Techniques Seoul National University IDB Lab. Kisung Kim
Efficient Exact Similarity Searches using Multiple Token Orderings Jongik Kim 1 and Hongrae Lee 2 1 Chonbuk National University, South Korea 2 Google Inc.
Sanjay Agarwal Surajit Chaudhuri Gautam Das Presented By : SRUTHI GUNGIDI.
Similarity Join Wu Yang Main work MS--A Primitive Operator for Similarity Joins in Data Cleaning ICDE 2006 Google--Scaling Up All Pairs Similarity.
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
Click to edit Present’s Name Xiaoyang Zhang 1, Jianbin Qin 1, Wei Wang 1, Yifang Sun 1, Jiaheng Lu 2 HmSearch: An Efficient Hamming Distance Query Processing.
Experiments An Efficient Trie-based Method for Approximate Entity Extraction with Edit-Distance Constraints Entity Extraction A Document An Efficient Filter.
Querying Structured Text in an XML Database By Xuemei Luo.
Approximate XML Joins Huang-Chun Yu Li Xu. Introduction XML is widely used to integrate data from different sources. Perform join operation for XML documents:
Graph Indexing: A Frequent Structure- based Approach Alicia Cosenza November 26 th, 2007.
Cache-Conscious Performance Optimization for Similarity Search Maha Alabduljalil, Xun Tang, Tao Yang Department of Computer Science University of California.
文本挖掘简介 邹权 博士,助理教授. Outline  Introduction  TF-IDF  Similarity.
Computer Science and Engineering Efficiently Monitoring Top-k Pairs over Sliding Windows Presented By: Zhitao Shen 1 Joint work with Muhammad Aamir Cheema.
Efficient Common Items Extraction from Multiple Sorted Lists Wei Lu, Cuitian Rong, Jinchuan Chen, Xiaoyong Du, Gabriel Fung, Xiaofang Zhou Renmin University.
Expert Systems with Applications 34 (2008) 459–468 Multi-level fuzzy mining with multiple minimum supports Yeong-Chyi Lee, Tzung-Pei Hong, Tien-Chin Wang.
The Sweet Spot between Inverted Indices and Metric-Space Indexing for Top-K–List Similarity Search Evica Milchevski , Avishek Anand ★ and Sebastian Michel.
Efficient Processing of Top-k Spatial Preference Queries
Presented by: Aneeta Kolhe. Named Entity Recognition finds approximate matches in text. Important task for information extraction and integration, text.
Ranking objects based on relationships Computing Top-K over Aggregation Sigmod 2006 Kaushik Chakrabarti et al.
Computer Science and Engineering TreeSpan Efficiently Computing Similarity All-Matching Gaoping Zhu #, Xuemin Lin #, Ke Zhu #, Wenjie Zhang #, Jeffrey.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Chris Manning and Pandu Nayak Efficient.
Fast Indexes and Algorithms For Set Similarity Selection Queries M. Hadjieleftheriou A.Chandel N. Koudas D. Srivastava.
Efficient Merging and Filtering Algorithms for Approximate String Searches Chen Li, Jiaheng Lu and Yiming Lu Univ. of California, Irvine, USA ICDE ’08.
Lecture 4: Data Integration and Cleaning CMPT 733, SPRING 2016 JIANNAN WANG.
Ning Jin, Wei Wang ICDE 2011 LTS: Discriminative Subgraph Mining by Learning from Search History.
EFFICIENT ALGORITHMS FOR APPROXIMATE MEMBER EXTRACTION By Swapnil Kharche and Pavan Basheerabad.
1 Efficient Processing of Partially Specified Twig Queries Junfeng Zhou Renmin University of China.
Date : 2016/08/09 Advisor : Jia-ling Koh Speaker : Yi-Yui Lee
Outline Introduction State-of-the-art solutions
COMP9313: Big Data Management Lecturer: Xin Cao Course web site:
Optimizing Parallel Algorithms for All Pairs Similarity Search
Indexing & querying text
Jiannan Wang (Tsinghua, China) Guoliang Li (Tsinghua, China)
Efficient Similarity Joins for Near Duplicate Detection
TT-Join: Efficient Set Containment Join
Chuan Xiao, Wei Wang, Xuemin Lin
MatchCatcher: A Debugger for Blocking in Entity Matching
Weighted Exact Set Similarity Join
Wei Wang University of New South Wales, Australia
Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research)
An Efficient Partition Based Method for Exact Set Similarity Joins
Presentation transcript:

Top-k Set Similarity Joins Chuan Xiao, Wei Wang, Xuemin Lin and Haichuan Shang Univ. of New South Wales, Austrailia ICDE ’09 9 Feb 2011 Taewhi Lee Based on Chuan Xiao’s presentation slides in ICDE ’09

Outline  Introduction  Problem Definition  Existing Approaches  Top-k Join Similarity Join Algorithms  Experiments 2

Motivation  Data Cleaning UniversityCityStatePostal Code University of New South WalesSydneyNSW2052 University of SydneySydneyNSW2006 University of MelbourneMelbourneVictoria3010 University of QueenslandBrisbaneQueensland4072 University of New South ValesSydneyNSW2052 3

More Application  Near duplicate Web page detection Obama Has Busy Final Day Before Taking Office as Bush Says Farewells New York Times Jan 19th, 2009 iht.com Jan 20,

Outline  Introduction  Problem Definition  Existing Approaches  Top-k Join Similarity Join Algorithms  Experiments 5

(Traditional) Set Similarity Join  Each record is tokenized into a set  Given a collection of records, the set similarity join problem is to fi nd all pairs of records,, such that sim(x,y)  t  Common similarity functions: –jaccard: –cosine: –dice: x = {A,B,C,D,E} y = {B,C,D,E,F} 4/6 = /5 = 0.8 8/10 = What if t is unknown beforehand?

What If t is Unknown Beforehand?  Example – using jaccard similarity function –w = {A, B, C, D, E} –x = {A, B, C, E, F} –y = {B, C, D, E, F} –z = {B, C, F, G, H} –If t = 0.7  no results –If t = 0.4 ,,,, (too many results and long running time)  Return the top-k results ranked by their similarity values –if k = 1  7

Top-k Set Similarity Join  Return top-k pairs of records, ranked by similarity scores  Advantages over traditional similarity join –Without specifying a threshold –Output results progressively  benefit interactive applications –Produce most meaningful results under limited resources/time constraints  Can be stopped at any time, but still guarantee sim(output results)  sim(unseen pairs) 8

Outline  Introduction  Problem Definition  Existing Approaches  Top-k Join Similarity Join Algorithms  Experiments 9

Straightforward Solution  Start from a certain t, repeat the following steps: –answer traditional sim-join with t as threshold –if # of results  k, stop and output k results with highest sim –else, decrease t  Example (jaccard, k = 2) –w = {A, B, C, E} –x = {A, B, C, E, F} –y = {B, C, D, E, F} –z = {B, C, F, G, H} –t = 0.9  no result –t = 0.8  –t = 0.7  –t = 0.6 , results don’t change! Which thresholds shall we enumerate? 0.8,

Naïve and Index-Based Algorithms  Na ï ve Algorithm: –Compare every pair of objects -> O(n 2 ) time complexity  Index-based Algorithm [Sarawagi et al. SIGMOD04] : Record Set Index Construction Candidate Generation Verification Result Pairs tokenrecord_id Awxy Bxz … Cyz … <w,x><w,x> <w,y><w,y> <x,y><x,y> <x,z><x,z> … inverted lists 11

 Sort the tokens by a global ordering –increasing order of document frequency  Only need to index the first few tokens (prefix) for each record  Example: jaccard t = 0.8  |x  y|  4 if |x|=|y|=5 AB CD upper bound O(x,y) = 3 < 4! prefix sorted EFG EFG 12 Prefix Filter [Chaudhuri et al. ICDE06, Bayardo et al. WWW07] x y  Must share at least one token in prefix to be a candidate pair –For jaccard, prefix length = |x| * (1 – t) + 1  each t is associated with a prefix length

Outline  Introduction  Problem Definition  Existing Approaches  Top-k Join Similarity Join Algorithms  Experiments 13

Necessary Thresholds  Each prefix is associated with a threshold –the maximum possible similarity a record can achieve with other records ABC x = t x y z

Event-driven Model  Problem: repeated invocation of sim-join algorithm –t is decreasing  run sim-join algorithm in an incremental way  Prefix Event –Initialize prefix length for each record as 1  –For each prefix event  Probe the inverted list of the token for candidate pairs, verify the candi date pairs, and insert them into temp results  Insert x into A ’ s inverted list  Extend prefix by one token  maintain prefix events with a max-heap on t –Stop until t  k-th temp result ’ s similarity 15

Topk-join - Example 16 ABCE ABCEF BCDEF BCFGH w x y z tokenrecord_id Awx Byzxw Cyz inverted list prefix event (w,x) = 0.8 (y,z) = 0.43 (x,y) = 0.67 temporary result jaccard, k=2 verified t wice! t=0.6  2nd temp result’s sim

Optimizations - Verification  In the above example, (w,x) and (y,z) have been verified twice  How to avoid repeated verification? –Memorize all verified pairs with a hash table  too much memory consumption –Check if this pair will be identified again when it is verified for the first time –Keep only those will be identified again before algorithm stops –Guarantee no pair will be verified twice ABDEF ACDEF x y if k-th temp result’s sim = 0.7 won’t be identified again! 17

Optimizations - Indexing  How to reduce inverted list size to save memory? –t is decreasing  calculate the upper bound of similarity for future probings into inverted lists –Don ’ t insert into inverted list if upper bound  k-th temp result ’ s similarity ACDEF BCDEF x y max. similarity = 4/6 = 0.67

Outline  Introduction  Problem Definition  Existing Approaches  Top-k Join Similarity Join Algorithms  Experiments 19

Experiment Settings  Algorithms –topk-join –pptopk: modified ppjoin[ Xiao, et al. WWW08 ], a prefix-filter based approach, with t = 0.95, 0.90,  Measure –Compare topk-join and pptopk (candidate size, running time) –Output results progressively  Dataset dataset# of recordsavg. record size DBLP (author, title)855k14.0 TREC (author, title, abstract)348k130.1 TREC-3GRAM348k868.5 UNIREF-3GRAM (protein seq.)500k

Experiment Results 21

Experiment Results 22

Experiment Results 23

Thank You! Any questions or comments?

Related Work  Index-based approaches –S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. In SIG MOD, 2004 –C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algorithms for approxi mate string searches. in ICDE, 2008  Prefix-based approaches –S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joi ns in data cleaning. In ICDE, 2006 –R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In WWW, 2007 –C. Xiao, W. Wang, X. Lin, and J. X. Yu. Efficient similarity joins for near duplic ate detection. In WWW, 2008  PartEnum –A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In VLD B,