Top-k Set Similarity Joins Chuan Xiao, Wei Wang, Xuemin Lin and Haichuan Shang University of New South Wales and NICTA.

Slides:



Advertisements
Similar presentations
1 Efficient Merging and Filtering Algorithms for Approximate String Searches Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming.
Advertisements

Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Space-Constrained Gram-Based Indexing for Efficient.
String Similarity Measures and Joins with Synonyms
Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1] Pirooz Chubak May 22, 2008.
13/04/20151 SPARK: Top- k Keyword Query in Relational Database Wei Wang University of New South Wales Australia.
Jianxin Li, Chengfei Liu, Rui Zhou Swinburne University of Technology, Australia Wei Wang University of New South Wales, Australia Top-k Keyword Search.
Power-Law Based Estimation of Set Similarity Join Size Hongrae Lee, University of British Columbia Raymond T. Ng, University of British Columbia Kyuseok.
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego Luis Gravano Columbia University Yannis.
Jiannan Wang (Tsinghua, China) Guoliang Li (Tsinghua, China) Jianhua Feng (Tsinghua, China)
The Volcano/Cascades Query Optimization Framework
School of Computer Science and Engineering Finding Top k Most Influential Spatial Facilities over Uncertain Objects Liming Zhan Ying Zhang Wenjie Zhang.
Connected Substructure Similarity Search Haichuan Shang The University of New South Wales & NICTA, Australia Joint Work: Xuemin Lin (The University of.
SPARK: Top-k Keyword Query in Relational Databases Yi Luo, Xuemin Lin, Wei Wang, Xiaofang Zhou Univ. of New South Wales, Univ. of Queensland SIGMOD 2007.
1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)
Dong Deng, Guoliang Li, Jianhua Feng Database Group, Tsinghua University Present by Dong Deng A Pivotal Prefix Based Filtering Algorithm for String Similarity.
An Overview of Similarity Query Processing 김종익 전북대학교 컴퓨터공학부.
Ming Hua, Jian Pei Simon Fraser UniversityPresented By: Mahashweta Das Wenjie Zhang, Xuemin LinUniversity of Texas at Arlington The University of New South.
Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search Alexander Behm 1, Shengyue Ji 1, Chen Li 1, Jiaheng.
Quantile-Based KNN over Multi- Valued Objects Wenjie Zhang Xuemin Lin, Muhammad Aamir Cheema, Ying Zhang, Wei Wang The University of New South Wales, Australia.
Suggestion of Promising Result Types for XML Keyword Search Joint work with Jianxin Li, Chengfei Liu and Rui Zhou ( Swinburne University of Technology,
Efficient Processing of Top-k Spatial Keyword Queries João B. Rocha-Junior, Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørvåg 1 SSTD 2011.
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.
Reza Sherkat ICDE061 Reza Sherkat and Davood Rafiei Department of Computing Science University of Alberta Canada Efficiently Evaluating Order Preserving.
A Unified Approach for Computing Top-k Pairs in Multidimensional Space Presented By: Muhammad Aamir Cheema 1 Joint work with Xuemin Lin 1, Haixun Wang.
33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Towards Graph Containment Search and Indexing Chen Chen 1, Xifeng Yan 2, Philip.
Cost-Based Variable-Length-Gram Selection for String Collections to Support Approximate Queries Efficiently Xiaochun Yang, Bin Wang Chen Li Northeastern.
Performance and Scalability: Apriori Implementation.
Efficient Parallel Set-Similarity Joins Using Hadoop Chen Li Joint work with Michael Carey and Rares Vernica.
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
Efficient Exact Set-Similarity Joins Arvind Arasu Venkatesh Ganti Raghav Kaushik DMX Group, Microsoft Research.
Graph Indexing Techniques Seoul National University IDB Lab. Kisung Kim
Efficient Exact Similarity Searches using Multiple Token Orderings Jongik Kim 1 and Hongrae Lee 2 1 Chonbuk National University, South Korea 2 Google Inc.
Mehdi Kargar Aijun An York University, Toronto, Canada Discovering Top-k Teams of Experts with/without a Leader in Social Networks.
Sanjay Agarwal Surajit Chaudhuri Gautam Das Presented By : SRUTHI GUNGIDI.
Similarity Join Wu Yang Main work MS--A Primitive Operator for Similarity Joins in Data Cleaning ICDE 2006 Google--Scaling Up All Pairs Similarity.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
Click to edit Present’s Name Xiaoyang Zhang 1, Jianbin Qin 1, Wei Wang 1, Yifang Sun 1, Jiaheng Lu 2 HmSearch: An Efficient Hamming Distance Query Processing.
Experiments An Efficient Trie-based Method for Approximate Entity Extraction with Edit-Distance Constraints Entity Extraction A Document An Efficient Filter.
Querying Structured Text in an XML Database By Xuemei Luo.
Top-k Set Similarity Joins Chuan Xiao, Wei Wang, Xuemin Lin and Haichuan Shang Univ. of New South Wales, Austrailia ICDE ’09 9 Feb 2011 Taewhi Lee Based.
Approximate XML Joins Huang-Chun Yu Li Xu. Introduction XML is widely used to integrate data from different sources. Perform join operation for XML documents:
Graph Indexing: A Frequent Structure- based Approach Alicia Cosenza November 26 th, 2007.
Cache-Conscious Performance Optimization for Similarity Search Maha Alabduljalil, Xun Tang, Tao Yang Department of Computer Science University of California.
文本挖掘简介 邹权 博士,助理教授. Outline  Introduction  TF-IDF  Similarity.
Computer Science and Engineering Efficiently Monitoring Top-k Pairs over Sliding Windows Presented By: Zhitao Shen 1 Joint work with Muhammad Aamir Cheema.
Efficient Common Items Extraction from Multiple Sorted Lists Wei Lu, Cuitian Rong, Jinchuan Chen, Xiaoyong Du, Gabriel Fung, Xiaofang Zhou Renmin University.
Presented by: Aneeta Kolhe. Named Entity Recognition finds approximate matches in text. Important task for information extraction and integration, text.
Ranking objects based on relationships Computing Top-K over Aggregation Sigmod 2006 Kaushik Chakrabarti et al.
Computer Science and Engineering TreeSpan Efficiently Computing Similarity All-Matching Gaoping Zhu #, Xuemin Lin #, Ke Zhu #, Wenjie Zhang #, Jeffrey.
Information Technology Selecting Representative Objects Considering Coverage and Diversity Shenlu Wang 1, Muhammad Aamir Cheema 2, Ying Zhang 3, Xuemin.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Chris Manning and Pandu Nayak Efficient.
Fast Indexes and Algorithms For Set Similarity Selection Queries M. Hadjieleftheriou A.Chandel N. Koudas D. Srivastava.
Ning Jin, Wei Wang ICDE 2011 LTS: Discriminative Subgraph Mining by Learning from Search History.
Date : 2016/08/09 Advisor : Jia-ling Koh Speaker : Yi-Yui Lee
Outline Introduction State-of-the-art solutions
COMP9313: Big Data Management Lecturer: Xin Cao Course web site:
Optimizing Parallel Algorithms for All Pairs Similarity Search
Jiannan Wang (Tsinghua, China) Guoliang Li (Tsinghua, China)
Efficient Similarity Joins for Near Duplicate Detection
TT-Join: Efficient Set Containment Join
Chuan Xiao, Wei Wang, Xuemin Lin
MatchCatcher: A Debugger for Blocking in Entity Matching
Weighted Exact Set Similarity Join
Minwise Hashing and Efficient Search
Wei Wang University of New South Wales, Australia
Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research)
An Efficient Partition Based Method for Exact Set Similarity Joins
Presentation transcript:

Top-k Set Similarity Joins Chuan Xiao, Wei Wang, Xuemin Lin and Haichuan Shang University of New South Wales and NICTA

2 Motivation  Data Cleaning UniversityCityStatePostal Code University of New South WalesSydneyNSW2052 University of SydneySydneyNSW2006 University of MelbourneMelbourneVictoria3010 University of QueenslandBrisbaneQueensland4072 University of New South ValesSydneyNSW2052

3 More Applications Obama Has Busy Final Day Before Taking Office as Bush Says Farewells New York Times Jan 19th, 2009 iht.com Jan 20, 2009

4 (Traditional) Set Similarity Join  Each record is tokenized into a set  Given a collection of records, the set similarity join problem is to find all pairs of records,, such that sim(x,y)  t  Common similarity functions: jaccard: cosine: dice:  What if t is unknown beforehand? x = {A,B,C,D,E} y = {B,C,D,E,F} 4/6 = /5 = 0.8 8/10 = 0.8

5 What if t is unknown beforehand?  Example – using jaccard similarity function w = {A, B, C, D, E} x = {A, B, C, E, F} y = {B, C, D, E, F} z = {B, C, F, G, H} If t = 0.7  no results If t = 0.4 ,,,,  too many results and long running time  Return the top-k results ranked by their similarity values if k = 1 

6 Top-k Set Similarity Join  Return top-k pairs of records, ranked by similarity scores.  Advantages over traditional similarity join without specifying a threshold output results progressively  benefit interactive applications produces most meaningful results under limited resources or time constraints  can be stopped at any time, but still guarantee sim(output results)  sim(unseen pairs)

7 Straightforward Solution  Start from a certain t, repeat the following steps: answer traditional sim-join with t as threshold if # of results  k, stop and output k results with highest sim else, decrease t  Example (jaccard, k = 2) w = {A, B, C, E} x = {A, B, C, E, F} y = {B, C, D, E, F} z = {B, C, F, G, H} t = 0.9  no result t = 0.8  t = 0.7  t = 0.6 , results don ’ t change! Which thresholds shall we enumerate? 0.8, 0.6

8 Naïve and Index-Based Algorithms  Na ï ve Algorithm: Compare every pair of objects -> O(n 2 ) time complexity  Index-based Algorithm [Sarawagi et al. SIGMOD04] : Record Set Index Construction Candidate Generation Verification Result Pairs tokenrecord_id Awxy Bxz … Cyz … <w,x><w,x> <w,y><w,y> <x,y><x,y> <x,z><x,z> … inverted lists

9 Prefix Filter [Chaudhuri et al. ICDE06, Bayardo et al. WWW07]  Sort the tokens by a global ordering increasing order of document frequency  Only need to index the first few tokens (prefix) for each record  Example: jaccard t = 0.8  |x  y|  4 if |x|=|y|=5 x = y =  Must share at least one token in prefix to be a candidate pair For jaccard, prefix length = |x| * (1 – t) + 1  each t is associated with a prefix length AB CD upper bound O(x,y) = 3 < 4! prefix sorted EFG EFG

10 Necessary Thresholds  Each prefix is associated with a threshold, i.e., the maximum possible similarity a record can achieve with other records.  What thresholds shall we enumerate? All the thresholds with which prefixes are associated!  Necessary thresholds If we change between different thresholds, there exists a database instance where the results will change extend prefix by one token, and consider the new t ABC x = t

11 Event-driven Model  Problem: repeated invocation of sim-join algorithm t is decreasing  run sim-join algorithm in an incremental way  Prefix Event initialize prefix length for each record as 1  for each prefix event  probe the inverted list of the token for candidate pairs, verify the candidate pairs, and insert them into temp results.  insert x into A ’ s inverted list  extend prefix by one token  maintain prefix events with a max- heap on t stop until t  k-th temp result ’ s similarity x y z

12 topk-join - Example ABCE ABCEF BCDEF BCFGH w x y z tokenrecord_id Awx Byzxw Cyz inverted list prefix event (w,x) = 0.8 (y,z) = 0.43 (x,y) = 0.67 temporary result jaccard, k=2 verified twice! t=0.6  2nd temp result ’ s sim

13 Optimizations - Verification  In the above example, (w,x) and (y,z) have been verified twice  How to avoid repeated verification? memorize all verified pairs with a hash table  too much memory consumption check if this pair will be identified again when it is verified for the first time keep only those will be identified again before algorithm stops guarantee no pair will be verified twice ABDEF ACDEF x y if k-th temp result ’ s sim = 0.7 won ’ t be identified again!

14 Optimizations - Indexing  How to reduce inverted list size to save memory? identified by or, yet the maximum similarity they can achieve is 4/6 = 0.67 t is decreasing  calculate the upper bound of similarity for future probings into inverted lists don ’ t insert into inverted list if this upper bound  k-th temp result ’ s similarity ACDEF BCDEF x y

15 Experiment Settings  Algorithms topk-join pptopk: modified ppjoin[ Xiao, et al. WWW08 ], a prefix-filter based approach, with t = 0.95, 0.90,  Measure compare topk-join and pptopk (candidate size, running time) output results progressively  Dataset dataset# of recordsavg. record size DBLP (author, title)855k14.0 TREC (author, title, abstract)348k130.1 TREC-3GRAM348k868.5 UNIREF-3GRAM (protein seq.)500k372.9

16 Experiment Results

17 Experiment Results

18 Thank you! Questions?

19 Related Work  Index-based approaches S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. In SIGMOD, C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algorithms for approximate string searches. in ICDE,  Prefix-based approaches S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In WWW, C. Xiao, W. Wang, X. Lin, and J. X. Yu. Efficient similarity joins for near duplicate detection. In WWW,  PartEnum A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set- similarity joins. In VLDB, 2006.