Similarity Join Wu Yang 2009.4.9. Main work MS--A Primitive Operator for Similarity Joins in Data Cleaning ICDE 2006 Google--Scaling Up All Pairs Similarity.

Slides:



Advertisements
Similar presentations
Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin.
Advertisements

String Similarity Measures and Joins with Synonyms
Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1] Pirooz Chubak May 22, 2008.
Weiren Yu 1, Jiajin Le 2, Xuemin Lin 1, Wenjie Zhang 1 On the Efficiency of Estimating Penetrating Rank on Large Graphs 1 University of New South Wales.
Processing XML Keyword Search by Constructing Effective Structured Queries Jianxin Li, Chengfei Liu, Rui Zhou and Bo Ning Swinburne University of Technology,
13/04/20151 SPARK: Top- k Keyword Query in Relational Database Wei Wang University of New South Wales Australia.
Power-Law Based Estimation of Set Similarity Join Size Hongrae Lee, University of British Columbia Raymond T. Ng, University of British Columbia Kyuseok.
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
New Models for Graph Pattern Matching Shuai Ma ( 马 帅 )
Jiannan Wang (Tsinghua, China) Guoliang Li (Tsinghua, China) Jianhua Feng (Tsinghua, China)
Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18.
School of Computer Science and Engineering Finding Top k Most Influential Spatial Facilities over Uncertain Objects Liming Zhan Ying Zhang Wenjie Zhang.
Big Data Lecture 6: Locality Sensitive Hashing (LSH)
Top-k Set Similarity Joins Chuan Xiao, Wei Wang, Xuemin Lin and Haichuan Shang University of New South Wales and NICTA.
Connected Substructure Similarity Search Haichuan Shang The University of New South Wales & NICTA, Australia Joint Work: Xuemin Lin (The University of.
SPARK: Top-k Keyword Query in Relational Databases Yi Luo, Xuemin Lin, Wei Wang, Xiaofang Zhou Univ. of New South Wales, Univ. of Queensland SIGMOD 2007.
Arvind Arasu, Surajit Chaudhuri, and Raghav Kaushik Presented by Bryan Wilhelm.
An Overview of Similarity Query Processing 김종익 전북대학교 컴퓨터공학부.
SIGMOD 2006University of Alberta1 Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters Presented by Fan Deng Joint work with.
Quantile-Based KNN over Multi- Valued Objects Wenjie Zhang Xuemin Lin, Muhammad Aamir Cheema, Ying Zhang, Wei Wang The University of New South Wales, Australia.
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
Reza Sherkat ICDE061 Reza Sherkat and Davood Rafiei Department of Computing Science University of Alberta Canada Efficiently Evaluating Order Preserving.
Near Duplicate Detection
线性代数习题课 吉林大学 术洪亮 第一讲 行 列 式 前面我们已经学习了关 于行列式的概念和一些基本 理论,其主要内容可概括为:
JAVA 程式設計與資料結構 第二十章 Searching. Sequential Searching Sequential Searching 是最簡單的一種搜尋法,此演 算法可應用在 Array 或是 Linked List 此等資料結構。 Sequential Searching 的 worst-case.
演算法 8-1 最大數及最小數找法 8-2 排序 8-3 二元搜尋法.
Indexing Techniques Mei-Chen Yeh.
Efficient Parallel Set-Similarity Joins Using Hadoop Chen Li Joint work with Michael Carey and Rares Vernica.
Efficient Exact Set-Similarity Joins Arvind Arasu Venkatesh Ganti Raghav Kaushik DMX Group, Microsoft Research.
Deduplication CSCI 572: Information Retrieval and Search Engines Summer 2010.
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
Graph Indexing Techniques Seoul National University IDB Lab. Kisung Kim
Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma
Efficient Exact Similarity Searches using Multiple Token Orderings Jongik Kim 1 and Hongrae Lee 2 1 Chonbuk National University, South Korea 2 Google Inc.
School of Information Technology & Electrical Engineering Multiple Feature Hashing for Real-time Large Scale Near-duplicate Video Retrieval Jingkuan Song*,
《 UML 分析与设计》 交互概述图 授课人:唐一韬. 知 识 图 谱知 识 图 谱知 识 图 谱知 识 图 谱.
Similarity join problem with Pass- Join-K using Hadoop ---BY Yu Haiyang.
INFORMATION EXTRACTION SNITA SARAWAGI. Management of Information Extraction System Performance Optimization Handling Change Integration of Extracted Information.
Click to edit Present’s Name Xiaoyang Zhang 1, Jianbin Qin 1, Wei Wang 1, Yifang Sun 1, Jiaheng Lu 2 HmSearch: An Efficient Hamming Distance Query Processing.
Internet Information Retrieval Sun Wu. Course Goal To learn the basic concepts and techniques of internet search engines –How to use and evaluate search.
Experiments An Efficient Trie-based Method for Approximate Entity Extraction with Edit-Distance Constraints Entity Extraction A Document An Efficient Filter.
Top-k Similarity Join over Multi- valued Objects Wenjie Zhang Jing Xu, Xin Liang, Ying Zhang, Xuemin Lin The University of New South Wales, Australia.
Top-k Set Similarity Joins Chuan Xiao, Wei Wang, Xuemin Lin and Haichuan Shang Univ. of New South Wales, Austrailia ICDE ’09 9 Feb 2011 Taewhi Lee Based.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections Presenter: Tsai Tzung.
Jianmin Wang 1, Shaoxu Song 1, Xiaochen Zhu 1, Xuemin Lin 2 1 Tsinghua University, China 2 University of New South Wales, Australia 1/23 VLDB 2013.
文本挖掘简介 邹权 博士,助理教授. Outline  Introduction  TF-IDF  Similarity.
Computer Science and Engineering Efficiently Monitoring Top-k Pairs over Sliding Windows Presented By: Zhitao Shen 1 Joint work with Muhammad Aamir Cheema.
Experiments Faerie: Efficient Filtering Algorithms for Approximate Dictionary-based Entity Extraction Entity Extraction A Document An Efficient Filter.
The Sweet Spot between Inverted Indices and Metric-Space Indexing for Top-K–List Similarity Search Evica Milchevski , Avishek Anand ★ and Sebastian Michel.
Presented by: Aneeta Kolhe. Named Entity Recognition finds approximate matches in text. Important task for information extraction and integration, text.
Ranking objects based on relationships Computing Top-K over Aggregation Sigmod 2006 Kaushik Chakrabarti et al.
力的合成 力的合成 一、力的合成 二、力的平行四边形 上一页下一页 目 录 退 出. 一、力的合成 O. O. 1. 合力与分力 我们常常用 一个力来代替几个力。如果这个 力单独作用在物体上的效果与原 来几个力共同作用在物体上的效 果完全一样,那么,这一个力就 叫做那几个力的合力,而那几个 力就是这个力的分力。
Large-scale Similarity Join with Edit-distance Constraints ---BY Yu Haiyang 1/30.
Information Technology Selecting Representative Objects Considering Coverage and Diversity Shenlu Wang 1, Muhammad Aamir Cheema 2, Ying Zhang 3, Xuemin.
Efficient Approximate Search on String Collections Part II Marios HadjieleftheriouChen Li.
课程简介 算法设计与分析 ( Design and Analysis of Algorithms) 任课教师 : 王轶彤 Tel: TA: 周泽学 授课方式 : 全英文教学, 3.
Scalability of Local Image Descriptors Björn Þór Jónsson Department of Computer Science Reykjavík University Joint work with: Laurent Amsaleg (IRISA-CNRS)
Task assignment of interactive Entity resolution 龚赛赛
COMP9313: Big Data Management Lecturer: Xin Cao Course web site:
Multiple Feature Hashing for Real-time Large Scale
Optimizing Parallel Algorithms for All Pairs Similarity Search
Near Duplicate Detection
Jiannan Wang (Tsinghua, China) Guoliang Li (Tsinghua, China)
Efficient Similarity Joins for Near Duplicate Detection
TT-Join: Efficient Set Containment Join
Chuan Xiao, Wei Wang, Xuemin Lin
Weighted Exact Set Similarity Join
Pay Me and I’ll Follow You: Detection of Crowdturfing Following Activities in Microblog Environment Liu Yuli 2016/05/22.
Wei Wang University of New South Wales, Australia
An Efficient Partition Based Method for Exact Set Similarity Joins
Presentation transcript:

Similarity Join Wu Yang

Main work MS--A Primitive Operator for Similarity Joins in Data Cleaning ICDE 2006 Google--Scaling Up All Pairs Similarity Search www2007 University of New South Wales & NICTA Australia – Chuan Xiao , Wei Wang , Xuemin Lin PPJoin : Efficient Similarity Joins for Near Duplicate Detection , WWW2008 , EdJoin:An Efficient Algorithm for Similarity Joins With Edit Distance Constraints , VLDB 2008 Approximate Entity Extraction with Edit Distance Constraints. SIGMOD 2009 Top-k Set Similarity Joins. ICDE

3 Outline Motivation Algorithms Experiments Thinking

Near Duplicate Data On one end, a winded Pete Sampras tried to summon enough energy to give the New York fans another memorable win to talk about it on the subway ride home. On the other side, Roger Federer wore a sly grin like he knew age was about to catch up to the former world No. 1 - the man who owns the record of 14 Grand Slams he wants. 03/11/2008 | 11:28 AM By JAY COHEN, AP Sports Writer Mar 11, 4:23 am EDT

App: deduplication / App: Identify spams Plagiarism Copyright protection Replicate Web collections

App: data integration / record linkage Efficient Similarity Joins for Near Duplicate Detection

7 Applications For Web search engines: Perform focused crawling Increase the quality and diversity of query results Identify spams. For Web mining: Perform document clustering Find replicate Web collections Detect plagiarism SPAM TEMPLATE Sir/Madam, We happily announce to you the draw of the EURO MILLIONS SPANISH LOTTERY INTERNATIONAL WINNINGS PROGRAM PROMOTIONS held on the 27TH MARCH 2008 in SPAIN. Your company or your personal address attached to ticket number with serial main number drew lucky star winning numbers which consequently won in the 2ND category, you have therefore been approved for a lump sum pay out of Euros. (NINE HUNDRED AND SIXTY THOUSAND EUROS). CONGRATULATIONS!!! Sincerely yours, Q. What are the advantages of RAID5 over RAID4? A. 1. Several write requests could be processed in parallel, since the bottleneck of a unique check disk has been eliminated. 2. Read requests have a higher level of parallelism. Since the data is distributed over all disks, read requests involve all disks, whereas in systems with a dedicated check disk the check disk never participates in read. Q. What are the advantages of RAID5 over RAID4? A. 1. Several write requests could be processed in parallel, since the bottleneck of a single check disk has been eliminated. 2. Read requests have a higher level of parallelism on RAID5. Since the data is distributed over all disks, read requests involve all disks, whereas in systems with a check disk the check disk never participates in read.

Algorithms Data set Similarity function Algorithms

Data set dblp.raw texas.raw trec.raw uniref.500K.raw

10 Similarity Function Common similarity functions: Jaccard: Cosine: Overlap: Jaccard can be equivalently converted to Overlap x = {A,B,C,D,E} y = {B,C,D,E,F} 4/6 = /5 = 0.8 4

Similarity Function Hamming distance =|(x-y)U(y-x)| Edit distance

Algorithms - classication

Algorithms object Similarity between sets Binary similarity functions Contains, intersects Numerical similarity functions Overlap, Jaccard, dice, cosine Similarity between strings Treat strings as sets Jaccard (on q-grams), edit distance

algorithms SSJoin All-pairs PPJoin, PPJoin+ Top-k Set Similarity Joins

SSJoin Based on sets Why string to set? Cited from Efficient Exact Set-Similarity Joins --MS Generalizes to many string similarity funcs Powerful primitive Sets ≈ Relations Leverage relational data processing

SSjoin find {(r, s) | r ∈ R, s ∈ S, overlap(r, s) ≥ t} A fundamental “operator” can handle other similarity functions (Jaccard, cosine, Hamming, dice, edit distance, …) via transformation Efficient Similarity Joins for Near Duplicate Detection

Prefix Filtering-based similarity join 1. SSJoin[Chaudhuri et al, ICDE06] Formalize the prefix-filtering principle 2. All pairs [Bayardo et al, WWW07] Use prefix-filtering in an asymmetric way 3. PPJoin+[Xiao et al, WWW08] Employs prefix-filtering, position filtering and suffix Filtering- based Similarity Joins

ALL Pairs

Prefix + Positional Information We use prefix filter (All-Pairs [www07] ) as basic framework Intuition tokens sorted -> rank, or position of tokens within a record estimate tighter upper bounds of overlap between x and y with positional information Contributions index construction index not only tokens, but their positions in the record  ppjoin algorithm candidate generation probe tokens in suffix, compare the positions in the record  ppjoin+ algorithm

Experiments

Experiments

Experiments

Thinking Further optimization on performances 1. Index for similarity functions (e.g., cosine) 2. Better pruning techniques 3. Optimize for the specific similarity/distance function

Thinking 已有方法对于 token 的处理 基于 inverted-list 方法 TF , IDF IR 中常见加权的方式 w i,j= tf*idf 直觉:既然 token 的权重对于算法的效率有影响,那么有 没有更好的方式处理 token 的排序呢?是否对结果有影响 呢? 思考:对于 token 排序的过程中,对于某些词,是否可以 屏蔽掉,对于某些词,是否定义其权重。

continue 报告中所介绍的算法,都是基于 SET 的。如图

Related Work Approximate: LSH: A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In VLDB, Shingling: A. Z. Broder. On the resemblence and containment of documents. In SEQS, Exact: Index-based: S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. In SIGMOD, Prefix-based: S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, All-Pairs: R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In WWW, PPjoin,PPjoin+ Chuan Xiao, Wei Wang, Xuemin Lin, Jeffrey Xu Yu. Efficient Similarity Joins for Near Duplicate Detection. WWW 2008 Pigeon-hole principle based: PartEnum: A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In VLDB, 2006.

References [SEQS97] A. Z. Broder. On the resemblance and containment of documents. In SEQS [MIR] R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrival. Addison Wesley, 1 st edition, May [VLDB99] LSH: A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In VLDB, [SIGMOD04] S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. In SIGMOD, [ICDE06] S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, [VLDB06] PartEnum: A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In VLDB, [WWW07] All-Pairs: R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In WWW, [WWW 2008] Efficient Similarity Joins for Near Duplicate Detection Chuan Xiao, Wei Wang, Xuemin Lin, Jeffrey Xu Yu.. WWW 2008 [VLDB 2008]. Ed-Join: An Efficient Algorithm for Similarity Joins with Edit Distance Constraints. VLDB [ICDE 2009]. Top-k Set Similarity Joins Chuan Xiao, Wei Wang, Xuemin Lin, Haichuan Shang