An Overview of Similarity Query Processing 김종익 전북대학교 컴퓨터공학부.

Slides:

Advertisements

Similar presentations

Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Space-Constrained Gram-Based Indexing for Efficient.

Advertisements

Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1] Pirooz Chubak May 22, 2008.

Indexing DNA Sequences Using q-Grams

Ziv Bar-YossefMaxim Gurevich Google and Technion Technion TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A AA.

Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.

Analysis of Algorithms

Top-k Set Similarity Joins Chuan Xiao, Wei Wang, Xuemin Lin and Haichuan Shang University of New South Wales and NICTA.

Dong Deng, Guoliang Li, Jianhua Feng Database Group, Tsinghua University Present by Dong Deng A Pivotal Prefix Based Filtering Algorithm for String Similarity.

Combinatorial Pattern Matching CS 466 Saurabh Sinha.

Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search Alexander Behm 1, Shengyue Ji 1, Chen Li 1, Jiaheng.

Quantile-Based KNN over Multi- Valued Objects Wenjie Zhang Xuemin Lin, Muhammad Aamir Cheema, Ying Zhang, Wei Wang The University of New South Wales, Australia.

Efficient Processing of Top-k Spatial Keyword Queries João B. Rocha-Junior, Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørvåg 1 SSTD 2011.

1 Query Languages. 2 Boolean Queries Keywords combined with Boolean operators: –OR: (e 1 OR e 2 ) –AND: (e 1 AND e 2 ) –BUT: (e 1 BUT e 2 ) Satisfy e.

Reza Sherkat ICDE061 Reza Sherkat and Davood Rafiei Department of Computing Science University of Alberta Canada Efficiently Evaluating Order Preserving.

Near Duplicate Detection

1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring.

Efficient Parallel Set-Similarity Joins Using Hadoop Chen Li Joint work with Michael Carey and Rares Vernica.

Graph Indexing Techniques Seoul National University IDB Lab. Kisung Kim

Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma

Optimizing Queries Using Materialized Views Qiang Wang CS848.

Efficient Exact Similarity Searches using Multiple Token Orderings Jongik Kim 1 and Hongrae Lee 2 1 Chonbuk National University, South Korea 2 Google Inc.

VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams Chen Li Bin Wang and Xiaochun Yang Northeastern University,

Similarity Join Wu Yang Main work MS--A Primitive Operator for Similarity Joins in Data Cleaning ICDE 2006 Google--Scaling Up All Pairs Similarity.

Similarity join problem with Pass- Join-K using Hadoop ---BY Yu Haiyang.

BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.

Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.

When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.

A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms.

Querying Structured Text in an XML Database By Xuemei Luo.

Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.

Top-k Set Similarity Joins Chuan Xiao, Wei Wang, Xuemin Lin and Haichuan Shang Univ. of New South Wales, Austrailia ICDE ’09 9 Feb 2011 Taewhi Lee Based.

Chapter 6: Information Retrieval and Web Search

文本挖掘简介邹权博士，助理教授. Outline  Introduction  TF-IDF  Similarity.

Efficient Common Items Extraction from Multiple Sorted Lists Wei Lu, Cuitian Rong, Jinchuan Chen, Xiaoyong Du, Gabriel Fung, Xiaofang Zhou Renmin University.

Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.

Efficient Approximate Search on String Collections Marios Hadjieleftheriou Chen Li 1.

Presented by: Aneeta Kolhe. Named Entity Recognition finds approximate matches in text. Important task for information extraction and integration, text.

CS4432: Database Systems II Query Processing- Part 2.

Web Mining Issues Size Size –>350 million pages –Grows at about 1 million pages a day Diverse types of data Diverse types of data.

Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.

Large-scale Similarity Join with Edit-distance Constraints ---BY Yu Haiyang 1/30.

Chen Li Department of Computer Science Joint work with Liang Jin, Nick Koudas, Anthony Tung, and Rares Vernica Answering Approximate Queries Efficiently.

Improving Search for Emerging Applications * Some techniques current being licensed to Bimaple Chen Li UC Irvine.

Graph Indexing From managing and mining graph data.

1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.

Retele de senzori Curs 2 - 1st edition UNIVERSITATEA „ TRANSILVANIA ” DIN BRAŞOV FACULTATEA DE INGINERIE ELECTRICĂ ŞI ŞTIINŢA CALCULATOARELOR.

Efficient Merging and Filtering Algorithms for Approximate String Searches Chen Li, Jiaheng Lu and Yiming Lu Univ. of California, Irvine, USA ICDE ’08.

EFFICIENT ALGORITHMS FOR APPROXIMATE MEMBER EXTRACTION By Swapnil Kharche and Pavan Basheerabad.

Why indexing? For efficient searching of a document

Efficient Approximate Search on String Collections Part I

Outline Introduction State-of-the-art solutions

COMP9313: Big Data Management Lecturer: Xin Cao Course web site:

Efficient Multi-User Indexing for Secure Keyword Search

Near Duplicate Detection

Text Based Information Retrieval

Jiannan Wang (Tsinghua, China) Guoliang Li (Tsinghua, China)

Evaluation of Relational Operations

Efficient Similarity Joins for Near Duplicate Detection

Query Languages.

Pass-Join: A Partition based Method for Similarity Joins

Chuan Xiao, Wei Wang, Xuemin Lin

Lecture 12: Data Wrangling

Weighted Exact Set Similarity Join

Data Integration for Relational Web

Efficient Subgraph Similarity All-Matching

Efficient Record Linkage in Large Data Sets

Jongik Kim1, Dong-Hoon Choi2, and Chen Li3

Fragment Assembly 7/30/2019.

An Efficient Partition Based Method for Exact Set Similarity Joins

Presentation transcript:

An Overview of Similarity Query Processing 김종익 전북대학교 컴퓨터공학부

2 Table of Contents 01. Applications of similarity query processing 02. Problem Formulation 03. string Decomposition 04. Similarity Function 05. A naïve approach 06. Overlap Similarity 07. Similarity Query Processing with Inverted lists 08. Similarity Function Revisited 09. Filter and Verification Framework 10. Prefix Filtering based Approach 11. Exploiting Document Frequency Ordering

3 Some examples and figures in this presentation are taken from the following materials Marios Hadjieleftheriou and Chen Li, Efficient Approximate Search on String Collections (tutorial), ICDE 2009 and VLDB 2009 Chuan Xiao, Wei Wang, Xuemin Lin, Jeffrey Xu Yu, Efficient Similarity Joins for Near Duplicate Detection, WWW 2008 (slide) Jongik Kim and Hongrae Lee, Efficient Exact Similarity Searches using Multiple Token Orderings, ICDE 2012 (slide)

Applications of similarity query processing (1/8) 4 Actual queries gathered by Google Web Search

5 Should be “Niels Bohr” Applications of similarity query processing (2/8) Data Integration and data cleaning R informix…… microsoft…… ……… ……… S infromix… … … mcrosoft… ……

6 Applications of similarity query processing (3/8) Duplicate (Web) Documents Detection

7 Applications of similarity query processing (4/8) Identify Spams SPAM TEMPLATE Sir/Madam, We happily announce to you the draw of the EURO MILLIONS SPANISH LOTTERY INTERNATIONAL WINNINGS PROGRAM PROMOTIONS held on the 27TH MARCH 2008 in SPAIN. Your company or your personal address attached to ticket number with serial main number drew lucky star winning numbers which consequently won in the 2ND category, you have therefore been approved for a lump sum pay out of Euros. (NINE HUNDRED AND SIXTY THOUSAND EUROS). CONGRATULATIONS!!! Sincerely yours,

8 Applications of similarity query processing (5/8) Detect Plagiarism Q. What are the advantages of RAID5 over RAID4? A. 1. Several write requests could be processed in parallel, since the bottleneck of a unique check disk has been eliminated. 2. Read requests have a higher level of parallelism. Since the data is distributed over all disks, read requests involve all disks, whereas in systems with a dedicated check disk the check disk never participates in read. Q. What are the advantages of RAID5 over RAID4? A. 1. Several write requests could be processed in parallel, since the bottleneck of a single check disk has been eliminated. 2. Read requests have a higher level of parallelism on RAID5. Since the data is distributed over all disks, read requests involve all disks, whereas in systems with a check disk the check disk never participates in read.

Recommendation of friends in an SNS service 9 Applications of similarity query processing (6/8) Friends vector: Friends vector: Friends of a person can be representation of a binary vector

Read (a fragment of genome sequence) Alignment 10 Applications of similarity query processing (7/8) GCTGATGTGCCGCCTCACTCCGGTGG … CACTCCTGTGG CTCACTCCTGTGG GCTGATGTGCCACCTCA GATGTGCCACCTCACTC GTGCCGCCTCACTCCTG CTCCTGTGG Reference sequence Short reads

11 Applications of similarity query processing (8/8) Supported by Oracle Text CREATE TABLE engdict(word VARCHAR(20), len INT); Create preferences for text indexing: begin ctx_ddl.create_preference('STEM_FUZZY_PREF', 'BASIC_WORDLIST'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_MATCH','ENGLISH'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_SCORE','0'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_NUMRESULTS','5000'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','SUBSTRING_INDEX','TRUE'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','STEMMER','ENGLISH'); end; / CREATE INDEX fuzzy_stem_subst_idx ON engdict ( word ) INDEXTYPE IS ctxsys.context PARAMETERS ('Wordlist STEM_FUZZY_PREF'); Usage: SELECT * FROM engdict WHERE CONTAINS(word, 'fuzzy(universisty, 70, 6, weight)', 1) > 0; Limitation: cannot handle errors in the first letters: Katherine versus Catherine Query Relaxation

12 Problem Formulation (1/2) Find strings similar to a given string

Similar to: a domain-specific function returns a similarity value between two strings Common similarity functions: Jaccard coefficient Cosine similarity Dice similarity Edit distance 13 Problem Formulation (2/2) Functions require set data

14 String Decomposition Word tokens for long string (e.g. web page) x = “yes as soon as possible” y = “as soon as possible please” x = {A, B, C, D, E} y = {B, C, D, E, F} wordyesassoonas 1 possbileplease tokenABCDEF q-gram tokens for short string (e.g. keyword query) x = “universal” G(x, 2) = {un, ni, iv, ve, er, rs, sa, al} u n i v e r s a l

15 Similarity Function Jaccard Similarity Cosine similarity Dice similarity x = {A, B, C, D, E} y = {B, C, D, E, F} Edit Distance ED(x, y) = minimum number of edit operations to change x to y (insertion, deletion, substitution) x: Tom Hanks y: Ton Hank ED(x, y) = 2

16 A naïve approach Given a collection of strings C, a query string x, and a threshold t of a similarity function sim, 1. decompose each string in C and the query string into tokens. 2. output those string y ∈ C such that sim(x, y) ≥ t. Since C contains a lot of strings, this approach is obviously inefficient.

17 Overlap Similarity (1/2) Given a similarity threshold t, Overlap Similarity

18 Overlap Similarity (2/2) Given an edit distance d, u n i v e r s a l d edit operations could affect d x q grams or, d edit operations on x can mutate d x q grams of x x = “universal” and G(x, 2) = {un, ni, iv, ve, er, rs, sa, al} 2 edit operations on x mutate 2 x 2 q-grams Hence, y should contains at least |G(x, 2)| - 2 x 2 = 4 q-grams in G(x, 2)

19 Similarity Query Processing with Inverted lists IDStringRecord (token set) 1area{, re, ea} 2artisan{, rt, ti, is, sa, an} 3artist{, rt, ti, is, st} 4tisk{ti, is, sk} ……… ar sk ea is sa rt st ti re 1 Make Inverted Lists an 2 3 Query: “artist”  Overlap threshold: 4 Merge to count occurrences Answers of the query 2: “artisan” 3: “artist” {,,,, } ar rt tiis st 4 ar

Count threshold t≥ 3 minHeap : count 2 < t (X) 2: count 3 = t (O) … Merge Algorithm – HeapMerge

21 Similarity Function Revisited Given a query x with a similarity threshold t, FOR ALL y, To determine the overlap threshold, we need to know the size of y, which varies according to each string in a collection.

22 Filter and Verification Framework Find those strings that shares at least α tokens with the query string, where α is an overlap lower bound. FILTER Verify each string found in filtering stage by directly applying a similarity function VERIFICATION Quickly generate initial candidates using a minimum constraint Refine candidates using α FILTER REFINEMENT

23 Prefix Filtering based Approach Query x = “artist”  {ar, rt, ti, is, st} and overlap threshold α = 4 ar is rt st ti Inverted lists for the query st rt ar is ti Sort the lists by their sizes Prefix Lists: the first |G(x, 2)| – α + 1 lists Suffix Lists: remaining α – 1 lists Filtering Phase (the prefix filtering) Merge the prefix lists to generate candidates Refinement Phase Search the suffix lists for each candidate A candidate searches each suffix list to identify if it is contained in the list Binary search is used because suffix lists are usually very long candidates Sort the tokens by their document frequencies Document frequency ordering

24 Exploiting Document Frequency Ordering (1/2) General Goal: minimize the number of candidates initially generated by making use of the document frequency ordering rt st ti ar is st rt ar is ti Prefix Lists: the first |G(x, 2)| – α + 1 lists Query x = “artist”  {ar, rt, ti, is, st} and overlap threshold α = 4 Suffix Lists: remaining α – 1 lists Prefix Lists: the first |G(x, 2)| – α + 1 lists Suffix Lists: remaining α – 1 lists Sort the tokens by their document frequencies candidates 12 3 We can reduce 1.time for merging short lists 2.number of candidates  time for verification candidates

25 Query x = {w 1, w 2 } and overlap threshold α = 2 w 2 is the prefix list # of candidates is 5 w 2 is the prefix list # of candidates is 0 w 1 is the prefix list # of candidates is 0 Total number of candidates is 0 Partition Observation By partitioning a data set, we can artificially modify document frequencies of tokens in each partition. We evaluate a query in each partition and take the union of the results. We can reduce the number of candidates by utilizing different token orderings among partitions. Because partitions have different token orderings, we need to sort tokens in a query record in each partition. Exploiting Document Frequency Ordering (2/2)

Q&A