Fast Indexes and Algorithms For Set Similarity Selection Queries M. Hadjieleftheriou A.Chandel N. Koudas D. Srivastava.

Slides:



Advertisements
Similar presentations
Text Categorization.
Advertisements

Group Recommendation: Semantics and Efficiency
13/04/20151 SPARK: Top- k Keyword Query in Relational Database Wei Wang University of New South Wales Australia.
Indexing. Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w w 1j... w 1m 1/|d 1 | d 2 w 21 w w 2j... w 2m 1/|d.
Analysis of Algorithms CS Data Structures Section 2.6.
Reference-based Indexing of Sequence Databases Jayendra Venkateswaran, Deepak Lachwani, Tamer Kahveci, Christopher Jermaine University of Florida-Gainesville.
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
CSCE 580 ANDREW SMITH JOHNNY FLOWERS IDA* and Memory-Bounded Search Algorithms.
Top-k Set Similarity Joins Chuan Xiao, Wei Wang, Xuemin Lin and Haichuan Shang University of New South Wales and NICTA.
Effective Keyword Search in Relational Databases Fang Liu (University of Illinois at Chicago) Clement Yu (University of Illinois at Chicago) Weiyi Meng.
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
Learning for Text Categorization
IR Models: Overview, Boolean, and Vector
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Aggregation Algorithms and Instance Optimality
Blind Search-Part 2 Ref: Chapter 2. Search Trees The search for a solution can be described by a tree - each node represents one state. The path from.
Vector Space Model CS 652 Information Extraction and Integration.
The Vector Space Model …and applications in Information Retrieval.
IR Models: Review Vector Model and Probabilistic.
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department.
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
Efficient Exact Similarity Searches using Multiple Token Orderings Jongik Kim 1 and Hongrae Lee 2 1 Chonbuk National University, South Korea 2 Google Inc.
Term Weighting and Ranking Models Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.
Search A Basic Overview Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata October 20, 2014.
Approximate XML Joins Huang-Chun Yu Li Xu. Introduction XML is widely used to integrate data from different sources. Perform join operation for XML documents:
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
CSC 211 Data Structures Lecture 13
Efficient Common Items Extraction from Multiple Sorted Lists Wei Lu, Cuitian Rong, Jinchuan Chen, Xiaoyong Du, Gabriel Fung, Xiaofang Zhou Renmin University.
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
Spatio-temporal Pattern Queries M. Hadjieleftheriou G. Kollios P. Bakalov V. J. Tsotras.
Lecture 3: Uninformed Search
All right reserved by Xuehua Shen 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)
Review 1 Arrays & Strings Array Array Elements Accessing array elements Declaring an array Initializing an array Two-dimensional Array Array of Structure.
Vector Space Models.
Presented by Suresh Barukula 2011csz  Top-k query processing means finding k- objects, that have highest overall grades.  A query in multimedia.
UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.
Sorting.
1 Data Mining: Text Mining. 2 Information Retrieval Techniques Index Terms (Attribute) Selection: Stop list Word stem Index terms weighting methods Terms.
C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.
Sudhanshu Khemka.  Treats each document as a vector with one component corresponding to each term in the dictionary  Weight of a component is calculated.
Date: 2013/4/1 Author: Jaime I. Lopez-Veyna, Victor J. Sosa-Sosa, Ivan Lopez-Arevalo Source: KEYS’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang KESOSD.
Top-k Query Processing Optimal aggregation algorithms for middleware Ronald Fagin, Amnon Lotem, and Moni Naor + Sushruth P. + Arjun Dasgupta.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Implementation of Vector Space Model March 27, 2006.
WHICH SEARCH OR SORT IS BETTER?. COMPARING ALGORITHMS Time efficiency refers to how long it takes an algorithm to run Space efficiency refers to the amount.
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
EFFICIENT ALGORITHMS FOR APPROXIMATE MEMBER EXTRACTION By Swapnil Kharche and Pavan Basheerabad.
IR 6 Scoring, term weighting and the vector space model.
Date : 2016/08/09 Advisor : Jia-ling Koh Speaker : Yi-Yui Lee
An Efficient Algorithm for Incremental Update of Concept space
Applied Discrete Mathematics Week 2: Functions and Sequences
Indexing & querying text
Information Retrieval and Web Search
Sorting by Tammy Bailey
Top-k Query Processing
Spatio-temporal Pattern Queries
Basic Information Retrieval
Representation of documents and queries
6. Implementation of Vector-Space Retrieval
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
VECTOR SPACE MODEL Its Applications and implementations
Presentation transcript:

Fast Indexes and Algorithms For Set Similarity Selection Queries M. Hadjieleftheriou A.Chandel N. Koudas D. Srivastava

Strings as sets s 1 = “Main St. Maine”: ‘Main’ ‘St.’ ‘Maine’ ‘Mai’ ‘ain’ ‘in ’ ‘n S’ ‘ St’ ‘St.’ ‘t. ’ … s 2 = “Main St. Main”: ‘Main’ ‘St.’ ‘Main’ How similar is s 1 and s 2 ?

TF/IDF weighted similarity Inverse Document Frequency (idf): ‘Main’ is common ‘Maine’ is not idf(t) = log 2 [1 + N / df(t)] Term Frequency (tf): ‘Main’ appears twice in s 2 Similarity: Inner Product

Is TF important? Information retrieval: Given a query string retrieve relevant documents Relational databases: Given a query string retrieve relevant strings In practice TF is small in many applications

IDF similarity Query q = {t 1, …, t n } Set s = {r 1, …, r m } Length len(s) = ( t 2 s idf(t) 2 ) 1/2 I(q, s) =  t 2 s \ q idf(t) 2 / len(s) len(q) IDF is as good as TF/IDF in practice!

How can I build an index? Let w(t, s) = idf(t) / len(s) Then I(q, s) =  t 2 q \ s w(t, s) w(t, q) So Decompose strings into tokens Compute the idf of each token Create one inverted list per token Sort lists by string id: Do a merge join Sort lists by w: Run TA/NRA

Example: Sort by id

Example: Sort by w NRA: Round robin list accesses Main memory hash table Computes lower and upper bounds per entry

Semantic properties of IDF Order Preservation: For all t 1  t 2 : if w(t 1, s) < w(t 1, r), then w(t 2, s) < w(t 2, r) Length Boundedness: Query q, set s, threshold  – I(q, s) >=  )  len(q) < len(s) < len(q) / 

Improved NRA Order Preservation determines if a given set appears in a list or not t i : encounter s 1, then s 2 t k : encounter s 2 first Length Boundedness restricts the search in a small portion of lists

Something surprising Lemma: NRA reads arbitrarily more elements than iNRA Lemma: NRA reads arbitrarily more elements than any algorithm that uses the Length Boundedness property

Any other strategies? NRA style is breadth-first Try depth-first: Sort query lists in decreasing idf order –Let q = {t 1, …, t n } and idf(t 1 ) > idf(t 2 ) > …> idf(t n ) Let i be the maximum length a set s in t i can have s.t. I(q, s) >= , assuming that s exists in all t k > t i – i =  I <= k <= n idf(t k ) 2 /  len(q) i is a natural cutoff point 1 > 2 > … > n

Shortest-First Sort q={t 1, …, t n } in decreasing idf order Let candidate set C For 1 <= i <= n Skip to first entry with len(s) >=  len(q) Compute i Let  i = min( i, len(q) / ) Repeat –s = pop next element from t i –Maintain lower/upper bounds of entries in C Until len(s) > max(max len C,  i )

Comparison with NRA Lemma: Let q={t 1, …, t n } and d the maximum depth SF descents over all lists. In the worst case iNRA will read (d – 1)(n – 1) elements more than SF But surprisingly

A hybrid strategy Run iNRA normally Use i and max len C to stop reading from a particular list This guarantees that iNRA stops with or before SF Drawback of NRA variants: Very high book keeping cost compared to SF

Experiments DBLP, IMDB and YellowPages datasets Actors, movies, authors, businesses etc. Vary threshold, query size, query strings and mistakes Test wall-clock time, pruning power Algorithms:NRA, TA, iNRA, iTA, SF, Hybrid, Sort-by-id, Improved SQL based

Wall-clock time vs. Threshold

Wall-clock time vs. Query size TA NRA Sort-by-id iTA SF

Space

Conclusion Proposed a simplified TF/IDF measure Identified strong monotonicity properties Used the properties to design efficient algorithms SF works best overall in practice Achieves sub-second answers in most practical cases

Q&A

Pruning power vs. Threshold

Pruning power vs. Query size NRA TA iTA