Automated Ranking Of Database Query Results  Sanjay Agarwal - Microsoft Research  Surajit Chaudhuri - Microsoft Research  Gautam Das - Microsoft Research.

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.
Advertisements

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
IMPLEMENTATION OF INFORMATION RETRIEVAL SYSTEMS VIA RDBMS.
Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
Chapter 5: Introduction to Information Retrieval
Introduction to Information Retrieval
Analysis of Algorithms
Brian Babcock Surajit Chaudhuri Gautam Das at the 2003 ACM SIGMOD International Conference By Shashank Kamble Gnanoba.
Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International University Gerhard Weikum, MPI Informatik.
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
Automated Ranking of Database Query Results Sanjay Agarwal, Surajit Chaudhuri, Gautam Das, Aristides Gionis Presented by Mahadevkirthi Mahadevraj Sameer.
Learning for Text Categorization
IR Models: Overview, Boolean, and Vector
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Supporting Queries with Imprecise Constraints Ullas Nambiar Dept. of Computer Science University of California, Davis Subbarao Kambhampati Dept. of Computer.
6/15/20151 Top-k algorithms Finding k objects that have the highest overall grades.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Ch 4: Information Retrieval and Text Mining
Computer comunication B Information retrieval. Information retrieval: introduction 1 This topic addresses the question on how it is possible to find relevant.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 7: Scores in a Complete Search.
The Vector Space Model …and applications in Information Retrieval.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Answering Imprecise Queries over Autonomous Web Databases Ullas Nambiar Dept. of Computer Science University of California, Davis Subbarao Kambhampati.
Top- K Query Evaluation with Probabilistic Guarantees Martin Theobald, Gerhard Weikum, Ralf Schenkel Presenter: Avinandan Sengupta.
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary.
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Winter Semester 2003/2004Selected Topics in Web IR and Mining7-1 7 Top-k Queries on Web Sources and Structured Data 7.1 Top-k Queries over Autonomous Web.
Automated Ranking Of Database Query Results  Sanjay Agarwal - Microsoft Research  Surajit Chaudhuri - Microsoft Research  Gautam Das - Microsoft Research.
DBXplorer: A System for Keyword- Based Search over Relational Databases Sanjay Agrawal Surajit Chaudhuri Gautam Das Presented by Bhushan Pachpande.
Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International.
1 Evaluating top-k Queries over Web-Accessible Databases Paper By: Amelie Marian, Nicolas Bruno, Luis Gravano Presented By Bhushan Chaudhari University.
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.
Time Series Data Analysis - I Yaji Sripada. Dept. of Computing Science, University of Aberdeen2 In this lecture you learn What are Time Series? How to.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Supporting Top-k join Queries in Relational Databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by: Richa Varshney.
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
All right reserved by Xuehua Shen 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)
Vector Space Models.
Presented by Suresh Barukula 2011csz  Top-k query processing means finding k- objects, that have highest overall grades.  A query in multimedia.
UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.
CS4432: Database Systems II Query Processing- Part 2.
Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Optimal Aggregation Algorithms for Middleware By Ronald Fagin, Amnon Lotem, and Moni Naor.
Top-k Query Processing Optimal aggregation algorithms for middleware Ronald Fagin, Amnon Lotem, and Moni Naor + Sushruth P. + Arjun Dasgupta.
Fast Indexes and Algorithms For Set Similarity Selection Queries M. Hadjieleftheriou A.Chandel N. Koudas D. Srivastava.
Database Searching and Information Retrieval Presented by: Tushar Kumar.J Ritesh Bagga.
Introduction n IR systems usually adopt index terms to process queries n Index term: u a keyword or group of selected words u any word (more general) n.
03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.
Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International University Gerhard Weikum, MPI Informatik.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
1 VLDB, Background What is important for the user.
IR 6 Scoring, term weighting and the vector space model.
Plan for Today’s Lecture(s)
Indexing & querying text
Information Retrieval and Web Search
Top-k Query Processing
Chapter 15 QUERY EXECUTION.
Rank Aggregation.
Information Retrieval and Web Search
Representation of documents and queries
6. Implementation of Vector-Space Retrieval
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Presentation transcript:

Automated Ranking Of Database Query Results  Sanjay Agarwal - Microsoft Research  Surajit Chaudhuri - Microsoft Research  Gautam Das - Microsoft Research  Aristides Gionis - Computer Science Dept Stanford University Presented by: Suvigya Jaiswal (Fall 10)

Ranking As the name suggests ‘Ranking’ is the process of ordering a set of values (or data items) based on some parameter that is of high relevance to the user of ranking process. Ranking and returning the most relevant results of user’s query is a popular paradigm in information retrieval.

Database Ranking Example

Introduction Automated Ranking is used in Information Retrieval(IR). Database Systems do not support Automated Ranking but support only Boolean Query model. Following scenarios not handled well by SQL Systems 1. Empty Answers(Query too specific) 2. Many Answers(Query not specific)

Introduction How to adapt ranking functions from IR to handle Database ranking problem? 1.When each of the attribute in the relation is a categorical attribute, mimic the IR solution by applying the TD-IDF idea of frequency of values 2. When attributes are also numerical,extend the TD-IDF concepts to numerical domains. In some cases the TF-IDF idea does not produce results with desired accuracy, in these cases we use workload information to arrive at better results.

Contributions of Paper IDF Similarity QF similarity QFIDF Similarity Index Based Threshold Algorithm:

IDF Similarity: Intro Given: A Table R Attributes {A1,….,Am) Tuples {T1,….,Tn} Value k Query’s WHERE clause is of the form: “Where C1 AND C2 AND ….. AND Cm”

Attributes Numerical Attribute Categorical Attribute SNOMFRPRICECOLORMODELTYPE 1AUDI REDQ5SUV 2BMW REDZ4 3TOYOTA BLUECAMRYSEDAN 4HONDA GREENACCORDSEDAN 5NISSAN WHITE350ZCONVE RTIBLE TuplesTuples

Cosine Similarity Cosine Similarity from IR can be applied when the database has only categorical attributes. Tuple and Query are considered a small document. A documents is an m-dimensional vector with m words i th element in the vector represents the TF of the word. Cosine Similarity:

Cosine Similarity IDF used to further refine Cosine Similarity IDF(w)= log(N/F(w)) N is number of documents F(w) is the number of documents in which w appears. Idea behind using IDF? More often occurring words convey information than rarely occurring words.

IDF Similarity For every value t in the domain of A k, IDF k (t) is defined as IDF k (t)=log(n/F k (t)) n=# of tuples, F k (t)) is the frequency of tuples A k =t T= Q= Condition is of the form “WHERE A 1= q 1 AND A 2= q 2,….., AND A m= q m “ S k (u,v) = IDF k (u) if u=v Otherwise, S k (u,v)=0

Uses As an Example say we want to find all convertibles made by Nissan. The System will return the following: 1. All Convertibles made by Nissan. 2. All the Convertibles made by other manufacturers. 3. All Nissan Cars which may not be convertibles. Why so? Convertible is a rarer car type than other Nissan cars.

IDF Similarity for Numerical Data Why the IDF Similarity for categorical data cannot be used for numeric data? SELECT * FROM R WHERE PRICE=300K AND BEDROOM = 10; S(u,v) will incorrectly evaluate to zero. Since 315K and 305K are close to 300K (resp. 9 & 10) but not equal. IDPRICEBEDROOMCITY 1315K9DALLAS 2300K10FTW 3305K10ARLINGTON

IDF Similarity for Numerical Data Solution: {t 1,t 2 …..t n } be the values of attribute A. For every value t The denominator represents the sum of contributions to t from every other t i Further t is from t i, lesser is the contribution from t i

IDF Similarity for Numerical Data Similarity is defined as density at ‘t’ of a Gaussian Distribution centered at q. Suppose there are n 1 tuples that have the same value and the remaining n-n 1 tuples have a value far from t. 1. If q belongs to n-n 1 then S(t,q) almost 0 2. If q belongs to n 1 then S(t,q) = log (n/n t )

QF Similarity Why is IDF Similarity not sufficient ? Examples: 1. In a database, more homes are built in recent years than in the pervious years (1980’s). But IDF of the homes built in recent years will be less. Yet the demand for newer homes is more. 2. In a bookstore DB, the demand of a particular author’s work might be more even if he has written many books. But the IDF of that author will be low.

QF Similarity The Idea behind QF Similarity is that the importance of attribute values is related to the frequency of their occurrence in the query string in the workload In previous example it is reasonable to assume that the queries for newer homes appears more often than queries for older homes Also the query for a particular author might appear more often than the other authors if his books are more popular in spite of him having many books

QF Similarity We define query frequency QF as QF(q) = RQF(q)/ RQFMax RQF(q) raw frequency of occurrence of value q of attribute A in query strings of workload RQFMaxraw frequency of most frequently occurring value in workload S(t,q)=QF(q) if q=t else 0

Similarity between different attributes If we use IDF or QF Similarity to measure any of the following we get 0 as the answer S(Toyota, Honda)=0 S(Accord, Camry)=0 1.But we know that Honda and Toyota make cars that are directed toward the same market segment. 2.Accord and Camry are the same type of Cars of comparable quality

Similarity between different attributes To solve this problem we apply the intuition that if certain pair of values(t<>u) often occur together in the work load then they are similar. For example if we receive many queries which has C-Conditions of the form “MFR IN {Toyota, Honda, Nissan}” It suggest that Toyota, Honda and Nissan are more similar to each other than they are to Ferrari or Mercedes Hence we can say that by using this metric, S(Toyota, Honda)=0.8 S(Ferrari, Toyota)=o.1

Similarity between different attributes Let W(t) be the subset of Queries in workload W in which the categorical value t appears(in our example say Toyota) in an IN clause. Jaccard Coefficient measures similarity b/w W(t) and W(q) Similarity coefficient is then defined as:

QFIDF Similarity QF Similarity can be unreliable in certain situations. This happens because QF Similarity is purely workload based. It doesn’t take data values into account. To tackle this we define QFIDF Similarity: S(t,q)=QF(q) *IDF(q) when t=q 0, otherwise where QF(q)=(RQF(q)+1)/(RQFMax+1). 1 is added to the numerator and denominator so that QF is never zero.

Many Answer Problem. IDF Similarity and QF Similarity may sometimes run into problem: many tuples may tie for the same similarity score and thus get ordered arbitrarily. Approach is to determine weights of missing attribute values that reflect their “global importance” for ranking purposes IF we seek homes with four bedrooms in DB, we can examine attributes other than number of bedrooms to rank the result set. If we knew that “Dallas” is a more important location than “Fort- Worth” in a global sense, we would rank four bedroom homes in Dallas higher than four bedroom homes in Fort-Worth.

We use workload information to determine global importance of missing attribute values. We define the global importance of missing attribute value t k as log( QF k ( t k )) Extend QF Similarity to use the quantity Sum(log( QF k ( t k ))) to break ties in each equivalence class (larger this quantity1, higher the rank of the tuple) where the summation is over missing attributes. An alternative strategy is to rank tied tuples higher if their missing attribute values have small IDF, i.e. occur more frequently in the database.

Implementation Two Phases: Pre-processing component Query processing component

Pre-processing component Compute IDF(t) (resp. QF(t)) for all categorical values t involves scanning the database (resp. scanning/parsing the workload) to compute frequency of occurrences of values in the database (resp. workload), and store the results in auxiliary tables. We cannot pre-compute IDF(q) (resp. QF(q)) for numerical attributes; thus we have to store an approximate representation of the smooth function IDF( ) (resp. QF( )) so that the function value at any q can be retrieved at runtime.

Query processing component main task of the query processing component is, given a query Q and an integer K, to efficiently retrieve the Top-K tuples from the database using one of the ranking functions.

A simpler query processing problem Inputs: (a) a database table R with m categorical columns, clustered on key column TID, where standard database indexes exist on a subset of columns, (b) A query expressed as a conjunction of m single-valued conditions of the form Ak = qk., and (c) an integer K. Similarity function: Overlap Similarity Output: The Top-K tuples of R most similar to Q.

An index-based Top-K implementation: monotonic property: if T and U are two tuples such that for all k, S k (t k,q k )< S k (u k,q k ) then SIM(T,Q) <=SIM(U, Q). adapt Fagin’s Threshold Algorithm (TA) Two types of access methods required 1. Sorted Access 2. Random Access use of an early stopping condition, by which the algorithm can detect that the final Top-K tuples have been retrieved before all tuples have been processed.

Read all grades of an object once seen from a sorted access No need to wait until the lists give k common objects Do sorted access (and corresponding random accesses) until you have seen the top k answers. How do we know that grades of seen objects are higher than the grades of unseen objects ? Predict maximum possible grade unseen objects: a: 0.9 b: 0.8 c: L1L1 L2L2 d: 0.9 a: 0.85 b: 0.7 c: f: 0.65 d: 0.6 f: 0.6 Seen Possibly unseen Threshold value Threshold Algorithm T = min(0.72, 0.7) = 0.7

IDA1A1 A2A2 Min(A 1,A 2 ) Step 1: - parallel sorted access to each list (a, 0.9) (b, 0.8) (c, 0.72) (d, 0.6) L1L1 L2L2 (d, 0.9) (a, 0.85) (b, 0.7) (c, 0.2) a d For each object seen: - get all grades by random access - determine Min(A1,A2) - amongst 2 highest seen ? keep in buffer Example – Threshold Algorithm

IDA1A1 A2A2 Min(A 1,A 2 ) a: 0.9 b: 0.8 c: 0.72 d: L1L1 L2L2 d: 0.9 a: 0.85 b: 0.7 c: Step 2: - Determine threshold value based on objects currently seen under sorted access. T = min(L1, L2) a d T = min(0.9, 0.9) = objects with overall grade ≥ threshold value ? stop else go to next entry position in sorted list and repeat step 1 Example – Threshold Algorithm

IDA1A1 A2A2 Min(A 1,A 2 ) Step 1 (Again): - parallel sorted access to each list (a, 0.9) (b, 0.8) (c, 0.72) (d, 0.6) L1L1 L2L2 (d, 0.9) (a, 0.85) (b, 0.7) (c, 0.2) a d For each object seen: - get all grades by random access - determine Min(A1,A2) - amongst 2 highest seen ? keep in buffer b Example – Threshold Algorithm

IDA1A1 A2A2 Min(A 1,A 2 ) a: 0.9 b: 0.8 c: 0.72 d: L1L1 L2L2 d: 0.9 a: 0.85 b: 0.7 c: Step 2 (Again): - Determine threshold value based on objects currently seen. T = min(L1, L2) a b T = min(0.8, 0.85) = objects with overall grade ≥ threshold value ? stop else go to next entry position in sorted list and repeat step 1 Example – Threshold Algorithm

IDA1A1 A2A2 Min(A 1,A 2 ) a: 0.9 b: 0.8 c: 0.72 d: L1L1 L2L2 d: 0.9 a: 0.85 b: 0.7 c: Situation at stopping condition a b T = min(0.72, 0.7) = 0.7 Example – Threshold Algorithm

Algorithm

Experiment Results Quality results For queries with empty answers, QFIDF produced the best rankings, followed by QF, then IDF, and finally Overlap. For queries with empty answers, the ranking quality of QF improves with increasing workload size. For queries with numerous answers, QF produced better rankings than IDF. Performance results The preprocessing time and space requirements of all techniques scale linearly with data size. When all indexes are present, ITA is more efficient than SQL Server Top-K for all similarity functions. Even when a subset of indexes is present, ITA can perform well

References eid= eid= Ppt Slides by Ramya Soumri(Fall 09) [14] R. Fagin. Fuzzy Queries in Multimedia Database Systems. PODS 1998.

Thank You