Automated Ranking Of Database Query Results  Sanjay Agarwal - Microsoft Research  Surajit Chaudhuri - Microsoft Research  Gautam Das - Microsoft Research.

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.
Advertisements

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
IMPLEMENTATION OF INFORMATION RETRIEVAL SYSTEMS VIA RDBMS.
Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
Chapter 5: Introduction to Information Retrieval
Brian Babcock Surajit Chaudhuri Gautam Das at the 2003 ACM SIGMOD International Conference By Shashank Kamble Gnanoba.
Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International University Gerhard Weikum, MPI Informatik.
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
Automated Ranking of Database Query Results Sanjay Agarwal, Surajit Chaudhuri, Gautam Das, Aristides Gionis Presented by Mahadevkirthi Mahadevraj Sameer.
Learning for Text Categorization
IR Models: Overview, Boolean, and Vector
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Supporting Queries with Imprecise Constraints Ullas Nambiar Dept. of Computer Science University of California, Davis Subbarao Kambhampati Dept. of Computer.
6/15/20151 Top-k algorithms Finding k objects that have the highest overall grades.
Ch 4: Information Retrieval and Text Mining
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 7: Scores in a Complete Search.
The Vector Space Model …and applications in Information Retrieval.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Answering Imprecise Queries over Autonomous Web Databases Ullas Nambiar Dept. of Computer Science University of California, Davis Subbarao Kambhampati.
CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary.
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
Chapter 5: Information Retrieval and Web Search
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
Winter Semester 2003/2004Selected Topics in Web IR and Mining7-1 7 Top-k Queries on Web Sources and Structured Data 7.1 Top-k Queries over Autonomous Web.
DBXplorer: A System for Keyword- Based Search over Relational Databases Sanjay Agrawal Surajit Chaudhuri Gautam Das Presented by Bhushan Pachpande.
Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International.
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
1 Evaluating top-k Queries over Web-Accessible Databases Paper By: Amelie Marian, Nicolas Bruno, Luis Gravano Presented By Bhushan Chaudhari University.
DBXplorer: A System for Keyword- Based Search over Relational Databases Sanjay Agrawal, Surajit Chaudhuri, Gautam Das Cathy Wang
Querying Structured Text in an XML Database By Xuemei Luo.
Time Series Data Analysis - I Yaji Sripada. Dept. of Computing Science, University of Aberdeen2 In this lecture you learn What are Time Series? How to.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Supporting Top-k join Queries in Relational Databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by: Richa Varshney.
Chapter 6: Information Retrieval and Web Search
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
All right reserved by Xuehua Shen 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)
Vector Space Models.
Presented by Suresh Barukula 2011csz  Top-k query processing means finding k- objects, that have highest overall grades.  A query in multimedia.
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
Automated Ranking Of Database Query Results  Sanjay Agarwal - Microsoft Research  Surajit Chaudhuri - Microsoft Research  Gautam Das - Microsoft Research.
1 Data Mining: Text Mining. 2 Information Retrieval Techniques Index Terms (Attribute) Selection: Stop list Word stem Index terms weighting methods Terms.
Lecture 6: Scoring, Term Weighting and the Vector Space Model
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Top-k Query Processing Optimal aggregation algorithms for middleware Ronald Fagin, Amnon Lotem, and Moni Naor + Sushruth P. + Arjun Dasgupta.
Basics of Databases and Information Retrieval1 Databases and Information Retrieval Lecture 1 Basics of Databases and Information Retrieval Instructor Mr.
03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.
Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International University Gerhard Weikum, MPI Informatik.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
1 VLDB, Background What is important for the user.
INFORMATION RETRIEVAL MEASUREMENT OF RELEVANCE EFFECTIVENESS 1Adrienn Skrop.
Text Similarity: an Alternative Way to Search MEDLINE James Lewis, Stephan Ossowski, Justin Hicks, Mounir Errami and Harold R. Garner Translational Research.
IR 6 Scoring, term weighting and the vector space model.
Information Retrieval and Web Search
Top-k Query Processing
Chapter 15 QUERY EXECUTION.
Probabilistic Ranking of Database Query Results
Representation of documents and queries
Data Integration for Relational Web
6. Implementation of Vector-Space Retrieval
Boolean and Vector Space Retrieval Models
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Query Specific Ranking
Probabilistic Ranking of Database Query Results
Presentation transcript:

Automated Ranking Of Database Query Results  Sanjay Agarwal - Microsoft Research  Surajit Chaudhuri - Microsoft Research  Gautam Das - Microsoft Research  Aristides Gionis - Computer Science Dept Stanford University Ramya Somuri Nov ‘ Presented at the first Conference on Innovative Data Systems Research (CIDR) in the year 2003

 Introduction  Problem Formulation  Similarity Functions  Implementation  Experiments  Conclusion

Similarity, Relevance, Preference What do we want? A Ranked List What do we want? Query Semantics: True/False { BOOLEAN MODEL} Query Results Representation: Empty Answers Many Answers Select * From Realtor R Where 400K<Price<600K AND #Bedrooms= 4 ? Boolean Semantics of SQL: Success and Barrier? Problems: Example:

 In this case it would be desirable to return a ranked list of ‘approximately’ matching tuples without burdening the user to specify any additional conditions.  In other words, an automated approach for ranking and returning approximately matching tuples.

 As the name suggests ‘Ranking’ is the process of ordering a set of values (or data items) based on some parameter that is of high relevance to the user of ranking process.  Ranking and returning the most relevant results of user’s query is a popular paradigm in information retrieval.

 Automated ranking of query results is the process of taking a user query and mapping it to a Top-K query with a ranking function that depends on conditions specified in the user query.

Architecture Of Ranking Systems

 Develop a method for automatically ranking database records by relevance to a given query  Derive a Similarity Function  Apply Similarity Function between Query & Records in database  Rank the Result-Set and return Top-K records Focus of Paper

W o rkflow W o rkflow

 IDF Similarity - Mimics the TF-IDF concept for Heterogeneous data.  QF Similarity - Utilizes workload information.  QFIDF Similarity - Combination of QF and IDF.

Problem Formulation Problem Formulation

Attributes VkVk Numerical Attribute Categorical Attribute SNOMFRPRICECOLORMODELTYPE 1AUDI REDQ5SUV 2BMW REDZ4 3TOYOTA BLUECAMRYSEDAN 4HONDA GREENACCORDSEDAN 5NISSAN WHITE350ZCONVE RTIBLE TuplesTuples

 R- Relation  {A l, …,A m }- Set of Attributes  V k – Set of valid attribute values for an attribute A k  {t l,……, t m }- Tuples/records  A tuple t is expressed as t = for a tuple with values t k ε V k for each k  Q -

 Where clause of Query Q is of the form “WHERE C l AND …….AND C k ” Each C i is of the form A i IN {value l, ………..,value k } / A i IN [lb,ub]  Similarity coefficient S k (u,v) can be defined as “similarity” for the attribute values [u,v]  S k (u,v) =1 if u=v =0 if u,v are dissimilar  W k – “importance” of attribute/Attribute weight 0<w k <1; Σw k =1

IR technique Q = set of key words IDF(w) = log(N/F(w)) N - No of documents F(w) - No of occurrences of documents in which w appears TF(w,d)=Frequency of occurrence of w in d Cosine similarity between query and document is normalized dot product of the two corresponding vector Similarity function known as cosine similarity with TF-IDF weightings d w Database(only categorical attribute) T= Q= Condition is “WHERE is A 1= q 1 ” IDF k (t)=log(n/F k (t)) n-number of tuples in database F k (t) -Frequency of tuples in database where A k =t Similarity between T and Q is Sum of corresponding similarity coefficients over all attributes Dot product is un-normalized TF is irrelevant Similarity function known as IDF similarity tuple

 Select model from automobile_database Where TYPE=“convertible” and MFR=“Nissan”;  System generates tuples in the following order  Nissan Convertibles  Convertibles by other manufacturer  Other cars/types by Nissan  “Convertible” is rare and has higher IDF than “Nissan” which is a common car manufacturer

 No  Example Select * From automobile_database Where price=3000  S k (u,v) = 1 if (u=v) otherwise 0 is a bad definition since two numerical values might be close but not equal.

 S k (u,v) = 1-d/ | u k -l k | where d=|v-u| is the distance between the value & [ l k, u k ] is the domain of A k  Example: Select * from Realtor R where #rooms=4 Bedroomsd=|v-u|S k (u,v) |u k -l k |

 For numeric data  Inappropriate to use previous similarity coefficients/functions.  frequency of numeric value depends on nearby values.  Discretizing numeric to categorical attribute is problematic.  Solution:  {t 1,t 2 …..t n } be the values of attribute A. For every value t,  Similarity function is sum of ”contributions” of t from every other point it contributions modeled as Gaussian distribution

 Problem: In a realtor database, more homes are built in recent years such as 2007 and 2008 as compared to 1980 and1981. Thus recent years have small IDF.Yet newer homes have higher demand.  Solution: QF Similarity.

 Importance of attribute values is directly related to the frequency of their occurrence in workload.  In the previous example, it is reasonable to assume that more queries are requesting for newer homes than for older homes. Thus the frequency of the year 2008 appearing in the workload will be more than that of year 1981.

 Query frequency QF(q) = RQF(q)/ RQFMax RQF(q)- raw frequency of occurrence of value q of attribute A in query strings of workload RQFMax- raw frequency of most frequently occurring value in workload  S(t,q)= QF(q), if q=t 0, otherwise

Consider a workload W = { Q1,Q2,Q3,Q4} Q1- Select * from Realtor R where year=“2009” Q2- Select * from Realtor R Where year=“2009” Q3- Select * from Realtor R Where year=“2008” Q4- Select * from Realtor R Where year=“2007” Attribute Year= { 1981,……., 2009} QF (2008) = RQF(2008)/RQFMax = 1/2. If a query requests for an attribute value not in the workload, then QF=0. Ex- QF(1981)=0

 Problem/Example: S MFR (Toyota,Honda) =0 S MODEL (Camry, Accord) =0  Solution: Similarity Coefficients that are non-zero even when the pair of categorical attributes is different Eg:S MFR (Toyota,Honda) =0.9

 Similarity between pairs of different categorical attribute values can also be derived from workload  The similarity coefficient between tuple and query in this case is defined by jaccard coefficient scaled by QF factor as shown below. S(t,q)=J(W(t),W(q))QF(q)

 Analyzing IN clauses of queries: If certain pair of values often occur together in the workload,they are similar.e.g. queries with C as “MFR IN {TOYOTA,HONDA,NISSAN}”  Several recent queries in workload by a specific user repeatedly requesting for TOYOTA and HONDA.  Numerical values that occur in the workload can also benefit from query frequency analysis.

Why QFIDF?  QF is purely workload-based.  Doesn't use data at all.  Fails in case of insufficient & unreliable workloads. What is QFIDF?  QFIDF is a hybrid ranking function obtained by combing IDF, QF weights by multiplying them  For QFIDF Similarity  S(t,q)=QF(q) *IDF(q),when t=q 0, otherwise where QF(q)=(RQF(q)+1)/(RQFMax+1).  Thus we get small non zero QF even if value is never referenced in workload model.

 In case of many answers problem, the recently discussed ranking functions might fail to perform.  This is because many tuples may tie for the same similarity score. Such a scenario could arise for empty answer problem also.  To break this tie, requires looking beyond the attributes specified in the query, i.e., missing attributes.

Many Answers Problem: Breaking Ties Many Answers Problem: Breaking Ties

 Solution: Determine the weights of missing attribute values that reflect their “global importance” for ranking purposes by using workload information.  Extend QF similarity,use quantity Σlog(QF k (t k )) to break ties. Consider a query requesting for 4 bedroom houses. -Result set= many # of homes -Examine the other attributes other than # of bed rooms(missing attributes). Ex- Location - Dallas is more important than Arlington. -Rank the 4 bed room homes in Dallas higher than that of Arlington

 Rank the tuples with large IDF for missing attributes higher - Arlington homes are given more preference than Dallas homes since Arlington has a higher IDF, but this scenario is not true in real practice.  Rank the tuples with small IDF for missing attributes higher Consider homes with decks, but since we are considering smaller IDF preference will be given to homes without decks since they have a smaller IDF which is not true in real practice.

 Pre-processing component  Query–processing component

 Compute and store a representation of similarity function in auxiliary database tables.  For categorical data: Compute IDF(t) (resp QF(t)),to compute frequency of occurrences of values in database and store the results in auxiliary database tables.  For numeric data: An approximate representation of smooth function IDF() (resp(QF()) is stored, so that function value of q is retrieved at runtime.

 Main task: Given a query Q and an integer K, retrieve Top-K tuples from the database using one of the ranking functions.  Ranking function is extracted in pre-processing phase.  SQL-DBMS functionality used for solving top-K problem.  Handling simpler query processing problem  Input: table R with M categorical columns, Key column TID, C is conjunction of form A k =q k..... and integer K.  Output: top-K tuples of R similar to Q.  Similarity function: Overlap Similarity.

 Traditional approach ?  Indexed based approach  overlap similarity function satisfies the following monotonic property.  If T and U are two tuples such that for all K, S k (t k,q k )< S k (u k,q k ) then SIM(T,Q) < SIM(U,Q)  To adapt TA implement Sorted and Random access methods.  Performs sorted access for each attribute, retrieve complete tuples with corresponding TID by random access and maintains buffer of Top-K tuples seen so far.

Read all grades of an object once seen from a sorted access No need to wait until the lists give k common objects Do sorted access (and corresponding random accesses) until you have seen the top k answers. How do we know that grades of seen objects are higher than the grades of unseen objects ? Predict maximum possible grade unseen objects: a: 0.9 b: 0.8 c: L1L1 L2L2 d: 0.9 a: 0.85 b: 0.7 c: f: 0.65 d: 0.6 f: 0.6 Seen Possibly unseen Threshold value Threshold Algorithm (TA) T = min(0.72, 0.7) = 0.7

IDA1A1 A2A2 Min(A 1,A 2 ) Step 1: - parallel sorted access to each list (a, 0.9) (b, 0.8) (c, 0.72) (d, 0.6) L1L1 L2L2 (d, 0.9) (a, 0.85) (b, 0.7) (c, 0.2) a d For each object seen: - get all grades by random access - determine Min(A1,A2) - amongst 2 highest seen ? keep in buffer Example – Threshold Algorithm

IDA1A1 A2A2 Min(A 1,A 2 ) a: 0.9 b: 0.8 c: 0.72 d: L1L1 L2L2 d: 0.9 a: 0.85 b: 0.7 c: Step 2: - Determine threshold value based on objects currently seen under sorted access. T = min(L1, L2) a d T = min(0.9, 0.9) = objects with overall grade ≥ threshold value ? stop else go to next entry position in sorted list and repeat step 1 Example – Threshold Algorithm

IDA1A1 A2A2 Min(A 1,A 2 ) Step 1 (Again): - parallel sorted access to each list (a, 0.9) (b, 0.8) (c, 0.72) (d, 0.6) L1L1 L2L2 (d, 0.9) (a, 0.85) (b, 0.7) (c, 0.2) a d For each object seen: - get all grades by random access - determine Min(A1,A2) - amongst 2 highest seen ? keep in buffer b Example – Threshold Algorithm

IDA1A1 A2A2 Min(A 1,A 2 ) a: 0.9 b: 0.8 c: 0.72 d: L1L1 L2L2 d: 0.9 a: 0.85 b: 0.7 c: Step 2 (Again): - Determine threshold value based on objects currently seen. T = min(L1, L2) a b T = min(0.8, 0.85) = objects with overall grade ≥ threshold value ? stop else go to next entry position in sorted list and repeat step 1 Example – Threshold Algorithm

IDA1A1 A2A2 Min(A 1,A 2 ) a: 0.9 b: 0.8 c: 0.72 d: L1L1 L2L2 d: 0.9 a: 0.85 b: 0.7 c: Situation at stopping condition a b T = min(0.72, 0.7) = 0.7 Example – Threshold Algorithm

Sorted access Random access

Stopping Condition  Hypothetical tuple – current value a 1,…, a p for A 1,… A p, corresponding to index seeks on L 1,…, L p and q p+1,….. q m for remaining columns from the query directly.  Termination – Similarity of hypothetical tuple to the query< tuple in Top-k buffer with least similarity.

 Automated Ranking Infrastructure for SQL databases.  Extended TF-IDF based techniques from Information retrieval to numeric and mixed data.  Implementation of Ranking function that exploited indexed access (Fagin’s TA)