Probabilistic Ranking of Database Query Results Gautam Das, Surajit Chaudhuri, Vagelis Hristidis, Gerhard Weikum Presented by: Z.M. Joseph Spring 2006, CSE, UT Arlington
Introduction Addresses the Many-Answers problem Thus: Not a very selective query – Has many matching tuples Thus needs some ranking Thus: Specified attributes all match Must look into non-specified attributes
Challenge How do you select based on non-specified attributes? Difficult to get correlation information Expensive to manage
Approach Build off Probabilistic Information Retrieval Combine: Global Score Contains global importance of unspecified attributes Conditional Score Captures strength of correlation between unspecified and specified attributes Preprocessing at Intermediate Knowledge Representation Layer
Recall from PIR We already know that for a tuple t: t can be broken down as: X: As the set of specified attributes Y: The list of unspecified attributes R is the ideal set of result tuples D is a single database table (approximated to ~R)
Structured Data Simplifies to: This automatically increases probability for unspecified attributes that occur more in the ideal tuple set R
Limited Independence Assumptions Possible to capture dependencies and correlations from structured data Efficient approach: X and Y values within themselves are independent of each other Allows derivation of: This assumption may not always be correct!
Workload-Based R Estimation In order to use these techniques, the ideal result set R must be known. Use statistics gathered from the workload View the workload as a set of tuples containing each query and the specified attributes Thus can replace P(y|R) with P(y|X,W) Properties of R can be obtained by examining the workload for queries that retrieved X in the past
Workload-Based R Estimation Thus the ranking function is: Does not contain R Quantities are all ‘atomic’ and can be computed First part is global, second part is conditional Can use association rules for , etc. These values stored in intermediate knowledge representation layer
Implementation Atomic Probabilities Module – stores atomic quantities in the intermediate knowledge representation layer Index Module – Uses inputs and association rules to create global and conditional scores Scan Algorithm – Selects tuples that satisfy the condition and then finds the ranking based on the scores List Merge Algorithm – Alternate to scanning
Conclusion Gives a ranking for the Many-Answer problem by factoring in unspecified attributes Automated Makes use of workload statistics and correlations Can still be adjusted by users and/or domain experts Can use user feedback as well