Download presentation
Presentation is loading. Please wait.
Published byΦιλήμων Σπυρόπουλος Modified over 5 years ago
1
Probabilistic Ranking of Database Query Results
Gautam Das, Surajit Chaudhuri, Vagelis Hristidis, Gerhard Weikum Presented by: Z.M. Joseph Spring 2006, CSE, UT Arlington
2
Introduction Addresses the Many-Answers problem Thus:
Not a very selective query – Has many matching tuples Thus needs some ranking Thus: Specified attributes all match Must look into non-specified attributes
3
Challenge How do you select based on non-specified attributes?
Difficult to get correlation information Expensive to manage
4
Approach Build off Probabilistic Information Retrieval Combine:
Global Score Contains global importance of unspecified attributes Conditional Score Captures strength of correlation between unspecified and specified attributes Preprocessing at Intermediate Knowledge Representation Layer
5
Recall from PIR We already know that for a tuple t:
t can be broken down as: X: As the set of specified attributes Y: The list of unspecified attributes R is the ideal set of result tuples D is a single database table (approximated to ~R)
6
Structured Data Simplifies to:
This automatically increases probability for unspecified attributes that occur more in the ideal tuple set R
7
Limited Independence Assumptions
Possible to capture dependencies and correlations from structured data Efficient approach: X and Y values within themselves are independent of each other Allows derivation of: This assumption may not always be correct!
8
Workload-Based R Estimation
In order to use these techniques, the ideal result set R must be known. Use statistics gathered from the workload View the workload as a set of tuples containing each query and the specified attributes Thus can replace P(y|R) with P(y|X,W) Properties of R can be obtained by examining the workload for queries that retrieved X in the past
9
Workload-Based R Estimation
Thus the ranking function is: Does not contain R Quantities are all ‘atomic’ and can be computed First part is global, second part is conditional Can use association rules for , etc. These values stored in intermediate knowledge representation layer
10
Implementation Atomic Probabilities Module – stores atomic quantities in the intermediate knowledge representation layer Index Module – Uses inputs and association rules to create global and conditional scores Scan Algorithm – Selects tuples that satisfy the condition and then finds the ranking based on the scores List Merge Algorithm – Alternate to scanning
11
Conclusion Gives a ranking for the Many-Answer problem by factoring in unspecified attributes Automated Makes use of workload statistics and correlations Can still be adjusted by users and/or domain experts Can use user feedback as well
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.