Presentation is loading. Please wait.

Presentation is loading. Please wait.

Probabilistic Ranking of Database Query Results

Similar presentations


Presentation on theme: "Probabilistic Ranking of Database Query Results"— Presentation transcript:

1 Probabilistic Ranking of Database Query Results
Gautam Das, Surajit Chaudhuri, Vagelis Hristidis, Gerhard Weikum Presented by: Z.M. Joseph Spring 2006, CSE, UT Arlington

2 Introduction Addresses the Many-Answers problem Thus:
Not a very selective query – Has many matching tuples Thus needs some ranking Thus: Specified attributes all match Must look into non-specified attributes

3 Challenge How do you select based on non-specified attributes?
Difficult to get correlation information Expensive to manage

4 Approach Build off Probabilistic Information Retrieval Combine:
Global Score Contains global importance of unspecified attributes Conditional Score Captures strength of correlation between unspecified and specified attributes Preprocessing at Intermediate Knowledge Representation Layer

5 Recall from PIR We already know that for a tuple t:
t can be broken down as: X: As the set of specified attributes Y: The list of unspecified attributes R is the ideal set of result tuples D is a single database table (approximated to ~R)

6 Structured Data Simplifies to:
This automatically increases probability for unspecified attributes that occur more in the ideal tuple set R

7 Limited Independence Assumptions
Possible to capture dependencies and correlations from structured data Efficient approach: X and Y values within themselves are independent of each other Allows derivation of: This assumption may not always be correct!

8 Workload-Based R Estimation
In order to use these techniques, the ideal result set R must be known. Use statistics gathered from the workload View the workload as a set of tuples containing each query and the specified attributes Thus can replace P(y|R) with P(y|X,W) Properties of R can be obtained by examining the workload for queries that retrieved X in the past

9 Workload-Based R Estimation
Thus the ranking function is: Does not contain R Quantities are all ‘atomic’ and can be computed First part is global, second part is conditional Can use association rules for , etc. These values stored in intermediate knowledge representation layer

10 Implementation Atomic Probabilities Module – stores atomic quantities in the intermediate knowledge representation layer Index Module – Uses inputs and association rules to create global and conditional scores Scan Algorithm – Selects tuples that satisfy the condition and then finds the ranking based on the scores List Merge Algorithm – Alternate to scanning

11 Conclusion Gives a ranking for the Many-Answer problem by factoring in unspecified attributes Automated Makes use of workload statistics and correlations Can still be adjusted by users and/or domain experts Can use user feedback as well


Download ppt "Probabilistic Ranking of Database Query Results"

Similar presentations


Ads by Google