Probabilistic Ranking of Database Query Results

Probabilistic Ranking of Database Query Results
Gautam Das, Surajit Chaudhuri, Vagelis Hristidis, Gerhard Weikum Presented by: Z.M. Joseph Spring 2006, CSE, UT Arlington

Introduction Addresses the Many-Answers problem Thus:
Not a very selective query – Has many matching tuples Thus needs some ranking Thus: Specified attributes all match Must look into non-specified attributes

Challenge How do you select based on non-specified attributes?
Difficult to get correlation information Expensive to manage

Approach Build off Probabilistic Information Retrieval Combine:
Global Score Contains global importance of unspecified attributes Conditional Score Captures strength of correlation between unspecified and specified attributes Preprocessing at Intermediate Knowledge Representation Layer

Recall from PIR We already know that for a tuple t:
t can be broken down as: X: As the set of specified attributes Y: The list of unspecified attributes R is the ideal set of result tuples D is a single database table (approximated to ~R)

Structured Data Simplifies to:
This automatically increases probability for unspecified attributes that occur more in the ideal tuple set R

Limited Independence Assumptions
Possible to capture dependencies and correlations from structured data Efficient approach: X and Y values within themselves are independent of each other Allows derivation of: This assumption may not always be correct!

Workload-Based R Estimation
In order to use these techniques, the ideal result set R must be known. Use statistics gathered from the workload View the workload as a set of tuples containing each query and the specified attributes Thus can replace P(y|R) with P(y|X,W) Properties of R can be obtained by examining the workload for queries that retrieved X in the past

Workload-Based R Estimation
Thus the ranking function is: Does not contain R Quantities are all ‘atomic’ and can be computed First part is global, second part is conditional Can use association rules for , etc. These values stored in intermediate knowledge representation layer

Implementation Atomic Probabilities Module – stores atomic quantities in the intermediate knowledge representation layer Index Module – Uses inputs and association rules to create global and conditional scores Scan Algorithm – Selects tuples that satisfy the condition and then finds the ranking based on the scores List Merge Algorithm – Alternate to scanning

Conclusion Gives a ranking for the Many-Answer problem by factoring in unspecified attributes Automated Makes use of workload statistics and correlations Can still be adjusted by users and/or domain experts Can use user feedback as well

Probabilistic Ranking of Database Query Results

Similar presentations

Presentation on theme: "Probabilistic Ranking of Database Query Results"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Probabilistic Ranking of Database Query Results

Similar presentations

Presentation on theme: "Probabilistic Ranking of Database Query Results"— Presentation transcript:

Similar presentations

About project

Feedback