Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International University Gerhard Weikum, MPI Informatik Presented by Raghunath Ravi Sivaramakrishnan Subramani 1
Roadmap Motivation Key Problems System Architecture Construction of Ranking Function Implementation Experiments Conclusion and open problems 2
3 Motivation Many-answers problem Two alternative solutions: Query reformulation Automatic ranking Apply probabilistic model in IR to DB tuple ranking
4 Example – Realtor Database House Attributes: Price, City, Bedrooms, Bathrooms, SchoolDistrict, Waterfront, BoatDock, Year Query: City =`Seattle’ AND Waterfront = TRUE Too Many Results! Intuitively, Houses with lower Price, more Bedrooms, or BoatDock are generally preferable
Rank According to Unspecified Attributes Score of a Result Tuple t depends on Global Score: Global Importance of Unspecified Attribute Values [CIDR2003] ◦ E.g., Newer Houses are generally preferred Conditional Score: Correlations between Specified and Unspecified Attribute Values ◦ E.g., Waterfront BoatDock Many Bedrooms Good School District 5
Roadmap Motivation Key Problems System Architecture Construction of Ranking Function Implementation Experiments Conclusion and open problems 6
Key Problems Given a Query Q, How to Combine the Global and Conditional Scores into a Ranking Function. Use Probabilistic Information Retrieval (PIR). How to Calculate the Global and Conditional Scores. Use Query Workload and Data. 7
Roadmap Motivation Key Problems System Architecture Construction of Ranking Function Implementation Experiments Conclusion and open problems 8
9 System Architecture
Roadmap Motivation Key Problems System Architecture Construction of Ranking Function Implementation Experiments Conclusion and open problems 10
11 PIR Review Bayes’ Rule Product Rule Document (Tuple) t, Query Q R: Relevant Documents R = D - R: Irrelevant Documents
12 Adaptation of PIR to DB Tuple t is considered as a document Partition t into t(X) and t(Y) t(X) and t(Y) are written as X and Y Derive from initial scoring function until final ranking function is obtained
13 Preliminary Derivation
14 Limited Independence Assumptions Given a query Q and a tuple t, the X (and Y) values within themselves are assumed to be independent, though dependencies between the X and Y values are allowed
15 Continuing Derivation
16 Pre-computing Atomic Probabilities in Ranking Function Relative frequency in W Relative frequency in D (#of tuples in W that conatains x, y)/total # of tuples in W (#of tuples in D that conatains x, y)/total # of tuples in D Use Workload Use Data
Roadmap Motivation Key Problems System Architecture Construction of Ranking Function Implementation Experiments Conclusion and open problems 17
18 Architecture of Ranking Systems
19 Scan Algorithm Preprocessing - Atomic Probabilities Module Computes and Indexes the Quantities P(y | W), P(y | D), P(x | y, W), and P(x | y, D) for All Distinct Values x and y Execution Select Tuples that Satisfy the Query Scan and Compute Score for Each Result-Tuple Return Top-K Tuples
20 Beyond Scan Algorithm Scan algorithm is Inefficient Many tuples in the answer set Another extreme Pre-compute top-K tuples for all possible queries Still infeasible in practice Trade-off solution Pre-compute ranked lists of tuples for all possible atomic queries At query time, merge ranked lists to get top-K tuples
Output from Index Module CondList C x {AttName, AttVal, TID, CondScore} B + tree index on (AttName, AttVal, CondScore) GlobList G x {AttName, AttVal, TID, GlobScore} B + tree index on (AttName, AttVal, GlobScore) 21
Index Module 22
Preprocessing Component Preprocessing For Each Distinct Value x of Database, Calculate and Store the Conditional (C x ) and the Global (G x ) Lists as follows ◦ For Each Tuple t Containing x Calculate and add to C x and G x respectively Sort C x, G x by decreasing scores Execution Query Q: X 1 =x 1 AND … AND X s =x s Execute Threshold Algorithm [Fag01] on the following lists: C x1,…,C xs, and G xb, where G xb is the shortest list among G x1,…,G xs 23
List Merge Algorithm 24
Roadmap Motivation Key Problems System Architecture Construction of Ranking Function Implementation Experiments Conclusion and open problems 25
26 Experimental Setup Datasets: ◦ MSR HomeAdvisor Seattle ( ◦ Internet Movie Database ( Software and Hardware: Microsoft SQL Server2000 RDBMS P4 2.8-GHz PC, 1 GB RAM C#, Connected to RDBMS through DAO
27 Quality Experiments Conducted on Seattle Homes and Movies tables Collect a workload from users Compare Conditional Ranking Method in the paper with the Global Method [CIDR03]
28 Quality Experiment-Average Precision For each query Q i, generate a set H i of 30 tuples likely to contain a good mix of relevant and irrelevant tuples Let each user mark 10 tuples in H i as most relevant to Q i Measure how closely the 10 tuples marked by the user match the 10 tuples returned by each algorithm
29 Quality Experiment- Fraction of Users Preferring Each Algorithm 5 new queries Users were given the top-5 results
30 Performance Experiments Datasets Compare 2 Algorithms: Scan algorithm List Merge algorithm
31 Performance Experiments – Pre- computation Time
32 Performance Experiments – Execution Time
Roadmap Motivation Key Problems System Architecture Construction of Ranking Function Implementation Experiments Conclusion and open problems 33
34 Conclusions – Future Work Conclusions Completely Automated Approach for the Many- Answers Problem which Leverages Data and Workload Statistics and Correlations Based on PIR Drawbacks Mutiple-table query Non-categorical attributes Future Work Empty-Answer Problem Handle Plain Text Attributes
35 Questions?