Answering Why-not Questions on Top-K Queries Andy He and Eric Lo The Hong Kong Polytechnic University
Background The database community has focused on the performance issues for decades Recently more people turn their focus on to the usability issues Supporting keyword search Query auto-completion Explaining your query result (a.k.a. Why and Why-Not Questions) 2/33
Why-Not Questions You post a query Q Database returns you a result R R gives you “surprise” E.g., a tuple m that you are expecting in the result is missing, you ask “WHY??!” You pose a why-not question (Q,R,m) Database returns you an explanation E 3/33
The (short) history of Why-Not Chapman and Jagadish “Why Not?” [SIGMOD 09] Select-Project-Join (SPJ) Questions Explanation E = “tell you which operator excludes the expected tuple” Hung, Che, A.H. Doan, and J. Naughton “On the Provenance of Non-Answers to Queries Over Extracted Data” [PVLDB 09] SPJ Queries Explanation E =“tell you how to modify the data” 4/33
The (short) history of Why-Not Herschel and Herandez “Explaining Missing Answers to SPJUA Queries” [PVLDB 10] SPJUA Queries Explanation E =“tell you how to modify the data” Tran and C.Y. Chan “How to Conquer why-not Questions” [SIGMOD 10] SPJA Queries Explanation E =“tell you how to modify your query” 5/33
About this work Why-Not question on Top-k queries. Hotel Top-3 Hotel Weighting w origin = Result Rank 1: Sheraton Rank 2: Westin Rank 3: InterContinental “WHY my favorite Renaissance NOT in the Top-3 result?” If my value of k is too small? Or I should revise my weighting? Or need to modify both k and weighting? Explanation E = “tell you how to refine your Top-K query in order to get your favorites back to the result” 6/33
One possible answer - only modify k Original query Q(k original =3,w original = ) The ranking of Renaissance under the original weighting w original = Rank 1: Sheraton Rank 2: Westin Rank 3: InterContinental Rank 4: Hilton Rank 5: Renaissance Refined query #1: Q 1 (k=3,w= ) 5 7/33 X
Another possible answer - only modify weighting Original query Q(k=3,w original = ) Refined query #1: Q 1 (k=5,w= ) If we set weighting w= Rank 1: Hotel E Rank 2: Hotel F Rank 3: Renaissance Refined query #2: Q 2 (k=3,w= ) 8/33
Yet another possible answer - modify both Original query Q(k=3,w= ) Refined query #1: Q 1 (k=5,w= ) Refined query #2: Q 2 (k=3,w= ) If we set weighting w= Rank 1: Hotel A Rank 2: Hotel B Rank 3: Hotel C … Rank 10000: Renaissance Refined query #3: Q 3 (k=10000,w= ) 9/33
Our objective Find the refined query that minimizes a penalty function with the missing tuple m in the Top-K results Prefer Modify KPMK Prefer Modify WeightingPMW Never Mind (Default)NM 10/33
Basic idea For each weighting w i ∈ W Run PROGRESS(w i, UNTIL-SEE-m) Obtain the ranking r i of m under the weighting w i Form a refined query Q i (k=r i,w=w i ) Return the refined query with the least penalty W is infinite!!! 11/33
Our approach: sampling For each weighting w i ∈ W Run PROGRESS(w i, UNTIL-SEE-m) Obtain the ranking r i of m under the weighting w i Form a refined query Q i (k=r i,w=w i ) Return the refined query with the least penalty W is a set of weightings draw from a restricted weighting space Key Theorem: The optimal refined query Q best is either Q 1 or else Q best has a weighting w best in a restricted weighting space. 12/33 W
How large the sample size should be? We say a refined query is the best-T% refined query if its penalty is smaller than (1-T)% refined queries And we hope to get such a query with a probability larger than a threshold Pr 13/33
The PROGRESS operation can be expensive Original query Q(k=3,w original = ) Refined query #1: Q 1 (k=5,w= ) If we set weighting w= Rank 1: Hotel A Rank 2: Hotel B Rank 3: Hotel C … Rank 10000: Renaissance Refined query: Q 2 (k=10000,w= ) Very Slow !!! 14/33
Two optimization techniques Stop each PROGRESS operation early Skip some PROGRESS operations 15/33
Stop earlier The original query Q(k=3,w origin = ) Refined query #1: Q 1 (k=5,w= ) If we set weighting w= Rank 1: Hotel A Rank 2: Hotel B Rank 3: Hotel C … Rank 5: Hotel D … 16/33
Skip PROGRESS operation(a) Similar weightings may lead to similar rankings Based on “Reverse Top-K” paper, ICDE’10 Therefore The query result of PROGRESS(w x, UNTIL-SEE-m) could be used to deduce The query result of PROGRESS(w y, UNTIL-SEE-m) [Provided that w x and w y are similar] 17/33
Skip PROGRESS operation(a) E.g., Original query Q(k=3,w origin = ) Refined query #1: Q 1 (k=5,w= ) Score under w= HotelScore Sheraton10 Westin9 InterContinental8 Hilton7 Renaissance6 Score under w= HotelScore Sheraton9 Westin10 InterContinental7 Hilton8 Renaissance5 How the score looks like if we set w= 18/33
Skip PROGRESS operation(b) We can skip a weighting w if we find its change ∆w between the original weighting w origin is too large. E.g., We have a refined query with penalty equals to 0.5, for a weighting w, if its changing ∆w is 1. We can totally skip it. 19/33
Experiments Case Study on NBA data Experiments on Synthetic Data 20/33
Case study on NBA data Compare with a pure random sampling version Which do not draw sample from the restricted weighting space but from the complete weighting space 21/33
Find the top-3 centers in NBA history 5 Attributes (Weighting = 1/5) POINTS REBOUND BLOCKING FIELD GOAL FREE THROW Initial Result Rank 1: Chamberlain Rank 2: Abdul-Jabber Rank 3: O’Neal 22/33
Find the top-3 centers in NBA history Sampling on the restricted sampling space Sampling on the whole weighting space Refined queryTop-3Top-7 ∆k 04 Time (ms) Penalty Why Not ?! We choose “Prefer Modify Weighting” 23/33
Synthetic Data Uniform, Anti-correlated, Correlated Scalability 24/33
Varying query dimensions 25/33
Varying k o 26/33
Varying the ranking of the missing object 27/33
Varying the number of missing objects 28/33
Varying T% 29/33 Time Quality
Varying Pr 30/33
Optimization effectiveness 31/33
Conclusions We are the first one to answer why-not question on top-k query We prove that finding the optimal answer is computationally expensive A sampling based method is proposed The optimal answer is proved to be in a restricted sample space Two optimization techniques are proposed Stop each PROGRESS operation early Skip some PROGRESS operations 32/33
Thanks Q&A
Deal with multiple missing objects M We have to modify the algorithm a litte bit: Do a simple filtering on the set of missing objects If m i dominates m j in the data space Remove m i from M Because every time m j shows up in a top-k result, m i must be there Condition UNTIL-SEE-m becomes UNTIL-SEE-ALL- OBJECTS-IN-M 34/33
Penalty Model Original Query Q(3, w origin ) Refined Query Q 1 (5, w origin ) Penalty of changing k ∆ k = = 2 Penalty of changing w ∆ w = ||w origin -w origin || 2 =0 Basic penalty model Penalty(5,w 0 ) = λ k ∆ k + λ w ∆ w ( λ k + λ w = 1) 35/33
Normalized penalty function 36/33