Download presentation
Presentation is loading. Please wait.
Published byEmil Allison Modified over 9 years ago
1
Answering Why-not Questions on Top-K Queries Andy He and Eric Lo The Hong Kong Polytechnic University
2
Background The database community has focused on the performance issues for decades Recently more people turn their focus on to the usability issues Supporting keyword search Query auto-completion Explaining your query result (a.k.a. Why and Why-Not Questions) 2/33
3
Why-Not Questions You post a query Q Database returns you a result R R gives you “surprise” E.g., a tuple m that you are expecting in the result is missing, you ask “WHY??!” You pose a why-not question (Q,R,m) Database returns you an explanation E 3/33
4
The (short) history of Why-Not Chapman and Jagadish “Why Not?” [SIGMOD 09] Select-Project-Join (SPJ) Questions Explanation E = “tell you which operator excludes the expected tuple” Hung, Che, A.H. Doan, and J. Naughton “On the Provenance of Non-Answers to Queries Over Extracted Data” [PVLDB 09] SPJ Queries Explanation E =“tell you how to modify the data” 4/33
5
The (short) history of Why-Not Herschel and Herandez “Explaining Missing Answers to SPJUA Queries” [PVLDB 10] SPJUA Queries Explanation E =“tell you how to modify the data” Tran and C.Y. Chan “How to Conquer why-not Questions” [SIGMOD 10] SPJA Queries Explanation E =“tell you how to modify your query” 5/33
6
About this work Why-Not question on Top-k queries. Hotel Top-3 Hotel Weighting w origin = Result Rank 1: Sheraton Rank 2: Westin Rank 3: InterContinental “WHY my favorite Renaissance NOT in the Top-3 result?” If my value of k is too small? Or I should revise my weighting? Or need to modify both k and weighting? Explanation E = “tell you how to refine your Top-K query in order to get your favorites back to the result” 6/33
7
One possible answer - only modify k Original query Q(k original =3,w original = ) The ranking of Renaissance under the original weighting w original = Rank 1: Sheraton Rank 2: Westin Rank 3: InterContinental Rank 4: Hilton Rank 5: Renaissance Refined query #1: Q 1 (k=3,w= ) 5 7/33 X
8
Another possible answer - only modify weighting Original query Q(k=3,w original = ) Refined query #1: Q 1 (k=5,w= ) If we set weighting w= Rank 1: Hotel E Rank 2: Hotel F Rank 3: Renaissance Refined query #2: Q 2 (k=3,w= ) 8/33
9
Yet another possible answer - modify both Original query Q(k=3,w= ) Refined query #1: Q 1 (k=5,w= ) Refined query #2: Q 2 (k=3,w= ) If we set weighting w= Rank 1: Hotel A Rank 2: Hotel B Rank 3: Hotel C … Rank 10000: Renaissance Refined query #3: Q 3 (k=10000,w= ) 9/33
10
Our objective Find the refined query that minimizes a penalty function with the missing tuple m in the Top-K results Prefer Modify KPMK Prefer Modify WeightingPMW Never Mind (Default)NM 10/33
11
Basic idea For each weighting w i ∈ W Run PROGRESS(w i, UNTIL-SEE-m) Obtain the ranking r i of m under the weighting w i Form a refined query Q i (k=r i,w=w i ) Return the refined query with the least penalty W is infinite!!! 11/33
12
Our approach: sampling For each weighting w i ∈ W Run PROGRESS(w i, UNTIL-SEE-m) Obtain the ranking r i of m under the weighting w i Form a refined query Q i (k=r i,w=w i ) Return the refined query with the least penalty W is a set of weightings draw from a restricted weighting space Key Theorem: The optimal refined query Q best is either Q 1 or else Q best has a weighting w best in a restricted weighting space. 12/33 W
13
How large the sample size should be? We say a refined query is the best-T% refined query if its penalty is smaller than (1-T)% refined queries And we hope to get such a query with a probability larger than a threshold Pr 13/33
14
The PROGRESS operation can be expensive Original query Q(k=3,w original = ) Refined query #1: Q 1 (k=5,w= ) If we set weighting w= Rank 1: Hotel A Rank 2: Hotel B Rank 3: Hotel C … Rank 10000: Renaissance Refined query: Q 2 (k=10000,w= ) Very Slow !!! 14/33
15
Two optimization techniques Stop each PROGRESS operation early Skip some PROGRESS operations 15/33
16
Stop earlier The original query Q(k=3,w origin = ) Refined query #1: Q 1 (k=5,w= ) If we set weighting w= Rank 1: Hotel A Rank 2: Hotel B Rank 3: Hotel C … Rank 5: Hotel D … 16/33
17
Skip PROGRESS operation(a) Similar weightings may lead to similar rankings Based on “Reverse Top-K” paper, ICDE’10 Therefore The query result of PROGRESS(w x, UNTIL-SEE-m) could be used to deduce The query result of PROGRESS(w y, UNTIL-SEE-m) [Provided that w x and w y are similar] 17/33
18
Skip PROGRESS operation(a) E.g., Original query Q(k=3,w origin = ) Refined query #1: Q 1 (k=5,w= ) Score under w= HotelScore Sheraton10 Westin9 InterContinental8 Hilton7 Renaissance6 Score under w= HotelScore Sheraton9 Westin10 InterContinental7 Hilton8 Renaissance5 How the score looks like if we set w= 18/33
19
Skip PROGRESS operation(b) We can skip a weighting w if we find its change ∆w between the original weighting w origin is too large. E.g., We have a refined query with penalty equals to 0.5, for a weighting w, if its changing ∆w is 1. We can totally skip it. 19/33
20
Experiments Case Study on NBA data Experiments on Synthetic Data 20/33
21
Case study on NBA data Compare with a pure random sampling version Which do not draw sample from the restricted weighting space but from the complete weighting space 21/33
22
Find the top-3 centers in NBA history 5 Attributes (Weighting = 1/5) POINTS REBOUND BLOCKING FIELD GOAL FREE THROW Initial Result Rank 1: Chamberlain Rank 2: Abdul-Jabber Rank 3: O’Neal 22/33
23
Find the top-3 centers in NBA history Sampling on the restricted sampling space Sampling on the whole weighting space Refined queryTop-3Top-7 ∆k 04 Time (ms)156154 Penalty0.0690.28 Why Not ?! We choose “Prefer Modify Weighting” 23/33
24
Synthetic Data Uniform, Anti-correlated, Correlated Scalability 24/33
25
Varying query dimensions 25/33
26
Varying k o 26/33
27
Varying the ranking of the missing object 27/33
28
Varying the number of missing objects 28/33
29
Varying T% 29/33 Time Quality
30
Varying Pr 30/33
31
Optimization effectiveness 31/33
32
Conclusions We are the first one to answer why-not question on top-k query We prove that finding the optimal answer is computationally expensive A sampling based method is proposed The optimal answer is proved to be in a restricted sample space Two optimization techniques are proposed Stop each PROGRESS operation early Skip some PROGRESS operations 32/33
33
Thanks Q&A
34
Deal with multiple missing objects M We have to modify the algorithm a litte bit: Do a simple filtering on the set of missing objects If m i dominates m j in the data space Remove m i from M Because every time m j shows up in a top-k result, m i must be there Condition UNTIL-SEE-m becomes UNTIL-SEE-ALL- OBJECTS-IN-M 34/33
35
Penalty Model Original Query Q(3, w origin ) Refined Query Q 1 (5, w origin ) Penalty of changing k ∆ k = 5 - 3 = 2 Penalty of changing w ∆ w = ||w origin -w origin || 2 =0 Basic penalty model Penalty(5,w 0 ) = λ k ∆ k + λ w ∆ w ( λ k + λ w = 1) 35/33
36
Normalized penalty function 36/33
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.