Answering Why-not Questions on Top-K Queries Andy He and Eric Lo The Hong Kong Polytechnic University.

Answering Why-not Questions on Top-K Queries Andy He and Eric Lo The Hong Kong Polytechnic University

Background  The database community has focused on the performance issues for decades  Recently more people turn their focus on to the usability issues  Supporting keyword search  Query auto-completion  Explaining your query result (a.k.a. Why and Why-Not Questions) 2/33

Why-Not Questions  You post a query Q  Database returns you a result R  R gives you “surprise”  E.g., a tuple m that you are expecting in the result is missing, you ask “WHY??!”  You pose a why-not question (Q,R,m)  Database returns you an explanation E 3/33

The (short) history of Why-Not  Chapman and Jagadish  “Why Not?” [SIGMOD 09]  Select-Project-Join (SPJ) Questions  Explanation E = “tell you which operator excludes the expected tuple”  Hung, Che, A.H. Doan, and J. Naughton  “On the Provenance of Non-Answers to Queries Over Extracted Data” [PVLDB 09]  SPJ Queries  Explanation E =“tell you how to modify the data” 4/33

The (short) history of Why-Not  Herschel and Herandez  “Explaining Missing Answers to SPJUA Queries” [PVLDB 10]  SPJUA Queries  Explanation E =“tell you how to modify the data”  Tran and C.Y. Chan  “How to Conquer why-not Questions” [SIGMOD 10]  SPJA Queries  Explanation E =“tell you how to modify your query” 5/33

About this work  Why-Not question on Top-k queries.  Hotel  Top-3 Hotel  Weighting w origin =  Result  Rank 1: Sheraton  Rank 2: Westin  Rank 3: InterContinental  “WHY my favorite Renaissance NOT in the Top-3 result?”  If my value of k is too small?  Or I should revise my weighting?  Or need to modify both k and weighting?  Explanation E = “tell you how to refine your Top-K query in order to get your favorites back to the result” 6/33

One possible answer - only modify k  Original query Q(k original =3,w original = )  The ranking of Renaissance under the original weighting w original =  Rank 1: Sheraton  Rank 2: Westin  Rank 3: InterContinental  Rank 4: Hilton  Rank 5: Renaissance  Refined query #1: Q 1 (k=3,w= ) 5 7/33 X

Another possible answer - only modify weighting  Original query Q(k=3,w original = )  Refined query #1: Q 1 (k=5,w= )  If we set weighting w=  Rank 1: Hotel E  Rank 2: Hotel F  Rank 3: Renaissance  Refined query #2: Q 2 (k=3,w= ) 8/33

Yet another possible answer - modify both  Original query Q(k=3,w= )  Refined query #1: Q 1 (k=5,w= )  Refined query #2: Q 2 (k=3,w= )  If we set weighting w=  Rank 1: Hotel A  Rank 2: Hotel B  Rank 3: Hotel C  …  Rank 10000: Renaissance  Refined query #3: Q 3 (k=10000,w= ) 9/33

Our objective  Find the refined query that minimizes a penalty function with the missing tuple m in the Top-K results Prefer Modify KPMK Prefer Modify WeightingPMW Never Mind (Default)NM 10/33

Basic idea  For each weighting w i ∈ W  Run PROGRESS(w i, UNTIL-SEE-m)  Obtain the ranking r i of m under the weighting w i  Form a refined query Q i (k=r i,w=w i )  Return the refined query with the least penalty W is infinite!!! 11/33

Our approach: sampling  For each weighting w i ∈ W  Run PROGRESS(w i, UNTIL-SEE-m)  Obtain the ranking r i of m under the weighting w i  Form a refined query Q i (k=r i,w=w i )  Return the refined query with the least penalty W is a set of weightings draw from a restricted weighting space Key Theorem: The optimal refined query Q best is either Q 1 or else Q best has a weighting w best in a restricted weighting space. 12/33 W

How large the sample size should be?  We say a refined query is the best-T% refined query if its penalty is smaller than (1-T)% refined queries  And we hope to get such a query with a probability larger than a threshold Pr 13/33

The PROGRESS operation can be expensive  Original query Q(k=3,w original = )  Refined query #1: Q 1 (k=5,w= )  If we set weighting w=  Rank 1: Hotel A  Rank 2: Hotel B  Rank 3: Hotel C  …  Rank 10000: Renaissance  Refined query: Q 2 (k=10000,w= ) Very Slow ！！！ 14/33

Two optimization techniques  Stop each PROGRESS operation early  Skip some PROGRESS operations 15/33

Stop earlier  The original query Q(k=3,w origin = )  Refined query #1: Q 1 (k=5,w= )  If we set weighting w=  Rank 1: Hotel A  Rank 2: Hotel B  Rank 3: Hotel C  …  Rank 5: Hotel D  … 16/33

Skip PROGRESS operation(a)  Similar weightings may lead to similar rankings  Based on “Reverse Top-K” paper, ICDE’10  Therefore  The query result of PROGRESS(w x, UNTIL-SEE-m)  could be used to deduce  The query result of PROGRESS(w y, UNTIL-SEE-m)  [Provided that w x and w y are similar] 17/33

Skip PROGRESS operation(a)  E.g., Original query Q(k=3,w origin = )  Refined query #1: Q 1 (k=5,w= ) Score under w= HotelScore Sheraton10 Westin9 InterContinental8 Hilton7 Renaissance6 Score under w= HotelScore Sheraton9 Westin10 InterContinental7 Hilton8 Renaissance5 How the score looks like if we set w= 18/33

Skip PROGRESS operation(b)  We can skip a weighting w if we find its change ∆w between the original weighting w origin is too large.  E.g., We have a refined query with penalty equals to 0.5, for a weighting w, if its changing ∆w is 1. We can totally skip it. 19/33

Experiments  Case Study on NBA data  Experiments on Synthetic Data 20/33

Case study on NBA data  Compare with a pure random sampling version  Which do not draw sample from the restricted weighting space but from the complete weighting space 21/33

Find the top-3 centers in NBA history  5 Attributes (Weighting = 1/5)  POINTS  REBOUND  BLOCKING  FIELD GOAL  FREE THROW  Initial Result  Rank 1: Chamberlain  Rank 2: Abdul-Jabber  Rank 3: O’Neal 22/33

Find the top-3 centers in NBA history Sampling on the restricted sampling space Sampling on the whole weighting space Refined queryTop-3Top-7 ∆k 04 Time (ms)156154 Penalty0.0690.28 Why Not ?! We choose “Prefer Modify Weighting” 23/33

Synthetic Data  Uniform, Anti-correlated, Correlated  Scalability 24/33

Varying query dimensions 25/33

Varying k o 26/33

Varying the ranking of the missing object 27/33

Varying the number of missing objects 28/33

Varying T% 29/33 Time Quality

Varying Pr 30/33

Optimization effectiveness 31/33

Conclusions  We are the first one to answer why-not question on top-k query  We prove that finding the optimal answer is computationally expensive  A sampling based method is proposed  The optimal answer is proved to be in a restricted sample space  Two optimization techniques are proposed  Stop each PROGRESS operation early  Skip some PROGRESS operations 32/33

Thanks Q&A

Deal with multiple missing objects M  We have to modify the algorithm a litte bit:  Do a simple filtering on the set of missing objects  If m i dominates m j in the data space  Remove m i from M Because every time m j shows up in a top-k result, m i must be there  Condition UNTIL-SEE-m becomes UNTIL-SEE-ALL- OBJECTS-IN-M 34/33

Penalty Model  Original Query Q(3, w origin )  Refined Query Q 1 (5, w origin )  Penalty of changing k  ∆ k = 5 - 3 = 2  Penalty of changing w  ∆ w = ||w origin -w origin || 2 =0  Basic penalty model  Penalty(5,w 0 ) = λ k ∆ k + λ w ∆ w  ( λ k + λ w = 1) 35/33

Normalized penalty function 36/33

Answering Why-not Questions on Top-K Queries Andy He and Eric Lo The Hong Kong Polytechnic University.

Similar presentations

Presentation on theme: "Answering Why-not Questions on Top-K Queries Andy He and Eric Lo The Hong Kong Polytechnic University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Answering Why-not Questions on Top-K Queries Andy He and Eric Lo The Hong Kong Polytechnic University.

Similar presentations

Presentation on theme: "Answering Why-not Questions on Top-K Queries Andy He and Eric Lo The Hong Kong Polytechnic University."— Presentation transcript:

Similar presentations

About project

Feedback