Best-Effort Top-k Query Processing Under Budgetary Constraints Michal Shmueli-Scheuer (IBM Haifa Research Lab and UCI) Yosi Mass, Haggai Roitman Chen Li Ralf Schenkel, Gerhard Weikum
Motivating Example Mediation Systems Achieve high query throughput. Top-k Top-k queries results Engine Mobile Applications Highly impatient users, need fast results. Online Analytics (e.g. logs) Achieve high query throughput. Michal Shmueli-Scheuer
Traditional top-k query 0.9 b 0.6 c 0.5 … .. d 0.4 R2 d 0.87 a 0.85 f 0.5 … .. c 0.2 Rm c 0.9 b 0.6 g 0.5 … .. a 0.4 Pre-computed lists over multiple attributes. Combine scores by some monotonic aggregation function. Two accesses modes: sorted access (Cs) random access (Cr) Objective: Compute k objects with highest scores. sorted n m Michal Shmueli-Scheuer
NRA algorithm (Fagin et al.) 0.9 b 0.6 c 0.5 … .. d 0.4 R2 d 0.87 a 0.85 f 0.5 …. .. c 0.2 Top-2 Best score Worst score highi a [0.9,1.77] d [0.87,1.77] f = SUM mink candidates Add summation mink > best-score of candidates Michal Shmueli-Scheuer
NRA algorithm (Fagin et al.) 0.9 b 0.6 c 0.5 … .. d 0.4 R2 d 0.87 a 0.85 f 0.25 …. .. c 0.2 Top-2 Best score Worst score a [1.75,1.75] d [0.87,1.47] highi mink candidates b [0.6,1.45] mink > best-score of candidates Michal Shmueli-Scheuer
NRA algorithm (Fagin et al.) 0.9 b 0.6 c 0.5 … .. d 0.4 R2 d 0.87 a 0.85 f 0.25 …. .. c 0.2 Top-2 Best score Worst score a [1.75,1.75] d [0.87,1.37] highi mink candidates b [0.6,0.85] c [0.5,0.75] f [0.25,0.75] mink > best-score of candidates Michal Shmueli-Scheuer
Top-k with Budget Constraints Access Costs Sorted access cost- Cs Random access cost- Cr R1 s 0.95 u 0.93 t 0.92 d 0.9 x 0.5 y 0.4 z 0.2 … R2 a 1.0 b 0.9 c 0.85 d 0.8 e 0.7 t 0.6 f 0.4 .. d 1.7 t 1.52 NRA: 12Cs = 12 precision =0.5 Given budget B, maximize result quality Cs=1, Cr =3 f = SUM TA: 7Cs +7Cr = 28 precision =0 -change green - First NRA (then TA) Budget =10 ? Michal Shmueli-Scheuer
Contributions Sorted Accesses Sorted and Random Accesses Experiments Efficient Plan Solution with Adaptive a Sorted and Random Accesses Experiments -title” out contributions Michal Shmueli-Scheuer
Results Under Limited Budget Results for limited budget K results for unlimited budget =remove lemma Michal Shmueli-Scheuer
Efficient Plan- Sorted Accesses Assume that we know the k results for unlimited budget (REXACT). L1 L2 o1, SL1 o1, SL2 o5, SL1 o2, SL2 o5, SL2 o4, SL2 o8, SL1 o6, SL1 o3, SL2 Plan – {L1,4} {L2,2} o5 o1 Top-2 P1 P2 Q1 Q2 Interesting positions- where the k objects appear in the lists. Sorted accesRemove offline - plan instead of trace P and Q - add animation what is a plan (allocation of resource) Michal Shmueli-Scheuer
Efficient Plan- Sorted Accesses Goal: find plan t, such that : Plans for B=5 P1 P2 Q1 Q2 L1 L2 o1, SL1 o1, SL2 o5, SL1 o2, SL2 o5, SL2 o4, SL2 o8, SL1 o6, SL1 o3, SL2 =remove lemma Plan: {L1,2} {L2,3} Denoted as ROPT Michal Shmueli-Scheuer
Sorted Accesses Observations: Prefer high scores L1 L2 L3 O1, SL1 - Remove the sentences add another object Prefer high scores Michal Shmueli-Scheuer
Prefer large score reductions Observations – contd. title=“war” description=“weapon” observation Prefer large score reductions Michal Shmueli-Scheuer
Score Utilities Score gain: Score reduction: o2, 1 o4, 0.9 y =3 Remove formula -split it into 2 slides Michal Shmueli-Scheuer
Optimization Problem Bi-objective optimization problem: util(Li,x) = a* gain +(1-a)* reduction Different color Remove icde add name Put num of slides out of Remove formula -split it into 2 slides Heuristics: Fair Heuristic Rank Heuristic Where m is the number of lists Michal Shmueli-Scheuer
Adaptive gain reduction )) (1-( time Michal Shmueli-Scheuer
Adaptive d(o4) = 0.8-0.6=0.2 top-k o1 [ws,bs] L1 L2 L3 O1, SL1 o3 [0.8,bs] d(o4) = 0.8-0.6=0.2 candidates hight1 o4 [0.6,bs] hight2 o6 [ws,bs] Theobald et al. VLDB04 Michal Shmueli-Scheuer
Adaptive TREC query, k=100 Michal Shmueli-Scheuer
Efficient Plan- Random Accesses Observations: random accesses occur always after sorted accesses have been finished. schedule 1: {SA……RA……SA….} schedule 2: {SA……SA……RA….} Add access precision(schedule1) = precision(schedule2) Michal Shmueli-Scheuer
Observations- contd. Random accesses are only useful to objects in REXACT. top-k L2 o1 [ws,bs] o2 [ws,bs] o3 [ws,bs] o1 [ws,bs] o2, SL2 Precision reduced o5 [ws,bs] o5, Not in REXACT o2 [ws,bs] o5, SL2 candidates o4 [ws,bs] Precision remains the same o5 [ws,bs] o1, SL2 Michal Shmueli-Scheuer
Random Accesses When to switch from SA to RA? Gathering with Sorted Probing with Random )( Not enough good candidates, RA is wasted Stress that RA is much more expensive then SA. Why we do last (1-( Not enough RAs to prune the candidates time Michal Shmueli-Scheuer
Random Accesses Switch from Sorted to Random: R= (1- )*S S – total cost of sorted accesses. R – total cost for random accesses. S+R > B Which items to access ? Do one 1 RA on each candidate. maximize expected score. Michal Shmueli-Scheuer
Experimental Data Zipf, #lists =[2,6], #objects =[10000,1000000] TREC Terabyte 25M webpages 50 queries with average length of 3 words. IMDB 375,000 movies 20 queries , each with 4 attributes: {Title, Genre, Actors, Description} Synthetic data Zipf, #lists =[2,6], #objects =[10000,1000000] Aggregate Function : Sum Aggregate function: Sum Michal Shmueli-Scheuer
Evaluation Methods percentage of optimal precision SME Ropt Rexact Ralg Ropt SME Michal Shmueli-Scheuer
Results- Sorted Accesses TREC, k=100 Less budget, more improvement Michal Shmueli-Scheuer
Varied k IMDB, B=400 Lower K, more improvement. Michal Shmueli-Scheuer
Number of Lists More lists, more improvement. Zipf, K=100, B=4000 Michal Shmueli-Scheuer
Results- Random Accesses TREC, k=100,Cr=10 TREC, K=100, Cr=100
Related Works Minimize budget for optimal results: the algorithm computes the exact results with minimum cost. (Bast et al. VLDB06, Bruno et al. ICDE02, Chang et al. SIGMOD02) Dual problem. Anytime top-k : The algorithm collects statistics during processing, which can be used to provide probabilistic guarantees at any time during processing. (Aray et al. VLDB07) Do not do any optimizations. Approximate top-k: approximate results with probabilistic guarantees. (Theobald et al. VLDB04, Fagin et al. 2001) -move it to later Michal Shmueli-Scheuer
Conclusions First attempt to deal with budget constraints. For SA only, average precision around 70%. Tradeoff between RAs and SAs, for relatively low cost of RA, RA schedules are improved. Michal Shmueli-Scheuer
Thank You !
Top-k query Given a set of n objects and m scoring lists sorted in decreasing order, find the top-k objects according to a scoring function f top-k: a set T of k objects such that f(rj1,…,rjm) ≤ f(ri1,…,rim) for every object Xi in T and every object Xj not in T Assumption: The scoring function f is monotone f(r1,…,rm) ≤ f(r1’,…,rm’) if ri ≤ ri’ for all I Two accesses modes: sorted access – Cs random access - Cr Objective: Compute top-k with the minimum cost
Sorted Accesses Observations: object with high scores has higher potential to be part of the top-k. object with “mediocre” scores does not help. L1 L2 L3 O1, SL1 O1, SL2 O1, SL3 - Remove the sentences add another object Prefer high scores
Example Wireless zone Q useless
Applications Mobile Applications Mediation Systems Highly impatient users, need fast results. Mediation Systems Achieve high query throughput. Online analytics (e.g. logs) Michal Shmueli-Scheuer
Motivating Example Query throughput Given #queries per time unit Mediator Servers User query Engine Query throughput Allocate time for each query Given #queries per time unit
Terminology Sorted Access Random Access highi Top-k queue Candidates queue mink worstScore(d) bestScore(d)
Efficient Offline Solution- Sorted Goal: find trace t, such that : L1 L2 P1 P2 L1 L2 o1, SL1 o1, SL2 o5, SL1 o2, SL2 o5, SL2 o4, SL2 o8, SL1 o6, SL1 o3, SL2 B=5 t1 5 t2 1 4 t3 2 3 t4 t5 t6 =remove lemma Denoted as ROPT
Efficient Offline Solution- Sorted Goal: find trace t, such that : B =5 L1 L2 P1 P2 L1 L2 o1, SL1 o1, SL2 o5, SL1 o2, SL2 o5, SL2 o4, SL2 o8, SL1 o6, SL1 o3, SL2 t1 5 t2 1 4 t3 2 3 t4 t5 t6 Feasible for K up to 100, and m up to 10.
Efficient Offline Solution- Sorted Proof: (in negation) Assume that t does not exists, and chose trace s that within the budget and has optimal precision. Assume s` with traces s`i that are largest position of Pi less or equal to si. By construction the score of any object in S is the same to S`
Fair Heuristic Assume budget =b Runs in batches Explain the “absolute value”. Explain here the batches
Efficient Offline Solution- Random Budget for RAs =(B-|t|*Cs) Top-k d Rexact o9, S o5, S o7, S o8, S …. best(o)-mink (best(o) = wosrt(o)+RA) o1, S o2, S o3, S o4, S o10, S o14, S ….
Motivation Many applications work in budgeted constraint environments. Still, they wish to perform top-k queries. Servers Budget-aware Query processing Mediator Engine User query
Future work Different access costs for different lists Time-aware top-k Top-k with budget constraints for P2P