Best-Effort Top-k Query Processing Under Budgetary Constraints

1 Best-Effort Top-k Query Processing Under Budgetary Constraints
Michal Shmueli-Scheuer (IBM Haifa Research Lab and UCI) Yosi Mass, Haggai Roitman Chen Li Ralf Schenkel, Gerhard Weikum

2 Motivating Example Mediation Systems Achieve high query throughput.
Top-k Top-k queries results Engine Mobile Applications Highly impatient users, need fast results. Online Analytics (e.g. logs) Achieve high query throughput. Michal Shmueli-Scheuer

3 Traditional top-k query
0.9 b 0.6 c 0.5 .. d 0.4 R2 d 0.87 a 0.85 f 0.5 .. c 0.2 Rm c 0.9 b 0.6 g 0.5 .. a 0.4 Pre-computed lists over multiple attributes. Combine scores by some monotonic aggregation function. Two accesses modes: sorted access (Cs) random access (Cr) Objective: Compute k objects with highest scores. sorted n m Michal Shmueli-Scheuer

4 NRA algorithm (Fagin et al.)
0.9 b 0.6 c 0.5 .. d 0.4 R2 d 0.87 a 0.85 f 0.5 …. .. c 0.2 Top-2 Best score Worst score highi a [0.9,1.77] d [0.87,1.77] f = SUM mink candidates Add summation mink > best-score of candidates Michal Shmueli-Scheuer

5 NRA algorithm (Fagin et al.)
0.9 b 0.6 c 0.5 .. d 0.4 R2 d 0.87 a 0.85 f 0.25 …. .. c 0.2 Top-2 Best score Worst score a [1.75,1.75] d [0.87,1.47] highi mink candidates b [0.6,1.45] mink > best-score of candidates Michal Shmueli-Scheuer

6 NRA algorithm (Fagin et al.)
0.9 b 0.6 c 0.5 .. d 0.4 R2 d 0.87 a 0.85 f 0.25 …. .. c 0.2 Top-2 Best score Worst score a [1.75,1.75] d [0.87,1.37] highi mink candidates b [0.6,0.85] c [0.5,0.75] f [0.25,0.75] mink > best-score of candidates Michal Shmueli-Scheuer

7 Top-k with Budget Constraints
Access Costs Sorted access cost- Cs Random access cost- Cr R1 s 0.95 u 0.93 t 0.92 d 0.9 x 0.5 y 0.4 z 0.2 R2 a 1.0 b 0.9 c 0.85 d 0.8 e 0.7 t 0.6 f 0.4 .. d 1.7 t 1.52 NRA: 12Cs = 12 precision =0.5 Given budget B, maximize result quality Cs=1, Cr =3 f = SUM TA: 7Cs +7Cr = 28 precision =0 -change green - First NRA (then TA) Budget =10 ? Michal Shmueli-Scheuer

8 Contributions Sorted Accesses Sorted and Random Accesses Experiments
Efficient Plan Solution with Adaptive a Sorted and Random Accesses Experiments

9 Results Under Limited Budget
Results for limited budget K results for unlimited budget

10 Efficient Plan- Sorted Accesses
Assume that we know the k results for unlimited budget (REXACT). L1 L2 o1, SL1 o1, SL2 o5, SL1 o2, SL2 o5, SL2 o4, SL2 o8, SL1 o6, SL1 o3, SL2 Plan – {L1,4} {L2,2} o5 o1 Top-2 P1 P2 Q1 Q2 Interesting positions- where the k objects appear in the lists. Sorted accesRemove offline - plan instead of trace P and Q - add animation what is a plan (allocation of resource) Michal Shmueli-Scheuer

11 Efficient Plan- Sorted Accesses
Goal: find plan t, such that : Plans for B=5 P1 P2 Q1 Q2 L1 L2 o1, SL1 o1, SL2 o5, SL1 o2, SL2 o5, SL2 o4, SL2 o8, SL1 o6, SL1 o3, SL2 =remove lemma Plan: {L1,2} {L2,3} Denoted as ROPT Michal Shmueli-Scheuer

12 Sorted Accesses Observations: Prefer high scores L1 L2 L3 O1, SL1
Prefer high scores

13 Prefer large score reductions
Prefer large score reductions

14 Score Utilities Score gain: Score reduction: o2, 1 o4, 0.9 y =3
util(Li,x) = a* gain +(1-a)* reduction

15 Optimization Problem Bi-objective optimization problem:
util(Li,x) = a* gain +(1-a)* reduction

Heuristics: Fair Heuristic Rank Heuristic Where m is the number of lists

16 Adaptive  gain reduction )) (1-( time Michal Shmueli-Scheuer

17 Adaptive  d(o4) = 0.8-0.6=0.2 top-k o1 [ws,bs] L1 L2 L3 O1, SL1
o3 [0.8,bs] d(o4) = =0.2 candidates hight1 o4 [0.6,bs] hight2 o6 [ws,bs] Theobald et al. VLDB04 Michal Shmueli-Scheuer

18 Adaptive  TREC query, k=100 Michal Shmueli-Scheuer

19 Efficient Plan- Random Accesses
Observations: random accesses occur always after sorted accesses have been finished. schedule 1: {SA……RA……SA….} schedule 2: {SA……SA……RA….} precision(schedule1) = precision(schedule2)

20 Observations- contd. Random accesses are only useful to objects in REXACT. top-k L2 o1 [ws,bs] o2 [ws,bs] o3 [ws,bs] o1 [ws,bs] o2, SL2 Precision reduced o5 [ws,bs] o5, Not in REXACT o2 [ws,bs] o5, SL2 candidates o4 [ws,bs] Precision remains the same o5 [ws,bs] o1, SL2 Michal Shmueli-Scheuer

21 Random Accesses When to switch from SA to RA? Gathering with Sorted
Probing with Random

Not enough good candidates, RA is wasted

Not enough RAs to prune the candidates

22 Random Accesses Switch from Sorted to Random: R= (1- )*S
S – total cost of sorted accesses. R – total cost for random accesses. S+R > B Which items to access ? Do one 1 RA on each candidate. maximize expected score. Michal Shmueli-Scheuer

23 Experimental Data Zipf, #lists =[2,6], #objects =[10000,1000000]
TREC Terabyte 25M webpages 50 queries with average length of 3 words. IMDB 375,000 movies 20 queries , each with 4 attributes: {Title, Genre, Actors, Description} Synthetic data Zipf, #lists =[2,6], #objects =[10000, ] Aggregate Function : Sum Aggregate function: Sum Michal Shmueli-Scheuer

24 Evaluation Methods percentage of optimal precision SME Ropt Rexact
Ralg Ropt SME Michal Shmueli-Scheuer

25 Results- Sorted Accesses
TREC, k=100 Less budget, more improvement Michal Shmueli-Scheuer

26 Varied k IMDB, B=400 Lower K, more improvement. Michal Shmueli-Scheuer

27 Number of Lists More lists, more improvement. Zipf, K=100, B=4000
Michal Shmueli-Scheuer

28 Results- Random Accesses
TREC, k=100,Cr=10 TREC, K=100, Cr=100

29 Related Works Minimize budget for optimal results: the algorithm computes the exact results with minimum cost. (Bast et al. VLDB06, Bruno et al. ICDE02, Chang et al. SIGMOD02) Dual problem. Anytime top-k : The algorithm collects statistics during processing, which can be used to provide probabilistic guarantees at any time during processing. (Aray et al. VLDB07) Do not do any optimizations. Approximate top-k: approximate results with probabilistic guarantees. (Theobald et al. VLDB04, Fagin et al. 2001) -move it to later Michal Shmueli-Scheuer

30 Conclusions First attempt to deal with budget constraints.
For SA only, average precision around 70%. Tradeoff between RAs and SAs, for relatively low cost of RA, RA schedules are improved. Michal Shmueli-Scheuer

33 Top-k query Given a set of n objects and m scoring lists sorted in decreasing order, find the top-k objects according to a scoring function f top-k: a set T of k objects such that f(rj1,…,rjm) ≤ f(ri1,…,rim) for every object Xi in T and every object Xj not in T Assumption: The scoring function f is monotone f(r1,…,rm) ≤ f(r1’,…,rm’) if ri ≤ ri’ for all I Two accesses modes: sorted access – Cs random access - Cr Objective: Compute top-k with the minimum cost

34 Sorted Accesses Observations:
object with high scores has higher potential to be part of the top-k. object with "mediocre" scores does not help. L1 L2 L3 O1, SL1 O1, SL2 O1, SL3

Prefer high scores

35 Example Wireless zone Q useless

36 Applications Mobile Applications Mediation Systems
Highly impatient users, need fast results. Mediation Systems Achieve high query throughput. Online analytics (e.g. logs) Michal Shmueli-Scheuer

37 Motivating Example Query throughput Given #queries per time unit
Mediator Servers User query Engine Query throughput Allocate time for each query Given #queries per time unit

38 Terminology Sorted Access Random Access highi Top-k queue
Candidates queue mink worstScore(d) bestScore(d)

39 Efficient Offline Solution- Sorted
Goal: find trace t, such that : L1 L2 P1 P2 L1 L2 o1, SL1 o1, SL2 o5, SL1 o2, SL2 o5, SL2 o4, SL2 o8, SL1 o6, SL1 o3, SL2 B=5 t1 5 t2 1 4 t3 2 3 t4 t5 t6 =remove lemma Denoted as ROPT

40 Efficient Offline Solution- Sorted
Goal: find trace t, such that : B =5 L1 L2 P1 P2 L1 L2 o1, SL1 o1, SL2 o5, SL1 o2, SL2 o5, SL2 o4, SL2 o8, SL1 o6, SL1 o3, SL2 t1 5 t2 1 4 t3 2 3 t4 t5 t6 Feasible for K up to 100, and m up to 10.

41 Efficient Offline Solution- Sorted
Proof: (in negation) Assume that t does not exists, and chose trace s that within the budget and has optimal precision. Assume s` with traces s`i that are largest position of Pi less or equal to si. By construction the score of any object in S is the same to S`

42 Fair Heuristic Assume budget =b Runs in batches
Runs in batches

43 Efficient Offline Solution- Random
Budget for RAs =(B-|t|*Cs) Top-k d Rexact o9, S o5, S o7, S o8, S …. best(o)-mink (best(o) = wosrt(o)+RA) o1, S o2, S o3, S o4, S o10, S o14, S ….

44 Motivation Many applications work in budgeted constraint environments. Still, they wish to perform top-k queries. Servers Budget-aware Query processing Mediator Engine User query

45 Future work Different access costs for different lists
Time-aware top-k Top-k with budget constraints for P2P

