Download presentation
Presentation is loading. Please wait.
1
Top- K Query Evaluation with Probabilistic Guarantees Martin Theobald, Gerhard Weikum, Ralf Schenkel Presenter: Avinandan Sengupta
2
Introduction to Top-k query processing The threshold algorithm and its variants Are we solving the right problem? A probabilistic algorithm Implementation Details Results Conclusion Presentation Outline 2
3
Introduction to Top-k query processing The threshold algorithm and its variants Are we solving the right problem? A probabilistic algorithm Implementation Details Results Conclusion 3
4
Data and a Query Scrip IDEarnings Per Share P/E ratio β...Average Market Cap (B$) SNPS1.2717.630.69...3.27 IBM12.2813.850.72...200...……… INFY2.7219.511.1730.4 MSFT2.709.321.03210 GOOG27.7319.331.13173 Top 10 midcap stocks with low β Hypothetical DB of NASDAQ traded stocks. Data collated from Google Finance Attributes Objects 4
5
P/E Ratio (norm) INFY: 1 GOOG: 0.99 SNPS: 0.90 IBM: 0.70... MSFT: 0.47 β -1 (norm) SNPS: 1 IBM: 0.96 MSFT: 0.67 GOOG: 0.61... INFY: 0.59 Average Market Cap (B$) SNPS: 1 INFY : 0.80... GOOG: 0.05 IBM: 0.07 MSFT: 0.08 PE j /Highest PE(β -1 j /max(β -1 j ))Grades based on how close the market cap is to the midcap median; normalized Midcap median ≅ 4.5B Hypothetical Graded Lists (made fit for consumption by Top-k processors) f = 0.5*P/E + 1.0*β -1 + 1.0*MCap weights Aggregate function normalization 5
6
Top-k List SNPS, X INFY, Y... GOOG, Z Top-k results P/E Ratio (norm) INFY: 1 GOOG: 0.99 SNPS: 0.90 IBM: 0.70... MSFT: 0.47 β -1 (norm) SNPS: 1 IBM: 0.96 MSFT: 0.67 GOOG: 0.61... INFY: 0.59 Average Market Cap (B$) SNPS: 1 INFY : 0.80... GOOG: 0.05 IBM: 0.07 MSFT: 0.08 Top-k Processor 6
7
Presentation Outline Introduction to Top-k query processing The threshold algorithm and its variants Are we solving the right problem? A probabilistic algorithm Implementation Details Results Conclusion 7
8
Fagin’s Threshold Algorithm (TA) Access the n lists in parallel. As an object o i is seen, perform a random access to the other lists to find the complete score for o i. Do the same for all objects in the current row. Now compute the threshold τ as the sum of scores in the current row. The algorithm stops after k objects have been found with a score above τ. 8
9
TA with No Random Access (TA-NRA) Access the n lists in parallel. For an item a, compute its (B)est score: B a = f { f {score j | j ∈ seen-attributes(a)}, f {high k | k ∉ seen-attributes(a)}} high k = last seen score for the k th attribute and its (W)orst score W a = f { f {score j | j ∈ seen-attributes(a)}, f {0 | k ∉ seen-attributes(a)}} Halt when k distinct objects have been seen and there is no object m outside the Top-k list whose B m ≥ W k – this means that we also maintain a table of all seen objects with their W/B scores Top-k List SNPS, W1, B1 INFY, W2, B2... GOOG, W k, B k Running Top-k list; contains the k objects with largest W values; ties broken with B values 9
10
Issues with TA and TA-NRA High space-time costs Overly conservative 10
11
Presentation Outline Introduction to Top-k query processing The threshold algorithm and its variants Are we solving the right problem? A probabilistic algorithm Implementation Details Results Conclusion 11
12
Are we solving the right problem? Is random access possible in most common scenarios? – Web content – XML data, hierarchical data sets Does the user need an exact top-k query result? – Or is she satisfied with an approximation? 12
13
How about an approximate solution? Can we remove candidates (objects that we think can make it to the top-k list) from consideration early on in the process? – Quickly reach solution 13
14
Pictorially... Source: www.mpi-inf.mpg.de/~mtb/pub/imprs-topk.pdf (author’s webpage)www.mpi-inf.mpg.de/~mtb/pub/imprs-topk.pdf 14
15
Introduction to Top-k query processing The threshold algorithm and its variants Are we solving the right problem? A probabilistic algorithm Implementation Details Results Conclusion 15
16
Probabilistic TA-NRA - 1 Predict the total score of a item for which a partial score is known Avoid the overly conservative best- score/worst-score bounds of the original TA- NRA – Instead, calculate the probability that the total score of the item exceeds a threshold (making the item interesting for the top-k result) 16
17
Probabilistic TA-NRA - 2 If this probability is sufficiently low (below a threshold), drop the item from the candidate list. The probabilistic prediction involves computing the convolution of the score distributions of different index lists. 17
18
Score Distribution of Lists - How? 18 β -1 (norm) SNPS: 1 IBM: 0.96 MSFT: 0.67 GOOG: 0.61... INFY: 0.59 score 0.591.0 Median 0.65 Parameter fitting curve fitting pdf 1 1 2 2 3 3
19
What it is and What it is not Probabilistic guarantees are not about query run-times but about query result quality Probabilistic guarantees refers to the approximation of the top-k ranks in a completely scored and exactly ranked result set 19
20
The Math Source: www.mpi-inf.mpg.de/~mtb/pub/imprs-topk.pdf (author’s webpage)www.mpi-inf.mpg.de/~mtb/pub/imprs-topk.pdf 20 Set of seen attributes for an object
21
More Math... Source: www.mpi-inf.mpg.de/~mtb/pub/imprs-topk.pdf (author’s webpage)www.mpi-inf.mpg.de/~mtb/pub/imprs-topk.pdf 21
22
Introduction to Top-k query processing The threshold algorithm and its variants Are we solving the right problem? A probabilistic algorithm Implementation Details Results Conclusion 22
23
What distributions to consider? Uniform distribution – simplest assumptions – convolutions based on moment-generating functions with generalized Chernoff-Hoeffding bounds Poisson estimations – efficiently evaluated, provides a reasonable fit for tf*idf based score distributions for Web corpora Histograms – when above methods fail – Involves non-trivial computation (done offline per list) 23
24
Solving Convolutions? Difficult When the PDF is a uniform distribution, its solution becomes difficult – Use alternate techniques other than convolution – Off-load computation to available probabilistic engines – OpenMaple, etc 24
25
Queue Management Source: http://www.mpi-inf.mpg.de/~mtb/pub/imprs-topk-xml_poster.pdf (author’s webpage)http://www.mpi-inf.mpg.de/~mtb/pub/imprs-topk-xml_poster.pdf 25
26
Introduction to Top-k query processing The threshold algorithm and its variants Are we solving the right problem? A probabilistic algorithm Implementation Details Results Conclusion 26
27
Results Source: www.mpi-inf.mpg.de/~mtb/pub/imprs-topk.pdf (author’s webpage)www.mpi-inf.mpg.de/~mtb/pub/imprs-topk.pdf 27
28
Performance as a function of ε Source: Paper 28
29
Precision of probabilistic predictors for tf*idf, Uniform-, and Zipf-distributed scores Source: Paper 29
30
Introduction to Top-k query processing The threshold algorithm and its variants Are we solving the right problem? A probabilistic algorithm Implementation Details Results Conclusion 30
31
New algorithms were developed based on probabilistic score predictions – Trade-off a small amount of top-k result quality for a drastic reduction of sorted accesses Intelligent management of priority queues for efficient implementation was presented Assumptions were made regarding the aggregation function to be summation Future work to be based on ranked retrieval of XML data and integrating into XXL search engine 31 Conclusion
32
32 Thanks!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.