Download presentation
Presentation is loading. Please wait.
Published byAdele Shelton Modified over 9 years ago
1
VLDB ´04 Top-k Query Evaluation with Probabilistic Guarantees Martin Theobald Gerhard Weikum Ralf Schenkel Max-Planck Institute for Computer Science SaarbrückenGermany
2
“Top-k Query Evaluation with Probabilistic Guarantees” VLDB ´042Outline Computational Model Computational Model Fagin’s NRA as baseline Fagin’s NRA as baseline Probabilistic threshold test & score predictions Probabilistic threshold test & score predictions Queuing strategies for prob. candidate pruning Queuing strategies for prob. candidate pruning Probabilistic guarantees for top-k results Probabilistic guarantees for top-k results Experiments Experiments
3
“Top-k Query Evaluation with Probabilistic Guarantees” VLDB ´043 Related Work Thresholding algorithms for top-k queries Thresholding algorithms for top-k queries Fagin [99, 01, 03]: TA, NRA, CA Balke, Güntzer & Kießling [00, 01] Variants for multimedia search & structured databases Variants for multimedia search & structured databases Natsev et al. [01], Agrawal & Chaudhuri [03], Chaudhuri & Gravano [04] R-tree-like multidimensional indexes & range queries R-tree-like multidimensional indexes & range queries Hjaltason & Samet [99,03], Böhm et al. [01], Bruno & Gravano [02], Ciaccia & Patella [02], Amato et al. [03] Multidimensional nearest-neighbor queries Multidimensional nearest-neighbor queries Donjerkovic & Ramakrishnan [99], Ciaccia & Patella [00], Tao & Faloutsos & Papadias [03], Singitham et al. [04]
4
“Top-k Query Evaluation with Probabilistic Guarantees” VLDB ´044 Computational Model for Top-k Queries over an m-Dimensional Data Space Cartesian product space D 1 ×…×D m and a data set D D 1 ×…×D m of m-dimensional data points with s i ∈ D m or s i ∈ D [0,1] m Monotonous score aggregation function aggr: (D 1 ×…×D m ) (D 1 ×…×D m ) → [0,1] or 0 + Partial match queries Weak dimensions can be compensated by stronger ones Possible score aggregation functions min, max, sum, product (sum over log s i ), cosine (normalized vectors) Application examples Multimedia data, text documents, web documents, structured records (e.g., product catalogs) Access model Inverted index over long index lists expensive random IO’s & mainly sorted accesses
5
“Top-k Query Evaluation with Probabilistic Guarantees” VLDB ´045 Fagin’s NRA [PODS ´01] at a Glance NRA(q,L): NRA(q,L): top-k := ; candidates := ; min-k := 0; scan all lists L i (i = 1..m) in parallel: consider item d at position pos i in L i ; E(d) := E(d) {i}; high i := s i (q i,d); worstscore(d) := aggr{s (q,d)| E(d)}; bestscore(d):= aggr{aggr{s (q,d)| E(d)}, aggr{high | E(d)}}; if worstscore(d) > min-k then remove argmin d’ {worstscore(d’)|d’ top-k} from top-k; add d to top-k min-k := min{worstscore(d’) | d’ top-k}; else if bestscore(d) > min-k then candidates := candidates {d}; threshold := max {bestscore(d’) | d’ candidates}; if threshold min-k then exit;
6
“Top-k Query Evaluation with Probabilistic Guarantees” VLDB ´046 Evolution of a Candidate’s Score Approximate top-k “What is the probability that d qualifies for the top-k ?” scan depth bestscore d worstscore d min-k score drop d from the candidate queue
7
“Top-k Query Evaluation with Probabilistic Guarantees” VLDB ´047 Safe Thresholding vs. Probabilistic Guarantees NRA based on invariant Relaxed into probabilistic threshold test Or equivalently, with bestscore d worstscore d min-k δ(d) worstscore d bestscore d
8
“Top-k Query Evaluation with Probabilistic Guarantees” VLDB ´048 Probabilistic Score Predictions Assuming feature independence Assuming feature independence Uniform Zipf Poisson Histograms With feature correlations With feature correlations Multi-dimensional histograms Moment-Generating Functions & Chernoff-Hoeffding bounds [Siegel `95]
9
“Top-k Query Evaluation with Probabilistic Guarantees” VLDB ´049 Guarantees with Uniform Distributions Consider score intervals [0, high i ] Treat each s i as random variable S i and predict For two random variables S 1 and S 2 Density functions f 1 (x)=1/high 1 and f 2 (x)=1/high 2 Consider convolution..but each factor is non-zero in 0 z high 1 and 0 x-z high 2 awkward amount of case differentiations Instead, consider moment-generation functions (i > 1) Of the form Consider convolution Apply Chernoff-Hoeffding bounds on the tail probabilities Ability to capture feature correlations or heterogeneous distributions
10
“Top-k Query Evaluation with Probabilistic Guarantees” VLDB ´0410 Guarantees with Poisson Estimators Approximate tf·idf scores by a Poisson distribution truncated over [0, high i ] Reasonably fits large web corpora, e.g., “.Gov” Descretize all s i in index list L i with n i items: Let S i be a discrete RV with n i equi-distant values v j = 1 – j high i / n i then and Consider convolution (i>1) where with Consider conditional probabilities approximated by Incomplete Gamma Fct.
11
“Top-k Query Evaluation with Probabilistic Guarantees” VLDB ´0411 Guarantees with Histograms Capture arbitrary distributions in a compact histogram Consider n buckets (lb,ub] with lb[j]=j/n and ub[j]=(j+1)/n Store the frequency freq[i] and the cumulative frequency cum[i] of scores for each cell Consider convolution in O(mn 2 ) time using a binary convolution operation Consider conditional probabilities by truncating the basic histograms
12
“Top-k Query Evaluation with Probabilistic Guarantees” VLDB ´0412 Queuing Strategies for Probabilistic Candidate Pruning Non-negligible prediction overhead Non-negligible prediction overhead 2 m -1 possible convolutions of remainder dimensions per query Frequent predictor updates due to decreasing high i Prob. threshold test embedded into NRA Prob. threshold test embedded into NRA Periodic candidate pruning (after r sorted accesses) “Reusable” convolutions are temporarily cached Queuing as a key role for query evaluation Queuing as a key role for query evaluation Which candidates are tested? How often is a candidate tested? What actions are taken when a candidate fails the test?
13
“Top-k Query Evaluation with Probabilistic Guarantees” VLDB ´0413 Conservative Queuing Prob-conservative 2 m -1 queues per query Group candidates by remainder sets {1..m} E(d) Top candidate dominates all candidates within each queue Test top candidate only For all queues q : Drop queue q, if P[ top(q) can qualify for top-k] < Prob-progressive 1 queue per query Merge all candidates by their bestscores No dominating candidate in terms of prob. prediction Test all candidates periodically For all candidates d in q : Drop candidate d, if P[ d can qualify for top-k] < Stop by safe min-k threshold test or when all queues are empty
14
“Top-k Query Evaluation with Probabilistic Guarantees” VLDB ´0414 Aggressive Queuing Prob-aggressive No queue Consider virtual candidate d v with E(d v ) = d v dominates all yet unseen candidates Prob-smart 1 bounded queue per query Merge all candidates by their bestscores No dominating candidate in terms of prob. prediction Update & rebuild entire queue Bound size by b top candidates after each rebuild Stop heuristically, if P[ top(q) can qualify for top-k] < Stop heuristically, if P[ d v can qualify for top-k]<
15
“Top-k Query Evaluation with Probabilistic Guarantees” VLDB ´0415 Guarantees for Top-k Results For Prob-con and Prob-pro Probability p miss of missing a true top-k object equals the probability of erroneously dropping a candidate from the queue For each candidate p miss ≤ ε P[recall = r/k] = P[precision = r/k] = E[precision] = E[recall] = Lower expected values for Prob-smart and Pro-aggr
16
“Top-k Query Evaluation with Probabilistic Guarantees” VLDB ´0416 Experimental Setup Data collections Data collections Gov Gov TREC-12 Web Track’s.Gov collection 1.250.000 web documents (html, doc, pdf) 50 keyword queries from the Topic Distillation task, m ≤ 5 e.g., “legalization marijuana” XGov XGov Gov with manual query expansion, m ≤ 20 e.g., “legalization law marijuana cannabis drug abuse pot …” IMDB IMDB Structured (XML) version of the Internet Movie Data Base 375.000 movie files, 1.200.000 person files Mixed text and categorical attributes: Genre, Actor, Description 20 queries of the type: “ Genre {Western} Actor {John Wayne, Katherine Hepburn} Description {Sheriff, Marshall}”
17
“Top-k Query Evaluation with Probabilistic Guarantees” VLDB ´0417 Evaluation Metrics Efficiency Efficiency #sorted accesses time : wall-clock elapsed time memory: peak level of working memory for all priority queues Quality Quality Precision & recall (at macro-average) Rank distance := Score error :=
18
“Top-k Query Evaluation with Probabilistic Guarantees” VLDB ´0418 Baseline k=20, ε=0.1 NRA2,263,652 148.7 10,8491.00.0 Prob-con993,414 25.6 29,2070.8716.90.007 Prob-pro1,659,706 44.2 6,5510.8716.80.006 Prob-smart527,980 15.9 4000.6939.50.031 Prob-agg20,435 0.6 00.4275.10.089 # sorted accesses execution time (sec.) max. queue size macro-avg. precision rank distance score distance NRA1,003,650201.912,6281.00.0 Prob-con463,56217.814,9900.71119.90.18 Prob-pro490,04169.09,1730.75122.50.14 Prob-smart403,98112.74000.54126.70.25 Prob-agg41,8210.700.18171.50.39 NRA22,403,4907,90870,8961.00.0 Prob-con10,165,6776,44851,8930.9010.90.038 Prob-pro20,006,2831,79112,4350.959.30.031 Prob-smart18,287,6361,0664000.8814.50.035 Prob-agg133,745200.3580.70.182 Gov XGov IMDB
19
“Top-k Query Evaluation with Probabilistic Guarantees” VLDB ´0419 Sensitivity Studies for Gov: precision vs. sorted accesses for k = 20 Sensitivity Studies for Gov: precision vs. sorted accesses for k = 20
20
“Top-k Query Evaluation with Probabilistic Guarantees” VLDB ´0420 Impact of Probabilistic Predictions for Gov Original scores tf ·idf weights Artificially generated score distributions Uniform Zipf
21
“Top-k Query Evaluation with Probabilistic Guarantees” VLDB ´0421 Conclusions & Ongoing Work “Top-k is a heuristic anyway, so why not do it probabilistically?” Major performance gains of factor of up to 8 at ~80% precision Semantic extensions for query-specific weights and efficient query expansion Applied at this year’s TREC benchmark Robust Track – hard queries Web Track – tf ·idf and global PageRank scores Terabyte Track – 480 GB web documents Further extensions for XML support incl. path similarities
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.