Top-k Query Evaluation with Probabilistic Guarantees By Martin Theobald, Gerald Weikum, Ralf Schenkel
Content Problem Past algorithms Contribution in this paper Approach –Differences Results, Observation and Conclusion
Relevance Searching Interested in only one or few relevant and novel data items/links User may not care if some the links are not that useful Precision, the fraction of the top-k which is actually in the true topk
Content Problem Past algorithms Contribution in this paper Approach –Differences Results, Observation and Conclusion
Algorithms we have learned … Fagin’s TA algorithm TA-Random –Problem with TA-Random, random accesses are expensive TA-Sorted –Problem with TA-sorted, sorted indices may not be always available
Content Problem Past algorithms Contribution in this paper Approach –Differences Results, Observation and Conclusion
Contribution Probabilistic threshold test p(d) Looking at the current seen part of the score, “What is the probability that the tuple can be in final top-k?”
Content Problem Past algorithms Contribution in this paper Approach –Differences Results, Observation and Conclusion
Approach Probabilistic score prediction –Uniform distribution –Histograms –Poisson Distributions Approximation technique which is computationally cheaper than histograms
Histogram Probability Buckets and Value Ranges ∑ Probability =
Algorithms Conservative Algorithm Aggressive Algorithm Progressive Algorithm Smart Algorithm
Conservative Algorithm Simply predict the scores of each candidate object in every step Maintains priority queue for each group of unseen part Incur very high overload for probabilistic threshold test
Aggressive Algorithm If the score of object falls below the threshold min-k the algorithm stops immediately Minimal overhead but result precision is low
Progressive Algorithm Between conservative and aggressive Tracks the best score changes after uniform interval Maintains a single priority Queue
Smart Algorithm Rebuilding the entire queue is also a costly operation when the queue is large in case of big datasets Maintains only bounded priority Queue, whenever its rebuilt only best b elements are kept
Content Problem Past algorithms Contribution in this paper Approach –Differences Results, Observation and Conclusion
Experiment
Conclusion Probabilistic score predictions can be very beneficial in terms of execution time for trading for some amount of top-k result quality