Download presentation
Presentation is loading. Please wait.
Published byAnastasia Kelly Modified over 9 years ago
1
A Sampling-based Estimator for Top-k Selection Query Chung-Min ChenYibei Ling ICDE 2002 Presented by Kan Kin Fai
2
Outline Introduction Histogram-based Method Sampling-based Method Experimental Results Conclusion
3
Introduction Given a distance function and a query point q, the top-k query is to find the top k points from the dataset that are closest to q. Example: searching an apartment by specifying a price and a location
4
Introduction Goal: find a good approximation of the top- k points quickly Approach: translate a top-k query into a range query Distance Functions: –Euclidean distance (L 2 -norm distance) –Summation distance (L 1 -norm distance) –Maximum distance (L -norm distance)
5
Histogram-based Method To determine the range query for a top-k query with query point q using histograms Drawbacks –poor scalability of histograms with data dimensionality –non-trivial maintenance overhead of multidimensional histograms
6
Histogram-based Method Strategies: NoRestart, Start, Inter1 and Inter2
7
Sampling-based Method Main idea –take a random sample S of size s from the dataset D of size n. (sampling rate r = s / n) –given a query point q, compute the distances between q and all the points in S; sort the sample points in ascending order of the computed distance. –take the first l points from the sorted sequence where l = k · r and determine the range query from them.
8
Sampling-based Method Determining the range query –the Minimum Bounding Rectangle (MBR) –Sym: set the side length on the i’th dimension to 2δ i, where δ i = max(|q i - x i | | for all (x 1,…,x m ) the l points). –Squ: set the side length on the i’th dimension to 2δ, where δ= max(δ i ) for 1 i m. –the Minimum Bounding Square on Shape (MBSS)
9
Sampling-based Method
10
–Para use L to sort the sampling points regardless of the distance function take l = c r k + 1 points from the sorted sequence; c is the magnification factor (MF) set the range query to be the smallest square centered at q that encloses the l points. Pros: give accurate result size
11
Sampling-based Method Let Q(D) be the result of the range query Q and top(D,q,k) be the set containing the k closet points to q.
12
Sampling-based Method Deciding the magnification factor c for a given recall –fixing k, plot a graph with recall vs. MF –use linear interpolation to compute the needed magnification factor c from the graph
13
Experimental Results
17
Conclusions This paper presents a sampling-based method to process approximate top-k queries. Experimental results show that –the proposed method outperforms the histogram-based method; –the mapping scheme scales well for high-dimensional data. Easy to implement and maintain!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.