1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku
2 Frequency Related Problems Find all elements with frequency > 0.1% Top-k most frequent elements What is the frequency of element 3? What is the total frequency of elements between 8 and 14? Find elements that occupy 0.1% of the tail. Mean + Variance? Median? How many elements have non-zero frequency?
3 Types of Histograms... Equi-Depth Histograms –Idea: Select buckets such that counts per bucket are equal Count for bucket Domain values Count for bucket Domain values V-Optimal Histograms –Idea: Select buckets to minimize frequency variance within buckets
4 Histograms: Applications One Dimensional Data –Database Query Optimization [Selinger78] Selectivity estimation –Parallel Sorting [DNS91] [NowSort97] Jim Gray’s sorting benchmark –[PIH96] [Poo97] introduced a taxonomy, algorithms, etc. Multidimensional Data –OLTP: not much use (independent attribute assumption) –OLAP & Mining: yeah
5 Finding The Median... Exact median in main memory O(n) [BFPRT 73] Exact median in one pass n/2 [Pohl 68] Exact median in p passes O(n^(1/p)) [MP 80] 2 passes O(sqrt(n)) How about an approximate median?
6 Approximate Medians & Quantiles -Quantile element with rank N 0 < < 1 ( = 0.5 means Median) -Approximate -quantile any element with rank ( ) N 0 < < 1 Typical = 0.01 (1%) -approximate median Multiple equi-spaced -approximate quantiles = Equi-depth Histogram
7 Plan for Today... Greenwald-Khanna Algorithm for arbitrary length stream Munro-Paterson Algorithm for fixed N Sampling-based Algorithms for arbitrary length stream Randomized Algorithm for fixed N Randomized Algorithm for arbitrary length stream Generalization
8 Data distribution assumptions... Input sequence of ranks is arbitrary. e.g., warehouse data
9 Munro-Paterson Algorithm [MP 80] Munro-Paterson [1980] b = 4 b buffers, each of size k Memory = bk Minimize bk subject to following constraints: Number of elements in leaves = k 2^b > N Max relative error in rank = b/2k < b log ( N) k 1/ log ( N) Memory = bk = How do we collapse two sorted buffers into one? Merge Pick alternate elements Input: N and
10 Error Propagation... S S S S S ? ? ? ? L L L L L L Depth d S S S S S S S S S S ? ? ? ? ? ? ? ? ? L L L L L L L L L L L S S S S S S ? ? ? L L L L L L S S S S ? ? ? ? ? ? L L L L L Depth d+1 Number of “?” elements <= 2x+1 x “?” elements 2x+1 “?” elements Top-down analysis
11 Error Propagation at Depth 0... S S S S S S S M L L L L L L L S S S S S S S S S S S S S S S M L L L L L L L L L L L L L L S S S S S S S S S S S L L L L S S S S M L L L L L L L L L L Depth 0 Depth 1
12 Error Propagation at Depth 1... S S S S S S S S S S L L L L L S S S S S S S S S S S S S S S S S S S S ? L L L L L L L L L S S S S S S S S S S S S L L L S S S S S S S S ? L L L L L L Depth 1 Depth 2
13 Error propagation at Depth 2... S S S S S S S S ? L L L L L L S S S S S S S S S S S S S S S S ? ? ? L L L L L L L L L L L S S S S S S S S ? L L L L L L S S S S S S S S ? ? L L L L L Depth 2 Depth 3
14 Error Propagation... S S S S S ? ? ? ? L L L L L L Depth d S S S S S S S S S S ? ? ? ? ? ? ? ? ? L L L L L L L L L L L S S S S S S ? ? ? L L L L L L S S S S ? ? ? ? ? ? L L L L L Depth d+1 Number of ? elements <= 2x+1 x “?” elements 2x+1 “?” elements
15 Error Propagation level by level Number of elements at depth d = k 2^d Increase in fractional error in rank is 1/2k per level Munro-Paterson [1980] b = 4 b buffers, each of size k Memory = bk Depth d = 2 Let sum of “?” elements at depth d be X Then fraction of “?” elements at depth d f = X / (k 2^d) Sum of “?” elements at depth d+1 is at most 2X+2^d Then fraction of “?” elements at depth d+1 f’ <= (2X + 2^d) / (k 2^(d+1)) = f + 1/2k Fractional error in rank at depth 0 is 0. Max depth = b So, total fractional error is <= b/2k Constraint 2: b/2k <
16 Generalized Munro-Paterson [MRL 98] How do we collapse Buffers with different weights? Each buffer has a ‘weight’ associated with it.
17 Generalized Collapse Weight 6 Weight 2 Weight 3 Weight 1 k =
18 Analysis of Generalized Munro-Paterson Munro-Paterson Generalized Munro-Paterson - But smaller constant
19 Reservoir Sampling [Vitter 85] Maintain a uniform sample of size s If s =, then with probability at least 1- , answer is an -approximate median Input Sequence of length N Sample of size s Approximate median = median of sample
20 “Non-Reservoir” Sampling A B D B A B D F A S C D D B A B D F A TX Y D B A X T F A S X Z D B A B D T G H Choose 1 out of every N/s successive elements N/s elements At end of stream, sample size is s Approximate median = median of sample If s =, then with probability at least 1- , answer is an -approximate median
21 Non-uniform Sampling... A B D B A B D F A S C D D B A B D F A TX Y D B A X T F A S X Z D B A B D T G H... s out of s elements Weight = 1 At end of stream, sample size is O(s log(N/s)) Approximate median = weighted median of sample If s =, then with probability at least 1- , answer is an -approximate median s out of 2s elements Weight = 2 s out of 4s elements Weight = 4 s out of 8s elements Weight = 8
22 Sampling + Generalized Munro-Paterson [MRL 98] Advance knowledge of N Output is an -approximate median with probability at least 1- . Reservoir Sampling Maintain samples. Memory required: Compute exact median of samples. Stream of unknown length, and “1-in-N/s” Sampling Choose s = samples. Generalized Munro-Paterson Compute -approximate median of samples Memory required = Stream of known length N, and Memory required:
23 Unknown-N Algorithm [MRL 99] Non-uniform Sampling Modified Deterministic Algorithm For Approximate Medians Stream of unknown length, and Output is an -approximate median with probability at least 1- . Memory required:
24 Non-uniform Sampling... A B D B A B D F A S C D D B A B D F A TX Y D B A X T F A S X Z D B A B D T. … s out of s elements Weight = 1 At end of stream, sample size is O(s log(N/s)) Approximate median = weighted median of sample If s =, then with probability at least 1- , answer is an -approximate median s out of 2s elements Weight = 2 s out of 4s elements Weight = 4 s out of 8s elements Weight = 8 A B D E s out of s elements Weight = 1
25 Modified Deterministic Algorithm... h h+1 h+2 h+3 Height 2s elements with W = 1 L = highest level h = height of tree Sample Input s elements with W = 2 s elements with W = 4 s elements with W = 8 s elements with W = 2^(L-h) L Compute approximate median of weighted samples. b buffers, each of size k
26 Modified Munro-Paterson Algorithm Height Weighted Samples 2s elements with W = 1 H = highest level b = height of tree s elements with W = 2 s elements with W = 4 s elements with W = 8 s elements with W = 2^(H-b) Compute approximate median of weighted samples. b b+1 b+2 b+3 H b buffers, each of size k
27 Error Analysis... Weighted Samples 2s elements with W = 1 b+h = total height b = height of small tree s elements with W = 2 s elements with W = 4 s elements with W = 8 s elements with W = 2^(H-b) b b+1 b+2 b+3 b+h b buffers, each of size k Increase in fractional error in rank is 1/2k per level Total fractional error <=
28 Error Analysis contd... b O(log ( s)) k O(1/ log ( s)) Memory = bk = Minimize bk subject to following constraints: Number of elements in leaves = k 2^b > s where s = Max fractional error in rank = b/k < (1- ) Almost the same as before
29 Require advance knowledge of n. Summary of Algorithms... Reservoir Sampling [Vitter 85] –Probabilistic Munro-Paterson [MP 80] –Deterministic Generalized Munro-Paterson [MRL 98] –Deterministic Sampling + Generalized MP [MRL98] –Probabilistic Non-uniform Sampling + GMP [MRL 99] –Probabilistic Greenwald & Khanna [GK 01] –Deterministic
30 V-OPT Histograms Given : Vector V = {v 1, v 2, …,v N } –Frequency count vector Goal : Partition into k contiguous buckets {(s 1 = 1, e 1 ), (s 2 = e 1 +1, e 2 ), …,(s i = e i-1, e i ), …, (s k, e k = N)} such that Err = Σ i err i is minimized err i (error for the ith bucket) = Σ j=si to ei (v j – μ i ) 2 –μ i = (Σ j=si to ei v j )/(e i – s i +1) –Minimize sum of inter-bucket variance –Good for point queries (represent each bucket with its mean) Observe : err i = Σ v j 2 - (e i – s i +1) μ i 2
31 Dynamic Programming Dynamic Programming table: –T(i,j) = Error for OPT partition with j buckets for V[1…i] –‘i’ ranges from 1 to N –‘j’ ranges from 1 to k T(i,j+1) = min m < i T(m,j) + err (m+1, i) –err(m+1, i) : error in bucket with s = m+1 and e = i –Check for all m < i –T(i,1) is just variance of first i values Gives a O(N 2 k) time algorithm that uses O(Nk) space –Provided : Given indices s,e can calculate err(s,e) in O(1) time
32 Dynamic Programming (contd.) Define : S(j) = Σ i = 1 to j v i –Prefix Sum vector –j ranges from 1 to N. Define : SS(j) = Σ i = 1 to j v i 2 –Prefix “Sum squares” vector –j ranges from 1 to N.
33 List of papers... [Hoeffding63] W Hoeffding, “Probability Inequalities for Sums of Bounded Random Variables”, Amer. Stat. Journal, p 13-30, 1963 [MP80] J I Munro and M S Paterson, “Selection and Sorting in Limited Storage”, Theoretical Computer Science, 12: , [Vit85] J S Vitter, “Random Sampling with a Reservoir”, ACM Trans. on Math. Software, 11(1):37-57, [MRL98] G S Manku, S Rajagopalan and B G Lindsay, “Approximate Medians and other Quantiles in One Pass and with Limited Memory”, ACM SIGMOD 98, p , [MRL99] G S Manku, S Rajagopalan and B G Lindsay, “Random Sampling Techniques for Space Efficient Online Computation of Order Statistics of Large Datasets”, ACM SIGMOD 99, pp , [GK01] M Greenwald and S Khanna, “Space-Efficient Online Computation of Quantile Summaries”, ACM SIGMOD 2001, p 58-66, 2001.