June 16, 2004 PODS 1 Approximate Counts and Quantiles over Sliding Windows Arvind Arasu, Gurmeet Singh Manku Stanford University
June 16, PODS Sliding Window Model time
June 16, PODS Sliding Window Model time
June 16, PODS Sliding Window Model time SUM = 66
June 16, PODS Sliding Window Model time SUM = 59
June 16, PODS Statistics over Sliding Windows Easy if we store entire window Easy if we store entire window Storing entire window expensive Storing entire window expensive Space: “last 1 hour” 1000 elements/sec Space: “last 1 hour” 1000 elements/sec Focus of much previous work: Focus of much previous work: Compute approximate statistics using limited space
June 16, PODS Contributions Algorithms for computing approximate quantiles and approximate frequency counts over sliding windows Algorithms for computing approximate quantiles and approximate frequency counts over sliding windows Space requirement: Space requirement: є = error parameter є = error parameter N = size of the window N = size of the window Logarithmic in window size (N) Logarithmic in window size (N) (Almost) linear in (Almost) linear in poly-log (, N ) 1є1є 1є
June 16, PODS Contributions over Previous Work Frequency counts: First known algorithm for sliding window model Frequency counts: First known algorithm for sliding window model Quantiles: Improves over [ LLXY `04 ] Quantiles: Improves over [ LLXY `04 ] [LLXY `04] space: [LLXY `04] space: Quadratic in Quadratic in 1 є2 ( ) poly-log (, N ) 1 є 1є
June 16, PODS Rest of the Talk Formal problem specification Formal problem specification Sliding windows Sliding windows (Approximate) frequency counts (Approximate) frequency counts Our algorithms Our algorithms Fixed-size sliding windows Fixed-size sliding windows Variable-size sliding windows Variable-size sliding windows Frequency Counts only, for Quantiles see paper
June 16, PODS Sliding Windows Two abstract window models Two abstract window models Fixed-size sliding windows Fixed-size sliding windows Row-based windows Row-based windows Variable-size sliding windows Variable-size sliding windows Time-based windows, shared windows Time-based windows, shared windows
June 16, PODS Fixed-Size Sliding Windows time Window size (N) = 5
June 16, PODS Fixed-Size Sliding Windows time Window size (N) = 5
June 16, PODS Fixed-Size Sliding Windows time Window size (N) = 5
June 16, PODS Fixed-Size Sliding Windows time Window size (N) = 5
June 16, PODS Variable-Size Sliding Windows time Window size (N) = 5
June 16, PODS Variable-Size Sliding Windows time Window size (N) = 6
June 16, PODS Variable-Size Sliding Windows time Window size (N) = 7
June 16, PODS Variable-Size Sliding Windows time Window size (N) = 6
June 16, PODS Variable-Size Sliding Windows time Window size (N) = 5
June 16, PODS Variable-Size Sliding Windows time Window size (N) = 4
June 16, PODS Variable-Size Sliding Windows time Window size (N) = 3
June 16, PODS Frequency Counts ElementCount Select Element, Count(*) From Multiset Group by Element
June 16, PODS Approximate Frequency Counts Elements and their approximate counts Elements and their approximate counts Approximate Count : Approximate Count : True Count – є M < Approximate Count ≤ True Count True Count – є M < Approximate Count ≤ True Count Error parameter: є Error parameter: є Size of input: M Size of input: M Only elements with Approximate Count > 0 Only elements with Approximate Count > 0 References: [MG ’82, DLM ’02, MM ’02, KSP ’03] References: [MG ’82, DLM ’02, MM ’02, KSP ’03]
June 16, PODS Approximate Frequency Counts Input Size: M = 20 ElementTrue Count Error Error parameter: є = 0.25 Absolute error: є M = 5 Approx. Count
June 16, PODS Approximate Frequency Counts Input Size: M = 20 ElementTrue Count Error Approx. Count Error parameter: є = 0.25 Absolute error: є M = 5
June 16, PODS Approximate Frequency Counts All elements with frequency ≥ єM appear in the output. All elements with frequency ≥ єM appear in the output. There exists an output with ≤ elements. There exists an output with ≤ elements. Theorem: An approximate frequency count of size O( ) can be produced in one pass over the input using O( ) space. Theorem: An approximate frequency count of size O( ) can be produced in one pass over the input using O( ) space. References: [MG ’82, DLM ’02, KSP ’03] References: [MG ’82, DLM ’02, KSP ’03] 1 є 1є 1 є
June 16, PODS Rest of the Talk Formal problem specification Formal problem specification Sliding windows Sliding windows (Approximate) frequency counts (Approximate) frequency counts Our algorithms Our algorithms Fixed-size sliding windows Fixed-size sliding windows Variable-size sliding windows Variable-size sliding windows Frequency Counts only, for Quantiles see paper
June 16, PODS Fixed-Size Sliding Windows Window Size: N Window Size: N Error parameter: є Error parameter: є Absolute error: є N Absolute error: є N
June 16, PODS Overview N
June 16, PODS Overview N
June 16, PODS Overview N
June 16, PODS Overview N
June 16, PODS Overview N
June 16, PODS Overview N
June 16, PODS Overview N
June 16, PODS Overview N
June 16, PODS Overview N
June 16, PODS Details N єNєN 4 1 є log ( ) є 1 є 0 є 2 = O(єN)
June 16, PODS Error Invariant Absolute error of all blocks identical є i N i єNєN 1 є log ( ) = є i Error parameter for block N i Number of elements in block
June 16, PODS Merge Operation N
Block 1Block 2Block1 + Block2 є 2 N 2 ˜ f 2 < - f 2 f 2 f 1 + () є 1 N 1 є 2 N 2 ( + )< ˜ f 1 ˜ f f 1 f 1 f 2 f 2 ˜ f 2 ˜ f 1 ˜ f 1 ˜ f 2 + є 1 N 1 ˜ f 1 < - f 1 - Add approximate counts of elements. Absolute error adds up. True count Approx. count ≤ f 1 ≤ f 2 ≤ f 2 f 1 + ()
June 16, PODS Error Analysis N O(єN) log ( є ) єNєN 1 () O ( є ) 1 ++
June 16, PODS Space Requirement N єNєN 4 1 є log ( ) є 1 є 0 є 2
June 16, PODS Approximate Frequency Counts All elements with frequency ≥ єM appear in the output. All elements with frequency ≥ єM appear in the output. There exists an output with ≤ elements. There exists an output with ≤ elements. Theorem: An approximate frequency count of size O( ) can be produced in one pass over the input using O( ) space. Theorem: An approximate frequency count of size O( ) can be produced in one pass over the input using O( ) space. References: [MG ’82, DLM ’02, KSP ’03] References: [MG ’82, DLM ’02, KSP ’03] 1 є 1є 1 є
June 16, PODS Space Requirement Space required for level-ℓ blocks: 1 є ℓ x N N ℓ Size of approx. count Number of “active” blocks N єN / log ( 1 є ) == 1 є 1 є () Total space : x log () 1 є 1 є 1 є () 2 = 1 є 1 є ()
June 16, PODS Fixed-Size Sliding Windows: Summary Theorem: є-approximate frequency counts can be maintained over a fixed-size sliding window of size N using space. 1 є 1 є log () 2
June 16, PODS Variable-Size Windows Error parameter: є Error parameter: є Variable window size: n Variable window size: n Variable absolute error: єn Variable absolute error: єn
June 16, PODS Fixed-Size Window Algorithm? єNєN 4 1 є log ( ) є 1 є 0 є 2 N
June 16, PODS Fixed-Size Window Algorithm? F (є, N) єNєN n n error parameter = N
June 16, PODS Limited Variability F(є/2, N) computes є-approximate frequency counts for window sizes (N/2 ≤ n ≤ N). F(є/2, N) computes є-approximate frequency counts for window sizes (N/2 ≤ n ≤ N).
June 16, PODS Variable-Size Windows time n F(є/2, N) F(є/2, N/2) F(є/2, 2/є) log (єn) N = 2 ≥ n > N/2 p
June 16, PODS Variable-Size Windows time F(є/2, N) F(є/2, N/2) F(є/2, 2/є) n
June 16, PODS Variable-Size Windows time F(є/2, N) F(є/2, N/2) F(є/2, 2/є) n
June 16, PODS Variable-Size Windows time F(є/2, N) F(є/2, N/2) F(є/2, 2/є) n
June 16, PODS Variable-Size Windows time F(є/2, N/2) F(є/2, 2/є) n
June 16, PODS Variable-Size Windows time F(є/2, N/2) F(є/2, 2/є) n
June 16, PODS Variable-Size Windows time F(є/2, N/2) F(є/2, 2/є) n
June 16, PODS Variable-Size Windows time F(є/2, N/2) F(є/2, 2/є) n F(є/2, N)
June 16, PODS Variable-Size Windows: Summary Theorem: є-approximate frequency counts can be maintained over variable-size windows using 1 є 1 є log () 2 log (є n) space, where n is the current size of the sliding window.
June 16, PODS See Paper for … Randomized algorithms for frequency counts Randomized algorithms for frequency counts Deterministic and randomized algorithms for quantiles Deterministic and randomized algorithms for quantiles A general technique for variable-size window algorithms. A general technique for variable-size window algorithms. Converts fixed-size window algorithms to variable- size window algorithms Converts fixed-size window algorithms to variable- size window algorithms Works for Sum, Bit-Count Works for Sum, Bit-Count
June 16, PODS References used in Talk [DLM ’02]: E. D. Demaine, A. Lopez-Ortiz, and J.I. Munro. Frequency estimation of internet packet streams with limited space. ESA [DLM ’02]: E. D. Demaine, A. Lopez-Ortiz, and J.I. Munro. Frequency estimation of internet packet streams with limited space. ESA [KSP ’03]: R. M. Karp, S. Shenker, and C. H. Papadimitriou. A simple algorithm for finding frequent elements in streams and bags. TODS [KSP ’03]: R. M. Karp, S. Shenker, and C. H. Papadimitriou. A simple algorithm for finding frequent elements in streams and bags. TODS [LLXY ’04]: X. Lin, H. Lu, J. Xu, and Y. X. Yu. Continuously maintaining quantile summaries of the most recent N elements over a data stream. ICDE [LLXY ’04]: X. Lin, H. Lu, J. Xu, and Y. X. Yu. Continuously maintaining quantile summaries of the most recent N elements over a data stream. ICDE [MG ’82]: J. Misra, D. Gries. Finding repeated elements. Sci. Comput. Programming [MG ’82]: J. Misra, D. Gries. Finding repeated elements. Sci. Comput. Programming [MM ’02]: G. S. Manku, R. Motwani. Approximate frequency counts over data streams. VLDB [MM ’02]: G. S. Manku, R. Motwani. Approximate frequency counts over data streams. VLDB 2002.