Download presentation
Presentation is loading. Please wait.
1
Data-Streams and Histograms
Sudipto Guha, Nick Koudas & Kyuseok Shim
2
Background Histogram EquiWidth, EquiDepth, MHIST, MaxDiff, V-OPT
Captures distribution statistics in an efficient manner Applications Query optimization Approximate query answering Data mining (time series in particular) Piecewise transmission of data EquiWidth, EquiDepth, MHIST, MaxDiff, V-OPT
3
Background Data Stream
An ordered sequence of points that can be read only once or a small number of times Applications Mission critical network components Dynamic traffic configuration, fault identification, troubleshooting Performance of algorithm measured by number of passes algorithm must make over the stream
4
Motivation Since the end use of a histogram is to approximate a data distribution, why not use a near-optimal approximation of the best histogram if it means linear time computation?
5
Motivation Approximate V-OPT histograms by improving the dynamic programming solution from quadratic to linear time Revised algorithm uses little space, hence suitable for data stream model Assumes cost of interval is monotonic under inclusion
6
Problem Statement Given: Constraint: Dynamic Programming solution:
non-negative integers v1, ..., vn k intervals or buckets to partition the index 1..n Constraint: Minimize k VARk where is the variance of values in the kth bucket Dynamic Programming solution: OPT[k, n] = min {OPT[k-1, x] + VAR[(x+1)..n]} Runs in O(n2k) time with O(n) space x<n
7
Intuition of Improvement
For a x b, VAR[a..n] VAR[x..n] VAR [b..n] (1) OPT[a..n] OPT[x..n] OPT[b..n] (2) Use this monotonicity property to reduce the search space by settling for an approximation Instead of storing the whole OPT function, approximate it by a histogram!
8
Intuition of Improvement
For all 1 p k, maintain intervals (a1,b1),…, (al, bl) Value of bi (1+)ai The number of intervals l depends on p The value for each interval substitutes for each value in the interval reducing space and time complexity
9
Results Theorem: A (1+) approximation for V-OPT runs in O((k2/)log n) space and time O((nk2/)log n) in the data stream model
10
Advantages and Disadvantages
Accuracy/runtime tradeoff can be controlled by the parameter For data-stream model, alternatives abound: Random sampling (simple, assumption of distribution) Other histogram techniques (faster, less optimal) Wavelet (flexibility) Sliding Windows (later paper)
11
Improvements
12
Conclusion The authors provided an algorithm for approximating a distribution that runs reasonably fast and with small space requirements Proposed solution can be applied to data-stream model because values are not referred to unless they are stored
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.