Download presentation
Presentation is loading. Please wait.
Published byDuane Dawson Modified over 9 years ago
1
Estimating Entropy for Data Streams Khanh Do Ba, Dartmouth College Advisor: S. Muthu Muthukrishnan
2
Review of Data Streams Motivation: huge data stream that needs to be mined for info “efficiently.” Applications: monitoring IP traffic, mining email and text message streams, etc.
3
The Mathematical Model Sequence of integers A = a 1, …, a m , where each a i N = {1, …, n}. For each v N, the frequency m v of v is # occurrences of v in A. Statistics to be estimated are functions on A, but usually just on the m v ’s (e.g. frequency moments).
4
What is Entropy? In physics: measure of disorder in a system. In math: measure of randomness (or uniformity) of a probability distribution. Formula:
5
Entropy on Data Streams For big m, m v /m → Pr[v]. So formula becomes: Suffices to compute m (easy) and
6
The Goal Approximation algorithm to estimate μ. Approximate means to output a number Y such that: Pr[|Y – μ| λμ] ε, for any user- specified λ, ε > 0. Restrictions: o(n), preferably Õ(1), space, and only 1 pass over data.
7
The Algorithm We want Y to have E[Y] = μ and very small variance, so find a computable random variable X with E[X] = μ and small variance, and compute it several times. Y is the median of s 2 RVs Y i, each of which is the mean of s 1 RVs X ij = X (independently, identically computed).
8
Computing X Choose p {1, …, m} uniformly at random. Let r = #{q p | a q = a p } ( 1). X = m[r log r – (r – 1) log (r – 1)].
9
The Analysis Easy: E[Y] = E[X] = μ. Hard: Var[Y] is very small. Turns out s 1 = O(log n), s 2 = O(1) works. Each X maintained in O(log n + log m) space. Total: O(s 1 s 2 (log n + log m)) = O(log n log m).
10
Future Directions Extension to insert/delete streams. Applications in: DBMSs where massive secondary storage cannot be scanned quickly enough to answer real-time queries. Monitoring open flows through internet routers. Lowerbound proof showing algorithm is optimal, or an improved algorithm.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.