Maintaining Variance and k-Medians over Data Stream Windows Paper by Brian Babcock, Mayur Datar, Rajeev Motwani and Liadan O’Callaghan. Presentation by.

Maintaining Variance and k-Medians over Data Stream Windows Paper by Brian Babcock, Mayur Datar, Rajeev Motwani and Liadan O’Callaghan. Presentation by Anat Rapoport December 2003.

Characteristics of the data stream  Data elements arrive continually  Only the most recent N elements are used when answering queries  Single linear scan algorithm (can only have one look)  Store only the summery of the data seen thus far.

Introduction  Two important and related problems: –Variance –K-median clustering

Problem 1 (Variance) Given a stream of numbers, maintain at every instant the variance of the last N values where denotes the mean of the last N values

Problem 1 (Variance)  We cannot buffer the entire sliding window in memory  So we cannot compute the variance exactly at every instant  We will solve this problem approximately.  We use memory and provide an estimate with relative error of at most ε  The time required per new element is amortized O(1)

Extend to k-median  Given a multiset X of objects in a metric space M with distance function l the k- median problem is to pick k points c 1,…,c k ∈ M so as to minimize where C(x) is the closest of c 1,…,c k to x.  if C(x)=c i then x is said to be assigned to c i and l(x, c i ) is called the assignment distance of x  The objective function is the sum of the assignment distances.

Problem 2 (SWKM) Given a stream of points from a metric space M with distance function l, window size N, and parameter k, maintain at every instant t a median set c 1,…,c k ∈ M minimizing where X t is the multiset of N most recent points at time t

Exponential Histogram  From last week: Maintaining simple statistics over sliding windows  The exponential histogram estimates a class of aggregated functions over sliding windows  Their result applies to any function f satisfying the following properties for all multisets X,Y:

Where EH goes wrong  EH can estimate any function f defined over windows which satisfies:  Positive:  Polynomialy bounded:  Composable:  Weakly additive: where C f ≥1 is a constant “Weakly Additive” condition not valid for variance, k- medians

Failure of “ Weak Additivity ” Cannot afford to neglect contribution of last bucket Time Value Variance of each bucket is small Variance of combined bucket is large

The idea  Summarize intervals of the data stream using composable synopses  For efficient memory use adjacent intervals are combined, when it doesn’t increase the error significantly  The synopsis of the last interval in the sliding window is inaccurate. Some points have expired…  HOWEVER –We will find a way to estimate this interval

Timestamp  Corresponds to the position of an active data element in the current window  We do not make explicit updates  We use a wraparound counter of logN bits  Timestamp can be extracted by comparison with the counter value of the current arrival

Model  We store the data elements in the buckets of the histogram  Every bucket stores the synopsis structure for a contiguous set of elements  The partition is based on arrival time  The bucket also has a timestamp, of the most recent data element in it  When the timestamp reaches N+1 we drop the bucket

Model  Buckets are numbered B 1, …,B m –B 1 the most recent –B m the oldest  t 1,…,t m denote the bucket timestamp  All buckets but B m have only active data elements

Maintaining variance over sliding windows algorithm

Details  We would like to estimate the variance with relative error of at most ε  Maintain for each bucket B i, besides it’s timestamp t i, also: –number of elements n i –mean μ i –variance V i

Details  Define: another set of buckets B 1*,…, B j* that represent the suffixes of the data stream.  The bucket B m* represents all the points that arrived after the oldest non-expired bucket  The statistics for these buckets are computed temporarily

data structure: exponential histogram X1X1 XNXN X2X2 …… Window size N timestamp most recent oldest timestamp most recent …… B1B1 B m-1 BmBm oldest …… B2B2 B m*

Combination rule  In the algorithm we will need to combine adjacent buckets  Consider two buckets B i and B j that get combined to form a new bucket B ij  The statistics for B ij are

Lemma 1 The bucket combination procedure correctly computes n i,j, μ i,j, V i,j for the new bucket Proof  Note that n i,j, μ i,j, are correctly computed from the definitions of count and average  Define δ i =μ i, -μ i,j δ j =μ j, -μ i,j

Main Solution Idea  More careful estimation of last bucket ’ s contribution  Decompose variance into two parts –“ Internal ” variance: within bucket –“ External ” variance: between buckets Internal Variance of Bucket i Internal Variance of Bucket j External Variance

Estimation of the variance over the current active window  Let B m’ refer to the non-expired portion of the bucket B m (the set of active elements)  The estimation for n m’, μ m’, V m’ : –n m’ EST =N+1-t m (exact) –μ m’ EST = μ m –V m’ EST = V m /2  The statistics for B m’,m* are sufficient for computing the variance at time t.

Estimation of the variance over the current active window  The estimate for B m’ can be found in o(1) time if we keep statistics for B m  The error is due to the error in the estimation statistics for B m’  Theorem: Relative error ≤ ε, provided V m ≤ (ε 2 /9) V m*  Aim: Maintain V m ≤ (ε 2 /9) V m* using as few buckets as possible

Algorithm sketch  for every new element: –insert the new element to an existing bucket or to a new bucket –if B m ‘s timestamp > N delete it –if there are two adjacent buckets with small combined variance combine them to one bucket

Algorithm 1 (insert x t ) 1. if x t =μ 1 then insert x t to B 1, by incrementing n 1 by 1. Otherwise, create a new bucket for x t. The new bucket becomes B 1 with v 1 =0 μ 1 = x t, n 1 =1. An old bucket B i becomes B i+1. 2. if B m ‘s timestamp>N, delete the bucket. Bucket B m-1 becomes the new oldest bucket. Maintain the statistics of B m-1* (instead of B m* ), which can be computed using the previously maintained statistics for B m* and B m-1. (“deletion” of buckets also works…)

Algorithm 1 (insert x t ) 3. Let k=9/ε^2 and V i,i-1 is the variance combination of buckets B i and B i-1. While there exist an index i>2 such that kV i,i-1 ≤V i-1* find the smallest i and combine the buckets according to the combination rule. The statistics for B i* can be computed incrementally from the statistics for B i-1 and B i-1* 4. Output estimated variance at time t according to the estimation procedure.  V m’,m*

 Invariant 1 For every bucket B i, 9/ε 2 V i ≤V i* –Ensures that the relative error is ≤ ε  Invariant 2 For each i V i-1* –This invariant insures that the total number of buckets is small  O((1/ε 2 )log NR 2 ) – Each bucket requires constant space

Lemma 2 The number of buckets maintained at any point in time by an algorithm that preserves Invariant 2 is O(1/ε 2 logNR 2 ) where R is an upper bound on the absolute value of the data elements.

Proof sketch  From the combination rule: the variance of the union of two buckets is no less then the sum of the individual variances.  Algorithm that preserves invariant 2, the variance of the suffix bucket B i* doubles after every O(1/ε 2 ) buckets.  Total number of buckets: no more then O(1/ε 2 logV) where V is the variance of the last N points. V is no more than NR 2.  O(1/ε 2 log NR 2 )

Running time improvement  The algorithm requires O(1/ε 2 logNR 2 ) time per new element.  Most time is spent in step 3 where we make the sweep to combine buckets.  The time is proportional to the size of the histogram O(1/ε 2 logNR 2 ).  The trick: skip step 3 until we have seen Θ(1/ε 2 logNR 2 ).  This ensures that the time of the algorithm is amortized O(1).  May violate invariant 2 temporarily, but we restore it every Θ(1/ε 2 logNR 2 ) data points, when we execute step 3.

Variance algorithm summery  O(1/ε 2 logNR 2 ) time per new element  O(1/ε 2 log NR 2 ) memory  with error of at most ε

Clustering on sliding windows

Clustering Data Streams  Based on k-median problem: –Data stream points from metric space. –Find k clusters in the stream such that the sum of distances from data points to their closest center is minimized

Clustering Data Streams  Constant factor approximation algorithms –A simple two step algorithm: step1: For each set of M=n τ points, S i, find O(k) centers in S 1, …, S M -- Local clustering: Assign each point in S i to its closest center step2: Let S ’ be centers for S 1, …, S M with each center weighted by number of points assigned to it. Cluster S ’ to find k centers  The solution cost is < 2*optimal solution cost  τ<0.5 is a parameter which trades off space bound with approximation factor of 2 O(1/ τ )

One-pass algorithm: first phase 1 3 2 4 5 1 3 2 4 5 S1S1 S2S2 S Original data: k=1 we take: M=3

One-pass algorithm: second phase Original data: M=3 k=1 we take: 1 3 2 4 5 1 5 w=3 w=2 S’S’

Restate the algorithm… N τ points input data stream find O(k) medians store it with weight discard N τ points level-1 medians level-0 medians N τ medians with associated weight find O(k) medians level-2 medians Repeat 1/ τ times

The idea  In general, whenever there are n τ medians at level i they are clustered to form level (i+1) medians. level-i medians level-(i+1) medians data points

timestamp most recent …… B1B1 B m-1 BmBm oldest …… X1X1 XNXN X2X2 Window size N timestamp most recent oldest each bucket consists of a collection of data points or intermediate medians. data structure: exponential histogram

Point representation  Each point is represented by a triple: (p(x),w(x), c(x)). p(x) - identifier of x (coordinate) w(x) - weight of x, the number of points it represents c(x) - cost of x. An estimate of the sum of costs l(x,y) of all the leaves y in the tree which x is the root of x. w(x) = Σ(w(y 1 ), w(y 2 ),…,w(y i )) c(x) = Σ(c(y)+w(y) ‧ l(x,y)),for all y assigned to x if x is a level-0 median: w(x) = 1, c(x)=0  Thus, c(x) is an overestimate of the “true” cost of x

Bucket cost function  We maintain medians at intermediate levels  When ever there are N τ medians at the same level we cluster them into O(k) medians at the next higher level  Each bucket can be spilt into 1/τ groups where each contains medians at level j.  Each group contains at most N τ medians

Bucket cost function  Bucket’s cost function is an estimate of the cost of clustering the points represented by the bucket.  Consider bucket B i. Let be the set of medians in the bucket  Cost function for B i Where C(x) ∈ {c 1,…c k } is the median closest to x

Combination  Let B i and B j be two adjacent buckets that need to be combined to form B i,j  Let be the groups of medians from the two buckets. Set:  if then cluster the points from and set it to be empty.  C 0 set of O(k) medians obtained by clustering and so on… After at most 1/ τ unions we get B i,j  Now we compute the new bucket’s cost

Answer a query  Consider buckets B 1 …B m-1  Each contain at most 1/τ N τ medians, all contain at most 1/τ N τ medians  Cluster them to produce k medians  Cluster bucket B m to get k additional medians  Present the 2k medians as the answer

algorithm Insert x t  if number of level-0 medians in B 1 <k, add the point x t as a level-0 median in bucket B 1. else create a new bucket B 1 to contain x t and renumber the existing buckets accordingly.  if bucket B m ‘s time stamp > N, delete it; now, B m-1 becomes the last bucket.  Make a sweep over the buckets from most recent to least recent while there exists an index i>2 such that f(B i,i-1 ) ≤ 2f(B i-1* ), find the smallest such i and combine buckets B i and B i-1 using the combination procedure described above.

Invariant 3. For every bucket B i f(B i )≤2f(B i* ) Ensures a solution with 2k median whose cost is within multiplicative factor of 2 O(1/τ) of the cost of the optimal k-median solution. Invariant 4. For every bucket B i (i>1), f(B i,i-1 )>2f(B i-1* ) Ensures that the number of buckets never exceeds O(1/τ+logN) We assume that cost is bounded by poly(N)  O(1/τlogN) in the article

Running time improvement  After each element arrives we check if invariant 3 holds.  In order to reduce time we can execute bucket combination only after some amount of points accumulated in bucket B 1, Only after it fills we check for the invariant.  We assume that the algorithm is not called after each new entry. Instead, it maintains enough statistics to produce statistics when a query arrives.

Producing exactly k clusters  With each median, we estimate within a constant factor the number of active data points that are assigned to it.  We don’t cluster B m* and B m’ separately but cluster the medians from all the buckets together. However the weights of medians form B m are adjusted so that they reflect only the active data points.

Conclusions  The goal of such algorithms is to maintain statistics or information for the last N set of entries that is growing over real time.  The variance algorithm uses O(1/ε 2 logNR 2 ) memory and maintains an estimate of the variance with relative error of at most ε and amortized O(1) time per new element  The k-median algorithm provides a 2 O(1/τ) approximation for τ<0.5. It uses O(1/τ+logN) memory and requires O(1) amortized time per new element.

Questions?  More questions/comments can be sent to anatrapo@post.tau.ac.il anatrapo@post.tau.ac.il

Maintaining Variance and k-Medians over Data Stream Windows Paper by Brian Babcock, Mayur Datar, Rajeev Motwani and Liadan O’Callaghan. Presentation by.

Similar presentations

Presentation on theme: "Maintaining Variance and k-Medians over Data Stream Windows Paper by Brian Babcock, Mayur Datar, Rajeev Motwani and Liadan O’Callaghan. Presentation by."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Maintaining Variance and k-Medians over Data Stream Windows Paper by Brian Babcock, Mayur Datar, Rajeev Motwani and Liadan O’Callaghan. Presentation by.

Similar presentations

Presentation on theme: "Maintaining Variance and k-Medians over Data Stream Windows Paper by Brian Babcock, Mayur Datar, Rajeev Motwani and Liadan O’Callaghan. Presentation by."— Presentation transcript:

Similar presentations

About project

Feedback