Maintaining Variance and k-Medians over Data Stream Windows Paper by Brian Babcock, Mayur Datar, Rajeev Motwani and Liadan O’Callaghan. Presentation by.

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

Sampling From a Moving Window Over Streaming Data Brian Babcock * Mayur Datar Rajeev Motwani * Speaker Stanford University.
Two Segments Intersect?
Incremental Linear Programming Linear programming involves finding a solution to the constraints, one that maximizes the given linear function of variables.
Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.
Fast Algorithms For Hierarchical Range Histogram Constructions
Chapter 4: Trees Part II - AVL Tree
Dynamic Planar Convex Hull Operations in Near- Logarithmic Amortized Time TIMOTHY M. CHAN.
Maintaining Variance and k-Medians over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O’Callaghan Stanford University.
1 A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window Costas Busch Rensselaer Polytechnic Institute Srikanta Tirthapura.
Maintaining Variance over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O ’ Callaghan, Stanford University ACM Symp. on Principles.
Adaptive Frequency Counting over Bursty Data Streams Bill Lin, Wai-Shing Ho, Ben Kao and Chun-Kit Chui Form CIDM07.
Mining Data Streams.
Computing Diameter in the Streaming and Sliding-Window Models J. Feigenbaum, S. Kannan, J. Zhang.
Fusion Trees Advanced Data Structures Aris Tentes.
Lectures on Recursive Algorithms1 COMP 523: Advanced Algorithmic Techniques Lecturer: Dariusz Kowalski.
Constant-Time LCA Retrieval
COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong
Extension of DGIM to More Complex Problems
Sorting Heapsort Quick review of basic sorting methods Lower bounds for comparison-based methods Non-comparison based sorting.
Computing Diameter in the Streaming and Sliding-Window Models J. Feigenbaum, S. Kannan, J. Zhang.
CURE Algorithm Clustering Streams
1 Distributed Streams Algorithms for Sliding Windows Phillip B. Gibbons, Srikanta Tirthapura.
CS 536 Spring Global Optimizations Lecture 23.
UMass Lowell Computer Science Graduate Analysis of Algorithms Prof. Karen Daniels Spring, 2009 Lecture 3 Tuesday, 2/10/09 Amortized Analysis.
A survey on stream data mining
Chain: Operator Scheduling for Memory Minimization in Data Stream Systems Authors: Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani (Dept.
Andreas Klappenecker [based on the slides of Prof. Welch]
Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.
Prof. Fateman CS 164 Lecture 221 Global Optimization Lecture 22.
1 Mining Data Streams The Stream Model Sliding Windows Counting 1’s.
1 Separator Theorems for Planar Graphs Presented by Shira Zucker.
Foundations of Privacy Lecture 11 Lecturer: Moni Naor.
CS 591 A11 Algorithms for Data Streams Dhiman Barman CS 591 A1 Algorithms for the New Age 2 nd Dec, 2002.
Prof. Bodik CS 164 Lecture 16, Fall Global Optimization Lecture 16.
Time, Clocks, and the Ordering of Events in a Distributed System Leslie Lamport (1978) Presented by: Yoav Kantor.
Cloud and Big Data Summer School, Stockholm, Aug Jeffrey D. Ullman.
1 By: MOSES CHARIKAR, CHANDRA CHEKURI, TOMAS FEDER, AND RAJEEV MOTWANI Presented By: Sarah Hegab.
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
The Power of Incorrectness A Brief Introduction to Soft Heaps.
Jessie Zhao Course page: 1.
10/14/ Algorithms1 Algorithms - Ch2 - Sorting.
The Selection Problem. 2 Median and Order Statistics In this section, we will study algorithms for finding the i th smallest element in a set of n elements.
Lars Arge Presented by Or Ozery. I/O Model Previously defined: N = # of elements in input M = # of elements that fit into memory B = # of elements per.
Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.
1 Online Computation and Continuous Maintaining of Quantile Summaries Tian Xia Database CCIS Northeastern University April 16, 2004.
Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery.
CS4432: Database Systems II Query Processing- Part 2.
COMPSCI 102 Discrete Mathematics for Computer Science.
Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.
Clustering Data Streams A presentation by George Toderici.
Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window MADALGO – Center for Massive Data Algorithmics, a Center of the Danish.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Mining Data Streams (Part 1)
Andreas Klappenecker [partially based on the slides of Prof. Welch]
Frequency Counts over Data Streams
The Stream Model Sliding Windows Counting 1’s
Streaming & sampling.
Chapter 12: Query Processing
Spatial Online Sampling and Aggregation
Enumerating Distances Using Spanners of Bounded Degree
Y. Kotidis, S. Muthukrishnan,
Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani
Lecture 2- Query Processing (continued)
Approximation and Load Shedding Sampling Methods
Maintaining Stream Statistics over Sliding Windows
Presentation transcript:

Maintaining Variance and k-Medians over Data Stream Windows Paper by Brian Babcock, Mayur Datar, Rajeev Motwani and Liadan O’Callaghan. Presentation by Anat Rapoport December 2003.

Characteristics of the data stream  Data elements arrive continually  Only the most recent N elements are used when answering queries  Single linear scan algorithm (can only have one look)  Store only the summery of the data seen thus far.

Introduction  Two important and related problems: –Variance –K-median clustering

Problem 1 (Variance) Given a stream of numbers, maintain at every instant the variance of the last N values where denotes the mean of the last N values

Problem 1 (Variance)  We cannot buffer the entire sliding window in memory  So we cannot compute the variance exactly at every instant  We will solve this problem approximately.  We use memory and provide an estimate with relative error of at most ε  The time required per new element is amortized O(1)

Extend to k-median  Given a multiset X of objects in a metric space M with distance function l the k- median problem is to pick k points c 1,…,c k ∈ M so as to minimize where C(x) is the closest of c 1,…,c k to x.  if C(x)=c i then x is said to be assigned to c i and l(x, c i ) is called the assignment distance of x  The objective function is the sum of the assignment distances.

Problem 2 (SWKM) Given a stream of points from a metric space M with distance function l, window size N, and parameter k, maintain at every instant t a median set c 1,…,c k ∈ M minimizing where X t is the multiset of N most recent points at time t

Exponential Histogram  From last week: Maintaining simple statistics over sliding windows  The exponential histogram estimates a class of aggregated functions over sliding windows  Their result applies to any function f satisfying the following properties for all multisets X,Y:

Where EH goes wrong  EH can estimate any function f defined over windows which satisfies:  Positive:  Polynomialy bounded:  Composable:  Weakly additive: where C f ≥1 is a constant “Weakly Additive” condition not valid for variance, k- medians

Failure of “ Weak Additivity ” Cannot afford to neglect contribution of last bucket Time Value Variance of each bucket is small Variance of combined bucket is large

The idea  Summarize intervals of the data stream using composable synopses  For efficient memory use adjacent intervals are combined, when it doesn’t increase the error significantly  The synopsis of the last interval in the sliding window is inaccurate. Some points have expired…  HOWEVER –We will find a way to estimate this interval

Timestamp  Corresponds to the position of an active data element in the current window  We do not make explicit updates  We use a wraparound counter of logN bits  Timestamp can be extracted by comparison with the counter value of the current arrival

Model  We store the data elements in the buckets of the histogram  Every bucket stores the synopsis structure for a contiguous set of elements  The partition is based on arrival time  The bucket also has a timestamp, of the most recent data element in it  When the timestamp reaches N+1 we drop the bucket

Model  Buckets are numbered B 1, …,B m –B 1 the most recent –B m the oldest  t 1,…,t m denote the bucket timestamp  All buckets but B m have only active data elements

Maintaining variance over sliding windows algorithm

Details  We would like to estimate the variance with relative error of at most ε  Maintain for each bucket B i, besides it’s timestamp t i, also: –number of elements n i –mean μ i –variance V i

Details  Define: another set of buckets B 1*,…, B j* that represent the suffixes of the data stream.  The bucket B m* represents all the points that arrived after the oldest non-expired bucket  The statistics for these buckets are computed temporarily

data structure: exponential histogram X1X1 XNXN X2X2 …… Window size N timestamp most recent oldest timestamp most recent …… B1B1 B m-1 BmBm oldest …… B2B2 B m*

Combination rule  In the algorithm we will need to combine adjacent buckets  Consider two buckets B i and B j that get combined to form a new bucket B ij  The statistics for B ij are

Lemma 1 The bucket combination procedure correctly computes n i,j, μ i,j, V i,j for the new bucket Proof  Note that n i,j, μ i,j, are correctly computed from the definitions of count and average  Define δ i =μ i, -μ i,j δ j =μ j, -μ i,j

Main Solution Idea  More careful estimation of last bucket ’ s contribution  Decompose variance into two parts –“ Internal ” variance: within bucket –“ External ” variance: between buckets Internal Variance of Bucket i Internal Variance of Bucket j External Variance

Estimation of the variance over the current active window  Let B m’ refer to the non-expired portion of the bucket B m (the set of active elements)  The estimation for n m’, μ m’, V m’ : –n m’ EST =N+1-t m (exact) –μ m’ EST = μ m –V m’ EST = V m /2  The statistics for B m’,m* are sufficient for computing the variance at time t.

Estimation of the variance over the current active window  The estimate for B m’ can be found in o(1) time if we keep statistics for B m  The error is due to the error in the estimation statistics for B m’  Theorem: Relative error ≤ ε, provided V m ≤ (ε 2 /9) V m*  Aim: Maintain V m ≤ (ε 2 /9) V m* using as few buckets as possible

Algorithm sketch  for every new element: –insert the new element to an existing bucket or to a new bucket –if B m ‘s timestamp > N delete it –if there are two adjacent buckets with small combined variance combine them to one bucket

Algorithm 1 (insert x t ) 1. if x t =μ 1 then insert x t to B 1, by incrementing n 1 by 1. Otherwise, create a new bucket for x t. The new bucket becomes B 1 with v 1 =0 μ 1 = x t, n 1 =1. An old bucket B i becomes B i if B m ‘s timestamp>N, delete the bucket. Bucket B m-1 becomes the new oldest bucket. Maintain the statistics of B m-1* (instead of B m* ), which can be computed using the previously maintained statistics for B m* and B m-1. (“deletion” of buckets also works…)

Algorithm 1 (insert x t ) 3. Let k=9/ε^2 and V i,i-1 is the variance combination of buckets B i and B i-1. While there exist an index i>2 such that kV i,i-1 ≤V i-1* find the smallest i and combine the buckets according to the combination rule. The statistics for B i* can be computed incrementally from the statistics for B i-1 and B i-1* 4. Output estimated variance at time t according to the estimation procedure.  V m’,m*

 Invariant 1 For every bucket B i, 9/ε 2 V i ≤V i* –Ensures that the relative error is ≤ ε  Invariant 2 For each i V i-1* –This invariant insures that the total number of buckets is small  O((1/ε 2 )log NR 2 ) – Each bucket requires constant space

Lemma 2 The number of buckets maintained at any point in time by an algorithm that preserves Invariant 2 is O(1/ε 2 logNR 2 ) where R is an upper bound on the absolute value of the data elements.

Proof sketch  From the combination rule: the variance of the union of two buckets is no less then the sum of the individual variances.  Algorithm that preserves invariant 2, the variance of the suffix bucket B i* doubles after every O(1/ε 2 ) buckets.  Total number of buckets: no more then O(1/ε 2 logV) where V is the variance of the last N points. V is no more than NR 2.  O(1/ε 2 log NR 2 )

Running time improvement  The algorithm requires O(1/ε 2 logNR 2 ) time per new element.  Most time is spent in step 3 where we make the sweep to combine buckets.  The time is proportional to the size of the histogram O(1/ε 2 logNR 2 ).  The trick: skip step 3 until we have seen Θ(1/ε 2 logNR 2 ).  This ensures that the time of the algorithm is amortized O(1).  May violate invariant 2 temporarily, but we restore it every Θ(1/ε 2 logNR 2 ) data points, when we execute step 3.

Variance algorithm summery  O(1/ε 2 logNR 2 ) time per new element  O(1/ε 2 log NR 2 ) memory  with error of at most ε

Clustering on sliding windows

Clustering Data Streams  Based on k-median problem: –Data stream points from metric space. –Find k clusters in the stream such that the sum of distances from data points to their closest center is minimized

Clustering Data Streams  Constant factor approximation algorithms –A simple two step algorithm: step1: For each set of M=n τ points, S i, find O(k) centers in S 1, …, S M -- Local clustering: Assign each point in S i to its closest center step2: Let S ’ be centers for S 1, …, S M with each center weighted by number of points assigned to it. Cluster S ’ to find k centers  The solution cost is < 2*optimal solution cost  τ<0.5 is a parameter which trades off space bound with approximation factor of 2 O(1/ τ )

One-pass algorithm: first phase S1S1 S2S2 S Original data: k=1 we take: M=3

One-pass algorithm: second phase Original data: M=3 k=1 we take: w=3 w=2 S’S’

Restate the algorithm… N τ points input data stream find O(k) medians store it with weight discard N τ points level-1 medians level-0 medians N τ medians with associated weight find O(k) medians level-2 medians Repeat 1/ τ times

The idea  In general, whenever there are n τ medians at level i they are clustered to form level (i+1) medians. level-i medians level-(i+1) medians data points

timestamp most recent …… B1B1 B m-1 BmBm oldest …… X1X1 XNXN X2X2 Window size N timestamp most recent oldest each bucket consists of a collection of data points or intermediate medians. data structure: exponential histogram

Point representation  Each point is represented by a triple: (p(x),w(x), c(x)). p(x) - identifier of x (coordinate) w(x) - weight of x, the number of points it represents c(x) - cost of x. An estimate of the sum of costs l(x,y) of all the leaves y in the tree which x is the root of x. w(x) = Σ(w(y 1 ), w(y 2 ),…,w(y i )) c(x) = Σ(c(y)+w(y) ‧ l(x,y)),for all y assigned to x if x is a level-0 median: w(x) = 1, c(x)=0  Thus, c(x) is an overestimate of the “true” cost of x

Bucket cost function  We maintain medians at intermediate levels  When ever there are N τ medians at the same level we cluster them into O(k) medians at the next higher level  Each bucket can be spilt into 1/τ groups where each contains medians at level j.  Each group contains at most N τ medians

Bucket cost function  Bucket’s cost function is an estimate of the cost of clustering the points represented by the bucket.  Consider bucket B i. Let be the set of medians in the bucket  Cost function for B i Where C(x) ∈ {c 1,…c k } is the median closest to x

Combination  Let B i and B j be two adjacent buckets that need to be combined to form B i,j  Let be the groups of medians from the two buckets. Set:  if then cluster the points from and set it to be empty.  C 0 set of O(k) medians obtained by clustering and so on… After at most 1/ τ unions we get B i,j  Now we compute the new bucket’s cost

Answer a query  Consider buckets B 1 …B m-1  Each contain at most 1/τ N τ medians, all contain at most 1/τ N τ medians  Cluster them to produce k medians  Cluster bucket B m to get k additional medians  Present the 2k medians as the answer

algorithm Insert x t  if number of level-0 medians in B 1 <k, add the point x t as a level-0 median in bucket B 1. else create a new bucket B 1 to contain x t and renumber the existing buckets accordingly.  if bucket B m ‘s time stamp > N, delete it; now, B m-1 becomes the last bucket.  Make a sweep over the buckets from most recent to least recent while there exists an index i>2 such that f(B i,i-1 ) ≤ 2f(B i-1* ), find the smallest such i and combine buckets B i and B i-1 using the combination procedure described above.

Invariant 3. For every bucket B i f(B i )≤2f(B i* ) Ensures a solution with 2k median whose cost is within multiplicative factor of 2 O(1/τ) of the cost of the optimal k-median solution. Invariant 4. For every bucket B i (i>1), f(B i,i-1 )>2f(B i-1* ) Ensures that the number of buckets never exceeds O(1/τ+logN) We assume that cost is bounded by poly(N)  O(1/τlogN) in the article

Running time improvement  After each element arrives we check if invariant 3 holds.  In order to reduce time we can execute bucket combination only after some amount of points accumulated in bucket B 1, Only after it fills we check for the invariant.  We assume that the algorithm is not called after each new entry. Instead, it maintains enough statistics to produce statistics when a query arrives.

Producing exactly k clusters  With each median, we estimate within a constant factor the number of active data points that are assigned to it.  We don’t cluster B m* and B m’ separately but cluster the medians from all the buckets together. However the weights of medians form B m are adjusted so that they reflect only the active data points.

Conclusions  The goal of such algorithms is to maintain statistics or information for the last N set of entries that is growing over real time.  The variance algorithm uses O(1/ε 2 logNR 2 ) memory and maintains an estimate of the variance with relative error of at most ε and amortized O(1) time per new element  The k-median algorithm provides a 2 O(1/τ) approximation for τ<0.5. It uses O(1/τ+logN) memory and requires O(1) amortized time per new element.

Questions?  More questions/comments can be sent to