Maintaining Variance over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O ’ Callaghan, Stanford University ACM Symp. on Principles.

Slides:



Advertisements
Similar presentations
Sampling From a Moving Window Over Streaming Data Brian Babcock * Mayur Datar Rajeev Motwani * Speaker Stanford University.
Advertisements

Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with Wenjie Zhang, Ying Zhang and Xuemin Lin University of.
1 11. Streaming Data Management Chapter 18 Current Issues: Streaming Data and Cloud Computing The 3rd edition of the textbook.
3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.
Ariel Rosenfeld Network Traffic Engineering. Call Record Analysis. Sensor Data Analysis. Medical, Financial Monitoring. Etc,
Maintaining Variance and k-Medians over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O’Callaghan Stanford University.
1 A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window Costas Busch Rensselaer Polytechnic Institute Srikanta Tirthapura.
Mining Data Streams.
Computing Diameter in the Streaming and Sliding-Window Models J. Feigenbaum, S. Kannan, J. Zhang.
Adaptive Monitoring of Bursty Data Streams Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani.
Heavy hitter computation over data stream
1 Continuous Queries over Data Streams Vitaly Kroivets, Lyan Marina Presentation for The Seminar on Database and Internet The Hebrew University of Jerusalem,
Extension of DGIM to More Complex Problems
Computing Diameter in the Streaming and Sliding-Window Models J. Feigenbaum, S. Kannan, J. Zhang.
CURE Algorithm Clustering Streams
1 Distributed Streams Algorithms for Sliding Windows Phillip B. Gibbons, Srikanta Tirthapura.
Processing Data-Stream Joins Using Skimmed Sketches Minos Garofalakis Internet Management Research Department Bell Labs, Lucent Technologies Joint work.
1 Stream-based Data Management IS698 Min Song 2 Characteristics of Data Streams  Data Streams Data streams — continuous, ordered, changing, fast, huge.
Stream Data Introduction or “Stream Data in 30 minutes or less…” Magdiel Galán CSE591: DataMining Dr. Huan Liu Spring 2004.
What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems.
May 21, 2003CS Forum Annual Meeting1 Randomization for Massive and Streaming Data Sets Rajeev Motwani.
Time-Decaying Sketches for Sensor Data Aggregation Graham Cormode AT&T Labs, Research Srikanta Tirthapura Dept. of Electrical and Computer Engineering.
A survey on stream data mining
Chain: Operator Scheduling for Memory Minimization in Data Stream Systems Authors: Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani (Dept.
1 PODS 2002 Motivation. 2 PODS 2002 Data Streams data sets Traditional DBMS – data stored in finite, persistent data sets data streams New Applications.
1 Mining Data Streams The Stream Model Sliding Windows Counting 1’s.
The Stanford Data Streams Research Project Profs. Rajeev Motwani & Jennifer Widom And a cast of full- and part-time students: Arvind Arasu, Brian Babcock,
1 Mining Data Streams The Stream Model Sliding Windows Counting 1’s.
CS 591 A11 Algorithms for Data Streams Dhiman Barman CS 591 A1 Algorithms for the New Age 2 nd Dec, 2002.
Cloud and Big Data Summer School, Stockholm, Aug Jeffrey D. Ullman.
Maintaining Variance and k-Medians over Data Stream Windows Paper by Brian Babcock, Mayur Datar, Rajeev Motwani and Liadan O’Callaghan. Presentation by.
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
Data Stream Systems Reynold Cheng 12 th July, 2002 Based on slides by B. Babcock et.al, “Models and Issues in Data Stream Systems”, PODS’02.
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
PODS Models and Issues in Data Stream Systems Rajeev Motwani Stanford University (with Brian Babcock, Shivnath Babu, Mayur Datar, and Jennifer Widom)
Stream Data Introduction
1 Approximating Quantiles over Sliding Windows Srimathi Harinarayanan CMPS 565.
Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.
August 21, 2002VLDB Gurmeet Singh Manku Frequency Counts over Data Streams Frequency Counts over Data Streams Stanford University, USA.
Adaptive Query Processing in Data Stream Systems Paper written by Shivnath Babu Kamesh Munagala, Rajeev Motwani, Jennifer Widom stanfordstreamdatamanager.
Data Stream Management Systems
Aum Sai Ram Security for Stream Data Modified from slides created by Sujan Pakala.
Load Shedding Techniques for Data Stream Systems Brian Babcock Mayur Datar Rajeev Motwani Stanford University.
1 Online Computation and Continuous Maintaining of Quantile Summaries Tian Xia Database CCIS Northeastern University April 16, 2004.
Memory Requirements of Data Streams Reynold Cheng 19 th July, 2002.
The Misra Gries Algorithm. Motivation Espionage The rest we monitor.
Data Streams Topics in Data Mining Fall 2015 Bruno Ribeiro © 2015 Bruno Ribeiro.
Data Mining: Concepts and Techniques Mining data streams
Mining of Massive Datasets Ch4. Mining Data Streams
June 16, 2004 PODS 1 Approximate Counts and Quantiles over Sliding Windows Arvind Arasu, Gurmeet Singh Manku Stanford University.
Streaming Semantic Data COMP6215 Semantic Web Technologies Dr Nicholas Gibbins –
Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window MADALGO – Center for Massive Data Algorithmics, a Center of the Danish.
Data Streams COMP3017 Advanced Databases Dr Nicholas Gibbins –
Mining Data Streams (Part 1)
Frequency Counts over Data Streams
COMP3211 Advanced Databases
The Stream Model Sliding Windows Counting 1’s
The Stream Model Sliding Windows Counting 1’s
Web-Mining Agents Stream Mining
Finding Frequent Items in Data Streams
Mining Data Streams The Stream Model Sliding Windows Counting 1’s Exponentially Decaying Windows Usually, we mine data that sits somewhere in a database.
Arvind Arasu, Brian Babcock
Mining Data Streams (Part 1)
Load Shedding Techniques for Data Stream Systems
Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani
Approximate Frequency Counts over Data Streams
Range-Efficient Computation of F0 over Massive Data Streams
Lu Tang , Qun Huang, Patrick P. C. Lee
Maintaining Stream Statistics over Sliding Windows
Presentation transcript:

Maintaining Variance over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O ’ Callaghan, Stanford University ACM Symp. on Principles of Database Systems (PODS 2003), June 2003 Presented by C.-L. Lin

Data streams data sets, one time query, relativelyTraditional DBMS( Data Base Management System) – data stored in finite and persistent data sets, one time query, relatively low update rate. data streams, continuous queries..Data Stream Management System (DSMS) - data input as continuous, possibly infinite data streams, continuous queries.. etc. –An example of continuous query : In a telecom company, we are interested in finding all outgoing calls longer than 2 minutes –New Applications Sensor networks Network monitoring and traffic engineering Telecom call records Network security Financial applications Manufacturing processes Web logs and clickstreams Massive data sets

Data Stream Management System (DSMS) Query Query Results (Limited Memory and/or Disk) Summary data (sum, count, Variance…) User/Application Stream Query Processor

Sliding Window Model … … Time Increases Current Time Window Size = N 1. When N is large ( many hours, days and months), we cannot buffer the entire sliding window in memory.  O(N  log R) bits of memory is required, where R is the upper bound on the absolute value of the data. So we cannot compute the sum, count, variance exactly at every instant. 2.Approximately compute variance over sliding window, and use as small memory as possible.  Future dataExpired data Timestamps

Review (1) Mean (2) Variance (3) Relative estimation error

The Concept of Buckets(1/2) Elements … Time Increases Timestamps B1B1 B2B2 B3B3 B4B4 Bucket Timestamps Suffix buckets

The Concept of Buckets (2/2) Time Window size N B1B1 B2B2 B3B3 B m-1 BmBm B m* (1)For each bucket B i, maintain (2)  Proof later

Estimated Variance Time Window size N B1B1 B2B2 B3B3 B m-1 BmBm B m* Error!!! 

Lemma 1 Proof. Define δ i =μ i -μ i,j δ j =μ j -μ i,j

When a new x t element arrives.. (1)create a new bucket for x t. The new bucket becomes B 1 with V 1 =0, μ 1 = x t, n 1 =1. An old bucket B i becomes B i+1. (2)if t m > N, delete the bucket. Bucket B m-1 becomes the new oldest bucket.  update B m-1*

Bucket Merge Invariant 1 For every bucket B i, –Ensures that the relative error is ≤ ε Invariant 2 For each i<1, for every bucket B i, –This invariant insures that the total number of buckets is small  O((1/ε 2 )log NR 2 )

Number of Buckets Lemma 2: The number of buckets maintained at any point in time by an algorithm that preserves Invariant 2 is O(1/ε 2 logNR 2 ) where R is an upper bound on the absolute value of the data elements. From the merge rule : the variance of the union of two buckets is no less then the sum of the individual variances. By invariant 2, the variance of the suffix bucket B i* doubles after every O(1/ε 2 ) buckets. Total number of buckets: no more then O(1/ε 2 logV) where V is the variance of the last N points. V is no more than NR 2.  O(1/ε 2 log NR 2 ) V 3* V 3,4 V 5*

Space Complexity (1) By lemma 2, the number of buckets maintained at any point in time by an algorithm is O(1/ε 2 logNR 2 ) (2) Each bucket requires constant space : ==> Overall memory is O(1/ε 2 logNR 2 ) But………… (1)Timestamps : O(logN) (2)Bucket size : (3)Mean: (4)Variance: O(logV) = O(logNR 2 )

Estimation Error Estimated Variance: Actual Variance: Error: (1)(2) (3)