Sampling for Windows on Data Streams by Vladimir Braverman

Slides:



Advertisements
Similar presentations
Fast Moment Estimation in Data Streams in Optimal Space Daniel Kane, Jelani Nelson, Ely Porat, David Woodruff Harvard MIT Bar-Ilan IBM.
Advertisements

Optimal Approximations of the Frequency Moments of Data Streams Piotr Indyk David Woodruff.
The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.
Sampling From a Moving Window Over Streaming Data Brian Babcock * Mayur Datar Rajeev Motwani * Speaker Stanford University.
CS4432: Database Systems II
Fast Algorithms For Hierarchical Range Histogram Constructions
Summarizing Distributed Data Ke Yi HKUST += ?. Small summaries for BIG data  Allow approximate computation with guarantees and small space – save space,
3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.
Analysis of Algorithms CS 477/677 Linear Sorting Instructor: George Bebis ( Chapter 8 )
Maintaining Variance and k-Medians over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O’Callaghan Stanford University.
1 A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window Costas Busch Rensselaer Polytechnic Institute Srikanta Tirthapura.
Mining Data Streams.
1 Summarizing Data using Bottom-k Sketches Edith Cohen AT&T Haim Kaplan Tel Aviv University.
1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku
1 Maintaining Bernoulli Samples Over Evolving Multisets Rainer Gemulla Wolfgang Lehner Technische Universität Dresden Peter J. Haas IBM Almaden Research.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 25, 2006
Algorithmic Complexity Nelson Padua-Perez Bill Pugh Department of Computer Science University of Maryland, College Park.
1 Hiring Problem and Generating Random Permutations Andreas Klappenecker Partially based on slides by Prof. Welch.
Data Structures – LECTURE 11 Hash tables
Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.
What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems.
Reverse Hashing for Sketch Based Change Detection in High Speed Networks Ashish Gupta Elliot Parsons with Robert Schweller, Theory Group Advisor: Yan Chen.
A survey on stream data mining
Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,
Improving the Accuracy of Continuous Aggregates & Mining Queries Under Load Shedding Yan-Nei Law* and Carlo Zaniolo Computer Science Dept. UCLA * Bioinformatics.
Hashing General idea: Get a large array
Approximation and Load Shedding Sampling Methods Carlo Zaniolo CSD—UCLA ________________________________________.
CS 202, Spring 2003 Fundamental Structures of Computer Science II Bilkent University1 Sorting CS 202 – Fundamental Structures of Computer Science II Bilkent.
CS 591 A11 Algorithms for Data Streams Dhiman Barman CS 591 A1 Algorithms for the New Age 2 nd Dec, 2002.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
How Robust are Linear Sketches to Adaptive Inputs? Moritz Hardt, David P. Woodruff IBM Research Almaden.
HOW TO SOLVE IT? Algorithms. An Algorithm An algorithm is any well-defined (computational) procedure that takes some value, or set of values, as input.
Maintaining Variance and k-Medians over Data Stream Windows Paper by Brian Babcock, Mayur Datar, Rajeev Motwani and Liadan O’Callaghan. Presentation by.
Finding Frequent Items in Data Streams [Charikar-Chen-Farach-Colton] Paper report By MH, 2004/12/17.
Machine Learning Lecture 23: Statistical Estimation with Sampling Iain Murray’s MLSS lecture on videolectures.net:
CHAPTER 09 Compiled by: Dr. Mohammad Omar Alhawarat Sorting & Searching.
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
Trevor Brown – University of Toronto B-slack trees: Space efficient B-trees.
The Selection Problem. 2 Median and Order Statistics In this section, we will study algorithms for finding the i th smallest element in a set of n elements.
Analysis of Algorithms CS 477/677
David Luebke 1 10/25/2015 CS 332: Algorithms Skip Lists Hash Tables.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.
Ihab Mohammed and Safaa Alwajidi. Introduction Hash tables are dictionary structure that store objects with keys and provide very fast access. Hash table.
Sorting: Implementation Fundamental Data Structures and Algorithms Klaus Sutner February 24, 2004.
CSCI-455/552 Introduction to High Performance Computing Lecture 23.
Today’s Material Sorting: Definitions Basic Sorting Algorithms
Rainer Gemulla, Wolfgang Lehner and Peter J. Haas VLDB 2006 A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets 2008/8/27 1.
CSC 413/513: Intro to Algorithms Hash Tables. ● Hash table: ■ Given a table T and a record x, with key (= symbol) and satellite data, we need to support:
Clustering Data Streams A presentation by George Toderici.
Approximation Algorithms based on linear programming.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Unconstrained Submodular Maximization Moran Feldman The Open University of Israel Based On Maximizing Non-monotone Submodular Functions. Uriel Feige, Vahab.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Mining Data Streams (Part 1)
Algorithms for Big Data: Streaming and Sublinear Time Algorithms
Frequency Counts over Data Streams
New Characterizations in Turnstile Streams with Applications
Data-Streams and Histograms
Finding Frequent Items in Data Streams
Streaming & sampling.
Hidden Markov Models Part 2: Algorithms
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Introduction to Stream Computing and Reservoir Sampling
Approximation and Load Shedding Sampling Methods
Presentation transcript:

Sampling for Windows on Data Streams by Vladimir Braverman

Data Stream Sequence of elements D=p 1,p 2,…,p N p i is drown from [m]. Objective: Calculate a function f(D). Restrictions: single pass, sub-linear memory, fast processing time (per element). p1p1 p2p2 p3p3 p4p4 p5p5 p6p6 p7p7 p8p8 p N-5 p N-4 p N-3 p N-2 …. p N-1 pNpN … p N-6 Time

Motivation Today’s applications: Huge amounts of data is whizzing by Objective Mining the data, computing statistics etc. Restrictions Expensive overload is not allowed Useful for many applications Networking, databases etc.

Data Stream Intensive theoretical research Streaming Systems Stream(Stanford), StreamMill (UCLA), Aurora (Brown), GigaScope (Rutgers), Nile (Purdue), Niagara (Wisconsin), Telegraph (Berkley) etc.

Data Stream The model allows insertions only What about deletions? Turnstile model Sliding Windows p1p1 p2p2 p3p3 p4p4 p5p5 p6p6 p7p7 p8p8 p N-5 p N-4 p N-3 p N-2 …. p N-1 pNpN … p N-6 Time

p1p1 p2p2 p3p3 p4p4 p5p5 p6p6 p7p7 p8p8 n=5 Sliding Windows expired active SW contains n most recent elements that are “active”. Older elements are “expired”.

p1p1 p2p2 p3p3 p4p4 p5p5 p6p6 p7p7 p8p8 Time p N-6 p N-5 p N-4 p N-3 …. p N-2 p N-1 pNpN p N-7 n=5 Sliding Windows expired active SW contains n most recent elements that are “active”. Older elements are “expired”.

p1p1 p2p2 p3p3 p4p4 p5p5 p6p6 p7p7 p8p8 Time p N-6 p N-5 p N-4 p N-3 …. p N-2 p N-1 pNpN p N-7 n=5, n is “huge” Sequence-based Windows expired active SW contains n most recent elements that are “active”. Older elements are “expired”.

p1p1 p2p2 p3p3 p4p4 p5p5 Time Timestamp-based windows p6p6 p7p7 p8p8 p9p9 p 10 p 11 p 12 p 13

What is known on sliding windows [BDM 02]Random sampling [DGIM 02]Sum, Count, average, Lp, 0<p≤2, weakly additive functions. [DM 02]Rarity, similarity [GT 02]Distributed sum, count [FKZ 02], [CS 04]Diameter [BDMO 03]Variance, k-medians [GDDLM 03]Frequent elements [AM 04]Counts, quantiles [AGHLRS 04]LIS [LT 06]Frequent items [LT 06]Count [ZG 06]Variance [CCM 07]Entropy

Random Sampling

Fundamental approximation method Pick a subset S of D Use f(S) to approximate f(D) p1p1 p2p2 p3p3 p4p4 p5p5 p6p6 p7p7 p8p8 p N-5 p N-4 p N-3 p N-2 …. p N-1 pNpN p N-6 p9p9 p 10

Types of k-sampling With replacement Samples x 1,…,x k are independent Without replacement Repetitions are forbidden, i.e., x i ≠ x j

Properties of Random Sampling General, simple, first-to-try method Stores an element, not aggregation Allows to change f a posteriori. Can be used for multiple statistics. Provides effective solutions with worst- case guarantees The only known solution for many problems

Some Known Methods for Data Streams Reservoir Sampling[V 85] Concise Sampling[GM 98] Inverse Sampling[CMR 05] Weighted Sampling[CMN 99] Biased Sampling[A 06] Priority Sampling[ADLT 05] Dynamic Sampling[FIS 05] Chain Sampling[BDM 02]

Streaming Sampling Easy if N is fixed Pick random index I from {1,2,…,N} Output p I But: N is not known in advance Naïve methods Store the whole stream Linear memory “Guess” the final value of N Not really uniform

Reservoir Sampling (Vitter 85) Maintains k uniform samples without replacement using Θ(k) space. Outputs sample for every prefix Intuition: The probability to pick p decreases as N grows  probabilities can be adjusted dynamically

Reservoir Sampling (Vitter 85) Reservoir (array) of k elements, initially empty Algorithm: Insert k first elements into the reservoir. For i>k, pick p i with probability 1/i If p i is chosen Pick one of samples in the reservoir randomly Replace it with p i

Sampling on Sliding Windows: Problem Definition Maintain uniform random sampling on sliding windows Output a sample for every window Use provably optimal memory

Sampling for Sliding Windows Can we use previous methods? No - samples expire p1p1 p2p2 p3p3 p4p4 p5p5 p6p6 p7p7 p8p8 Time n=5

Naïve Approach Store the whole window Linear memory => compute f(W) directly

p1p1 p2p2 p3p3 p4p4 p5p5 p6p6 p7p7 p8p8 Time n=5 Periodic Sampling Pick a sample p i from the first window When p i expires, take the new element Continue…

Periodic Sampling: problems Vulnerability to malicious behavior Given one sample, it is possible to predict all future samples Poor representation of periodic data If the period “agrees” with the sample Unacceptable for applications

Sampling on Sliding Windows: Problem Definition Maintain uniform random sampling on sliding windows Use provably optimal memory Samples on distinct windows are independent

Chain and Priority Methods Babcock, Datar, Motwani, SODA Maintain uniform random sampling on sliding windows Chain Sampling Sequence-based windows, with replacement. Uses optimal memory in expectation Uses O(k log{n}) w.h.p. Samples on distinct windows are weakly dependent Priority Sampling Timestamp-based windows, with replacement. Uses optimal memory in expectation and w.h.p. Samples on distinct windows are independent

S 3 Algorithms Maintain uniform random sampling on sliding windows Supports all cases Provably optimal Samples on distinct windows are independent

Sequence-basedTimestamp-based With Replacement Without Replacement Window Sampling Taxonomy

Sampling With Replacement on Sequence-Based Windows SamplingMemoryDependency NaïveO(n)No PeriodicO(k)Yes Chain (BDM 02) O(k) in expectation Weak S 3 (our result) O(k)No

Sampling Without Replacement on Sequence-Based Windows SamplingMemoryDependency NaïveO(n)No PeriodicO(k)Yes S3S3 O(k)No

Sampling With Replacement on Time-Based Windows SamplingMemoryDependency NaïveO(n)No Priority (BDM 02) O(k log(n)) w.h.p. No S3S3 O(k log(n))No

Sampling Without Replacement on Time-Based Windows SamplingMemoryDependency NaïveO(n)No S3S3 O(k)No

Sequence-basedTimestamp-based With Replacement O(k)O(k*log n) Without Replacement O(k) O(k*log n) Window Sampling S 3 : Recap

Concepts Prior algorithms: Replacement policy for expired samples S 3 algorithms: Divide stream into buckets Sample(s) for each bucket Combination rule

Sampling With Replacement for Sequence-Based Windows

p1p1 p2p2 p3p3 p4p4 p5p5 p6p6 p7p7 p8p8 Time p N-5 p N-4 p N-3 p N-2 …. p N-1 pNpN p N+1 p N-6 B1B1 B2B2 p9p9 p 10 B N/n B N/n+1 p N+2 p N+3 Active element Bucket Expired elementFuture element Notations

The Algorithm (for one sample) Divide D into buckets of size n Maintain random sample for each bucket (reservoir algorithm) Combine samples of buckets that have active elements: There are at most two such buckets p1p1 p2p2 p3p3 p4p4 p5p5 p6p6 p7p7 p8p8 p N-5 p N-4 p N-3 p N-2 …. p N-1 pNpN p N+1 p N-6 B1B1 B2B2 p9p9 p 10 B N/n B N/n+1 p N+2 p N+3 R1R1 R2R2 R N/n R N/n+1 Time

p N-5 p N-4 p N-3 p N-2 p N-1 pNpN p N+1 p N-6 B N/n B N/n+1 p N+2 p N+3 Time …. R1R1 R2R2 X

p N-5 p N-4 p N-3 p N-2 p N-1 pNpN p N+1 p N-6 B N/M B N/M+1 p N+2 p N+3 Time …. X Case 1

p N-5 p N-4 p N-3 p N-2 p N-1 pNpN p N+1 p N-6 B N/n B N/n+1 p N+2 p N+3 Time …. R1R1 R2R2 X Case 2

Sampling Without Replacement for Sequence-Based Windows

The Algorithm Divide D into buckets of size n Maintain k random samples for each bucket Combine samples of buckets that have active elements: p1p1 p2p2 p3p3 p4p4 p5p5 p6p6 p7p7 p8p8 p N-5 p N-4 p N-3 p N-2 …. p N-1 pNpN p N+1 p N-6 B1B1 B2B2 p9p9 p 10 B N/M B N/M+1 p N+2 p N+3 R 1,1 R 1,2 R 2,1 R 2,2 R 2,1 R 2,2 R 2,1 R 2,2 Time k=2

p N-5 p N-4 p N-3 p N-2 …. p N-1 pNpN p N+1 p N-6 B N/n B N/n+1 p N+2 p N+3 R 1,1 R 1,2 R 2,1 R 2,2 Time …. R 1,1 R 2,2 X= R1=R1=R2=R2=

Sampling With Replacement for Timestamp-Based Windows

Timestamp-based window n is unknown! Can be changed arbitrary Does our concept work? How to divide stream into buckets? How to combine samples?

AB p N-16 p N-15 p N-14 p N-13 p N-12 p N-11 p N-10 p N-9 p N-5 p N-4 p N-3 p N-2 …. p N-1 pNpN p N+1 p N-6 p N-8 p N-7 p N+2 p N+3 n=13 a=|A|=5b=|B|=10 What if we can maintain buckets A, B as before Samples from A and B a=|A|, b=|B|, c=|A ∩ W| If sample from A expired, X = sample from B If sample from A is active, X= sample from A with probability a/n Otherwise X= sample from B c= |A∩W|=3 The main idea, revised

AB p N-16 p N-15 p N-14 p N-13 p N-12 p N-11 p N-10 p N-9 p N-5 p N-4 p N-3 p N-2 …. p N-1 pNpN p N+1 p N-6 p N-8 p N-7 p N+2 p N+3 n=13 a=|A|=5b=|B|=10 c= |A∩W|=3 Correctness

AB p N-16 p N-15 p N-14 p N-13 p N-12 p N-11 p N-10 p N-9 p N-5 p N-4 p N-3 p N-2 …. p N-1 pNpN p N+1 p N-6 p N-8 p N-7 p N+2 p N+3 n=13 a=|A|=5b=|B|=10 The combination rule works if: 1. a ≤ n 2. It is possible to generate events w.p. a/M c= |A∩W|=3 Conclusions

The First Problem How to maintain A, B at any moment? |A| is less then n

The solution: ζ-decomposition List of buckets B 1,…,B s Contain all active elements 2 samples from each buckets B 1 may contain expired elements as well B1B1 B2B2 B3B3 B4B4 B s-1 BsBs …… Define Ensure that |A| ≤ |B| and s = O(log n)

ζ-decomposition : implementation Similar idea to smooth histograms Slightly different structure

AB p N-16 p N-15 p N-14 p N-13 p N-12 p N-11 p N-10 p N-9 p N-5 p N-4 p N-3 p N-2 …. p N-1 pNpN p N+1 p N-6 p N-8 p N-7 p N+2 p N+3 M=13 a=|A|=5 c= |A∩W|=3 b=|B|=10 Assuming a ≤ b ≤ n, how to generate events w.p. a/n? a,b are known, c is unknown and n=b+c The Second Problem

Approach Generate “biased” sample Y on A, using such that Y expires w.p. b/n Use Y to obtain probability a/n The details are in the paper

AB p N-16 p N-15 p N-14 p N-13 p N-12 p N-11 p N-10 p N-9 p N-5 p N-4 p N-3 p N-2 …. p N-1 pNpN p N+1 p N-6 p N-8 p N-7 p N+2 p N+3 M=13 a=|A|=5 c= |A∩W|=3 b=|B|=10 Given random sample from A, it is possible to construct random variable Y on A such that Lemma 1

Generate random vector V on D = Ax{0,1} a V = of independent random variables Q, H 1,…,H a Q ~ U(A) H i = 1 w.p. ab/(b+i)(b+i+1) Define a set of subspaces of D: A i = {p N-b-I } x {0,1} i-1 x {1} x {0,1} a-i

Lemma 2 Given Y from Lemma 1, it is possible to construct 0-1 random variable Z such that P(Z=1) = a/n Proof sketch: - Generate event T that happens w.p. a/b It is possible since a ≤ b and a,b are known

Sampling Without Replacement for Timestamp-Based Windows

Main idea Implement k-sample without replacement using k independent samples What can we do if the same point is sampled more then once? Approach: sample from different domains

Cascading lemma H i j j-sample (without replacement) from {1,…,i} Given H i j and H i+1 1, we can construct H i+1 j+1.

Cascading Lemma (Illustration) H 1 n-k+1 H 1 n-k+2 H 1 n-k+3 H 1 n-k+4 H 1 n-1 H1nH1n ….. H 2 n-k+2 H 3 n-k+3 H 4 n-k+4 H k-1 n-1 HknHkn

Conclusions Random Sampling Optimally solved Gives worst-case solutions for many problems

Thank you!