Introduction to Stream Computing and Reservoir Sampling

Slides:



Advertisements
Similar presentations
Sampling From a Moving Window Over Streaming Data Brian Babcock * Mayur Datar Rajeev Motwani * Speaker Stanford University.
Advertisements

Mining Data Streams (Part 1)
On Complexity, Sampling, and -Nets and -Samples. Range Spaces A range space is a pair, where is a ground set, it’s elements called points and is a family.
Mining Data Streams.
Week11 Parameter, Statistic and Random Samples A parameter is a number that describes the population. It is a fixed number, but in practice we do not know.
G. Alonso, D. Kossmann Systems Group
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
Matroids, Secretary Problems, and Online Mechanisms Nicole Immorlica, Microsoft Research Joint work with Robert Kleinberg and Moshe Babaioff.
1 Mining Data Streams The Stream Model Sliding Windows Counting 1’s.
Part III: Inference Topic 6 Sampling and Sampling Distributions
1 Mining Data Streams The Stream Model Sliding Windows Counting 1’s.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
Cloud and Big Data Summer School, Stockholm, Aug Jeffrey D. Ullman.
Applications of Poisson Process
Chapter 5 Sampling Distributions
Introduction The large amount of traffic nowadays in Internet comes from social video streams. Internet Service Providers can significantly enhance local.
Estimating Entropy for Data Streams Khanh Do Ba, Dartmouth College Advisor: S. Muthu Muthukrishnan.
Bruno Ribeiro CS69000-DM1 Topics in Data Mining. Bruno Ribeiro  Reviews of next week’s papers due Friday 5pm (Sunday 11:59pm submission closes) ◦ Assignment.
Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.
Bubble Sort.
: An alternative representation of level of significance. - normal distribution applies. - α level of significance (e.g. 5% in two tails) determines the.
More on Data Streams and Streaming Data Shannon Quinn (with thanks to William Cohen of CMU, and J. Leskovec, A. Rajaraman, and J. Ullman of Stanford)
1 An Arc-Path Model for OSPF Weight Setting Problem Dr.Jeffery Kennington Anusha Madhavan.
Mining of Massive Datasets Ch4. Mining Data Streams
Matroids, Secretary Problems, and Online Mechanisms Nicole Immorlica, Microsoft Research Joint work with Robert Kleinberg and Moshe Babaioff.
Rainer Gemulla, Wolfgang Lehner and Peter J. Haas VLDB 2006 A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets 2008/8/27 1.
Mining of Massive Datasets Ch4. Mining Data Streams.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Why getting it almost right is OK and Why scrambling the data may help Oops I made it again…
MathematicalMarketing Slide 3c.1 Mathematical Tools Chapter 3: Part c – Parameter Estimation We will be discussing  Nonlinear Parameter Estimation  Maximum.
Theory of Computational Complexity Probability and Computing Chapter Hikaru Inada Iwama and Ito lab M1.
What types of problems we study, Part 1: Statistical problemsHighlights of the theoretical results What types of problems we study, Part 2: ClusteringFuture.
Mining Data Streams (Part 1)
CSE15 Discrete Mathematics 03/06/17
Big Data Infrastructure
Algorithms for Big Data: Streaming and Sublinear Time Algorithms
The Stream Model Sliding Windows Counting 1’s
Web-Mining Agents Stream Mining
Data Streaming in Computer Networking
COMP9313: Big Data Management Lecturer: Xin Cao Course web site:
Streaming & sampling.
Chapter 5 Sampling Distributions
Hypothesis Testing – Introduction
Parameter, Statistic and Random Samples
Chapter 5 Sampling Distributions
Lecture 18: Uniformity Testing Monotonicity Testing
COS 518: Advanced Computer Systems Lecture 11 Michael Freedman
Mining Data Streams (Part 1)
COMP9313: Big Data Management Lecturer: Xin Cao Course web site:
Counting How Many Elements Computing “Moments”
Mining Data Streams (Part 2)
Mining Data Streams Some of these slides are based on Stanford Mining Massive Data Sets Course slides at
Stat 217 – Day 28 Review Stat 217.
Chapter 5 Sampling Distributions
Chapter 5 Sampling Distributions
Relational Algebra Chapter 4, Sections 4.1 – 4.2
Random Variables Binomial Distributions
Range-Efficient Computation of F0 over Massive Data Streams
Data Intensive and Cloud Computing Data Streams Lecture 10
Further Stats 1 Chapter 5 :: Central Limit Theorem
Network-Wide Routing Oblivious Heavy Hitters
Chapter 5 Sampling Distributions
Approximation and Load Shedding Sampling Methods
Mining Data Streams Many slides are borrowed from Stanford Data Mining Class and Prof. Jiawei Han’s lecture slides.
COS 518: Advanced Computer Systems Lecture 12 Michael Freedman
IE 360: Design and Control of Industrial Systems I
Counting Bits.
Presentation transcript:

Introduction to Stream Computing and Reservoir Sampling

Data Streams We do not know the entire data set in advance Google queries Twitter feeds Internet traffic Convenient to think of the data as infinite.

Contd.. Input data element enter one after another (i.e., in a stream). Cannot store the entire stream accessibly How do you make critical calculations about the stream using a limited amount of memory?

From http://www.mmds.org Applications Mining query streams Google wants to know what queries are more frequent today than yesterday Mining click streams Yahoo wants to know which of its pages are getting an unusual number of hits in the past hour Mining social network news feeds E.g., look for trending topics on Twitter, Facebook From http://www.mmds.org

From http://www.mmds.org Applications Sensor Networks Many sensors feeding into a central controller Telephone call records Data feeds into customer bills as well as settlements between telephone companies IP packets monitored at a switch Gather information for optimal routing Detect denial-of-service attacks From http://www.mmds.org

Formalization: One Pass Model At time t, we observe 𝑥 𝑡 . For analysis we assume that what we have observed is a sequence of D n ={ 𝑥 1 , 𝑥 2 , …, 𝑥 𝑛 } so far and we do not know 𝑛 in advance. We have at any time 𝑡 a limited memory budget which is much less than the t (or n). So storing every observation is out of question. Assume out goal is to calculate f( 𝐷 𝑛 ) Essentially, the algorithm should at any point in time 𝑡, be able to compute f( 𝐷 𝑡 ) (Why?)

Basic Question: Sampling If we get a representative sample of stream then we can do analysis on it. Example: Find trending tweets from a large enough random sample of streams. How to do sampling on a stream?

Sampling a Fraction Sample a random sample of say 1/10 the total size? How? Generate a random number between [1-10] and if that number is 1, then use the sample. Issues? Size is unknown, so this can go unbounded. How about sampling bias?

Fraction of duplicates in original vs sample? Say the original data has U + 2D elements where U are unique elements and all the D ones have one duplicate. Fraction of Duplicates = 2𝐷 𝑈+2𝐷 What is the probability of duplicate in random sample? Sample will contain U/10 of the singleton queries and 2d/10 of the duplicate queries at least once But only d/100 pairs of duplicates d/100 = 1/10 ∙ 1/10 ∙ d Of d “duplicates” 18d/100 appear exactly once 18d/100 = ((1/10 ∙ 9/10)+(9/10 ∙ 1/10)) ∙ d What happens to estimation?

Fixed sample size We want to sample s elements from the stream. When we stop at n elements, we should have Every element has s/n probability of being sampled. We have exactly s elements. Can this be done?

Reservoir Sampling of Size s Observe 𝑥 𝑛 If n < s Keep 𝑥 𝑛 Else with probability 𝑠 𝑛 , Select 𝑥 𝑛 and let it replace one of the s elements we already have uniformly. Claim: At any time t, any element in the sequence 𝑥 1 , 𝑥 2 , …, 𝑥 𝑛 has exactly 𝑠 𝑛 chance of being in the sample

Proof: By Induction Inductive hypothesis: After n elements, the sample S contains each element seen so far with prob. s/n Now element n+1 arrives Inductive step: For elements already in S, probability that the algorithm keeps it in S is: So, at time n, tuples in S were there with prob. s/n Time nn+1, tuple stayed in S with prob. n/(n+1) So prob. tuple is in S at time n+1 = 𝒔 𝒏 ⋅ 𝒏 𝒏+𝟏 = 𝒔 𝒏+𝟏 Element n+1 not discarded Element in the sample not picked Element n+1 discarded http://www.mmds.org

Weighted Reservoir sampling Every element 𝑥 𝑖 has weight 𝑤 𝑖 We want to sample s elements from the stream. When we stop at n elements, we should have Every element 𝑥 𝑖 in the sample has 𝑤 𝑖 𝑖 𝑤 𝑖 probability of being sampled. We have exactly s elements.

Attempt 1 Assume there are 𝑤 𝑖 copies of 𝑥 𝑖 anytime you observe 𝑥 𝑖 Make sure 𝑖 𝑤 𝑖 is big enough integer and every 𝑤 𝑖 is an integer. Issues?

Solution: (Pavlos S Efraimidis and Paul G Spirakis in 2006) Observe 𝑥 𝑖 Generate 𝑟 𝑖 uniformly in [0-1] Set 𝑠𝑐𝑜𝑟𝑒 𝑖 = 𝑟 𝑖 1 𝑤 𝑖 Report s elements with top-s values of 𝑠𝑐𝑜𝑟 𝑒 𝑖 Question: Can this be done on a stream?

Why it works Lemma: Let 𝑟 1 , 𝑟 2 be independent uniformly distributed random variables over [0, 1] and let 𝑋 1 = 𝑟 1 𝑤 1 and 𝑋 2 = 𝑟 2 𝑤 2 where 𝑤 1 , 𝑤 2 ≥ 0. Then Pr[X1 ≤ X2] = 𝑤 2 𝑤 1 + 𝑤 2 Proof: On board