1 More Stream-Mining Counting How Many Elements Computing “Moments”

Slides:



Advertisements
Similar presentations
Estimating Distinct Elements, Optimally
Advertisements

The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.
Sampling From a Moving Window Over Streaming Data Brian Babcock * Mayur Datar Rajeev Motwani * Speaker Stanford University.
Why Simple Hash Functions Work : Exploiting the Entropy in a Data Stream Michael Mitzenmacher Salil Vadhan And improvements with Kai-Min Chung.
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.
Order Statistics Sorted
Mining Data Streams.
Chapter 7 Introduction to Sampling Distributions
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 25, 2006
1 Hiring Problem and Generating Random Permutations Andreas Klappenecker Partially based on slides by Prof. Welch.
1 1 Slide © 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
A survey on stream data mining
More Stream-Mining Counting Distinct Elements Computing “Moments”
1 Mining Data Streams The Stream Model Sliding Windows Counting 1’s.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 17: 10/24.
1 Mining Data Streams The Stream Model Sliding Windows Counting 1’s.
1 Still More Stream-Mining Frequent Itemsets Elephants and Troops Exponentially Decaying Windows.
CSE 373 Data Structures Lecture 15
Ch. 8 & 9 – Linear Sorting and Order Statistics What do you trade for speed?
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
1  The goal is to estimate the error probability of the designed classification system  Error Counting Technique  Let classes  Let data points in class.
Cloud and Big Data Summer School, Stockholm, Aug Jeffrey D. Ullman.
Hellman’s TMTO 1 Hellman’s TMTO Attack. Hellman’s TMTO 2 Popcnt  Before we consider Hellman’s attack, consider simpler Time-Memory Trade-Off  “Population.
1 1 Slide IS 310 – Business Statistics IS 310 Business Statistics CSU Long Beach.
1 1 Slide © 2011 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license.
1 1 Slide Discrete Probability Distributions (Random Variables and Discrete Probability Distributions) Chapter 5 BA 201.
Great Theoretical Ideas In Computer Science Steven Rudich, Anupam GuptaCS Spring 2004 Lecture 22April 1, 2004Carnegie Mellon University
Random Sampling, Point Estimation and Maximum Likelihood.
Finding Frequent Items in Data Streams [Charikar-Chen-Farach-Colton] Paper report By MH, 2004/12/17.
Discrete Probability Distributions n Random Variables n Discrete Probability Distributions n Expected Value and Variance n Binomial Probability Distribution.
3. Counting Permutations Combinations Pigeonhole principle Elements of Probability Recurrence Relations.
 A probability function is a function which assigns probabilities to the values of a random variable.  Individual probability values may be denoted by.
BIA 2610 – Statistical Methods Chapter 5 – Discrete Probability Distributions.
1 1 Slide © 2004 Thomson/South-Western Slides Prepared by JOHN S. LOUCKS St. Edward’s University Slides Prepared by JOHN S. LOUCKS St. Edward’s University.
Probability Distributions. We need to develop probabilities of all possible distributions instead of just a particular/individual outcome Many probability.
1 Topic 3 - Discrete distributions Basics of discrete distributions Mean and variance of a discrete distribution Binomial distribution Poisson distribution.
Week 21 Conditional Probability Idea – have performed a chance experiment but don’t know the outcome (ω), but have some partial information (event A) about.
1 Chapter 7 Sampling Distributions. 2 Chapter Outline  Selecting A Sample  Point Estimation  Introduction to Sampling Distributions  Sampling Distribution.
Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.
Stat 13 Lecture 19 discrete random variables, binomial A random variable is discrete if it takes values that have gaps : most often, integers Probability.
Expectation. Let X denote a discrete random variable with probability function p(x) (probability density function f(x) if X is continuous) then the expected.
Analysis of Algorithms CS 477/677 Instructor: Monica Nicolescu Lecture 7.
Some Common Discrete Random Variables. Binomial Random Variables.
Mining of Massive Datasets Ch4. Mining Data Streams.
Section 7.3. Why we need Bayes?  How to assess the probability that a particular event occurs on the basis of partial evidence.  The probability p(F)
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
CIS 2033 based on Dekking et al. A Modern Introduction to Probability and Statistics B: Michael Baron. Probability and Statistics for Computer Scientists,
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Mining Data Streams (Part 1)
More Stream Mining Bloom Filters Sampling Streams Counting Distinct Items Computing Moments We are going to take up a number of interesting algorithms.
The Stream Model Sliding Windows Counting 1’s
Finding Frequent Items in Data Streams
Conditional Probability
Streaming & sampling.
St. Edward’s University
Mining Data Streams (Part 1)
Counting How Many Elements Computing “Moments”
Mining Data Streams (Part 2)
COMS E F15 Lecture 2: Median trick + Chernoff, Distinct Count, Impossibility Results Left to the title, a presenter can insert his/her own image.
CS200: Algorithm Analysis
CSCI B609: “Foundations of Data Science”
Introduction to Stream Computing and Reservoir Sampling
Statistics for Business and Economics (13e)
Chapter 5 Discrete Probability Distributions
Independence and Counting
Independence and Counting
Counting Bits.
Presentation transcript:

1 More Stream-Mining Counting How Many Elements Computing “Moments”

2 Counting Distinct Elements uProblem: a data stream consists of elements chosen from a set of size n. Maintain a count of the number of distinct elements seen so far. uObvious approach: maintain the set of elements seen.

3 Applications uHow many different words are found among the Web pages being crawled at a site? wUnusually low or high numbers could indicate artificial pages (spam?). uHow many different Web pages does each customer request in a week?

4 Using Small Storage uReal Problem: what if we do not have space to store the complete set? uEstimate the count in an unbiased way. uAccept that the count may be in error, but limit the probability that the error is large.

5 Flajolet-Martin* Approach uPick a hash function h that maps each of the n elements to log 2 n bits, uniformly. wImportant that the hash function be (almost) a random permutation of the elements. uFor each stream element a, let r (a ) be the number of trailing 0’s in h (a ). uRecord R = the maximum r (a ) seen. uEstimate = 2 R. * Really based on a variant due to AMS (Alon, Matias, and Szegedy)

6 Why It Works uThe probability that a given h (a ) ends in at least r 0’s is 2 -r.  If there are m different elements, the probability that R ≥ r is 1 – ( r ) m. Prob. a given h(a) ends in fewer than r 0’s. Prob. all h(a)’s end in fewer than r 0’s.

7 Why It Works --- (2)  Since 2 -r is small, (1-2 -r ) m ≈ e -m2.  If 2 r >> m, 1 – ( r ) m ≈ 1 – (1 - m2 -r ) ≈ m /2 r ≈ 0.  If 2 r << m, 1 – ( r ) m ≈ 1 – e -m2 ≈ 1. uThus, 2 R will almost always be around m. -r First 2 terms of the Taylor expansion of e x

8 Why It Doesn’t Work uE(2 R ) is actually infinite. wProbability halves when R -> R +1, but value doubles. uWorkaround involves using many hash functions and getting many samples. uHow are samples combined? wAverage? What if one very large value? wMedian? All values are a power of 2.

9 Solution uPartition your samples into small groups. uTake the average of groups. uThen take the median of the averages.

10 Moments (New Topic) uSuppose a stream has elements chosen from a set of n values. uLet m i be the number of times value i occurs. uThe k th moment is the sum of (m i ) k over all i.

11 Special Cases u0 th moment = number of different elements in the stream. wThe problem just considered. u1 st moment = sum of the numbers of elements = length of the stream. wEasy to compute. u2 nd moment = surprise number = a measure of how uneven the distribution is.

12 Example: Surprise Number uStream of length 100; 11 values appear. uUnsurprising: 10, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9. Surprise # = 910. uSurprising: 90, 1, 1, 1, 1, 1, 1, 1,1, 1, 1. Surprise # = 8,110.

13 AMS Method uWorks for all moments; gives an unbiased estimate. uWe’ll just concentrate on 2 nd moment. uBased on calculation of many random variables X. wEach requires a count in main memory, so number is limited.

14 One Random Variable uAssume stream has length n. uPick a random time to start, so that any time is equally likely. uLet the chosen time have element a in the stream. uX = n * ((twice the number of a ’s in the stream starting at the chosen time) – 1). wNote: just store the count.

15 Expected Value of X  2 nd moment is Σ a (m a ) 2.  E(X ) = (1/n ) ( Σ all times t n * (twice the number of times the stream element at time t appears from that time on) – 1 ).  = Σ a (1/n)(n )(1+3+5+…+2m a -1).  = Σ a (m a ) 2.

16 Combining Samples uCompute as many variables X as can fit in available memory. uAverage them in groups. uTake median of averages. uProper balance of group sizes and number of groups assures not only correct expected value, but expected error goes to 0 as number of samples gets large.

17 Problem: Streams Never End uWe assumed there was a number n, the number of positions in the stream. uBut real streams go on forever, so n is a variable --- the number of inputs seen so far.

18 Fixups 1.The variables X have n as a factor --- keep n separately; just hold the count in X. 2.Suppose we can only store k counts. We must throw some X ’s out as time goes on. wObjective: each starting time t is selected with probability k / n.

19 Solution to (2) uChoose the first k times for k variables. uWhen the n th element arrives (n > k ), choose it with probability k / n. uIf you choose it, throw one of the previously stored variables out, with equal probability.