Counting How Many Elements Computing “Moments”

Slides:



Advertisements
Similar presentations
Estimating Distinct Elements, Optimally
Advertisements

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.
Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with Wenjie Zhang, Ying Zhang and Xuemin Lin University of.
Order Statistics Sorted
Mining Data Streams.
Indian Statistical Institute Kolkata
Sections 4.1 and 4.2 Overview Random Variables. PROBABILITY DISTRIBUTIONS This chapter will deal with the construction of probability distributions by.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 25, 2006
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
A survey on stream data mining
1 More Stream-Mining Counting How Many Elements Computing “Moments”
More Stream-Mining Counting Distinct Elements Computing “Moments”
The Marriage Problem Finding an Optimal Stopping Procedure.
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
Cloud and Big Data Summer School, Stockholm, Aug Jeffrey D. Ullman.
Hellman’s TMTO 1 Hellman’s TMTO Attack. Hellman’s TMTO 2 Popcnt  Before we consider Hellman’s attack, consider simpler Time-Memory Trade-Off  “Population.
Discrete Probability Distributions n Random Variables n Discrete Probability Distributions n Expected Value and Variance n Binomial Probability Distribution.
3. Counting Permutations Combinations Pigeonhole principle Elements of Probability Recurrence Relations.
BIA 2610 – Statistical Methods Chapter 5 – Discrete Probability Distributions.
Week 21 Conditional Probability Idea – have performed a chance experiment but don’t know the outcome (ω), but have some partial information (event A) about.
1 Chapter 7 Sampling Distributions. 2 Chapter Outline  Selecting A Sample  Point Estimation  Introduction to Sampling Distributions  Sampling Distribution.
Stat 13 Lecture 19 discrete random variables, binomial A random variable is discrete if it takes values that have gaps : most often, integers Probability.
Calculating frequency moments of Data Stream
Mining of Massive Datasets Ch4. Mining Data Streams
Mining of Massive Datasets Ch4. Mining Data Streams.
Section 7.3. Why we need Bayes?  How to assess the probability that a particular event occurs on the basis of partial evidence.  The probability p(F)
Why getting it almost right is OK and Why scrambling the data may help Oops I made it again…
Virtual University of Pakistan
Sampling Distributions
Mining Data Streams (Part 1)
More Stream Mining Bloom Filters Sampling Streams Counting Distinct Items Computing Moments We are going to take up a number of interesting algorithms.
The Pigeonhole Principle
Introduction to Probability Distributions
Probability – part 1 Prof. Carla Gomes Rosen 1 1.
The Stream Model Sliding Windows Counting 1’s
Azita Keshmiri CS 157B Ch 12 indexing and hashing
Hellman’s TMTO Attack Hellman’s TMTO 1.
CS104:Discrete Structures
COMP9313: Big Data Management Lecturer: Xin Cao Course web site:
Finding Frequent Items in Data Streams
Estimating L2 Norm MIT Piotr Indyk.
Conditional Probability
Streaming & sampling.
Birthdays Parties and Counting Blind
Hash Functions/Review
St. Edward’s University
St. Edward’s University
Mining Data Streams (Part 1)
COMP9313: Big Data Management Lecturer: Xin Cao Course web site:
Sublinear Algorithmic Tools 2
Mining Data Streams (Part 2)
COMS E F15 Lecture 2: Median trick + Chernoff, Distinct Count, Impossibility Results Left to the title, a presenter can insert his/her own image.
Lecture 4: CountSketch High Frequencies
4 Probability Lesson 4.8 Combinations and Probability.
Mining Data Streams Some of these slides are based on Stanford Mining Massive Data Sets Course slides at
Hashing.
Probability and Statistics for Computer Scientists Second Edition, By: Michael Baron Chapter 2: Probability CIS Computational Probability and.
CS200: Algorithm Analysis
Range-Efficient Computation of F0 over Massive Data Streams
Lecture 7 Sampling and Sampling Distributions
Data Intensive and Cloud Computing Data Streams Lecture 10
Introduction to Stream Computing and Reservoir Sampling
Statistics for Business and Economics (13e)
Problem 1.
Modeling with the normal distribution
Chapter 5 Discrete Probability Distributions
Independence and Counting
Independence and Counting
Independence and Counting
Counting Bits.
Presentation transcript:

Counting How Many Elements Computing “Moments” More Stream-Mining Counting How Many Elements Computing “Moments”

Counting Distinct Elements Problem: a data stream consists of elements chosen from a set of size n. Maintain a count of the number of distinct elements seen so far. Obvious approach: maintain the set of elements seen.

Applications How many different words are found among the Web pages being crawled at a site? Unusually low or high numbers could indicate artificial pages (spam?). How many different Web pages does each customer request in a week?

Using Small Storage Real Problem: what if we do not have space to store the complete set? Estimate the count in an unbiased way. Accept that the count may be in error, but limit the probability that the error is large.

Flajolet-Martin* Approach Pick a hash function h that maps each of the n elements to log2n bits, uniformly. Important that the hash function be (almost) a random permutation of the elements. For each stream element a, let r (a ) be the number of trailing 0’s in h (a ). Record R = the maximum r (a ) seen. Estimate = 2R. * Really based on a variant due to AMS (Alon, Matias, and Szegedy)

Why It Works The probability that a given h (a ) ends in at least r 0’s is 2-r. If there are m elements in the stream, the probability that R ≥ r is 1 – (1 - 2-r )m. If 2r >> m, prob ≈ m / 2r (small). If 2r << m, prob ≈ 1. Thus, 2R will almost always be around m.

Why It Doesn’t Work E(2R ) is actually infinite. Probability halves when R -> R +1, but value doubles. That means using many hash functions and getting many samples. How are samples combined? Average? What if one very large value? Median? All values are a power of 2.

Solution Partition your samples into small groups. Take the average of groups. Then take the median of the averages.

Moments (New Topic) Suppose a stream has elements chosen from a set of n values. Let mi be the number of times value i occurs. The k th moment is the sum of (mi )k over all i.

Special Cases 0th moment = number of different elements in the stream. The problem just considered. 1st moment = sum of the numbers of elements = length of the stream. Easy to compute. 2nd moment = surprise number = a measure of how uneven the distribution is.

Example: Surprise Number Stream of length 100; 11 values appear. Unsurprising: 10, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9. Surprise # = 910. Surprising: 90, 1, 1, 1, 1, 1, 1, 1 ,1, 1, 1. Surprise # = 8,110.

AMS Method Works for all moments; gives an unbiased estimate. We’ll just concentrate on 2nd moment. Based on calculation of many random variables X. Each requires a count in main memory, so number is limited.

One Random Variable Assume stream has length n. Pick a random time to start, so that any time is equally likely. Let the chosen time have element a in the stream. X = n * ((twice the number of a ’s in the stream starting at the chosen time) – 1). Note: just store the count.

Expected Value of X 2nd moment is Σa (ma )2. E(X ) = (1/n )(Σall times t of n * (twice the number of times the stream element at time t appears from that time on) – 1). = Σa (1/n)(n )(1+3+5+…+2ma-1) . = Σa (ma )2.

Combining Samples Compute as many variables X as can fit in available memory. Average them in groups. Take median of averages. Proper balance of group sizes and number of groups assures not only correct expected value, but expected error goes to 0 as number of samples gets large.

Problem: Streams Never End We assumed there was a number n, the number of positions in the stream. But real streams go on forever, so n is a variable --- the number of elements seen so far.

Fixups The variables X have n as a factor --- need to scale as n grows. Suppose we can only store k counts. We must throw some X ’s out as time goes on. Objective: each X is selected with probability k / n.

Solution to (2) Choose the first k elements. When the n th element arrives (n > k ), choose it with probability k / n. If you choose it, throw one of the previously stored variables out, with equal probability.