Counting How Many Elements Computing “Moments”

Slides:

Advertisements

Similar presentations

Estimating Distinct Elements, Optimally

Advertisements

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.

Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with Wenjie Zhang, Ying Zhang and Xuemin Lin University of.

Order Statistics Sorted

Mining Data Streams.

Indian Statistical Institute Kolkata

Sections 4.1 and 4.2 Overview Random Variables. PROBABILITY DISTRIBUTIONS This chapter will deal with the construction of probability distributions by.

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 25, 2006

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006

A survey on stream data mining

1 More Stream-Mining Counting How Many Elements Computing “Moments”

More Stream-Mining Counting Distinct Elements Computing “Moments”

The Marriage Problem Finding an Optimal Stopping Procedure.

Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.

Cloud and Big Data Summer School, Stockholm, Aug Jeffrey D. Ullman.

Hellman’s TMTO 1 Hellman’s TMTO Attack. Hellman’s TMTO 2 Popcnt  Before we consider Hellman’s attack, consider simpler Time-Memory Trade-Off  “Population.

Discrete Probability Distributions n Random Variables n Discrete Probability Distributions n Expected Value and Variance n Binomial Probability Distribution.

3. Counting Permutations Combinations Pigeonhole principle Elements of Probability Recurrence Relations.

BIA 2610 – Statistical Methods Chapter 5 – Discrete Probability Distributions.

Week 21 Conditional Probability Idea – have performed a chance experiment but don’t know the outcome (ω), but have some partial information (event A) about.

1 Chapter 7 Sampling Distributions. 2 Chapter Outline  Selecting A Sample  Point Estimation  Introduction to Sampling Distributions  Sampling Distribution.

Stat 13 Lecture 19 discrete random variables, binomial A random variable is discrete if it takes values that have gaps : most often, integers Probability.

Calculating frequency moments of Data Stream

Mining of Massive Datasets Ch4. Mining Data Streams

Mining of Massive Datasets Ch4. Mining Data Streams.

Section 7.3. Why we need Bayes?  How to assess the probability that a particular event occurs on the basis of partial evidence.  The probability p(F)

Why getting it almost right is OK and Why scrambling the data may help Oops I made it again…

Virtual University of Pakistan

Sampling Distributions

Mining Data Streams (Part 1)

More Stream Mining Bloom Filters Sampling Streams Counting Distinct Items Computing Moments We are going to take up a number of interesting algorithms.

The Pigeonhole Principle

Introduction to Probability Distributions

Probability – part 1 Prof. Carla Gomes Rosen 1 1.

The Stream Model Sliding Windows Counting 1’s

Azita Keshmiri CS 157B Ch 12 indexing and hashing

Hellman’s TMTO Attack Hellman’s TMTO 1.

CS104:Discrete Structures

COMP9313: Big Data Management Lecturer: Xin Cao Course web site:

Finding Frequent Items in Data Streams

Estimating L2 Norm MIT Piotr Indyk.

Conditional Probability

Streaming & sampling.

Birthdays Parties and Counting Blind

Hash Functions/Review

St. Edward’s University

St. Edward’s University

Mining Data Streams (Part 1)

COMP9313: Big Data Management Lecturer: Xin Cao Course web site:

Sublinear Algorithmic Tools 2

Mining Data Streams (Part 2)

COMS E F15 Lecture 2: Median trick + Chernoff, Distinct Count, Impossibility Results Left to the title, a presenter can insert his/her own image.

Lecture 4: CountSketch High Frequencies

4 Probability Lesson 4.8 Combinations and Probability.

Mining Data Streams Some of these slides are based on Stanford Mining Massive Data Sets Course slides at

Probability and Statistics for Computer Scientists Second Edition, By: Michael Baron Chapter 2: Probability CIS Computational Probability and.

CS200: Algorithm Analysis

Range-Efficient Computation of F0 over Massive Data Streams

Lecture 7 Sampling and Sampling Distributions

Data Intensive and Cloud Computing Data Streams Lecture 10

Introduction to Stream Computing and Reservoir Sampling

Statistics for Business and Economics (13e)

Modeling with the normal distribution

Chapter 5 Discrete Probability Distributions

Independence and Counting

Independence and Counting

Independence and Counting

Presentation transcript:

Counting How Many Elements Computing “Moments” More Stream-Mining Counting How Many Elements Computing “Moments”

Counting Distinct Elements Problem: a data stream consists of elements chosen from a set of size n. Maintain a count of the number of distinct elements seen so far. Obvious approach: maintain the set of elements seen.

Applications How many different words are found among the Web pages being crawled at a site? Unusually low or high numbers could indicate artificial pages (spam?). How many different Web pages does each customer request in a week?

Using Small Storage Real Problem: what if we do not have space to store the complete set? Estimate the count in an unbiased way. Accept that the count may be in error, but limit the probability that the error is large.

Flajolet-Martin* Approach Pick a hash function h that maps each of the n elements to log2n bits, uniformly. Important that the hash function be (almost) a random permutation of the elements. For each stream element a, let r (a ) be the number of trailing 0’s in h (a ). Record R = the maximum r (a ) seen. Estimate = 2R. * Really based on a variant due to AMS (Alon, Matias, and Szegedy)

Why It Works The probability that a given h (a ) ends in at least r 0’s is 2-r. If there are m elements in the stream, the probability that R ≥ r is 1 – (1 - 2-r )m. If 2r >> m, prob ≈ m / 2r (small). If 2r << m, prob ≈ 1. Thus, 2R will almost always be around m.

Why It Doesn’t Work E(2R ) is actually infinite. Probability halves when R -> R +1, but value doubles. That means using many hash functions and getting many samples. How are samples combined? Average? What if one very large value? Median? All values are a power of 2.

Solution Partition your samples into small groups. Take the average of groups. Then take the median of the averages.

Moments (New Topic) Suppose a stream has elements chosen from a set of n values. Let mi be the number of times value i occurs. The k th moment is the sum of (mi )k over all i.

Special Cases 0th moment = number of different elements in the stream. The problem just considered. 1st moment = sum of the numbers of elements = length of the stream. Easy to compute. 2nd moment = surprise number = a measure of how uneven the distribution is.

Example: Surprise Number Stream of length 100; 11 values appear. Unsurprising: 10, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9. Surprise # = 910. Surprising: 90, 1, 1, 1, 1, 1, 1, 1 ,1, 1, 1. Surprise # = 8,110.

AMS Method Works for all moments; gives an unbiased estimate. We’ll just concentrate on 2nd moment. Based on calculation of many random variables X. Each requires a count in main memory, so number is limited.

One Random Variable Assume stream has length n. Pick a random time to start, so that any time is equally likely. Let the chosen time have element a in the stream. X = n * ((twice the number of a ’s in the stream starting at the chosen time) – 1). Note: just store the count.

Expected Value of X 2nd moment is Σa (ma )2. E(X ) = (1/n )(Σall times t of n * (twice the number of times the stream element at time t appears from that time on) – 1). = Σa (1/n)(n )(1+3+5+…+2ma-1) . = Σa (ma )2.

Combining Samples Compute as many variables X as can fit in available memory. Average them in groups. Take median of averages. Proper balance of group sizes and number of groups assures not only correct expected value, but expected error goes to 0 as number of samples gets large.

Problem: Streams Never End We assumed there was a number n, the number of positions in the stream. But real streams go on forever, so n is a variable --- the number of elements seen so far.

Fixups The variables X have n as a factor --- need to scale as n grows. Suppose we can only store k counts. We must throw some X ’s out as time goes on. Objective: each X is selected with probability k / n.

Solution to (2) Choose the first k elements. When the n th element arrives (n > k ), choose it with probability k / n. If you choose it, throw one of the previously stored variables out, with equal probability.