Getting the Most out of Your Sample Edith Cohen Haim Kaplan Tel Aviv University.

Slides:

Advertisements

Similar presentations

Optimal Approximations of the Frequency Moments of Data Streams Piotr Indyk David Woodruff.

Advertisements

1+eps-Approximate Sparse Recovery Eric Price MIT David Woodruff IBM Almaden.

Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO.

The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.

Truthful Mechanisms for Combinatorial Auctions with Subadditive Bidders Speaker: Shahar Dobzinski Based on joint works with Noam Nisan & Michael Schapira.

Why Simple Hash Functions Work : Exploiting the Entropy in a Data Stream Michael Mitzenmacher Salil Vadhan And improvements with Kai-Min Chung.

Distributed Assignment of Encoded MAC Addresses in Sensor Networks By Curt Schcurgers Gautam Kulkarni Mani Srivastava Presented By Charuka Silva.

Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.

Fast Algorithms For Hierarchical Range Histogram Constructions

3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.

Introduction to Histograms Presented By: Laukik Chitnis

Mining Data Streams.

Congestion Games with Player- Specific Payoff Functions Igal Milchtaich, Department of Mathematics, The Hebrew University of Jerusalem, 1993 Presentation.

1 Summarizing Data using Bottom-k Sketches Edith Cohen AT&T Haim Kaplan Tel Aviv University.

Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.

Sampling with unequal probabilities STAT262. Introduction In the sampling schemes we studied – SRS: take an SRS from all the units in a population – Stratified.

Chapter 4: Network Layer

Maximum likelihood (ML) and likelihood ratio (LR) test

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006

Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:

Maximum likelihood (ML) and likelihood ratio (LR) test

G. Cowan Lectures on Statistical Data Analysis 1 Statistical Data Analysis: Lecture 8 1Probability, Bayes’ theorem, random variables, pdfs 2Functions of.

Statistical Background

Similar Sequence Similar Function Charles Yan Spring 2006.

1Bloom Filters Lookup questions: Does item “ x ” exist in a set or multiset? Data set may be very big or expensive to access. Filter lookup questions with.

Maximum likelihood (ML)

Computational aspects of stability in weighted voting games Edith Elkind (NTU, Singapore) Based on joint work with Leslie Ann Goldberg, Paul W. Goldberg,

Sample Size Determination Ziad Taib March 7, 2014.

Distance Queries from Sampled Data: Accurate and Efficient Edith Cohen Microsoft Research.

METU Informatics Institute Min 720 Pattern Classification with Bio-Medical Applications PART 2: Statistical Pattern Classification: Optimal Classification.

Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.

Chapter 7 Estimation: Single Population

Coordinated weighted sampling for estimating aggregates over multiple weight assignments Edith Cohen, AT&T Research Haim Kaplan, Tel Aviv University Shubho.

Estimation for Monotone Sampling: Competitiveness and Customization Edith Cohen Microsoft Research.

Prof. Dr. S. K. Bhattacharjee Department of Statistics University of Rajshahi.

An Integration Framework for Sensor Networks and Data Stream Management Systems.

© 2009 IBM Corporation 1 Improving Consolidation of Virtual Machines with Risk-aware Bandwidth Oversubscription in Compute Clouds Amir Epstein Joint work.

Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.

A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik.

© 2010 AT&T Intellectual Property. All rights reserved. AT&T and the AT&T logo are trademarks of AT&T Intellectual Property. Structure-Aware Sampling:

Today Ensemble Methods. Recap of the course. Classifier Fusion

Jennifer Rexford Princeton University MW 11:00am-12:20pm Measurement COS 597E: Software Defined Networking.

Weikang Qian. Outline Intersection Pattern and the Problem Motivation Solution 2.

Chapter 2 Statistical Background. 2.3 Random Variables and Probability Distributions A variable X is said to be a random variable (rv) if for every real.

Chapter 7 Point Estimation of Parameters. Learning Objectives Explain the general concepts of estimating Explain important properties of point estimators.

Analysis of Algorithms CS 477/677 Instructor: Monica Nicolescu Lecture 7.

Confidence Interval & Unbiased Estimator Review and Foreword.

1 Introduction to Statistics − Day 4 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Lecture 2 Brief catalogue of probability.

Chapter 8 Estimation ©. Estimator and Estimate estimator estimate An estimator of a population parameter is a random variable that depends on the sample.

TU/e Algorithms (2IL15) – Lecture 12 1 Linear Programming.

Approximation Algorithms based on linear programming.

Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window MADALGO – Center for Massive Data Algorithmics, a Center of the Danish.

What you can do with Coordinated Sampling Edith Cohen Microsoft Research SVC Tel Aviv University Haim Kaplan Tel Aviv University.

Mining Data Streams (Part 1)

New Characterizations in Turnstile Streams with Applications

Computing and Compressive Sensing in Wireless Sensor Networks

Chapter 6: Sampling Distributions

The Variable-Increment Counting Bloom Filter

Ch9 Random Function Models (II)

Edith Cohen Google Research Tel Aviv University

Lecture 18: Uniformity Testing Monotonicity Testing

Foundations of Data Mining

Howard Huang, Sivarama Venkatesan, and Harish Viswanathan

Testing with Alternative Distances

Load Shedding Techniques for Data Stream Systems

1.6) Storing Integer:.

Summarizing Data by Statistics

Feifei Li, Ching Chang, George Kollios, Azer Bestavros

Range-Efficient Computation of F0 over Massive Data Streams

Presentation transcript:

Getting the Most out of Your Sample Edith Cohen Haim Kaplan Tel Aviv University

Why data is sampled Lots of Data: measurements, Web searches, tweets, downloads, locations, content, social networks, and all this both historic and current… To get value from data we need to be able to process queries over it But resources are constrained: Data too large to: transmit, store in full, to process even if stored in full…

Random samples A compact synopsis/summary of the data. Easier to store, transmit, update, and manipulate Aggregate queries over the data can be approximately (and efficiently) answered from the sample Flexible: Same sample supports many types of queries (which do not have to be known in advance)

Queries and estimators The value of our sampled data hinges on the quality of our estimators Estimation of some basic (sub)population statistics is well understood (sample mean, variance,…) We need to better understand how to estimate other basic queries: – difference norms (anomaly/change detection) – distinct counts

Our Aim We want to estimate sum aggregates over sampled data Classic sampling schemes Seek “variance optimal” estimators Understand (and meet) limits of what can be done

Example: sensor measurements timesensor1sensor2 7:0034 7: : : :0429 7:0513 7: : : : :1067 7: :1210 Could be radiation/pollution levels sugar levels temperature readings

Sensor measurements timesensor1sensor2max 7: : : : : : : : : : : : :1210 We are interested in the maximum reading in each timestamp

Sensor measurements: Sum of maximums over selected subset timesensor1sensor2max 7: : : : : : : : : : : : :1210

But we only have sampled data timesensor1sensor2max 7: :015? 7: : :04? 7:051? 7:065? 7:07 6 ? 7:0811? 7: :10? 7:1121? 7:1210?

Example: distinct count of IP addresses Interface1: Interface2: Active IP addresses …………… Active IP addresses …………………

Distinct Count Queries Interface1 Interface2 Query: How many distinct IP addresses from a certain AS accessed the router through one of the interfaces during this time period ? I1 Active IP addresses …………… I2 Active IP addresses ………………… Distinct in AS

This is also a sum aggregate We do not have all IP addresses going through each interface but just a sample

Example: Difference queries IP traffic between time periods IP prefixkBytes 03/05/ / / / / IP prefixkBytes 03/06/ / / / / / day1 day2

Difference Queries Query: Difference between day1 and day2 Query: Upward change from day1 to day2

Sum Aggregates We want to estimate sum aggregates from samples

Single-tuple “reduction” To estimate the sum we sum single-tuple estimates Almost WLOG since tuples are sampled (nearly) independently

Estimator properties Unbiased Nonnegative Pareto-optimal: No estimator has smaller variance on all data vectors (Tailor to data?) “Variance competitive”: Not far from minimum for all data Monotone: increases with information either or both must have Sometimes:

Sampling schemes Weighted vs. Random Access (weight-oblivious) – Sampling probability does/does not depend on value Independent vs. Coordinated sampling – In both cases sampling of an entry is independent of values of other entries. – Independent: joint inclusion probability is product of individual inclusion probabilities. – Coordinated: sharing random bits so that joint probability is higher Poisson / bottom-k sampling Start with a warm-up: Random Access Sampling

Random Access: Suppose each entry is sampled with probability p timesensor1sensor2max 7: :015? 7: : :04? 7:051? 7:065? 7:07 6 ? 7:0811? 7: :10? 7:1121? 7:1210? We know the maximum only when both entries are sampled = probability p 2 Tuple estimate: max/p 2 if both sampled. 0 otherwise. estimate

It’s a sum of unbiased estimators timesensor1sensor2e(max) 7:00344/p 2 7:0150 7: /p 2 7: /p 2 7:040 7:0510 7:0650 7: : : /p 2 7:100 7: :12100 This is the Horvitz-Thompson estimator: Nonnegative Unbiased:

The weakness which we address: timesensor1sensor2max 7: :015? 7: : :04? 7:051? 7:065? 7:07 6 ? 7:0811? 7: :10? 7:1121? 7:1210? Not optimal Ignores a lot of information in the sample

Doing the best for equal entries timesensor1sensor2max 7:00444 If none is sampled (1-p) 2 : timesensor1sensor2e(max) 7:00??0 If we sample at least one value 2p-p 2 : timesensor1sensor2e(max) 7:004?4/(2p-p 2 ) timesensor1sensor2e(max) 7:00?44/(2p-p 2 ) timesensor1sensor2e(max) 7:00444/(2p-p 2 )

What if entry values are different ? timesensor1sensor2max 7: timesensor1sensor2e(max) 7:003?3/(2p-p 2 ) timesensor1sensor2e(max) 7:00?44/(2p-p 2 ) If we sample both timesensor1sensor2e(max) 7:0034 ? Unbiasedness determines value If none is sampled: timesensor1sensor2e(max) 7:00??0 If we sample one value:

What if entry values are different ? timesensor1sensor2e(max) 7:00??0 timesensor1sensor2e(max) 7:003?3/(2p-p 2 ) timesensor1sensor2e(max) 7:00?44/(2p-p 2 ) timesensor1sensor2e(max) 7:0034X Unbiased: X>0 → nonnegative X> 4/(2p-p 2 ) → monotone

… The L estimator timesensor1sensor2e(max) 7:00??0 timesensor1sensor2e(max) 7:00v1v1 ?v 1 /(2p-p 2 ) timesensor1sensor2e(max) 7:00?v2v2 v 2 /(2p-p 2 ) timesensor1sensor2e(max) 7:00v1v1 v2v2 X Nonnegative Unbiased Pareto optimal Min possible variance for two equal entries Monotone (unique) and symmetric

Back to our sum aggregate timesensor1sensor2e(max) 7:0034 7:015 7: : :040 7:051 7:065 7:07 6 7:0811 7: :100 7:1121 7:1210

The L estimator timesensor1sensor2e(max) 7:00??0 timesensor1sensor2e(max) 7:00v1v1 ?v 1 /(2p-p 2 ) timesensor1sensor2e(max) 7:00?v2v2 v 2 /(2p-p 2 ) timesensor1sensor2e(max) 7:00v1v1 v2v2 X Nonnegative Unbiased Pareto optimal Min possible variance for two equal entries Monotone (unique) and symmetric

The U estimator timesensor1sensor2e(max) 7:00??0 timesensor1sensor2e(max) 7:00v1v1 ?v 1 /(p(1+[1-2p] + ) timesensor1sensor2e(max) 7:00?v2v2 v 2 /(p(1+[1-2p] + ) timesensor1sensor2e(max) 7:00v1v1 v2v2 (max(v 1,v 2 )-(v 1 +v 2 )(1-p)/(1+[1-2p] + ))/p 2 Q: What if we want to optimize the estimator for sparse vectors ? U estimator: Nonnegative, Unbiased, symmetric, Pareto optimal, not monotone

Variance ratio U/L vs. HT p=½

Order-based variance optimality An estimator is ‹-optimal if any estimator with lower variance on v has strictly higher variance on some z<v. The L max estimator is <-optimal for (v,v) < (v,v-x) The U max estimator is <-optimal for (v,0) < (v,x) We can construct unbiased <-optimal estimators for any other order < Fine-grained tailoring of estimator to data patterns

Weighted sampling: Estimating distinct count Interface1: Interface2: IP addresses IP addresses Q: How many distinct IP addresses accessed the router through one of the interfaces ?

Sampling is “weighted” Interface1: Interface2: If the IP address did not access the interface then we sample it with probability 0, otherwise with probability p IP addresses IP addresses

→ No unbiased nonnegative estimator when p<½ Independent “weighted” sampling with probability p IP addInterface1Interface2e(OR) ??0 IP addInterface1Interface2e(OR) ?1/p IP addInterface1Interface2e(OR) ?11/p IP addInterface1Interface2e(OR) x p 2 x + 2p(1-p)/p=1 → x=(2p-1)/ p 2 To be unbiased for (1,1) To be nonnegative for (0,0) To be unbiased for (1,0) To be unbiased for (0,1) <0 OR estimation

Negative result (??) Independent “weighted” sampling: There is no non-negative unbiased estimator for OR (related to M. Charikar, S. Chaudhuri, R. Motwani, and V. Narasayya (PODS 2000), negative results for distinct element counts) Same holds for other functions like the ℓ-th largest value (ℓ < d), and range (Lp difference) Known seeds: Same sampling scheme but we make random bits “public:” can sometimes get information on values also when entry is not sampled (and get good estimators!)

Estimate OR How do we know if did not occur in interface2 or was not sampled by interface2 ? Sampled with probability p Sample not empty with probability 2p-p Interface1Interface

Known seeds Interface2 samples active IP iff H 2 ( ) < p H 2 ( ) < p   interface2 Interface1Interface2 H 2 ( ) > p  We do not know Interface1 samples active IP iff H 1 ( ) < p

HT estimator+known seeds With probability p 2, H 1 ( ) < p and H 2 ( ) < p. We know if occurred in both interfaces. If sampled in either interface then HT estimate is 1/ p 2. Interface1Interface2 H 2 ( ) > p or H 1 ( ) > p, the HT estimate is 0 We only use outcomes for which we know everything ?

Our L estimator: Interface1Interface2 ? Nonnegative Unbiased Monotone (unique) and symmetric Pareto optimal Min possible variance for (1,1) (both interfaces are accessed)

Unbiased For items in the intersection (minimum possible variance): For items in one interface:

OR (ind weighted+known seeds) variance of L, U, HT

Independent Sampling + Known Seeds Method to construct unbiased <-optimal estimators. Nonnegative either through explicit constraints or smart selection of “<“ on vectors. Results: Take home: use known seeds with your classic weighted sampling scheme

Estimating max sum: Independent, pps, known seeds # IP flows to dest IP address in two one-hour periods :

Coordinated Sampling Shared seeds coordinated sampling: Seeds are known and shared across instances: More similar instances have more similar samples. Allows for tighter estimates of some functions (difference norms, distinct counts) Precise characterization of functions for which nonnegative and unbiased estimators exist. (Also, bounded, bounded variances) Results: Construction of nonnegative order-optimal estimators

Estimating L 1 difference Independent / Coordinated, pps, known seeds destination IP addresses: #IP flows in two time periods

Independent / Coordinated, pps, Known seeds

Surname occurrences in 2007, 2008 books (Google ngrams) Independent / Coordinated, pps, Known seeds

Surname occurrences in 2007, 2008 books (Google ngrams) Independent / Coordinated, pps, Known seeds

Conclusion We present estimators for sum aggregates over samples: unbiased, nonnegative, variance optimal – Tailoring estimator to data (<-optimality) – Classic sampling schemes: independent/coordinated weighted/weight-oblivious Open/future: independent sampling (weighted+known seeds): precise characterization, derivation for >2 instances