Coordinated weighted sampling for estimating aggregates over multiple weight assignments Edith Cohen, AT&T Research Haim Kaplan, Tel Aviv University Shubho.

Slides:



Advertisements
Similar presentations
Sampling From a Moving Window Over Streaming Data Brian Babcock * Mayur Datar Rajeev Motwani * Speaker Stanford University.
Advertisements

Chapter 10: Designing Databases
Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
Clustering Categorical Data The Case of Quran Verses
Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.
Logistics Network Configuration
Fast Algorithms For Hierarchical Range Histogram Constructions
Near-Duplicates Detection
Copyright © Cengage Learning. All rights reserved. 7 Probability.
Computer Science Dr. Peng NingCSC 774 Adv. Net. Security1 CSC 774 Advanced Network Security Topic 7.3 Secure and Resilient Location Discovery in Wireless.
Chapter 4 Probability and Probability Distributions
1 Summarizing Data using Bottom-k Sketches Edith Cohen AT&T Haim Kaplan Tel Aviv University.
Chapter 7 Probability 7.1 Experiments, Sample Spaces, and Events
Regression Part II One-factor ANOVA Another dummy variable coding scheme Contrasts Multiple comparisons Interactions.
ESTIMATION AND HYPOTHESIS TESTING
1 Reversible Sketches for Efficient and Accurate Change Detection over Network Data Streams Robert Schweller Ashish Gupta Elliot Parsons Yan Chen Computer.
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
Basic Probability. Theoretical versus Empirical Theoretical probabilities are those that can be determined purely on formal or logical grounds, independent.
Aggregation in Sensor Networks NEST Weekly Meeting Sam Madden Rob Szewczyk 10/4/01.
1 Anna Östlin Pagh and Rasmus Pagh IT University of Copenhagen Advanced Database Technology March 25, 2004 QUERY COMPILATION II Lecture based on [GUW,
Reverse Hashing for Sketch Based Change Detection in High Speed Networks Ashish Gupta Elliot Parsons with Robert Schweller, Theory Group Advisor: Yan Chen.
Flash Crowds And Denial of Service Attacks: Characterization and Implications for CDNs and Web Sites Aaron Beach Cs395 network security.
2.3. Measures of Dispersion (Variation):
Finding Similar Items.
Volatility Chapter 9 Risk Management and Financial Institutions 2e, Chapter 9, Copyright © John C. Hull
Distance Queries from Sampled Data: Accurate and Efficient Edith Cohen Microsoft Research.
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
5.1 Basic Probability Ideas
Estimation for Monotone Sampling: Competitiveness and Customization Edith Cohen Microsoft Research.
Census A survey to collect data on the entire population.   Data The facts and figures collected, analyzed, and summarized for presentation and.
Chapter 4 Statistics. 4.1 – What is Statistics? Definition Data are observed values of random variables. The field of statistics is a collection.
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
1 Sampling Distributions Lecture 9. 2 Background  We want to learn about the feature of a population (parameter)  In many situations, it is impossible.
© 2009 IBM Corporation 1 Improving Consolidation of Virtual Machines with Risk-aware Bandwidth Oversubscription in Compute Clouds Amir Epstein Joint work.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
© 2010 AT&T Intellectual Property. All rights reserved. AT&T and the AT&T logo are trademarks of AT&T Intellectual Property. Structure-Aware Sampling:
Chapter 3 Numerically Summarizing Data 3.2 Measures of Dispersion.
Challenges and Opportunities Posed by Power Laws in Network Analysis Bruno Ribeiro UMass Amherst MURI REVIEW MEETING Berkeley, 26 th Oct 2011.
Stratified K-means Clustering Over A Deep Web Data Source Tantan Liu, Gagan Agrawal Dept. of Computer Science & Engineering Ohio State University Aug.
Getting the Most out of Your Sample Edith Cohen Haim Kaplan Tel Aviv University.
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
1 Standard error Estimated standard error,s,. 2 Example 1 While measuring the thermal conductivity of Armco iron, using a temperature of 100F and a power.
PROBABILITY, PROBABILITY RULES, AND CONDITIONAL PROBABILITY
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.
Massive Data Sets and Information Theory Ziv Bar-Yossef Department of Electrical Engineering Technion.
1 An Arc-Path Model for OSPF Weight Setting Problem Dr.Jeffery Kennington Anusha Madhavan.
+ Chapter 5 Overview 5.1 Introducing Probability 5.2 Combining Events 5.3 Conditional Probability 5.4 Counting Methods 1.
Statistics Sampling Distributions and Point Estimation of Parameters Contents, figures, and exercises come from the textbook: Applied Statistics and Probability.
CHAPTER 2: Basic Summary Statistics
How to build a better Google? Adam Bak IST 497E November 21, 2002.
Web Design Vocabulary #3. HTML Hypertext Markup Language - The coding scheme used to format text for use on the World Wide Web.
Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window MADALGO – Center for Massive Data Algorithmics, a Center of the Danish.
Ranking: Compare, Don’t Score Ammar Ammar, Devavrat Shah (LIDS – MIT) Poster ( No preprint), WIDS 2011.
Estimating Volatilities and Correlations
Computing and Compressive Sensing in Wireless Sensor Networks
Near Duplicate Detection
The Variable-Increment Counting Bloom Filter
Streaming & sampling.
Edith Cohen Google Research Tel Aviv University
Foundations of Data Mining
Sublinear Algorithmic Tools 2
DDoS Attack Detection under SDN Context
Chapter 4 – Part 3.
Basic Concepts An experiment is the process by which an observation (or measurement) is obtained. An event is an outcome of an experiment,
Feifei Li, Ching Chang, George Kollios, Azer Bestavros
Introduction to Stream Computing and Reservoir Sampling
CHAPTER 2: Basic Summary Statistics
Data Pre-processing Lecture Notes for Chapter 2
Presentation transcript:

Coordinated weighted sampling for estimating aggregates over multiple weight assignments Edith Cohen, AT&T Research Haim Kaplan, Tel Aviv University Shubho Sen, AT&T Research

Data model Universe U of items: i1i1 i2i2 i3i3 i4i4 ………… w1(i1)w1(i1) w1(i2)w1(i2) w1(i3)w1(i3) w1(i4)w1(i4) w2(i1)w2(i1) w2(i2)w2(i2) w2(i3)w2(i3) w2(i4)w2(i4) w3(i1)w3(i1) w3(i2)w3(i2) w3(i3)w3(i3) w3(i4)w3(i4) Multiple weight assignments defined over the items IP addresses traffic day 1 traffic day 2 traffic day 3

Data model (example) Universe U of items: i1i1 i2i2 i3i3 i4i4 w1(i1)w1(i1) w1(i2)w1(i2) w1(i3)w1(i3) w1(i4)w1(i4) w2(i1)w2(i1) w2(i2)w2(i2) w2(i3)w2(i3) w2(i4)w2(i4) w3(i1)w3(i1) w3(i2)w3(i2) w3(i3)w3(i3) w3(i4)w3(i4) facebook users male friends female friends online per day …………

Data model (example) Universe U of items: i1i1 i2i2 i3i3 i4i4 ………… w1(i1)w1(i1) w1(i2)w1(i2) w1(i3)w1(i3) w1(i4)w1(i4) w2(i1)w2(i1) w2(i2)w2(i2) w2(i3)w2(i3) w2(i4)w2(i4) w3(i1)w3(i1) w3(i2)w3(i2) w3(i3)w3(i3) w3(i4)w3(i4) customers wait time item 1 wait time item 2 wait time item 3

Data model (example) Universe U of items: i1i1 i2i2 i3i3 i4i4 ………… w1(i1)w1(i1) w1(i2)w1(i2) w1(i3)w1(i3) w1(i4)w1(i4) w2(i1)w2(i1) w2(i2)w2(i2) w2(i3)w2(i3) w2(i4)w2(i4) w3(i1)w3(i1) w3(i2)w3(i2) w3(i3)w3(i3) w3(i4)w3(i4) license plates GW bridge day 1 GW bridge day 2 GW bridge day 3

Data model (example) Universe U of items: i1i1 i2i2 i3i3 i4i4 ………… w1(i1)w1(i1) w1(i2)w1(i2) w1(i3)w1(i3) w1(i4)w1(i4) w2(i1)w2(i1) w2(i2)w2(i2) w2(i3)w2(i3) w2(i4)w2(i4) w3(i1)w3(i1) w3(i2)w3(i2) w3(i3)w3(i3) w3(i4)w3(i4) web pages bytes out links in links

Aggregation queries Selection predicate d(i) over items in U compute the sum of the weights of all items for which the predicate holds: Examples: Total number of links out of pages of some web site Total number of flows into a particular network

This work:Aggregates depending on more than one weight function Selection predicate d(i) over items in U Want to compute the sum of f(i) for all items for which the predicate holds: f(i) depends on some subset of the weights given to i Simple examples: f(i) = max b w b (i), f(i) = min b w b (i), f(i)=max b w b (i)-min b w b (i)

Data model (example) Universe U of items: i1i1 i2i2 i3i3 i4i4 ………… w1(i1)w1(i1) w1(i2)w1(i2) w1(i3)w1(i3) w1(i4)w1(i4) w2(i1)w2(i1) w2(i2)w2(i2) w2(i3)w2(i3) w2(i4)w2(i4) w3(i1)w3(i1) w3(i2)w3(i2) w3(i3)w3(i3) w3(i4)w3(i4) customers wait time item 1 wait time item 2 wait time item 3 f(i) = max b w b (i) aggregated over a subset gives the total waiting time of these customers

Data model Universe U of items: i1i1 i2i2 i3i3 i4i4 ………… w1(i1)w1(i1) w1(i2)w1(i2) w1(i3)w1(i3) w1(i4)w1(i4) w2(i1)w2(i1) w2(i2)w2(i2) w2(i3)w2(i3) w2(i4)w2(i4) w3(i1)w3(i1) w3(i2)w3(i2) w3(i3)w3(i3) w3(i4)w3(i4) IP addresses traffic day 1 traffic day 2 traffic day 3 f(i) = max b w b (i)-min b w b (i) aggregated over a network gives sum of maximum changes in # of flows

Massive data volumes, too large to store or to transport elsewhere w 1 may not be available when we process w 2, etc, collected at different place/time Many different types of queries (d(i),f(i)), not known in advance Challenges  Exact aggregation can be very resource consuming or simply impossible Challenge: which summary to keep and how to estimate the aggregate ??

Desirable Properties of Sampling Scheme: Scalability: efficient on data streams and distributed data, decouple the summarization of different sets Applicable to a wide range of d(i), f(i) Good estimators: unbiased, small variance Keep a sample Solution: Use Sampling

Independent (Bernoulli) with probability kw(i)/W Poisson, PPS sampling k times without replacement Order (bottom-k) sampling (includes PPSWOR, priority sampling (Rosen 72, Rosen 97, CK07, DLT 07) Methods can be implemented efficiently in a stream/reservoir/distributed setting Order (bottom-k) sampling dominates other methods (fixed sample size, with unbiased estimators [CK07, DLT07] ) Sampling a single weight function

Rank-based definition of Order/bottom-k sampling For each item draw a random rank value from some distribution f w(i) Order/bottom-k: Pick the k smallest-ranked items to your sample r(i)=u(i)/w(i) : priority order samples r(i)=-ln(u(i))/w(i) : PPSWOR order samples Rank distributions: (The seed)

Independent sampling: Get a sample from each weight function independently Coordinated Sampling: If w 2 (i) ≥ w 1 (i) then r 2 (i) ≤ r 1 (i). For example by using the same seed u(i) for all w i : r 1 (i) = u(i)/w 1 (i) r 2 (i) = u(i)/w 2 (i) Relations between samples by different weight function Can be achieved efficiently via hashing

Coordination is critical for tight estimates We develop estimators for coordinated samples and analyze them Coordination is critical A lot of previous work on coordination in the context of survey sampling and when weights are uniform..

Horvitz Thompson Estimators [HT52]  a(Q) is an unbiased estimator of w(Q) HT estimator: Give each item i an adjusted weight a(i) Let p(i) be the prob. that item i is sampled If i is in sample, a(i)=w(i)/p(i) (otherwise a(i) = 0 )

Using HT for order samples Cannot compute p(i) from an order sample Therefore cannot compute HT adjusted weights a(i)=w(i)/p(i) p(i)? Problem For item i, if we can find a partition of the sample space such that we can compute the conditioned p(i) for item i for the cell that the sampled set belongs to Then apply HT using that conditioned p(i) p 1 (i) p 2 (i) p 4 (i) p 3 (i) p 5 (i) p 6 (i)  Obtain unbiased adjusted weights for each item Solution: Apply HT on partitions

Estimators for aggregates over multiple weights Suppose we want to estimate: We have a sample for each weight function using independent or coordinated ranks

Estimator for Identify the items i for which d(i)=true and you know min b w b (i) (=Items contained in all samples) Compute p(i), the probability of such item to be sampled for all weight functions (conditioned in a subspace that depends on the other ranks) Set a(i) = min b w b (i)/p(i) Sum up these adjusted weights

Independent vs Coordinated Variance is small when p(i) is large Coordinated sketches have larger p(i)

Estimator for Identify the items i for which d(i)=true and you know max b w b (i) If you see all weights then you know max b w b (i) But you never see them all if min b w b (i) = 0

Estimator for If the largest weight you see has rank smaller than min b {k th smallest rank of an item  I\{i}} then you know it’s the maximum. Use the consistency of the ranks:

Estimator for Compute p(i), the probability of this to happen Set a(i) = min b w b (i)/p(i) Sum up these adjusted weights If the largest weight you see has rank smaller than min b {k th smallest rank of an item  I\{i}} then you know it’s the maximum.

More estimators Now we can get an unbiased estimator for the L1 difference: Can prove that this estimator is nonnegative

Empirical Evaluation Data sets: IP packet traces: Items: (src,dest,src-port,dest-port). w b : total # of bytes in hour b Netflix data sets: Items: movies. w b : total # of ratings in month b Stock data: Items: ticker symbols w b : high value on day b

Netflix:

Netflix: for various f

Summary Coordinated sketches improves accuracy and sometimes estimating is not feasible without it It can be achieved with little overhead using state of the art sampling techniques

Thank you!

Application: Sensor nodes recording daily vehicle traffic in different locations in a city Items: vehicles (license plate numbers) Sets: All vehicles observed at a particular location/date Example queries: Number of distinct vehicles in Manhattan on election day (size of the union of all Manhattan locations on election day) Number of trucks with PA license plates that crossed both the Hudson and the East River on June 18, 2009

Application: Items are IP addresses. Sets h1,…,h24 correspond to destination IP’s observed in different 1- hour time periods. Example Queries: number of distinct IP dests In the first 12 hours (size of union of h1,…,h12) In at most one/at least 4 out of h1,…,h24 In the intersection of h9,…,h17 (all business hours) Same queries restricted to: blacklisted IP’s, servers in a specific country etc. Uses: Traffic Analysis, Anomaly detection

Application: Text/HyperText corpus: items are features/terms, sets are documents (or vice versa) Example Queries: Number of distinct hyperlinks to financial news websites from websites in the.edu domain Fraction of pages citing A that also cite B Similarity (Jaccard coefficient) of two documents Uses: research, duplicate elimination, similarity search or clustering

Application: Market basket data set. Items are consumers/baskets, subsets are goods (or vice versa) Example queries: Likelihood of purchasing beer if diapers were purchased Number of baskets from zip code containing shampoo Number of consumers purchasing baby food and not diapers Uses: Marketing, placement, pricing, choosing products

Application: Online digital library, items are viewers, each set corresponds to viewers for a specific movie Example Queries: total number of distinct viewers of any or all James Bond titles (union, intersection) any independent title from 2008 same restricted to selected viewers: females, certain geographic location, under age 30. Uses: marketing research, content placement,….. Many other applications: makrket basket analysis, text/hypertext analysis etc.

Rank-based definition of sampling For each item draw a random rank value from some distribution f w(i) Poisson: Pick all items with rank value below Tau(k) Order/bottom-k: Pick the k smallest-ranked items to your sample r(i)=u/w(i) : Poisson PPS, priority order samples r(i)=-ln(u)/w(i) : PPSWOR order samples Rank distributions:

Horvitz Thompson Estimators [HT52] Let p(i) be the prob. that item i is sampled If i is in the sample a(i)=w(i)/p(i) (otherwise a(i) = 0 )  a(i) is an unbiased estimator of w(i) Give each item i an adjusted weight a(i):

Rank Conditioning Estimator for single bottom-k sample set [CK07, DLT07] A sample of A < 0.42 Use the principal of applying HT on a partition For item i in sample, partition is based on the k th smallest rank value among all other items (excluding i) in the sample being some fixed value. In this specific case, k=4 and the fixed value = Compute p(i) : the probability that the rank value of i is smaller than For priority order samples, r(i)=u/w(i); p(i)=prob(r(i)<0.42)= min{1,0.42w(i)}; a(i)=w(i)/p(i)=max{w(i),100/42}