1 Summarizing Data using Bottom-k Sketches Edith Cohen AT&T Haim Kaplan Tel Aviv University.

Slides:



Advertisements
Similar presentations
Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red.
Advertisements

Estimation of Means and Proportions
Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
Point Estimation Notes of STAT 6205 by Dr. Fan.
Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.
Computing Classic Closeness Centrality, at Scale Edith Cohen Joint with: Thomas Pajor, Daniel Delling, Renato Werneck Microsoft Research.
Estimation  Samples are collected to estimate characteristics of the population of particular interest. Parameter – numerical characteristic of the population.
1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku
Maximum likelihood (ML) and likelihood ratio (LR) test
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Statistical Inference Chapter 12/13. COMP 5340/6340 Statistical Inference2 Statistical Inference Given a sample of observations from a population, the.
Building Low-Diameter P2P Networks Eli Upfal Department of Computer Science Brown University Joint work with Gopal Pandurangan and Prabhakar Raghavan.
Evaluating Hypotheses
Time-Decaying Sketches for Sensor Data Aggregation Graham Cormode AT&T Labs, Research Srikanta Tirthapura Dept. of Electrical and Computer Engineering.
Copyright © Cengage Learning. All rights reserved. 6 Point Estimation.
Market Risk VaR: Historical Simulation Approach
1 Inference About a Population Variance Sometimes we are interested in making inference about the variability of processes. Examples: –Investors use variance.
Pattern Recognition Topic 2: Bayes Rule Expectant mother:
Statistics in Bioinformatics May 12, 2005 Quiz 3-on May 12 Learning objectives-Understand equally likely outcomes, counting techniques (Example, genetic.
How confident are we that our sample means make sense? Confidence intervals.
Sampling Distributions & Point Estimation. Questions What is a sampling distribution? What is the standard error? What is the principle of maximum likelihood?
Business Statistics: Communicating with Numbers
Distance Queries from Sampled Data: Accurate and Efficient Edith Cohen Microsoft Research.
Leveraging Big Data: Lecture 11 Instructors: Edith Cohen Amos Fiat Haim Kaplan Tova Milo.
Estimation Goal: Use sample data to make predictions regarding unknown population parameters Point Estimate - Single value that is best guess of true parameter.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Probability theory: (lecture 2 on AMLbook.com)
Coordinated weighted sampling for estimating aggregates over multiple weight assignments Edith Cohen, AT&T Research Haim Kaplan, Tel Aviv University Shubho.
Census A survey to collect data on the entire population.   Data The facts and figures collected, analyzed, and summarized for presentation and.
Empirical Research Methods in Computer Science Lecture 2, Part 1 October 19, 2005 Noah Smith.
Prof. Dr. S. K. Bhattacharjee Department of Statistics University of Rajshahi.
Random Sampling, Point Estimation and Maximum Likelihood.
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.
© 2010 AT&T Intellectual Property. All rights reserved. AT&T and the AT&T logo are trademarks of AT&T Intellectual Property. Structure-Aware Sampling:
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Getting the Most out of Your Sample Edith Cohen Haim Kaplan Tel Aviv University.
Computing Classic Closeness Centrality, at Scale Edith Cohen Joint with: Thomas Pajor, Daniel Delling, Renato Werneck.
BCS547 Neural Decoding. Population Code Tuning CurvesPattern of activity (r) Direction (deg) Activity
Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests.
Chapter 7 Point Estimation of Parameters. Learning Objectives Explain the general concepts of estimating Explain important properties of point estimators.
1 Standard error Estimated standard error,s,. 2 Example 1 While measuring the thermal conductivity of Armco iron, using a temperature of 100F and a power.
BCS547 Neural Decoding.
Data in Motion Michael Hoffman (Leicester) S Muthukrishnan (Google) Rajeev Raman (Leicester)
1 Online Computation and Continuous Maintaining of Quantile Summaries Tian Xia Database CCIS Northeastern University April 16, 2004.
Massive Data Sets and Information Theory Ziv Bar-Yossef Department of Electrical Engineering Technion.
Sampling and estimation Petter Mostad
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Point Estimation of Parameters and Sampling Distributions Outlines:  Sampling Distributions and the central limit theorem  Point estimation  Methods.
1 Introduction to Statistics − Day 4 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Lecture 2 Brief catalogue of probability.
Sampling Theory and Some Important Sampling Distributions.
Class 5 Estimating  Confidence Intervals. Estimation of  Imagine that we do not know what  is, so we would like to estimate it. In order to get a point.
Statistics Sampling Distributions and Point Estimation of Parameters Contents, figures, and exercises come from the textbook: Applied Statistics and Probability.
Maximum likelihood estimators Example: Random data X i drawn from a Poisson distribution with unknown  We want to determine  For any assumed value of.
Chapter 9: Introduction to the t statistic. The t Statistic The t statistic allows researchers to use sample data to test hypotheses about an unknown.
1 IP Routing table compaction and sampling schemes to enhance TCAM cache performance Author: Ruirui Guo, Jose G. Delgado-Frias Publisher: Journal of Systems.
Learning Theory Reza Shadmehr Distribution of the ML estimates of model parameters Signal dependent noise models.
Fast Pseudo-Random Fingerprints Yoram Bachrach, Microsoft Research Cambridge Ely Porat – Bar Ilan-University.
© 2001 Prentice-Hall, Inc.Chap 8-1 BA 201 Lecture 12 Confidence Interval Estimation.
What you can do with Coordinated Sampling Edith Cohen Microsoft Research SVC Tel Aviv University Haim Kaplan Tel Aviv University.
Copyright © Cengage Learning. All rights reserved.
Statistical Estimation
Streaming & sampling.
Haim Kaplan and Uri Zwick
Sublinear Algorithmic Tools 2
Haim Kaplan and Uri Zwick
Range-Efficient Computation of F0 over Massive Data Streams
Learning From Observed Data
Presentation transcript:

1 Summarizing Data using Bottom-k Sketches Edith Cohen AT&T Haim Kaplan Tel Aviv University

2 The basic application setup There is a universe of items, each has a weight Keep a sketch of k items, so that you can estimate the weight of any subpopulation from the sketch. (also other aggregates, weight functions) Example: Items are flows going through a router w(i) is the number of packets of flow i. Queries supported by the sketch are estimates on the total weight of flows of a particular port, particular size, particular destination IP address, etc. w(i 1 )w(i 2 )w(i 3 )

3 Application setup : Coordinated sketches for multiple subsets Universe I of weighted items and a set of subsets over that universe. Keep a size-k sketch of each subset, so that you can support both queries within a subset and queries on subset relations such as aggregates over union, or intersection, resemblance…. These sketches are coordinated so that subset relations queries can be supported better.

4 …Sketches for multiple subsets Example applications: Items are features and subsets are documents. Estimate similarity of documents. Items are documents and subsets are features. Estimate “correlation” of features. Items are files and subsets are neighborhoods in a p2p network. Estimate total size of distinct items in a neighborhood. Items are goods and subsets are consumers. Estimate marketing cost of all goods in a subset of consumers.

5 Application setup : All-distances sketches Items are located in a metric space ( data stream with time stamp or sequence number, network with distance ). All-distances sketch of a location v is a compact encoding of the sketches of ALL neighborhoods of v. Efficient time-decaying and spatially-decaying aggregation For each query distance d, the sketch of the set of items within distance d from v can be retrieved from the all-distances sketch v

6 All-distances sketches

7 One approach: k-mins sketches Each item i draws a rank r(i) from an exponential distribution with parameter w(i) (-ln u/w(i) u ∊ [0,1] ) Pick the item with the smallest rank to your set Repeat k times (possibly concurrently) Equivalent to weighted sampling of k items with replacement, (convenient in distributed settings).  k-mins sketch (Cohen 97) ( (i 2,r (1) (i 2 ), (i 2,r (k) (i 2 ) ) (i 5,r (2) (i 5 ),….

8 …… k-mins sketches Multiple subsets: same r (1),…,r (k) for all subsets. All-distances: encode min rank at any distance (0,0.8), (2,0.6), (10,0.5), (15,0.2) Estimators (examples): The fraction of blue items among the k in the sketch is an unbiased estimate of their fraction in the population (k-1)/sum(k min ranks) is an unbiased estimate of the weight of the set (error decreases with k) Multiple subsets: fraction of “common” coordinates is an unbiased estimator for resemblance (ratio of intersection and union) x k

9 Applications of k-mins sketches Graph theory: Estimate the size of the transitive closure of a directed graph and of neighborhoods without explicitly computing them (Cohen ’94) Sensor networks: Estimating # of items, variance, in a neighborhood of a sensor (Cohen, Kaplan, SIGMOD’04), aggregation where weights decay with distance Streaming: aggregation where weights decay with time (Cohen, Strauss, PODS’03) Databases: Estimate the size of the “join” before actually computing it (Lewis, Cohen SODA’97) Data mining: Estimate resemblance of Web pages and Web sites (Broder 97, Bharat, Broder 99,…) Databases: Estimate association rules (CDFGM TKDE ’01)

10 Our approach: Bottom-k sketches Each item appears once – this is equivalent to weighted sampling without replacement Each item i draws a rank r(i) from some distribution that depends on its weight, say -ln u/w(i) where u ∊ [0,1] (exponential with PDF w(i)e -w(i)x ) Pick the k smallest-ranked items to your sketch. Multiple subsets: pick the k items with smallest ranks in the subset for the sketch of the subset.

11 Advantages of Bottom-k sketches Intuitively the sample is more informative, in particular for Zipf-like distribution, where there are few large weights. Often, more efficient to compute Provide more accurate estimates Bottom-k sketches can be used instead of k-mins sketches in almost every application

12 Bottom-k sketches can replace k- mins sketches in most applications Plain sketches for explicitly-represented subsets: Bottom-k sketches can be computed much more efficiently. All-distance sketches: Bottom-k sketches can be computed as efficiently ? Open: Euclidean plane

13 Multiple subsets with explicit representation Items are processed one by one, sketch is updated when a new item is processed: k-mins: The new item draws a vector of k random numbers We compare each coordinate to the “winner” so far in that coordinate and update if the new one wins  O(k) time to test and update bottom-k: The new item draws one random number We compare the number to the k smallest so far and update if smaller  O(1) time to test O(log k) time to update

14 … Multiple subsets with explicit representation The number of tests = sum of the sizes of subsets. The number of updates is “generally” logarithmic (depends on item-weights distribution and the order items are processed) and about the same for k-mins and bottom-k sketches. (Precise analysis in the paper) k-mins: O(k) time to test and update bottom-k: O(1) time to test O(log k) time to update

15 All distances bottom-k sketches Data structures are more complex than all- distances k-mins sketches -- maintain k smallest items in a single rank assignment instead of one smallest in k independent assignments. We analyze the number of operations for Constructing all-distances k-mins and bottom-k sketches for different orders in which items are “processed”, weight distributions, and relations of weight and location. Querying sketches  The number of operations is comparable for both types of sketches.

16 Bottom-k sketches provide more accurate estimates The red line shows an estimate based on k-mins sketch All other lines are various estimators that use bottom-k sketch

17 Estimating with bottom-k sketches (by “mimicking”) We can produce from a bottom-k sketch S a distribution D(S) on k-mins sketches. Drawing S and then a k-mins sketch from D(S) is equivalent to drawing a k-mins sketch  Can use all known estimators for k-mins sketch  We can do even better by taking expectation over D(S) or drawing from D(S) multiple times bottom-k sketches S k-mins sketches D(S)

18 Maximum likelihood estimate using bottom-k sketches Lemma: Over sketches with item weights (w 1,w 2,…w k ) The rank differences are independent, each distributed exponentially r 1 – r 0 r 2 – r 1... r k -r k-1 exp. with parameter W= w(I) exp. with parameter W-w 1 exp. with parameter W-w 1 -w w k-1 The probability that we see the particular differences is The ML estimate is the W which maximizes this

19 Adjusted weights estimators Old technique by Horvitz and Thompson: Let p i be the probability that item i is sampled. Assign to item i an adjusted weight of a(i)=w(i)/p i if i is sampled and a(i)=0 otherwise. Then  a(i) is an unbiased estimator of w(i) To estimate a subset J we sum up the adjusted weights of the elements of the sketch that are in J

20 Subspace conditioning Problem: p i may be impossible or hard to compute from the information in the sketch. Partition the probability space into subspaces and apply the technique within each subset. p i is now the probability that w i is in the sketch conditioned on the fact that the sketch is from a particular subspace. Our idea:

21 Rank conditioning estimators Partition the sketches according to the ( k+1 )- smallest rank ( r k+1 ) ( k th smallest rank among all other items) Then p i is the probability that the rank of i is smaller than r k+1 Lemma: E(a(i)a(j)) = w(i)w(j) (cov(i,j)=0) i<>j  the variance of the estimator for a subset J is the sum of the variances of the adjusted weights of the items in J

22 Reducing the variance Use a coarser partition of the probability space to do the conditioning Lemma: The variance is smaller for a coarser partition. Typically p i gets somewhat harder to compute

23 Priority sampling (Duffield,Lund,Thorup) fits nicely in the rank conditioning framework: These are bottom-k sketches with ranks drawn as follows: Item i with weight w(i) gets rank u/w(i) where u ∊ [0,1] Item i is in the sample if u/w(i) < r k+1. This happens with probability min{w(i) r k+1,1} So the adjusted weight of i is a(i)=w(i)/min{w(i) r k+1,1} = max{w(i), 1/ r k+1 } Rank conditioning adjusted weights:

24 Other weight functions We can estimate aggregates with respect to other weight functions, such as, number of distinct items, size distribution…. For a numeric property h(i), h(i)a(i)/w(i) is an unbiased estimator of h(i). Therefore, is an unbiased estimator of h(J)

25 Other aggregates Selectivity of a subpopulation Approximate quantiles Variance and higher moments Weighted random sample ……

26 Summary: b ottom-k sketches Useful in many application setups: –Sketch a single set –Coordinated sketches of multiple subsets –All-distances sketches A better alternative to k-mins sketches We facilitate use of bottom-k sketches: –Data structures, all-distances sketches –Analyze construction and query cost –Estimators, variance, confidence intervals [also in further work]