© 2010 AT&T Intellectual Property. All rights reserved. AT&T and the AT&T logo are trademarks of AT&T Intellectual Property. Structure-Aware Sampling:

Slides:

Advertisements

Similar presentations

1+eps-Approximate Sparse Recovery Eric Price MIT David Woodruff IBM Almaden.

Advertisements

Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.

Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.

Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with Wenjie Zhang, Ying Zhang and Xuemin Lin University of.

An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.

Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna

Fast Algorithms For Hierarchical Range Histogram Constructions

Summarizing Distributed Data Ke Yi HKUST += ?. Small summaries for BIG data  Allow approximate computation with guarantees and small space – save space,

3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.

Computing Classic Closeness Centrality, at Scale Edith Cohen Joint with: Thomas Pajor, Daniel Delling, Renato Werneck Microsoft Research.

Multidimensional Indexing

Sorting Comparison-based algorithm review –You should know most of the algorithms –We will concentrate on their analyses –Special emphasis: Heapsort Lower.

Mining Data Streams.

1 Summarizing Data using Bottom-k Sketches Edith Cohen AT&T Haim Kaplan Tel Aviv University.

1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku

Approximating Sensor Network Queries Using In-Network Summaries Alexandra Meliou Carlos Guestrin Joseph Hellerstein.

Copyright 2004 David J. Lilja1 What Do All of These Means Mean? Indices of central tendency Sample mean Median Mode Other means Arithmetic Harmonic Geometric.

Probabilistic Aggregation in Distributed Networks Ling Huang, Ben Zhao, Anthony Joseph and John Kubiatowicz {hling, ravenben, adj,

Deterministic Wavelet Thresholding for Maximum-Error Metrics Minos Garofalakis Bell Laboratories Lucent Technologies 600 Mountain Avenue Murray Hill, NJ.

Iterative Optimization of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial Intelligence.

Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2.

Accessing Spatial Data

Finding Aggregates from Streaming Data in Single Pass Medha Atre Course Project for CS631 (Autumn 2002) under Prof. Krithi Ramamritham (IITB).

Statistics & Modeling By Yan Gao. Terms of measured data Terms used in describing data –For example: “mean of a dataset” –An objectively measurable quantity.

Analysis of greedy active learning Sanjoy Dasgupta UC San Diego.

Ensemble Learning: An Introduction

Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets Based on the work of Jeffrey Scott Vitter and Min Wang.

FALL 2006CENG 351 Data Management and File Structures1 External Sorting.

B + -Trees (Part 1) COMP171. Slide 2 Main and secondary memories  Secondary storage device is much, much slower than the main RAM  Pages and blocks.

1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray.

Experimental Evaluation

B + -Trees COMP171 Fall AVL Trees / Slide 2 Dictionary for Secondary storage * The AVL tree is an excellent dictionary structure when the entire.

CS 580S Sensor Networks and Systems Professor Kyoung Don Kang Lecture 7 February 13, 2006.

One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.

Hydrologic Statistics

Fast Approximate Wavelet Tracking on Streams Graham Cormode Minos Garofalakis Dimitris Sacharidis

Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.

Game Trees: MiniMax strategy, Tree Evaluation, Pruning, Utility evaluation Adapted from slides of Yoonsuck Choe.

Coordinated weighted sampling for estimating aggregates over multiple weight assignments Edith Cohen, AT&T Research Haim Kaplan, Tel Aviv University Shubho.

Minimax Trees: Utility Evaluation, Tree Evaluation, Pruning CPSC 315 – Programming Studio Spring 2008 Project 2, Lecture 2 Adapted from slides of Yoonsuck.

1 B Trees - Motivation Recall our discussion on AVL-trees –The maximum height of an AVL-tree with n-nodes is log 2 (n) since the branching factor (degree,

Special Topics in Data Engineering Panagiotis Karras CS6234 Lecture, March 4 th, 2009.

Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.

Trevor Brown – University of Toronto B-slack trees: Space efficient B-trees.

Boltzmann Machine (BM) (§6.4) Hopfield model + hidden nodes + simulated annealing BM Architecture –a set of visible nodes: nodes can be accessed from outside.

Wavelet Synopses with Predefined Error Bounds: Windfalls of Duality Panagiotis Karras DB seminar, 23 March, 2006.

Constructing Optimal Wavelet Synopses Dimitris Sacharidis Timos Sellis

Sorting Fun1 Chapter 4: Sorting     29  9.

CSC 211 Data Structures Lecture 13

Getting the Most out of Your Sample Edith Cohen Haim Kaplan Tel Aviv University.

Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.

Computing Classic Closeness Centrality, at Scale Edith Cohen Joint with: Thomas Pajor, Daniel Delling, Renato Werneck.

Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery.

Spatial Indexing Techniques Introduction to Spatial Computing CSE 5ISC Some slides adapted from Spatial Databases: A Tour by Shashi Shekhar Prentice Hall.

By: Gang Zhou Computer Science Department University of Virginia 1 Medians and Beyond: New Aggregation Techniques for Sensor Networks CS851 Seminar Presentation.

Query Processing CS 405G Introduction to Database Systems.

1. Searching The basic characteristics of any searching algorithm is that searching should be efficient, it should have less number of computations involved.

CSC321: Introduction to Neural Networks and Machine Learning Lecture 15: Mixtures of Experts Geoffrey Hinton.

Data Structures and Algorithms Instructor: Tesfaye Guta [M.Sc.] Haramaya University.

ENTROPY Entropy measures the uncertainty in a random experiment. Let X be a discrete random variable with range S X = { 1,2,3,... k} and pmf p k = P X.

Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo

Estimating standard error using bootstrap

Advanced Sorting 7 2  9 4   2   4   7

Chapter 11 Sorting Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and Mount.

Data Transformation: Normalization

Orthogonal Range Searching and Kd-Trees

Spatial Online Sampling and Aggregation

COMS E F15 Lecture 2: Median trick + Chernoff, Distinct Count, Impossibility Results Left to the title, a presenter can insert his/her own image.

Linear sketching with parities

Linear sketching with parities

Presentation transcript:

© 2010 AT&T Intellectual Property. All rights reserved. AT&T and the AT&T logo are trademarks of AT&T Intellectual Property. Structure-Aware Sampling: Flexible and Accurate Summarization Edith Cohen, Graham Cormode, Nick Duffield AT&T Labs-Research

© 2010 AT&T Intellectual Property. All rights reserved. AT&T and the AT&T logo are trademarks of AT&T Intellectual Property.  Approximate summaries are vital in managing large data – E.g. sales records of a retailer, network activity for an ISP – Need to store compact summaries for later analysis  State-of-the-art summarization via sampling – Widely deployed in many settings – Models data as (key, weight) pairs – General purpose summary, enables subset-sum queries – Higher level analysis: quantiles, heavy hitters, other patterns & trends 2 Summaries and Sampling

© 2010 AT&T Intellectual Property. All rights reserved. AT&T and the AT&T logo are trademarks of AT&T Intellectual Property.  Current sampling methods are structure oblivious – But most queries are structure respecting!  Most queries are actually range queries – “How much traffic from region X to region Y between 2am and 4am?”  Much structure in data – Order (e.g. ordered timestamps, durations etc.) – Hierarchy (e.g. geographic and network hierarchies) – (Multidimensional) products of structures  Can we make sampling structure-aware and improve accuracy? 3 Limitations of Sampling

© 2010 AT&T Intellectual Property. All rights reserved. AT&T and the AT&T logo are trademarks of AT&T Intellectual Property.  Inclusion Probability Proportional to Size (IPPS): – Given parameter , probability of sampling key with weight w is min{1, w/  } – Key i has adjusted weight a i = w i /p  (w i ) = max{  w i } (Horvitz-Thompson) – Can pick a  so that expected sample size is k  VarOpt sampling methods are Variance Optimal over keys: – Produces a sample of size exactly k keys – Allow correlations between inclusion of keys (unlike Poisson sampling) – Give strong tail bounds on estimates via H-T estimates – But do not yet consider structure of keys 4 Background on Sampling

© 2010 AT&T Intellectual Property. All rights reserved. AT&T and the AT&T logo are trademarks of AT&T Intellectual Property.  Define a probabilistic aggregate of vector of sampling probs: – Let vector p  [0,1] n define sampling probabilities for n keys – Probabilistic aggregation to p’ sets entries to 0 or 1 so that:   i. E[p’ i ] = p i (Agreement in expectation)   i p’ i =  i p i (Agreement in sum)   key sets J. E[  i  J p’ i ]   i  J p i (Inclusion bounds)   key sets J. E[  i  J (1-p’ i )]   i  J (1-p i )(Exclusion bounds)  Apply probabilistic aggregation until all entries are set (0 or 1) – The 1 entries define the contents of the sample – This sample meets the requirements for a VarOpt sample 5 Probabilistic Aggregation

© 2010 AT&T Intellectual Property. All rights reserved. AT&T and the AT&T logo are trademarks of AT&T Intellectual Property.  Pair aggregation implements probabilistic aggregation – Pick two keys, i and j, such that neither is 0 or 1 – If p i + p j < 1, one of them gets set to 0:  Pick j to set to 0 with probability p i /(p i + p j ), or i with p j /(p i + p j )  The other gets set to p i + p j (preserving sum of probabilities) – If p i + p j  1, one of them gets set to 1:  Pick i with probability (1 - p j )/(2 - p i - p j ), or j with (1 - p i )/(2 - p i - p j )  The other gets set to p i + p j - 1 (preserving sum of probabilities) – This satisfies all requirements of probabilistic aggregation – There is complete freedom to pick which pair to aggregate at each step  Use this to provide structure awareness by picking “close” pairs 6 Pair Aggregation

© 2010 AT&T Intellectual Property. All rights reserved. AT&T and the AT&T logo are trademarks of AT&T Intellectual Property.  We want to measure the quality of a sample on structured data  Define range discrepancy based on difference between number of keys sampled in a range, and the expected number – Given a sample S, drawn according to a sample distribution p: Discrepancy of range R is  (S, R) = abs(|S  R| -  i  R p i ) – Maximum range discrepancy maximizes over ranges and samples: Discrepancy over sample dbn  is  = max s   max R  R  (S,R) – Given range space R, seek sampling schemes with small discrepancy 7 Range Discrepancy

© 2010 AT&T Intellectual Property. All rights reserved. AT&T and the AT&T logo are trademarks of AT&T Intellectual Property.  Can give very tight bounds for one-dimensional range structures  R = Disjoint Ranges – Pair selection picks pairs where both keys are in same range R – Otherwise, pick any pair  R = Hierarchy – Pair selection picks pairs with lowest LCA  In both cases, for any R  R, |S  R|  {   i  R p i ,   i  R p i  } – The maximum discrepancy is optimal:  < 1 8 One-dimensional structures

© 2010 AT&T Intellectual Property. All rights reserved. AT&T and the AT&T logo are trademarks of AT&T Intellectual Property.  R = order (i.e. points lie on a line in 1D) – Apply a left-to-right algorithm over the data in sorted order – For first two keys with 0 < p i, p j < 1, apply pair aggregation – Remember which key was not set, find next unset key, pair aggregate – Continue right until all keys are set  Sampling scheme for 1D order has discrepancy  < 2 – Analysis: view as a special case of hierarchy over all prefixes – Any R  R is the difference of 2 prefixes, so has  < 2  This is tight: cannot give VarOpt distribution with  < 2 – For given , we can construct a worst case input 9 One-dimensional order

© 2010 AT&T Intellectual Property. All rights reserved. AT&T and the AT&T logo are trademarks of AT&T Intellectual Property.  More generally, we have multidimensional keys  E.g. (timestamp, bytes) is product of hierarchy with order  KDHierarchy approach partitions space into regions – Make probability mass in each region approximately equal – Use KD-trees to do this. For each dimension in turn:  If it is an ‘order’ dimension, use median to split keys  If it is a ‘hierarchy’, find the split that minimizes the size difference  Recurse over left and right branches until we reach leaves 10 Product Structures

© 2010 AT&T Intellectual Property. All rights reserved. AT&T and the AT&T logo are trademarks of AT&T Intellectual Property.  Any query rectangle fully contains some rectangles, and cuts others – In d-dimensions on s leaves, at most O(d s (d-1)/d log s) rectangles touched – Consequently, error is concentrated around O((d log 1/2 s)s (d-1)/2d) ) 11 KD-Hierarchy Analysis

© 2010 AT&T Intellectual Property. All rights reserved. AT&T and the AT&T logo are trademarks of AT&T Intellectual Property.  Building the KD-tree over all data consumes a lot of space  Instead, take two passes over data and use less space – Pass 1: Compute uniform sample of size s’ > s and build tree – Pass 2: Maintain one key for each node in the tree  When two keys fall in same node, use pair aggregation  At end, pair aggregate up the tree to generate final sample  Conclude with a sample of size s, guided by structure of tree  Variations of the same approach work for 1D structures 12 I/O efficient sampling for product spaces

© 2010 AT&T Intellectual Property. All rights reserved. AT&T and the AT&T logo are trademarks of AT&T Intellectual Property.  Compared structure aware I/O Efficient Sampling to: – VarOpt ‘obliv’ (structure unaware) sampling – Qdigest: Deterministic summary for range queries – Sketches: Randomized summary based on hashing – Wavelets: 2D Haar wavelets – generate all coefficients, then prune  Studied on various data sets with different size, structure – Shown here: network traffic data (product of 2 hierarchies) – Query loads: uniform area rectangles, and uniform weight rectangles 13 Experimental Study

© 2010 AT&T Intellectual Property. All rights reserved. AT&T and the AT&T logo are trademarks of AT&T Intellectual Property. 14 Accuracy results  Compared on uniform area queries, and uniform weight queries  Clear benefit to structure aware sampling  Wavelet sometimes competitive but very slow Absolute Error Summary Size Network Data, uniform area queries aware obliv wavelet qdigest 2-4x improvement

© 2010 AT&T Intellectual Property. All rights reserved. AT&T and the AT&T logo are trademarks of AT&T Intellectual Property.  Structure aware sampling is somewhat slower than VarOpt – But still much faster than everything else, particularly wavelets  Queries take same time to perform for both sampling methods – Just answer query over the sample 15 Scalability Results Time (s)

© 2010 AT&T Intellectual Property. All rights reserved. AT&T and the AT&T logo are trademarks of AT&T Intellectual Property.  Structure aware sampling can improve accuracy greatly – For structure-respecting queries – Result is still variance optimal  The streaming (one-pass) case is harder – There is a unique VarOpt sampling distribution – Instead, must relax VarOpt requirement – Initial results in SIGMETRICS’11 16 Concluding Remarks

© 2010 AT&T Intellectual Property. All rights reserved. AT&T and the AT&T logo are trademarks of AT&T Intellectual Property. 17 Back up slides

© 2010 AT&T Intellectual Property. All rights reserved. AT&T and the AT&T logo are trademarks of AT&T Intellectual Property.  Two level hierarchy with 9 leaves – Draw a sample of size k=3  Structure-aware sampling would pick only 1 item from each branch  Uniform stream sampling picks each subset of 3 items uniformly – Probability of picking one from each branch = 3 3 /9C3 = 9/28  There is an optimal structure aware sampling scheme that always picks one from each branch – Gives exact estimates for weight of {A,B,C} {D,E,F} and {G,H,I} – But not possible to sample from this distribution in the stream 18 Toy Example v3v3 v1v1 v2v2 AGIHBCEFD