Range-Efficient Computation of F0 over Massive Data Streams

Slides:



Advertisements
Similar presentations
Estimating Distinct Elements, Optimally
Advertisements

The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.
Optimal Space Lower Bounds for all Frequency Moments David Woodruff Based on SODA 04 paper.
The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.
An Optimal Algorithm for the Distinct Elements Problem
Data Stream Algorithms Frequency Moments
Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with Wenjie Zhang, Ying Zhang and Xuemin Lin University of.
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
Discovering Lag Interval For Temporal Dependencies Larisa Shwartz Liang Tang, Tao Li, Larisa Shwartz1 Liang Tang, Tao Li
Summarizing Distributed Data Ke Yi HKUST += ?. Small summaries for BIG data  Allow approximate computation with guarantees and small space – save space,
Vladimir(Vova) Braverman UCLA Joint work with Rafail Ostrovsky.
ABSTRACT We consider the problem of computing information theoretic functions such as entropy on a data stream, using sublinear space. Our first result.
1 A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window Costas Busch Rensselaer Polytechnic Institute Srikanta Tirthapura.
Mining Data Streams.
1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 25, 2006
Algorithms for Distributed Functional Monitoring Ke Yi HKUST Joint work with Graham Cormode (AT&T Labs) S. Muthukrishnan (Google Inc.)
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
1 Distributed Streams Algorithms for Sliding Windows Phillip B. Gibbons, Srikanta Tirthapura.
What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems.
Time-Decaying Sketches for Sensor Data Aggregation Graham Cormode AT&T Labs, Research Srikanta Tirthapura Dept. of Electrical and Computer Engineering.
A survey on stream data mining
Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,
Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.
CS591A1 Fall Sketch based Summarization of Data Streams Manish R. Sharma and Weichao Ma.
CS 591 A11 Algorithms for Data Streams Dhiman Barman CS 591 A1 Algorithms for the New Age 2 nd Dec, 2002.
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
Cloud and Big Data Summer School, Stockholm, Aug Jeffrey D. Ullman.
Neighbourhood Sampling for Local Properties on a Graph Stream A. Pavan, Iowa State University Kanat Tangwongsan, IBM Research Srikanta Tirthapura, Iowa.
Estimating Entropy for Data Streams Khanh Do Ba, Dartmouth College Advisor: S. Muthu Muthukrishnan.
Stochastic Algorithms Some of the fastest known algorithms for certain tasks rely on chance Stochastic/Randomized Algorithms Two common variations – Monte.
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
1 Approximating Quantiles over Sliding Windows Srimathi Harinarayanan CMPS 565.
PODC Distributed Computation of the Mode Fabian Kuhn Thomas Locher ETH Zurich, Switzerland Stefan Schmid TU Munich, Germany TexPoint fonts used in.
Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.
Analysis of Algorithms CS 477/677 Instructor: Monica Nicolescu Lecture 7.
Facility Location in Dynamic Geometric Data Streams Christiane Lammersen Christian Sohler.
Embedding and Sketching Sketching for streaming Alexandr Andoni (MSR)
Massive Data Sets and Information Theory Ziv Bar-Yossef Department of Electrical Engineering Technion.
Data Stream Algorithms Lower Bounds Graham Cormode
Calculating frequency moments of Data Stream
Big Data Lecture 5: Estimating the second moment, dimension reduction, applications.
Incrementally Improving Lookup Latency in Distributed Hash Table Systems Hui Zhang 1, Ashish Goel 2, Ramesh Govindan 1 1 University of Southern California.
Fast Pseudo-Random Fingerprints Yoram Bachrach, Microsoft Research Cambridge Ely Porat – Bar Ilan-University.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window MADALGO – Center for Massive Data Algorithmics, a Center of the Danish.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Chapter 11 Sorting Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and Mount.
Mining Data Streams (Part 1)
New Characterizations in Turnstile Streams with Applications
The Stream Model Sliding Windows Counting 1’s
Open Problems in Streaming
Finding Frequent Items in Data Streams
Estimating L2 Norm MIT Piotr Indyk.
Randomized Algorithms
Streaming & sampling.
Approximate Matchings in Dynamic Graph Streams
Query-Friendly Compression of Graph Streams
Counting How Many Elements Computing “Moments”
Data Integration with Dependent Sources
Turnstile Streaming Algorithms Might as Well Be Linear Sketches
Range-Efficient Counting of Distinct Elements
Randomized Algorithms
Feifei Li, Ching Chang, George Kollios, Azer Bestavros
Introduction to Stream Computing and Reservoir Sampling
Lecture 6: Counting triangles Dynamic graphs & sampling
Heavy Hitters in Streams and Sliding Windows
Programming with data Lecture 3
Approximation and Load Shedding Sampling Methods
(Learned) Frequency Estimation Algorithms
Presentation transcript:

Range-Efficient Computation of F0 over Massive Data Streams A. Pavan, Iowa State University Srikanta Tirthapura, Iowa State University

Iowa State University ICDE 2005 Data Streams Network Monitoring All packets on a network link Example Statistics: Average packet size Number of different source-destination pairs Sensor Data Average (mean, median) and other aggregates of sensor readings Web-Click Streams Frequently requested items Change in request patterns over time One-pass algorithm is useful for data stored on disk Iowa State University ICDE 2005

Data Stream Characteristics Massive Data Sets, One-pass processing Limited workspace Much smaller than the size of the data Typically poly-logarithmic in the size of data Fast Processing Time per item Constant or logarithmic in data size Provide approximate answers to aggregate queries Frequency Moments of Data (F0, F2, etc) Quantiles, Distances between streams Iowa State University ICDE 2005

Reductions Between Data-Stream Problems Reducer Stream A Stream B Function f Function g g(B) = f(A) Iowa State University ICDE 2005

Iowa State University ICDE 2005 Reductions Single Element of Stream A may generate a list of elements in Stream B Algorithm on Stream B Inefficient to process elements of a list one by one List-efficient algorithms process the list quickly Range-efficient algorithms process a range of integers quickly Iowa State University ICDE 2005

Reductions to Range-Efficient F0 Max-Dominance Norms Range-Efficient F0 Counting Triangles in Graphs Duplicate Insensitive Sketches Iowa State University ICDE 2005

Iowa State University ICDE 2005 Range Efficient F0 Input Stream Sequence of ranges [l1,r1], [l2,r2] … [lm,rm] for each i, 0 · li · ri < n, and li, ri are integers Output: Return | [l1,r1] U [l2,r2] U … U [lm,rm]| i.e. number of distinct elements in the union (F0) Constraints: Single pass through the data Small Workspace Fast Processing Time Iowa State University ICDE 2005

Iowa State University ICDE 2005 Example Stream: [0,10], [100,200], [60, 120], [5,25] F0 is: |[0,25] [ [60,200]| = 167 120 5 10 25 60 100 200 Iowa State University ICDE 2005

Iowa State University ICDE 2005 Approximate Answers Known that exact solutions requires too much workspace (,)-Approximation: return a random variable X such that Pr[|X-F0| >  F0] <  Iowa State University ICDE 2005

Iowa State University ICDE 2005 Max-Dominance Norm Given k streams of m integers each, (the elements of the streams arrive in an arbitrary order), where 1 ≤ ai,j ≤ n a1,1 a1,2 .. a1,m a2,1 a2,2 … a2,m … ak,1 ak,2 … ak,m Return j=1m max1 ≤ i ≤ k ai,j a b Cormode and Muthukrishnan, ESA 2003 Iowa State University ICDE 2005

Reduction From Max-Dominance Norm Input stream I, output stream O: F0 of Output Stream = Dominance Norm of Input Stream Assign ranges to the k positions: [1,n] [n+1,2n] … [(k-1)n+1, kn] When element ai,j is received, generate the range [(j-1)m+1, (j-1)m+1+ai,j] Observation: F0 of the resulting stream of ranges is the dominance norm of the input stream a b Iowa State University ICDE 2005

Reduction from Sensor Data Aggregation Problem: Compute aggregates over sensor observations Sensors transmit “sketch” of data, instead of full data Multi-path routing Duplicate-insensitive sketches Duplicate-insensitive sum and average of sensor readings can be reduced to range-efficient F0 Distinct Summation Problem Considine et al. (2004) and Nath et al. (2004) Iowa State University ICDE 2005

Counting Triangles in Graphs Problem: Graph G=(V,E), where V={1..n} Elements of E arrive as a stream (i1,j1), (i2,j2).. Compute number of triangles in G Bar-Yossef et al. (SODA 2002) show a reduction to Range-efficient F0 and F2 on a stream of integers Iowa State University ICDE 2005

Iowa State University ICDE 2005 Input Stream Sequence of ranges [l1,r1], [l2,r2] … [lm,rm] for each i, 0 <= li <= ri < n, and li, ri are integers Output: | [l1,r1] U [l2,r2] U … U [lm,rm]| Our Results Randomized (,)-Approximation Algorithm for Range-efficient F0 of a data stream Time complexity (n is the size of the universe): Amortized processing time per interval: O(log(1/) (log (n/))) Time to answer a query for F0: O(log 1/) Space Complexity: O((1/2)(log(1/)) (log n)) Iowa State University ICDE 2005

Comparison to Previous Work Our Results Range-Efficient F0 Bar-Yossef et al. (2002) Time per item = O(log5 n)(1/5)(log 1/) Time per item = O(log n + log 1/)(log 1/) WorkSpace = O(1/3)(log n)(log 1/) Workspace = O(1/2)(log n)(log 1/) Max-Dominance Norms Cormode, Muthukrishnan (2003) Time per item= O(1/4 ) (log n) (log m) (log 1/) Time per item = O(log n + log 1/)(log 1/) Workspace = O(1/2)(log n+1/ (log m) (log log m)) (log 1/) O(1/2)(log m+ log n)(log 1/) Iowa State University ICDE 2005

Iowa State University ICDE 2005 Related Work F0 of a data stream Flajolet-Martin (JCSS 1985) Alon et al. (JCSS 1999) Gibbons and Tirthapura (SPAA 2001) Bar-Yossef et al. (RANDOM 2002) Lower Bounds, Indyk-Woodruff (FOCS 2003) L1 difference of data streams Feigenbaum et al. (FOCS 1999) Used range-summable hash functions Iowa State University ICDE 2005

Iowa State University ICDE 2005 Algorithm Random Sampling Two Parts: Adaptive Sampling Change sampling probabilities dynamically Gibbons and Tirthapura, SPAA 2001 Range Sampling Quickly sample from a range of integers Novel technical contribution Iowa State University ICDE 2005

Adaptive Sampling for F0 Given a stream of numbers find the number of distinct elements in the stream Random Sampling Algorithm Random Sample of distinct elements seen so far Sampling Level i (sampling probability =1/2i) If sample size exceeds threshold, then sub-sample to a smaller probability Target Workspace = O(1/2)(log 1/) integers Iowa State University ICDE 2005

Adaptive Sampling Example Sample = {}, p = 1 Target Workspace = 4 numbers Iowa State University ICDE 2005

Iowa State University ICDE 2005 Adaptive Sampling 5 Sample = {5}, p = 1 Target Workspace = 4 numbers Iowa State University ICDE 2005

Iowa State University ICDE 2005 Adaptive Sampling 5 3 Sample = {5,3}, p = 1 Target Workspace = 4 numbers Iowa State University ICDE 2005

Iowa State University ICDE 2005 Adaptive Sampling 5 3 7 Sample = {5,3,7}, p = 1 Target Workspace = 4 numbers Iowa State University ICDE 2005

Iowa State University ICDE 2005 Adaptive Sampling 5 3 7 Sample = {5,3,7}, p = 1 Target Workspace = 4 numbers Iowa State University ICDE 2005

Iowa State University ICDE 2005 Adaptive Sampling 5 3 7 6 Sample = {5,3,7,6}, p = 1 Target Workspace = 4 numbers Iowa State University ICDE 2005

Iowa State University ICDE 2005 Adaptive Sampling 5 3 7 6 8 Sample = {5,3,7,6,8}, p = 1 Overflow, sub-sample Sample = {3,6,8}, p = ½ Target Workspace = 4 numbers Iowa State University ICDE 2005

Iowa State University ICDE 2005 Adaptive Sampling 5 3 7 6 8 9 Sample = {3,6,8,9}, p= ½ Target Workspace = 4 numbers Iowa State University ICDE 2005

Iowa State University ICDE 2005 Adaptive Sampling 5 3 7 6 8 9 Sample = {3,6,8,9}, p= ½ Target Workspace = 4 numbers Iowa State University ICDE 2005

Iowa State University ICDE 2005 Adaptive Sampling 5 3 7 6 8 9 2 Sample = {3,6,8,9,2}, p= ½ Sample = {6,9}, p=¼ Target Workspace = 4 numbers Iowa State University ICDE 2005

Iowa State University ICDE 2005 Adaptive Sampling 5 3 7 6 8 9 2 Finally, Sample = {6,9}, p=¼ Return (Sample Size)(4) = 8 Iowa State University ICDE 2005

Adaptive Sampling for Range F0 Naïve: Given [x,y], successively insert {x, x+1, x+2, … y} into F0 algorithm Problem: Time per range very large Iowa State University ICDE 2005

Range Sampling – Time Efficient Quickly determine how many elements in range [l,r] belong to the sample at current probability Iowa State University ICDE 2005

Iowa State University ICDE 2005 Hash Functions Random sample through a hash function Consistent decisions for same elements Our Hash Function: h: {1…n} → {0,…,p-1}, h(x) = (ax+b) mod p Integers a and b chosen randomly from [0,p-1] Element x belongs in sample at level i if h(x) Є {0..vi} for some pre-determined vi For Range [l,r], if for some x Є [l,r] h(x) Є {0…vi}, then the range is “useful” Iowa State University ICDE 2005

Range Sampling Problem v 1 X1 X2 p-1 n f(x)=(ax+b) mod p Compute |{x Є [x1,x2] : f(x) Є [0,v] }| Iowa State University ICDE 2005

Arithmetic Progression p-1 v f(x1) f(x1+1) 1 X1 X2 n f(x)=(ax+b) mod p Common Difference = a Iowa State University ICDE 2005

Low and High Revolutions p-1 v f(x1) f(x1+1) Each revolution, number of hits on [0,v] is v/a (low rev) v/a +1 (high rev) Task: Count number of low, high revolutions Iowa State University ICDE 2005

Starting Points of Revolutions p-1 v f(x1) f(x1+1) Can find r = (v - v mod a) such that: If starting point in [0,r], then high revolution Else low revolution Task: Count the number of revolutions with starting point in [0,r] r Iowa State University ICDE 2005

Iowa State University ICDE 2005 Recursive Algorithm p-1 r a a-1 r modulo a circle modulo p circle Observation: Starting Points form an Arithmetic Progression with difference (- p mod a) Iowa State University ICDE 2005

Iowa State University ICDE 2005 Recursive Algorithm Focus on common difference Two Reductions Possible Common Difference a- (p mod a) Common Difference a Common Difference (p mod a) At least one of the two common differences is smaller than a/2 Iowa State University ICDE 2005

Iowa State University ICDE 2005 Range Sampling Range [x,y]: Time Complexity: O(log (y-x)) Space Complexity: O(log (y-x) + log m) Plug back into adaptive sampling to get range-efficient F0 algorithm Iowa State University ICDE 2005

Iowa State University ICDE 2005 Extensions Distributed Streams Each stream observed by different party Party sends a “sketch” to a referee Estimate F0 over the union streams, using the sketches Multi-dimensional ranges Sliding Windows Iowa State University ICDE 2005

Iowa State University ICDE 2005 Open Problems Simple Range-Efficient Algorithms for Fk (k > 1) ? Time Lower Bounds for Range-Efficient F0 Iowa State University ICDE 2005

Iowa State University ICDE 2005