Sketching, Sampling and other Sublinear Algorithms: Streaming Alex Andoni (MSR SVC)

Slides:



Advertisements
Similar presentations
Optimal Approximations of the Frequency Moments of Data Streams Piotr Indyk David Woodruff.
Advertisements

The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.
The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.
Efficient Algorithms via Precision Sampling Robert Krauthgamer (Weizmann Institute) joint work with: Alexandr Andoni (Microsoft Research) Krzysztof Onak.
An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.
Summarizing Distributed Data Ke Yi HKUST += ?. Small summaries for BIG data  Allow approximate computation with guarantees and small space – save space,
Sampling: Final and Initial Sample Size Determination
Heavy Hitters Piotr Indyk MIT. Last Few Lectures Recap (last few lectures) –Update a vector x –Maintain a linear sketch –Can compute L p norm of x (in.
Algorithms for data streams Foundations of Data Science 2014 Indian Institute of Science Navin Goyal.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 25, 2006
SIA: Secure Information Aggregation in Sensor Networks Bartosz Przydatek, Dawn Song, Adrian Perrig Carnegie Mellon University Carl Hartung CSCI 7143: Secure.
Fall 2006CENG 7071 Algorithm Analysis. Fall 2006CENG 7072 Algorithmic Performance There are two aspects of algorithmic performance: Time Instructions.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
1 Reversible Sketches for Efficient and Accurate Change Detection over Network Data Streams Robert Schweller Ashish Gupta Elliot Parsons Yan Chen Computer.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 8 May 4, 2005
Sublinear time algorithms Ronitt Rubinfeld Blavatnik School of Computer Science Tel Aviv University TexPoint fonts used in EMF. Read the TexPoint manual.
What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems.
A survey on stream data mining
Simple Linear Regression. Introduction In Chapters 17 to 19, we examine the relationship between interval variables via a mathematical equation. The motivation.
How Robust are Linear Sketches to Adaptive Inputs? Moritz Hardt, David P. Woodruff IBM Research Almaden.
Sketching, Sampling and other Sublinear Algorithms: Algorithms for parallel models Alex Andoni (MSR SVC)
Estimating Entropy for Data Streams Khanh Do Ba, Dartmouth College Advisor: S. Muthu Muthukrishnan.
Ragesh Jaiswal Indian Institute of Technology Delhi Threshold Direct Product Theorems: a survey.
Finding Frequent Items in Data Streams [Charikar-Chen-Farach-Colton] Paper report By MH, 2004/12/17.
Streaming Algorithm Presented by: Group 7 Advanced Algorithm National University of Singapore Min Chen Zheng Leong Chua Anurag Anshu Samir Kumar Nguyen.
Analysis of Algorithms
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
RESOURCES, TRADE-OFFS, AND LIMITATIONS Group 5 8/27/2014.
1 Joe Meehean.  Problem arrange comparable items in list into sorted order  Most sorting algorithms involve comparing item values  We assume items.
CSC 211 Data Structures Lecture 13
Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.
© Department of Statistics 2012 STATS 330 Lecture 20: Slide 1 Stats 330: Lecture 20.
Sublinear Algorithms via Precision Sampling Alexandr Andoni (Microsoft Research) joint work with: Robert Krauthgamer (Weizmann Inst.) Krzysztof Onak (CMU)
Sorting: Implementation Fundamental Data Structures and Algorithms Klaus Sutner February 24, 2004.
Data in Motion Michael Hoffman (Leicester) S Muthukrishnan (Google) Rajeev Raman (Leicester)
Analysis of Algorithms CS 477/677 Instructor: Monica Nicolescu Lecture 7.
Embedding and Sketching Sketching for streaming Alexandr Andoni (MSR)
Massive Data Sets and Information Theory Ziv Bar-Yossef Department of Electrical Engineering Technion.
11 Algorithmic Techniques for Massive Data (COMS ) Alex Andoni.
11 Lecture 24: MapReduce Algorithms Wrap-up. Admin PS2-4 solutions Project presentations next week – 20min presentation/team – 10 teams => 3 days – 3.
Sketching complexity of graph cuts Alexandr Andoni joint work with: Robi Krauthgamer, David Woodruff.
Mining of Massive Datasets Ch4. Mining Data Streams.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
SketchVisor: Robust Network Measurement for Software Packet Processing
Algorithms for Big Data: Streaming and Sublinear Time Algorithms
Stochastic Streams: Sample Complexity vs. Space Complexity
New Characterizations in Turnstile Streams with Applications
Approximating the MST Weight in Sublinear Time
A paper on Join Synopses for Approximate Query Answering
Finding Frequent Items in Data Streams
Estimating L2 Norm MIT Piotr Indyk.
Streaming & sampling.
Bolin Ding Silu Huang* Surajit Chaudhuri Kaushik Chakrabarti Chi Wang
Lecture 18: Uniformity Testing Monotonicity Testing
Sublinear Algorithmic Tools 3
Query-Friendly Compression of Graph Streams
Sublinear Algorithmic Tools 2
COMS E F15 Lecture 2: Median trick + Chernoff, Distinct Count, Impossibility Results Left to the title, a presenter can insert his/her own image.
CS 154, Lecture 6: Communication Complexity
Lecture 4: CountSketch High Frequencies
Lecture 7: Dynamic sampling Dimension Reduction
Turnstile Streaming Algorithms Might as Well Be Linear Sketches
Sublinear Algorihms for Big Data
CSCI B609: “Foundations of Data Science”
Range-Efficient Computation of F0 over Massive Data Streams
Lecture 6: Counting triangles Dynamic graphs & sampling
Heavy Hitters in Streams and Sliding Windows
Sublinear Algorihms for Big Data
(Learned) Frequency Estimation Algorithms
Presentation transcript:

Sketching, Sampling and other Sublinear Algorithms: Streaming Alex Andoni (MSR SVC)

A scenario IPFrequency IPFrequency Challenge: compute something on the table, using small space. Challenge: compute something on the table, using small space Example of “something”: # distinct IPs max frequency other statistics…

Sublinear: a panacea?  Sub-linear space algorithm for solving Travelling Salesperson Problem?  Sorry, perhaps a different lecture  Hard to solve sublinearly even very simple problems:  Ex: what is the count of distinct IPs seen  Will settle for:  Approximate algorithms: 1+  approximation true answer ≤ output ≤ (1+  ) * (true answer)  Randomized: above holds with probability 95%  Quick and dirty way to get a sense of the data IPFrequency

Streaming data  Data through a router  Data stored on a hard drive, or streamed remotely  More efficient to do a linear scan on a hard drive  Working memory is the (smaller) main memory

Application areas  Data can come from:  Network logs, sensor data  Real time data  Search queries, served ads  Databases (query planning)  …

Problem 1: # distinct elements i Frequency

Distinct Elements: idea 1 Algorithm DISTINCT: Initialize: minHash=1 hash function h into [0,1] Process (int i): if (h(i) < minHash) minHash = h(index); Output: 1/minHash-1 Algorithm DISTINCT: Initialize: minHash=1 hash function h into [0,1] Process (int i): if (h(i) < minHash) minHash = h(index); Output: 1/minHash [Flajolet-Martin’85, Alon-Matias-Szegedy’96]

Distinct Elements: idea 2 ZEROS(x) x= Algorithm DISTINCT: Initialize: minHash=1 hash function h into [0,1] Process (int i): if (h(i) < minHash) minHash = h(index); Output: 1/minHash-1 Algorithm DISTINCT: Initialize: minHash=1 hash function h into [0,1] Process (int i): if (h(i) < minHash) minHash = h(index); Output: 1/minHash-1 Algorithm DISTINCT: Initialize: minHash2=0 hash function h into [0,1] Process (int i): if (h(i) < 1/2^minHash2) minHash2 = ZEROS(h(index)); Output: 2^minHash2 Algorithm DISTINCT: Initialize: minHash2=0 hash function h into [0,1] Process (int i): if (h(i) < 1/2^minHash2) minHash2 = ZEROS(h(index)); Output: 2^minHash2

Problem 2: max count  Problem: compute the maximum frequency of an element in the stream  Bad news:  Hard to distinguish whether an element repeated (max = 1 vs 2)  Good news:  Can find “heavy hitters”  elements with frequency > total frequency / s  using space proportional to s IPFrequency heavy hitters

Heavy Hitters: CountMin Algorithm CountMin: Initialize (r, L): array Sketch[L][w] L hash functions h[L], into {0,…w-1} Process (int i): for(j=0; j<L; j++) Sketch[j][ h[j](i) ] += 1; Output: foreach i in PossibleIP { freq[i] = int.MaxValue; for(j=0; j<L; j++) freq[i] = min(freq[i], Sketch[j][h[j](i)]); } // freq[] is the frequency estimate Algorithm CountMin: Initialize (r, L): array Sketch[L][w] L hash functions h[L], into {0,…w-1} Process (int i): for(j=0; j<L; j++) Sketch[j][ h[j](i) ] += 1; Output: foreach i in PossibleIP { freq[i] = int.MaxValue; for(j=0; j<L; j++) freq[i] = min(freq[i], Sketch[j][h[j](i)]); } // freq[] is the frequency estimate 11 [Charikar-Chen-FarachColton’04, Cormode-Muthukrishnan’05]

Heavy Hitters: analysis Algorithm CountMin: Initialize (r, L): array Sketch[L][w] L hash functions h[L], into {0,…w-1} Process (int i): for(j=0; j<L; j++) Sketch[j][ h[j](i) ] += 1; Output: foreach i in PossibleIP { freq[i] = int.MaxValue; for(j=0; j<L; j++) freq[i] = min(freq[i], Sketch[j][h[j](i)]); } // freq[] is the frequency estimate Algorithm CountMin: Initialize (r, L): array Sketch[L][w] L hash functions h[L], into {0,…w-1} Process (int i): for(j=0; j<L; j++) Sketch[j][ h[j](i) ] += 1; Output: foreach i in PossibleIP { freq[i] = int.MaxValue; for(j=0; j<L; j++) freq[i] = min(freq[i], Sketch[j][h[j](i)]); } // freq[] is the frequency estimate

Problem 3: Moments IP

Scenario 2: distributed traffic Two sketches should be sufficient to compute something on the difference or sum IPFrequency IPFrequency

Common primitive: estimate sum a1a1 a2a2 a3a3 a4a4 a1a1 a3a3

Precision Sampling Framework a1a1 a2a2 a3a3 a4a4 u1u1 u2u2 u3u3 u4u4 ã 1 ã 2 ã 3 ã 4

Formalization Sum EstimatorAdversary

Precision Sampling Lemma  Goal: estimate ∑a i from {a ̃ i } satisfying |a i -a ̃ i |<u i.  Precision Sampling Lemma: can get, with 90% success:  O(1) additive error and 1.5 multiplicative error: S – O(1) < S ̃ < 1.5*S + O(1)  with average cost equal to O(log n)  Example: distinguish Σ a i =3 vs Σ a i =0  Consider two extreme cases:  if three a i =1: enough to have crude approx for all (u i =0.1) if all a i =3/n: only few with good approx u i =1/n, and the rest with u i =1 ε1+ε S – ε < S̃ < (1+ ε)S + ε O(ε -3 log n) [A-Krauthgamer-Onak’11]

Precision Sampling Algorithm  Precision Sampling Lemma: can get, with 90% success:  O(1) additive error and 1.5 multiplicative error: S – O(1) < S ̃ < 1.5*S + O(1)  with average cost equal to O(log n)  Algorithm:  Choose each u i  [0,1] i.i.d.  Estimator: S ̃ = count number of i‘s s.t. a ̃ i / u i > 6 (up to a normalization constant)  Proof of correctness:  we use only a ̃ i which are 1.5-approximation to a i  E[ S ̃ ] ≈ ∑ Pr[a i / u i > 6] = ∑ a i /6.  E[1/u i ] = O(log n) w.h.p. function of [ã i /u i - 4/ε] + and u i ’s concrete distrib. = minimum of O(ε -3 ) u.r.v. O(ε -3 log n) ε1+ε S – ε < S̃ < (1+ ε)S + ε

x1x1 x2x2 x3x3 x4x4 x5x5 x6x6 y1+y3y1+y3 y4y4 y2+y5+y6y2+y5+y6 x= H=

Streaming++  LOTS of work in the area:  Surveys  Muthukrishnan:  McGregor: graphmining.pdfhttp://people.cs.umass.edu/~mcgregor/papers/08- graphmining.pdf  Chakrabarti: Fall11/Notes/lecnotes.pdfhttp:// Fall11/Notes/lecnotes.pdf  Open problems:  Examples:  Moments, sampling  Median estimation, longest increasing sequence  Graph algorithms  E.g., dynamic graph connectivity [AGG’12, KKM’13,…]  Numerical algorithms (e.g., regression, SVD approximation)  Fastest (sparse) regression […CW’13,MM’13,KN’13,LMP’13]  related to Compressed Sensing