Joint work with Morteza Monemizadeh

Slides:



Advertisements
Similar presentations
Estimating Distinct Elements, Optimally
Advertisements

Rectangle-Efficient Aggregation in Spatial Data Streams Srikanta Tirthapura David Woodruff Iowa State IBM Almaden.
Fast Moment Estimation in Data Streams in Optimal Space Daniel Kane, Jelani Nelson, Ely Porat, David Woodruff Harvard MIT Bar-Ilan IBM.
Optimal Approximations of the Frequency Moments of Data Streams Piotr Indyk David Woodruff.
1+eps-Approximate Sparse Recovery Eric Price MIT David Woodruff IBM Almaden.
The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.
Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO Based on a paper in STOC, 2012.
Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO.
Optimal Space Lower Bounds for All Frequency Moments David Woodruff MIT
Numerical Linear Algebra in the Streaming Model Ken Clarkson - IBM David Woodruff - IBM.
Optimal Space Lower Bounds for all Frequency Moments David Woodruff Based on SODA 04 paper.
The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.
An Optimal Algorithm for the Distinct Elements Problem
Optimal Bounds for Johnson- Lindenstrauss Transforms and Streaming Problems with Sub- Constant Error T.S. Jayram David Woodruff IBM Almaden.
Numerical Linear Algebra in the Streaming Model
Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.
Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden.
An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.
Vladimir(Vova) Braverman UCLA Joint work with Rafail Ostrovsky.
Heavy Hitters Piotr Indyk MIT. Last Few Lectures Recap (last few lectures) –Update a vector x –Maintain a linear sketch –Can compute L p norm of x (in.
Turnstile Streaming Algorithms Might as Well Be Linear Sketches Yi Li Huy L. Nguyen David Woodruff.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 25, 2006
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
Sketching for M-Estimators: A Unified Approach to Robust Regression Kenneth Clarkson David Woodruff IBM Almaden.
Embedding and Sketching Alexandr Andoni (MSR). Definition by example  Problem: Compute the diameter of a set S, of size n, living in d-dimensional ℓ.
Estimating Entropy for Data Streams Khanh Do Ba, Dartmouth College Advisor: S. Muthu Muthukrishnan.
Streaming Algorithm Presented by: Group 7 Advanced Algorithm National University of Singapore Min Chen Zheng Leong Chua Anurag Anshu Samir Kumar Nguyen.
Information Complexity Lower Bounds for Data Streams David Woodruff IBM Almaden.
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
Information Theory for Data Streams David P. Woodruff IBM Almaden.
Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.
Sublinear Algorithms via Precision Sampling Alexandr Andoni (Microsoft Research) joint work with: Robert Krauthgamer (Weizmann Inst.) Krzysztof Onak (CMU)
1 Embedding and Similarity Search for Point Sets under Translation Minkyoung Cho and David M. Mount University of Maryland SoCG 2008.
Data in Motion Michael Hoffman (Leicester) S Muthukrishnan (Google) Rajeev Raman (Leicester)
Facility Location in Dynamic Geometric Data Streams Christiane Lammersen Christian Sohler.
Embedding and Sketching Sketching for streaming Alexandr Andoni (MSR)
Data Stream Algorithms Lower Bounds Graham Cormode
The Message Passing Communication Model David Woodruff IBM Almaden.
Beating CountSketch for Heavy Hitters in Insertion Streams Vladimir Braverman (JHU) Stephen R. Chestnut (ETH) Nikita Ivkin (JHU) David P. Woodruff (IBM)
Big Data Lecture 5: Estimating the second moment, dimension reduction, applications.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
New Algorithms for Heavy Hitters in Data Streams David Woodruff IBM Almaden Joint works with Arnab Bhattacharyya, Vladimir Braverman, Stephen R. Chestnut,
An Optimal Algorithm for Finding Heavy Hitters
Mining Data Streams (Part 1)
Information Complexity Lower Bounds
Stochastic Streams: Sample Complexity vs. Space Complexity
New Characterizations in Turnstile Streams with Applications
The Stream Model Sliding Windows Counting 1’s
Finding Frequent Items in Data Streams
Estimating L2 Norm MIT Piotr Indyk.
Streaming & sampling.
Approximate Matchings in Dynamic Graph Streams
Sublinear Algorithmic Tools 2
Counting How Many Elements Computing “Moments”
COMS E F15 Lecture 2: Median trick + Chernoff, Distinct Count, Impossibility Results Left to the title, a presenter can insert his/her own image.
Sketching and Embedding are Equivalent for Norms
Lecture 4: CountSketch High Frequencies
Lecture 7: Dynamic sampling Dimension Reduction
Turnstile Streaming Algorithms Might as Well Be Linear Sketches
Range-Efficient Counting of Distinct Elements
CIS 700: “algorithms for Big Data”
Y. Kotidis, S. Muthukrishnan,
Overview Massive data sets Streaming algorithms Regression
The Communication Complexity of Distributed Set-Joins
CSCI B609: “Foundations of Data Science”
CSCI B609: “Foundations of Data Science”
Range-Efficient Computation of F0 over Massive Data Streams
Streaming Symmetric Norms via Measure Concentration
Lecture 6: Counting triangles Dynamic graphs & sampling
(Learned) Frequency Estimation Algorithms
Presentation transcript:

Joint work with Morteza Monemizadeh Lp-Sampling David Woodruff IBM Almaden Joint work with Morteza Monemizadeh TU Dortmund

Output i with probability |xi|p/Fp, where Fp = |x|pp = Σi=1n |xi|p Given a stream of updates (i, a) to coordinates i of an n-dimensional vector x |a| < poly(n) a is an integer stream length < poly(n) Output i with probability |xi|p/Fp, where Fp = |x|pp = Σi=1n |xi|p Easy cases: p = 1 and updates all of the form (i, 1) for some i Solution: choose a random update in the stream, output the coordinate it updates [Alon, Matias, Szegedy] Generalizes to all positive updates p = 0 and there are no deletions Solution: min-wise hashing, hash all distinct coordinates as you see them, maintain the minimum hash and item [Broder, Charikar, Frieze, Mitzenmacher] [Indyk] [Cormode, Muthukrishnan]

Pr[I = j] = (1 ± ε)|xj|p/Fp Our main result For every 0 · p · 2, there is an algorithm that with probability · n-100 fails, and otherwise outputs an I in [n] for which for all j in [n] Pr[I = j] = (1 ± ε)|xj|p/Fp Condition on every invocation succeeding in any poly(n)-time algorithm Algorithm is 1-pass, poly(ε-1 log n)-space and update time, and also returns wi = (1 ± ε)|xj|p/Fp Generalizes to 1-pass n1-2/ppoly(ε-1 log n)-space for p > 2 “additive-error” samplers Pr[I = j] = |xj|p/Fp ± εFp given explicitly in [Jayram, W] implicitly in [Andoni, DoBa, Indyk, W]

Lp-sampling solves and unifies many well-studied streaming problems:

Solves Sampling with Deletions: [Cormode, Muthukrishnan, Rozenbaum] want importance sampling with deletions: maintain a sample i with probability |xi|/|x|1 Set p = 1 in our theorem [Chaudhuri, Motwani, Narasayya] ask to sample from the result of a SQL operation, e.g., self-join Set p = 2 in our theorem [Frahling, Indyk, Sohler] study maintaining approximate range spaces and costs of Euclidean spanning trees They need and obtain a routine to sample a point from a set undergoing insertions and deletions Alternatively, set p = 0 in our theorem

Alternative solution to Heavy Hitters Problem for any Fp: Output all i for which |xi|p > Á Fp Do not output any i for which |xi|p < (Á/2) Fp Studied by Charikar, Chen, Cormode, Farach-Colton, Ganguly, Muthukrishnan, and many others Invoke our algorithm O~(1/Á) times, use approximations to values Optimal up to poly(ε-1 log n) factors

Solves Block Heavy Hitters: given an n x d matrix, return indices i of rows Ri with |Ri|pp > Á ¢ Σj |Rj|pp [Andoni, DoBa, Indyk] study the case p = 1 Used by [Andoni, Indyk, Kraughtgamer] for constructing a small-size sketch for the Ulam metric under the edit distance Treat R as a big (nd)-dimensional vector Sample an entry (i, j) using our theorem for general p The probability a row i is sampled is |Ri|pp/ Σj |Rj|pp, so we can recover IDs of all the heavy rows. We do not use Cauchy random variables or Nisan’s pseudorandom generator, could be more practical than [ADI]

Alternative Solution to Fk-Estimation for any k ¸ 2: Optimal up to poly(ε-1 log n) factors Reduction given by [Coppersmith, Kumar]: Take r = O(n1-2/k) L2-samples wi1, … , wir In parallel estimate F2, call it F2’ Output (F2’/r) * Σj wijk-2 Proof: second moment method First algorithm not to use Nisan’s pseudorandom generator

Solves Cascaded Moment Estimation: Given an n x d matrix A, Fk(Fp)(A) = Σj |Aj|pkp Problem initiated by [Cormode, Muthukrishnan] Show F2(F0)(A) uses O(n1/2) space if no deletions Ask about complexity for other k and p For any p in [0,2], gives O(n1-1/k) space for Fk(Fp)(A) We get entry (i, j) with probability |Ai, j|p/ Σi’, j’ |Ai’, j’|p Probability row Ai is returned is Fp(Ai)/ Σj Fp(Aj) If 2 passes allowed, take O(n1-1/k) samples Ai, in 1st pass, compute Fp(Ai) in 2nd pass, and feed into Fk AMS estimator To get 1 pass, feed row IDs into an O(n1-1/k)-space algorithm of [Jayram, W] for estimating Fk based only on item IDs Algorithm is space-optimal [Jayram, W] Our theorem with p = 0 gives O(n1/2) space for F2(F0)(A) with deletions

Ok, so how does it work?

General Framework [Indyk, W] 1. Form streams by subsampling St = {i | |xi| in [ηt-1, ηt)} for η = 1 + £(ε) St contributes if |St|ηpt ¸ ³ Fp(x), where ³ = poly(ε/log n) assume p > 0 in talk Let h:[n] -> [n] be a hash function Create log n substreams Stream1, Stream2, …, Streamlog n Streamj is stream restricted to updates (i, c) with h(i) · n/2j Suppose 2j ¼ |St|. Then Streamj contains about 1 item of St Fp(Streamj) ¼ Fp(x)/2j |St| ηpt ¸ ³ Fp(x) means ηpt ¸ ³ Fp(Streamj) Can find the item in St in Streamj with Fp-heavy hitters algorithm Repeat the sampling poly(ε-1log n) times, count number of times there was an item in Streamj from St Use this to estimate sizes of contributing St, and Fp(x) ¼ Σt |St|ηpt 2. Run Heavy hitters algorithm on streams 3. Use heavy hitters to estimate contributing St

Additive Error Sampler [Jayram, W] For contributing St, we also get poly(ε-1log n) items from the heavy hitters routine If the sub-sampling is sufficiently random (Nisan’s generator, min-wise independent), these items are random in St Since we have (1 ± ε)-approximations s’t to all contributing St, can: Choose a contributing t with probability s’tηpt/Σt’ s’t’ηpt Output a random heavy hitter found in St For item i in contributing St, Pr[i output] =[s’tηpt/Σt’ s’t’ηpt] ¢ 1/|St| = (1 ± ε)|xi|p/Fp For item i in non-contributing St, Pr[i output] = 0

Relative Error in Words Force all classes to contribute Inject additional coordinates in each class whose purpose is to make every class contribute Inject just enough so that overall, Fp does not change by more than a (1+ε)-factor Run [Jayram, W]-sampling on resulting vector If the item sampled is an injected coordinate, forget about it Repeat many times in parallel and take the first repetition that is not an injected coordinate Since injected coordinates only contribute O(ε) to Fp mass, small # of repetitions suffice

Some Minor Points Before seeing the stream, we don’t know which classes contribute, so we inject coordinates into every class For St = {i | |xi| in [ηt-1, ηt)}, inject £(εFp/(ηpt # classes)) coordinates, where # classes = O(ε-1log n) Need to know Fp - just guess it, verify at end of stream For some classes, £(εFp/(ηpt # classes)) < 1, e.g. if t is very large, so we can’t inject any new coordinates Find all elements in these classes and (1 ± ε)-approximations to their frequencies separately using a heavy hitters algorithm When sampling, either choose a heavy hitter with the appropriate probability, or select from contributing sets using [Jayram, W]

There is a Problem The [Jayram, W]-sampler fails with probability ¸ poly(ε/log n), in which case it can output any item This is due to some of the subroutines of [Indyk, W] that it relies on, which only succeed with this probability So the large poly(ε/log n) additive error is still there Cannot repeat [Jayram, W] multiple times for amplification, since we get a collection of samples, and no obvious way of detecting failure On the other hand, could just repeat [Indyk, W] and take the median for the simpler Fk-estimation problem Our solution: Dig into the guts of the [Indyk, W] algorithm Amplify success probability to ¸ 1 – n-100 of subroutines

A Technical Point About [Indyk, W] In [Indyk, W], Create log n substreams Streamj, where Streamj includes each coordinate independently with probability 2-j Can find the items in contributing St in Streamj with Fp-heavy hitters Repeat the sampling poly(ε-1log n) times, observe the fraction there is an item in Streamj from St Can use [Indyk, W] to estimate every |St| since every class contributes Issue of misclassification St = {i | |xi| in [ηt-1, ηt)}, and Fp-heavy hitters algorithm only reports approximate frequencies of items i it finds If |xi| = ηt, it may be classified into St or St+1 – it doesn’t matter Simpler solution than in [Indyk, W] If item misclassified, just classify it consistently if we see it again Equivalent to sampling from x’ with |x’|p = (1 ± ε)|x|p Can ensure with probability ¸ 1-n-100, we obtain st’ = (1 ± ε)|St| for all t

A Technical Point About [Jayram, W] Since we have st’ = (1 ± ε)|St| for all t Choose a class t with probability s’tηpt/Σt’ s’t’ηpt Output a random heavy hitter found in St How do we output a random item in St ? Min-wise independent hash function h For each i in St, h(i) = minj in St h(j) with probability (1 ± ε)/|St| h can be an O(log 1/ε)-wise independent hash function We recover i* in St for which h(i*) is minimum Compatible with sub-sampling, where Streamj is items i for which h(i) · n/2j Our goal is to recover i* with probability ¸ 1-n-100 We have st’, and look at the level j* where |St|/2j* = £(log n) If h is O(log n)-wise independent, then with probability ¸ 1-n-100, i* is in Streamj* A worry: maybe Fp(Streamj*) >> Fp(x)/2j* so Heavy Hitter algorithm doesn’t work Can be resolved with enough independent repetitions

Beyond the Moraines: Sampling Records Given an n x d matrix M of rows M1, …, Mn, sample i with probability |Mi|X/Σj |Mj|X, where X is a norm If i sampled, return a vector v for which |v|X = (1 ± ε)|Mi|X Applications Estimating planar EMD [Andoni, DoBa, Indyk, W] Sampling records in a relational database Define classes St = {i | |Mi|X in [ηt-1, ηt)} for η = 1 + £(ε) If we have a heavy hitters algorithm for rows of a matrix, then we can apply a similar approach as before Space should be d¢poly(ε-1log n)

Heavy Hitters for Rows Algorithm in [Andoni, DoBa, Indyk, W] Partition rows into B buckets In each bucket maintain the vector sum of rows hashed to it If |Mi|X > γΣj |Mj|X, and if v is the vector in the bucket containing Mi, by the triangle inequality |v|X < |Mi|X + |Noise|X ¼ |Mi|X + Σj |Mj|X/B |v|X > |Mi|X - |Noise|X ¼ |Mi|X – Σj |Mj|X/B For B large enough, noise translates to a relative error

Lower Bounds Pr[I = j] = (1 ± ε)|xj|p/Fp For every 0 · p · 2, there is a randomized algorithm that with probability · n-100 outputs FAIL, and otherwise outputs an I in [n] for which for all j in [n] Pr[I = j] = (1 ± ε)|xj|p/Fp Algorithm is 1-pass, poly(ε-1 log n)-space and time, returns wi = (1 ± ε)|xj|p/Fp For p > 2, gives n1-2/ppoly(ε-1 log n)-space. Can we use less space for p > 2? Requires (n1-2/p) space for any ε. Reduction from L1-estimation Can improve to (n1-2/plog n) using augmented L1-estimation [Jayram, W] Can we output FAIL with probability 0? Requires (n) space for any ε. Reduction from 2-party equality testing with no error Given that we don’t output FAIL, can we get a sampler with ε = 0? Yes for 2-pass algorithms, using rejection sampling. 1-pass requires (n) space if algorithm outputs the corresponding probability wi (needed in many applications). Reduction from the 2-party INDEX problem

Some Open Questions Thank you 1-pass algorithms for Lp-sampling If we output FAIL with probability · n-100, and don’t require outputting the sampled item’s probability, can we get ε = 0 with low space? ε and log n factors are large. What is the optimal dependence on them? Useful for Fk-estimation for k > 2, and other applications Sampling from other distributions Given a vector (x1, …, xn) in a data stream, for which functions g can we sample from the distribution ¹(i) = |g(xi)|/Σj |g(xj)|? E.g., random walks Thank you