Vladimir(Vova) Braverman UCLA Joint work with Rafail Ostrovsky.

Slides:

Advertisements

Similar presentations

Estimating Distinct Elements, Optimally

Advertisements

Rectangle-Efficient Aggregation in Spatial Data Streams Srikanta Tirthapura David Woodruff Iowa State IBM Almaden.

Fast Moment Estimation in Data Streams in Optimal Space Daniel Kane, Jelani Nelson, Ely Porat, David Woodruff Harvard MIT Bar-Ilan IBM.

Optimal Approximations of the Frequency Moments of Data Streams Piotr Indyk David Woodruff.

The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.

Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO Based on a paper in STOC, 2012.

Estimating the Sortedness of a Data Stream Parikshit GopalanU T Austin T. S. JayramIBM Almaden Robert KrauthgamerIBM Almaden Ravi KumarYahoo! Research.

Numerical Linear Algebra in the Streaming Model Ken Clarkson - IBM David Woodruff - IBM.

Optimal Space Lower Bounds for all Frequency Moments David Woodruff Based on SODA 04 paper.

The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.

An Optimal Algorithm for the Distinct Elements Problem

Optimal Bounds for Johnson- Lindenstrauss Transforms and Streaming Problems with Sub- Constant Error T.S. Jayram David Woodruff IBM Almaden.

Xiaoming Sun Tsinghua University David Woodruff MIT

Efficient Algorithms via Precision Sampling Robert Krauthgamer (Weizmann Institute) joint work with: Alexandr Andoni (Microsoft Research) Krzysztof Onak.

Big Data Reading Group Grigory Yaroslavtsev 361 Levine

Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with Wenjie Zhang, Ying Zhang and Xuemin Lin University of.

An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.

VENKATA KIRAN YEDUGUNDLA

ABSTRACT We consider the problem of computing information theoretic functions such as entropy on a data stream, using sublinear space. Our first result.

Longest Increasing Subsequence and Distance to Monotonicity in Data Stream Model Hossein Jowhari Simon Fraser University Joint work with Funda Ergun Dagstuhl.

Sketching for M-Estimators: A Unified Approach to Robust Regression

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 25, 2006

Algorithms for Distributed Functional Monitoring Ke Yi HKUST Joint work with Graham Cormode (AT&T Labs) S. Muthukrishnan (Google Inc.)

On Sketching Quadratic Forms Robert Krauthgamer, Weizmann Institute of Science Joint with: Alex Andoni, Jiecao Chen, Bo Qin, David Woodruff and Qin Zhang.

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006

Time-Decaying Sketches for Sensor Data Aggregation Graham Cormode AT&T Labs, Research Srikanta Tirthapura Dept. of Electrical and Computer Engineering.

A survey on stream data mining

Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,

Sketching and Streaming Entropy via Approximation Theory Nick Harvey (MSR/Waterloo) Jelani Nelson (MIT) Krzysztof Onak (MIT)

Data Streams and Applications in Computer Science David Woodruff IBM Almaden Presburger lecture, ICALP, 2014.

Sketching for M-Estimators: A Unified Approach to Robust Regression Kenneth Clarkson David Woodruff IBM Almaden.

Sketching and Embedding are Equivalent for Norms Alexandr Andoni (Simons Inst. / Columbia) Robert Krauthgamer (Weizmann Inst.) Ilya Razenshteyn (MIT, now.

Embedding and Sketching Alexandr Andoni (MSR). Definition by example  Problem: Compute the diameter of a set S, of size n, living in d-dimensional ℓ.

Sketching, Sampling and other Sublinear Algorithms: Streaming Alex Andoni (MSR SVC)

Information Complexity Lower Bounds for Data Streams David Woodruff IBM Almaden.

Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.

1 Streaming Algorithms for Geometric Problems Piotr Indyk MIT.

Information Theory for Data Streams David P. Woodruff IBM Almaden.

Sublinear Algorithms via Precision Sampling Alexandr Andoni (Microsoft Research) joint work with: Robert Krauthgamer (Weizmann Inst.) Krzysztof Onak (CMU)

Information Complexity: an Overview Rotem Oshman, Princeton CCI Based on work by Braverman, Barak, Chen, Rao, and others Charles River Science of Information.

Facility Location in Dynamic Geometric Data Streams Christiane Lammersen Christian Sohler.

Enabling a “RISC” Approach for Software-Defined Monitoring using Universal Streaming Vyas Sekar Zaoxing Liu, Greg Vorsanger, Vladimir Braverman.

Embedding and Sketching Sketching for streaming Alexandr Andoni (MSR)

Massive Data Sets and Information Theory Ziv Bar-Yossef Department of Electrical Engineering Technion.

Data Stream Algorithms Lower Bounds Graham Cormode

Calculating frequency moments of Data Stream

Compression for Fixed-Width Memories Ori Rottenstriech, Amit Berman, Yuval Cassuto and Isaac Keslassy Technion, Israel.

The Message Passing Communication Model David Woodruff IBM Almaden.

Beating CountSketch for Heavy Hitters in Insertion Streams Vladimir Braverman (JHU) Stephen R. Chestnut (ETH) Nikita Ivkin (JHU) David P. Woodruff (IBM)

Big Data Lecture 5: Estimating the second moment, dimension reduction, applications.

New Algorithms for Heavy Hitters in Data Streams David Woodruff IBM Almaden Joint works with Arnab Bhattacharyya, Vladimir Braverman, Stephen R. Chestnut,

Information Complexity Lower Bounds

New Characterizations in Turnstile Streams with Applications

Open Problems in Streaming

Finding Frequent Items in Data Streams

Estimating Lp Norms Piotr Indyk MIT.

Approximate Matchings in Dynamic Graph Streams

Sublinear Algorithmic Tools 2

Sketching and Embedding are Equivalent for Norms

Lecture 4: CountSketch High Frequencies

Turnstile Streaming Algorithms Might as Well Be Linear Sketches

Range-Efficient Counting of Distinct Elements

Y. Kotidis, S. Muthukrishnan,

Overview Massive data sets Streaming algorithms Regression

CSCI B609: “Foundations of Data Science”

Range-Efficient Computation of F0 over Massive Data Streams

Streaming Symmetric Norms via Measure Concentration

Lecture 6: Counting triangles Dynamic graphs & sampling

Joint work with Morteza Monemizadeh

Sublinear Algorihms for Big Data

Presentation transcript:

Vladimir(Vova) Braverman UCLA Joint work with Rafail Ostrovsky

General method for computing over frequencies with polylog space (Zero-one frequency law) Recursive sketching for vectors Plan:

Stream Frequencies Frequency Vector

Frequency-Based Functions Frequency Vector G: N —> R 00 G(0)G(1)G(2)G(0) G(1)G(3) G-Sum(V) = ∑ G(m i ) Modified Vector The objective function The Data

D is a a stream p 1,…, p m where p j є [n] Frequency m i = |{j: p j = i}| Frequency-based function G-Sum(D) =∑ i G(m i ) F k frequency moment G(m i ) = m i k A single pass over D Small (polylog) memory : (1/ε log(nm)) O(1) The (Basic) Streaming Model Formal Definition Limitations Output a multiplicative approximation X such that: P(|X- ∑ i G(m i ) | > ε ∑ i G(m i ) ) < 2/3 What is needed

Alon, Matias, Szegedy (STOC 1996, JCSS 1999, Gödel Award 2005 ) Frequency moments G(x) = x k, in particular : Polylog-space algorithms for G(x) = x 0 and G(x) = x 2 Lower bounds for k>2 Algorithms for k>2 (large but sublinear memory)

The open question of Alon, Matias, Szegedy (1996) What is the space complexity of estimating other functions G(x)?

Our Result G(0)=0, G is non-decreasing Function G : R—> R is in STREAM-POLYLOG class If there exists an algorithm A such that for any data stream D and for any ε, A makes a single pass over D, uses (1/ε log(nm)) O(1) memory bits and outputs X s.t. P(|X - ∑ i G(m i ) | > ε ∑ i G(m i )) < 2/3. G is in STREAM-POLYLOG if and only if G is tractable The Main Result

Related Work (A subset) Alon, Gibbons, Matias, Szegedy PODS 99 Alon, Matias, Szegedy STOC 96 Andoni, Krauthgamer, Onak 2010 (arxiv) Bar-Yossef, Jayram, Kumar, Sivakumar JCSS 2004 Bar-Yossef, Jayram, Kumar, Sivakumar, Trevisan RANDOM 2002 Beame, Jayram, Rudra STOC 2007 Bhuvanagiri, Ganguly, Kesh, Saha SODA 2006 Bhuvanagiri, Ganguly ESA 2006 Chakrabarti, Do Ba, Muthukrishnan SODA 2007 Chakrabarti, Cormode, McGregor STOC 08, SODA 07 Chakrabarti, Khot, Sun 2003 Chakrabarti, Regev STOC 2011 Charikar, Chen, Farach-Colton Th.Comp.Sc Coppersmith, Kumar SODA 2004 Cormode, Datar, Indyk, Muthukrishnan VLDB 2002 Comrode, Muthukrishnan J.Alg Feigenbaum, Kannan, Strauss, Viswanathan FOCS 99 Flajolet, Martin JCSS 85 Ganguly 2004, 2011 Ganguly, Cormode RANDOM 2007 Guha, Indyk, McGregor COLT 2007 Guha, McGregor, Venkatasubramanian SODA 06 Harvey, Nelson, Onak FOCS 08 Indyk FOCS 2000 Indyk, Woodruff FOCS 03, STOC 2005 Jayram, McGregor, Muthukrishnan, Vee PODS 07 Kane, Nelson, Woodruff PODS 2010, SODA 2010 Kane, Nelson, Porat, Woodruff STOC 2011 Li SODA 2009, KDD 07 McGregor, Indyk SODA 2009 Monemizadeh, Woodruff SODA 2010 Muthukrishnan 2005 Nelson, Woodruff PODS 2011 Saks, Sun STOC 2002 Woodruff SODA 2004

Lower Bounds Reduction to MultiParty SET-DISJOINTESS problem The reduction requires monotonicity Relatively straightforward (see the paper)

y copies Lower Bounds (informal) … … … 0 …. Assume first that x = k * y Pick N~ G(x)/G(y) i i i …. i y copies jj …. j The Stream

Reduction (very informal) If the sets intersect then, by monotonicity, the value of G-Sum is at least NG(y) + G(x) ~ 2G(x) If do not intersect then the value is at most (N+k)G(y) ~ G(x) Any constant approximation algorithm for G-Sum MUST recognize the difference And thus requires N/(k^2) space ([Chakrabarti, Khot, Sun]) which is larger then any polylog Thus G is not tractable

We follow the fundamental idea of Indyk and Woodruff First we solve a specific case of G-heavy elements Then we show that the general case can be solved by recursive sketching Upper Bound: Basic Ideas

Mimic F G Certifier H 1 0 IF H=1 RETURN F ELSE RETURN 0

G-heavy elements G(1) G(10^10) G(1) Frequency Vector of size n

G(x)=x^2G(x)=x^3/2 Frequencies Certifier G3 G2 G1 If G is “good” then every G-heavy element is also F2-heavy Mimic F G Certifier H 1 0 IF H=1 RETURN F ELSE RETURN 0

Lemma 0 (very informal)

Proof for L_p (0<p<2)

Proof (sketch)

Mimic Function n Mimic F G Certifier H 1 0 IF H=1 RETURN F ELSE RETURN 0

Recursive Sketches

Lemma 1 Let V є R n be a vector with non-negative entries. Let H є {0,1} n be a random vector with pairwise- independent uniform entries. Let S be s.t.: Define Then

Hadamard product Had(U,V) of two vectors U and V is a vector with entries v i u i v1v1 v2v2 u1u1 u2u2 v1u1v1u1 v2u2v2u2 vnvn unun vnunvnun … Had(U,V)

Lemma 2 Denote for i=1,2,..,t Then are i.i.d. vectors

Lemma 3 Denote Then for

The general algorithm (informal) Maintain H 1,..,H t We can obtain V i by dropping all stream elements that are not “sampled” For t=O(log(n)), the number of non-zero elements in V t is constant, with constant probability Thus, given an oracle for “heavy” elements, the sum can be approximated using only log(n) number of calls to “heavy” elements oracle

The Algorithm for large Frequency moments (informal) The general algorithm works for any “separable” vector, in particular for frequency moments vector Also, such oracles for “heavy” elements exist for frequency moments E.g., CountSketch by Charikar, Chen, Farach-Colton, The final algorithm requires n 1-2/k log(n)log(m)log(log…(log(nm))) memory bits Independently Andoni, Krauthgamer, Onak improved the bound to n 1-2/k log(n)log(m) (Precision Sampling: Alex’s talk yesterday)

Notes We need to overcome additional technical issues Heavy elements: from precise values to approximations

Open problems Characterize non-monotonic functions (we made some progress) Extend the results to sublinear algorithms (o(n) space) Other models: deletions, sliding windows etc., Optimal algorithm for large frequency moments

Thank you!