Streaming Symmetric Norms via Measure Concentration

Slides:



Advertisements
Similar presentations
Estimating Distinct Elements, Optimally
Advertisements

Rectangle-Efficient Aggregation in Spatial Data Streams Srikanta Tirthapura David Woodruff Iowa State IBM Almaden.
1+eps-Approximate Sparse Recovery Eric Price MIT David Woodruff IBM Almaden.
The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.
Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO Based on a paper in STOC, 2012.
Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO.
Optimal Space Lower Bounds for All Frequency Moments David Woodruff MIT
Numerical Linear Algebra in the Streaming Model Ken Clarkson - IBM David Woodruff - IBM.
Optimal Bounds for Johnson- Lindenstrauss Transforms and Streaming Problems with Sub- Constant Error T.S. Jayram David Woodruff IBM Almaden.
Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.
Xiaoming Sun Tsinghua University David Woodruff MIT
Tight Lower Bounds for the Distinct Elements Problem David Woodruff MIT Joint work with Piotr Indyk.
Truthful Mechanisms for Combinatorial Auctions with Subadditive Bidders Speaker: Shahar Dobzinski Based on joint works with Noam Nisan & Michael Schapira.
A Nonlinear Approach to Dimension Reduction Robert Krauthgamer Weizmann Institute of Science Joint work with Lee-Ad Gottlieb TexPoint fonts used in EMF.
The Communication Complexity of Approximate Set Packing and Covering
Overcoming the L 1 Non- Embeddability Barrier Robert Krauthgamer (Weizmann Institute) Joint work with Alexandr Andoni and Piotr Indyk (MIT)
Vladimir(Vova) Braverman UCLA Joint work with Rafail Ostrovsky.
ABSTRACT We consider the problem of computing information theoretic functions such as entropy on a data stream, using sublinear space. Our first result.
Metric Embeddings As Computational Primitives Robert Krauthgamer Weizmann Institute of Science [Based on joint work with Alex Andoni]
Sketching for M-Estimators: A Unified Approach to Robust Regression
Turnstile Streaming Algorithms Might as Well Be Linear Sketches Yi Li Huy L. Nguyen David Woodruff.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 25, 2006
On Sketching Quadratic Forms Robert Krauthgamer, Weizmann Institute of Science Joint with: Alex Andoni, Jiecao Chen, Bo Qin, David Woodruff and Qin Zhang.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
1 Sampling Lower Bounds via Information Theory Ziv Bar-Yossef IBM Almaden.
Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.
Data Streams and Applications in Computer Science David Woodruff IBM Almaden Presburger lecture, ICALP, 2014.
Sketching and Embedding are Equivalent for Norms Alexandr Andoni (Simons Inst. / Columbia) Robert Krauthgamer (Weizmann Inst.) Ilya Razenshteyn (MIT, now.
How Robust are Linear Sketches to Adaptive Inputs? Moritz Hardt, David P. Woodruff IBM Research Almaden.
On Embedding Edit Distance into L_11 On Embedding Edit Distance into L 1 Robert Krauthgamer (Weizmann Institute and IBM Almaden)‏ Based on joint work (i)
Tight Bounds for Graph Problems in Insertion Streams Xiaoming Sun and David P. Woodruff Chinese Academy of Sciences and IBM Research-Almaden.
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
Sublinear Algorithms via Precision Sampling Alexandr Andoni (Microsoft Research) joint work with: Robert Krauthgamer (Weizmann Inst.) Krzysztof Onak (CMU)
Embedding and Sketching Sketching for streaming Alexandr Andoni (MSR)
Massive Data Sets and Information Theory Ziv Bar-Yossef Department of Electrical Engineering Technion.
Data Stream Algorithms Lower Bounds Graham Cormode
Sketching and Embedding are Equivalent for Norms Alexandr Andoni (Columbia) Robert Krauthgamer (Weizmann Inst) Ilya Razenshteyn (MIT) 1.
Beating CountSketch for Heavy Hitters in Insertion Streams Vladimir Braverman (JHU) Stephen R. Chestnut (ETH) Nikita Ivkin (JHU) David P. Woodruff (IBM)
New Algorithms for Heavy Hitters in Data Streams David Woodruff IBM Almaden Joint works with Arnab Bhattacharyya, Vladimir Braverman, Stephen R. Chestnut,
An Optimal Algorithm for Finding Heavy Hitters
Approximate Near Neighbors for General Symmetric Norms
Information Complexity Lower Bounds
New Characterizations in Turnstile Streams with Applications
Dana Ron Tel Aviv University
Dimension reduction for finite trees in L1
Streaming & sampling.
Approximate Matchings in Dynamic Graph Streams
Sublinear Algorithmic Tools 3
Distributed Submodular Maximization in Massive Datasets
Sublinear Algorithmic Tools 2
COMS E F15 Lecture 2: Median trick + Chernoff, Distinct Count, Impossibility Results Left to the title, a presenter can insert his/her own image.
Sketching and Embedding are Equivalent for Norms
CS 154, Lecture 6: Communication Complexity
Lecture 4: CountSketch High Frequencies
Lecture 7: Dynamic sampling Dimension Reduction
Turnstile Streaming Algorithms Might as Well Be Linear Sketches
Linear sketching with parities
Locally Decodable Codes from Lifting
The Curve Merger (Dvir & Widgerson, 2008)
Near-Optimal (Euclidean) Metric Compression
Overview Massive data sets Streaming algorithms Regression
Linear sketching over
Advances in Linear Sketching over Finite Fields
Linear sketching with parities
CSCI B609: “Foundations of Data Science”
Imperfectly Shared Randomness
Lecture 6: Counting triangles Dynamic graphs & sampling
Lecture 15: Least Square Regression Metric Embeddings
Sublinear Algorihms for Big Data
On Solving Linear Systems in Sublinear Time
Presentation transcript:

Streaming Symmetric Norms via Measure Concentration Robert Krauthgamer, Weizmann Institute of Science Joint with: Jaroslaw Blasiok, Vladimir Braverman, Stephen R. Chestnut, and Lin F. Yang Weizmann, January 2018 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA

Today’s massive data sets require ever more efficient algorithms Data-Stream Model Today’s massive data sets require ever more efficient algorithms Streaming algorithms (aka data-stream model): Input is a stream of tokens Typical model: One pass Low memory (called storage/space) Randomized main memory 110111010100100010000100001010…

Frequency-Vector Model Maintain 𝑣∈ ℝ 𝑛 (initially zero) Each input token is an “update”: Token (𝑗,𝑎) represents update 𝑣 𝑗 ← 𝑣 𝑗 +𝑎 Goal: At the “end”, estimate 𝑓(𝑣) Typically 𝑓: ℝ 𝑛 →ℝ is a norm, like 𝑙 𝑝 -norm for 0≤𝑝≤∞. Standard assumptions: Integral updates, usually just 𝑎∈{−1,+1} Stream length is 𝑚≤poly(𝑛), and every 𝑣 𝑗 ≤poly(𝑛) Why called frequency? It models the following important scenario: Stream is a sequence of items from large domain [𝑛] (e.g. IP addresses) Coordinate 𝑣 𝑗 counts the frequency of item 𝑗 Known as turnstile model: allow insertions + deletions, even deficit

Which Functions/Norms? Fantastic progress on 𝑙 𝑝 -norms: Goal: (1+𝜖)-approximation of 𝑓(𝑣)= 𝑗 𝑣 𝑗 𝑝 1/𝑝 Essentially tight storage-bounds for every 𝑝 !! 0≤𝑝≤2: logarithmic storage 𝑂( log 𝑛 ) [Alon-Mattias-Szegedy’96, Indyk’00, …] 2<𝑝≤∞ : Polynomial storage 𝑂 ( 𝑛 1−2/𝑝 ) [Indyk-Woodruff’05, BarYossef-Jayram-Kumar-Sivakumar’04, …] Has led to many other results Other functions like Entropy norms, cascaded norms, additive functions Other problems like 𝑙 𝑝 -heavy hitters, 𝑙 𝑝 -sampling Powerful techniques like linear sketching, embeddings, heavy hitters, precision sampling, even algorithms for sparse recovery Communication complexity of indexing, set disjointness, gap hamming Too many references to do justice here!

Outstanding Questions Q1: Which norms can be computed in the streaming model? Key examples: Earthmover distance, matrix norms (on ℝ 𝑛×𝑛 ) Q1’: Which distances can be computed in sketching model? Sketch = summary of the input (e.g., the memory image of a streaming algorithm at “intermediate time”) Q2: Is there a universal sketching for “all” norms? [See list of open problems at http://sublinear.info, #30 and #5]

Broader Characterization of Norms? We need generic framework or tools! Embedding = low-distortion mapping of “new” norm into “known” one Embeddings are “universal” for sketching norms [Andoni-K.-Razenshteyn’15] But are difficult to construct (and should be linear and explicit) And the “loss” is in approximation quality (not in storage) Effective techniques for “additive” functions 𝑓(𝑣)= 𝑗 𝜑 𝑣 𝑗 Heavy Hitters = Coordinates with “large” contribution to norm Hierarchical subsampling = virtual stream on subset of coordinates We characterize all symmetric norms

Symmetric Norms Definition: A norm ⋅ : ℝ 𝑛 →ℝ is called symmetric if 𝑥 is invariant under coordinate permutations 𝑥 is invariant under sign flips Implies monotonicity: ∀𝑖, 𝑥 𝑖 ≤ 𝑦 𝑖 ⇒ 𝑥 ≤ 𝑦 Examples: 𝑙 𝑝 -norms top-𝑘 norm, defined as Φ 𝑘 𝑥 = 𝑥 (1) +…+ 𝑥 (𝑘) 𝑘-support norm Non-examples: Cascaded norms Matrix operator norms

Modulus of Concentration 𝑏= sup { 𝑥 :𝑥∈ 𝑆 𝑛−1 } 𝑀=median { 𝑥 :𝑥∈ 𝑆 𝑛−1 } mc≔ 𝑏 𝑀 Theorem [from Levy’s Lemma]: If 𝑓: 𝑆 𝑛−1 →ℝ is 𝑏-Lipschitz with median 𝑀, then Pr 𝑥∈ 𝑆 𝑛−1 𝑓 𝑥 ∉(𝑀±2𝑏/ 𝑛 ) ≤1/3 , ∀𝜖>0, Pr 𝑥∈ 𝑆 𝑛−1 𝑓 𝑥 ∉ 1±𝜖 𝑀 ≤ 𝜋/2 𝑒 − 𝑀/𝑏 2 𝜖 2 𝑛/2 . Theorem [Dvoretzky, Milman]: For every norm ⋅ on ℝ 𝑛 and 𝜖>0 there is a linear subspace 𝑆 of dimension ≥𝑐 𝜖 𝑛/ mc 2 equipped with a Euclidean norm ⋅ 2 ′ such that ∀𝑥∈𝑆, 𝑥 2 ′ ≤ 𝑥 ≤ 1+𝜖 𝑥 2 ′

Examples Random vector hits the maximum Random vector hits the minimum 𝑙 1 𝑙 2 𝑙 𝑝 :𝑝>2 𝑏= 𝑛 𝑀≈ 𝑛 mc 𝑙 1 ≈1 Θ (1) 𝑏=1 𝑀=1 mc 𝑙 2 =1 Θ (1) 𝑏=1 𝑀≈ 𝑛 1 𝑝 − 1 2 mc 𝑙 𝑝 ≈ 𝑛 1 2 − 1 𝑝 Θ( 𝑛 1−2/𝑝 ) Streaming Complexity:

Wild Guess Is Θ mc 2 the optimal space complexity for stream-computing every symmetric norm? Define ⋅ = max ⋅ ∞ , 1 𝑛 ⋅ 1 𝑏 =1, 𝑀=Θ 1 ⇒ mc=𝑂(1) It contains a copy of 𝑙 ∞ of dimension 𝑛 , thus space is Ω 𝑛 . Reminder: For 𝑙 1 : 𝑏 = 𝑛 , 𝑀=Θ 𝑛 ⇒ mc=𝑂(1) For 𝑙 ∞ : 𝑏 =1, 𝑀=Θ log 𝑛 𝑛 ⇒ mc=𝑂 𝑛 log 𝑛 No! Lesson: Subspaces can be an obstacle!

Main Result mmc≔ max 𝑘≤𝑛 𝑏 (𝑘) 𝑀 (𝑘) Theorem: Let ⋅ be a symmetric norm on ℝ 𝑛 . Then There is a 1-pass algorithm that (1+𝜖)-approximates the norm using mmc 2 ⋅poly 1 𝜖 log 𝑛 bits of storage. Moreover, every such algorithm requires Ω mmc 2 bits of storage. Maximum Modulus of Concentration: mmc≔ max 𝑘≤𝑛 𝑏 (𝑘) 𝑀 (𝑘) where 𝑏 (𝑘) = sup 𝑥,0,…,0 :𝑥∈ 𝑆 𝑘−1 𝑀 (𝑘) = median 𝑥,0,…,0 :𝑥∈ 𝑆 𝑘−1

Two Examples Top-𝑘 norm: Φ 𝑘 𝑥 = 𝑥 (1) +…+ 𝑥 (𝑘) mmc=Θ 𝑛 𝑘 Φ 𝑘 𝑥 = 𝑥 (1) +…+ 𝑥 (𝑘) mmc=Θ 𝑛 𝑘 Like in sparse recovery or PCA (picking signal/top eigenvalues) 𝑘-support norm: Unit ball 𝐵 𝑘 =conv 𝑥∈ ℝ 𝑛 : supp 𝑥 ≤𝑘, 𝑥 2 ≤1 mmc=𝑂 log 𝑛 Was used in machine learning

Algorithmic Outline Inspired by [Indyk-Woodruff’05] Analysis: 0. Assume wlog coordinates are nonnegative and sorted 1. Round the coordinates to powers of 1+𝜖, called levels 2. Forget some levels (of low contribution) 3. Estimate size of remaining levels Here, “storage” will be governed by mmc

1. Round the Coordinates Let 𝑉∈ ℝ 𝑛 be the vector 𝑣 after rounding By monotonicity: 𝑣 is changed by factor ≤1+𝜖 rounding

2. Forget Some Levels Let 𝑉 𝑖 ∈ ℝ 𝑛 have the level 𝑖 coordinates of 𝑉 (i.e., zero all other coordinates) Definition: Level 𝑖 is 𝛽-contributing if 𝑉 𝑖 ≥𝛽 𝑉 Let 𝑉′∈ ℝ 𝑛 have all 𝛽-contributing levels of 𝑉 (i.e., zero every non-contributing level) Lemma: 𝑉′ ≥(1−𝛽 log 1+𝜖 𝑛 ) 𝑉

2½. Analysis of Medians Let 1 ( 𝑛 ′ ) be the vector with 𝑛’ ones (padded with zeros). Lemma 1 (Flat Median): For all 1≤𝑛’≤𝑛, 1 𝑛 ′ 1 ( 𝑛 ′ ) ≃𝑀 𝑛 ′ Proof uses Levy’s Lemma (measure concentration on the sphere) Lemma 2 (Median Monotonicity): For all 1≤𝑛’≤𝑛’’≤𝑛, 1 𝑛 ′ 1 ( 𝑛 ′ ) ≲mmc⋅ 1 𝑛 ′′ 1 ( 𝑛 ′′ ) Proof: 1 𝑛 ′ 1 ( 𝑛 ′ ) ≤ 𝑏 𝑛 ′′ ≤mmc⋅ 𝑀 𝑛 ′′ ≃mmc⋅ 1 𝑛 ′′ 1 ( 𝑛 ′′ )

3. Estimate Contributing Levels Let 𝑏 𝑖 denote the cardinality of level 𝑖 Lemma 3 (Important Levels): If level 𝑖 is 𝛽-contributing then 𝑏 𝑖 1+𝜖 2𝑖 ≳ 𝛽 2 𝑚𝑚 𝑐 2 𝑗<𝑖 𝑏 𝑗 1+𝜖 2𝑗 (compares 𝑉 𝑖 2 2 vs. 𝑉 <𝑖 2 2 ) 𝑏 𝑖 ≳ 𝛽 2 𝑚𝑚 𝑐 2 𝑗>𝑖 𝑏 𝑗 (compares size of level) Proof relies on our Median Analysis Estimate 𝑏 𝑖 of all important levels by approach of [IW05] ( 𝑙 2 heavy hitters via CountSketch) Lemma 4: Given estimates 1−𝜖 𝑏 𝑖 ≤ 𝑏 𝑖 ≤ 𝑏 𝑖 , their corresponding 𝑉 ∈ ℝ 𝑛 satisfies 1−𝜖 𝑉 𝑖 ≤ 𝑉 𝑖 ≤ 𝑉 𝑖 Proof relies on norm being symmetric

The Lower Bound Communication complexity of Set Disjointness 𝑡 players … 𝑡 players Case 1: unique intersection 1 1 1 1 𝑺 𝟏 𝑺 𝟐 𝑺 𝟑 𝑺 𝒕 player 𝑖 has set 𝑆 𝑖 ⊂[𝑛]

The Lower Bound Communication complexity of Set Disjointness 𝑡 players … 𝑡 players 1 1 1 Case 2: no intersection 𝑺 𝟏 𝑺 𝟐 𝑺 𝟑 𝑺 𝒕

The Lower Bound Communication complexity of Set Disjointness 𝑡 players … 𝑡 players Case 1: unique intersection 1 1 1 Case 2: no intersection One-way communication Player 𝑡 outputs decision Every randomized protocol must communicate total of Ω 𝑛 𝑡 bits [Chakrabarti-Khot-Sun’03, Gronemeier’09] 𝑺 𝟏 𝑺 𝟐 𝑺 𝟑 𝑺 𝒕

Streaming Algorithm yields a Protocol … 𝑀(𝐴) 𝑀(𝐴) 𝑀(𝐴) 𝑀(𝐴) 𝑺 𝟏 𝑺 𝟐 𝑺 𝟑 𝑺 𝒕 The protocol: Map each 𝑆 𝑖 to a stream of updates 𝑓 𝑆 𝑖 Player 1 runs streaming algorithm 𝐴 on 𝑓( 𝑆 1 ) Passes the memory content 𝑀(𝐴) to next player. Each player 𝑖 continues execution of algorithm 𝐴 on her 𝑓 𝑆 𝑖 , etc. Last player outputs a result based on the output of 𝐴. If players succeed whp, then total communication is Ω(𝑛/𝑡) Thus, at least one message has size Ω(𝑛/ 𝑡 2 ) Thus, 𝐴 requires storage Ω(𝑛/ 𝑡 2 )

The Reduction Given a symmetric norm ⋅ Fix “bad” 𝑣∈ 𝑆 𝑛−1 that attains the maximum 𝑣 =𝑏 Set number of players 𝑡= 𝑛 mmc ≈ 𝑛 𝑀 𝑏 Players have 𝑛 2 shared values 𝑍 𝑖, 𝑗 ∼𝑁 0,1 for 𝑖,𝑗∈[𝑛] For intuition think 𝑍 𝑖, 𝑗 ∈{±1} Players implicitly agree on 𝑛 vectors 𝑉 1 = 𝑣 1 𝑍 11 , 𝑣 2 𝑍 12 , …, 𝑣 𝑛 𝑍 1𝑛 ∈ ℝ 𝑛 𝑉 2 = 𝑣 2 𝑍 21 , 𝑣 3 𝑍 22 , …, 𝑣 1 𝑍 2𝑛 ∈ ℝ 𝑛 … 𝑉 𝑛 =( 𝑣 𝑛 𝑍 𝑛1 , 𝑣 1 𝑍 𝑛2 , … ,𝑣 𝑛−1 𝑍 𝑛𝑛 )∈ ℝ 𝑛 Each player 𝑖: adds to the stream all vectors 𝑉 𝑗 s.t. 𝑗∈ 𝑆 𝑖

Analysis If there is no intersection: the final (total) vector is 𝑈= 𝑗∈ 𝑆 1 ∪ 𝑆 2 ∪…∪ 𝑆 𝑡 𝑉 𝑗  essentially a random vector! If there is unique intersection, say 𝑘∈[𝑛], the final vector is 𝑊= 𝑗∈ 𝑆 1 ∪ 𝑆 2 ∪…∪ 𝑆 𝑡 \ {𝑘} 𝑉 𝑗 +𝑡 𝑉 𝑘 Lemma: with constant probability, 𝑈 ≤40 𝑛 𝑀 (because each entry has magnitude O(1)) 𝑊 ≥60 𝑛 𝑀 (because 𝑡 𝑉 𝑘 ≳𝑡𝑏= 𝑛 𝑀)

Thank You! Concluding Remarks Extensions: Tight tradeoff between storage (space) and accuracy (approximation) Further Directions: Simpler algorithm? Arbitrary norms? Matrix norms? Several papers recently... Reductions between problems? Thank You!