Sparse Recovery (Using Sparse Matrices)

Slides:

Advertisements

Similar presentations

Rectangle-Efficient Aggregation in Spatial Data Streams Srikanta Tirthapura David Woodruff Iowa State IBM Almaden.

Advertisements

1+eps-Approximate Sparse Recovery Eric Price MIT David Woodruff IBM Almaden.

The Complexity of Linear Dependence Problems in Vector Spaces David Woodruff IBM Almaden Joint work with Arnab Bhattacharyya, Piotr Indyk, and Ning Xie.

Numerical Linear Algebra in the Streaming Model Ken Clarkson - IBM David Woodruff - IBM.

Sparse Recovery Using Sparse (Random) Matrices Piotr Indyk MIT Joint work with: Radu Berinde, Anna Gilbert, Howard Karloff, Martin Strauss and Milan Ruzic.

On the Power of Adaptivity in Sparse Recovery Piotr Indyk MIT Joint work with Eric Price and David Woodruff, 2011.

Sparse Recovery Using Sparse (Random) Matrices Piotr Indyk MIT Joint work with: Radu Berinde, Anna Gilbert, Howard Karloff, Martin Strauss and Milan Ruzic.

Image acquisition using sparse (pseudo)-random matrices Piotr Indyk MIT.

Compressive Sensing IT530, Lecture Notes.

Overcoming the L 1 Non- Embeddability Barrier Robert Krauthgamer (Weizmann Institute) Joint work with Alexandr Andoni and Piotr Indyk (MIT)

An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.

Heavy Hitters Piotr Indyk MIT. Last Few Lectures Recap (last few lectures) –Update a vector x –Maintain a linear sketch –Can compute L p norm of x (in.

Turnstile Streaming Algorithms Might as Well Be Linear Sketches Yi Li Huy L. Nguyen David Woodruff.

ECE Department Rice University dsp.rice.edu/cs Measurements and Bits: Compressed Sensing meets Information Theory Shriram Sarvotham Dror Baron Richard.

“Random Projections on Smooth Manifolds” -A short summary

CSCI 317 Mike Heroux1 Sparse Matrix Computations CSCI 317 Mike Heroux.

Weak Formulation ( variational formulation)

CS591A1 Fall Sketch based Summarization of Data Streams Manish R. Sharma and Weichao Ma.

Data Streams and Applications in Computer Science David Woodruff IBM Almaden Presburger lecture, ICALP, 2014.

Sketching for M-Estimators: A Unified Approach to Robust Regression Kenneth Clarkson David Woodruff IBM Almaden.

Optimal Data-Dependent Hashing for Approximate Near Neighbors

Sketching, Sampling and other Sublinear Algorithms: Streaming Alex Andoni (MSR SVC)

Compressed Sensing Compressive Sampling

Sparse Fourier Transform

CS448f: Image Processing For Photography and Vision The Gradient Domain.

Mirek Riedewald Department of Computer Science Cornell University Efficient Processing of Massive Data Streams for Mining and Monitoring.

Game Theory Meets Compressed Sensing

Geometric Problems in High Dimensions: Sketching Piotr Indyk.

Recovery of Clustered Sparse Signals from Compressive Measurements

Cs: compressed sensing

4.5 Solving Systems using Matrix Equations and Inverses.

Sketching via Hashing: from Heavy Hitters to Compressive Sensing to Sparse Fourier Transform Piotr Indyk MIT.

Zahid Irfan & Dr. Asim Karim (Advisor) (zahidi, CS-509-Masters of Science (CS) Project Lahore University of Management Sciences, Lahore,

Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.

Source Localization on a budget Volkan Cevher Rice University Petros RichAnna Martin Lance.

A Compact Survey of Compressed Sensing Graham Cormode

Geometric Problems in High Dimensions: Sketching Piotr Indyk.

n-pixel Hubble image (cropped)

Sparse RecoveryAlgorithmResults  Original signal x = x k + u, where x k has k large coefficients and u is noise.  Acquire measurements Ax = y. If |x|=n,

Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo

SketchVisor: Robust Network Measurement for Software Packet Processing

PreCalculus Section 14.3 Solve linear equations using matrices

Cramer’s Rule (because Cramer RULES!)

Chapter 3 Linear Systems Review

Recent Developments in the Sparse Fourier Transform

Lecture 13 Compressive sensing

Lecture 22: Linearity Testing Sparse Fourier Transform

Estimating L2 Norm MIT Piotr Indyk.

Lecture 15 Sparse Recovery Using Sparse Matrices

Sublinear Algorithmic Tools 3

Lecture 11: Nearest Neighbor Search

Lecture 10: Sketching S3: Nearest Neighbor Search

Sketching and Embedding are Equivalent for Norms

Lecture 4: CountSketch High Frequencies

Lecture 7: Dynamic sampling Dimension Reduction

Turnstile Streaming Algorithms Might as Well Be Linear Sketches

Range-Efficient Counting of Distinct Elements

Near(est) Neighbor in High Dimensions

Preview Linear Systems Activity (on worksheet)

Y. Kotidis, S. Muthukrishnan,

Overview Massive data sets Streaming algorithms Regression

Let {image} Use substitution to determine which elements of S satisfy the inequality {image} Select the correct answer(s): {image}

CSCI B609: “Foundations of Data Science”

Aishwarya sreenivasan 15 December 2006.

 = N  N matrix multiplication N = 3 matrix N = 3 matrix N = 3 matrix

Minwise Hashing and Efficient Search

Sudocodes Fast measurement and reconstruction of sparse signals

APPLICATIONS OF LINEAR ALGEBRA IN INFORMATION TECHNOLOGY

Learning-Based Low-Rank Approximations

(Learned) Frequency Estimation Algorithms

Presentation transcript:

Sparse Recovery (Using Sparse Matrices) Piotr Indyk MIT

HHpφ (x) = { i: |xi| ≥ φ ||x||p } Heavy Hitters Also called frequent elements and elephants Define HHpφ (x) = { i: |xi| ≥ φ ||x||p } Lp Heavy Hitter Problem: Parameters: φ and φ’ (often φ’ = φ-ε) Goal: return a set S of coordinates s.t. S contains HHpφ (x) S is included in HHpφ’ (x) Lp Point Query Problem: Parameter:  Goal: at the end of the stream, given i, report x*i=xi   ||x||p Can solve L2 point query problem, with parameter  and failure probability P by storing O(1/2 log(1/P)) numbers sketches [A-Gibbons-M-S’99], [Gilbert-Kotidis-Muthukrishnan-Strauss’01]

||x*-x||p C(k) minx’ ||x’-x||q Sparse Recovery (approximation theory, learning Fourier coeffs, linear sketching, finite rate of innovation, compressed sensing...) Setup: Data/signal in n-dimensional space : x E.g., x is an 256x256 image  n=65536 Goal: compress x into a “sketch” Ax , where A is a m x n “sketch matrix”, m << n Requirements: Plan A: want to recover x from Ax Impossible: underdetermined system of equations Plan B: want to recover an “approximation” x* of x Sparsity parameter k Informally: want to recover largest k coordinates of x Formally: want x* such that ||x*-x||p C(k) minx’ ||x’-x||q over all x’ that are k-sparse (at most k non-zero entries) Want: Good compression (small m=m(k,n)) Efficient algorithms for encoding and recovery Why linear compression ? = A x Ax

Application I: Monitoring Network Traffic Data Streams Router routs packets Where do they come from ? Where do they go to ? Ideally, would like to maintain a traffic matrix x[.,.] Easy to update: given a (src,dst) packet, increment xsrc,dst Requires way too much space! (232 x 232 entries) Need to compress x, increment easily Using linear compression we can: Maintain sketch Ax under increments to x, since A(x+) = Ax + A Recover x* from Ax destination source x

Applications, ctd. Single pixel camera [Wakin, Laska, Duarte, Baron, Sarvotham, Takhar, Kelly, Baraniuk’06] Pooling Experiments [Kainkaryam, Woolf’08], [Hassibi et al’07], [Dai-Sheikh, Milenkovic, Baraniuk], [Shental-Amir-Zuk’09],[Erlich-Shental-Amir-Zuk’09]

Constructing matrix A “Most” matrices A work “Traditional” tradeoffs: Sparse matrices: Data stream algorithms Coding theory (LDPCs) Dense matrices: Compressed sensing Complexity/learning theory (Fourier matrices) “Traditional” tradeoffs: Sparse: computationally more efficient, explicit Dense: shorter sketches Recent results: the “best of both worlds”

Results Scale: Excellent Very Good Good Fair Paper R/D Sketch length Encode time Column sparsity Recovery time Approx [CCF’02], [CM’06] R k log n n log n log n l2 / l2 k logc n n logc n logc n [CM’04] l1 / l1 [CRT’04] [RV’05] D k log(n/k) nk log(n/k) nc l2 / l1 [GSTV’06] [GSTV’07] k2 logc n [BGIKS’08] n log(n/k) log(n/k) [GLR’08] k lognlogloglogn kn1-a n1-a [NV’07], [DM’08], [NT’08], [BD’08], [GK’09], … nk log(n/k) * log n log n * log emphasize arbitrary signals [IR’08], [BIR’08],[BI’09] D k log(n/k) n log(n/k) log(n/k) n log(n/k)* log l1 / l1 [GLSP’09] R k log(n/k) n logc n logc n k logc n l2 / l1 Caveats: (1) most “dominated” results not shown (2) only results for general vectors x are displayed (3) sometimes the matrix type matters (Fourier, etc)

||x-x*||1≤ C Err1 , where Err1=mink-sparse x’ ||x-x’||1 Part I Paper R/D Sketch length Encode time Column sparsity Recovery time Approx [CM’04] R k log n n log n log n l1 / l1 Theorem: There is a distribution over mxn matrices A, m=O(k logn), such that for any x, given Ax, we can recover x* such that ||x-x*||1≤ C Err1 , where Err1=mink-sparse x’ ||x-x’||1 with probability 1-1/n. The recovery algorithm runs in O(n log n) time. This talk: Assume x≥0 – this simplifies the algorithm and analysis; see the original paper for the general case Prove the following l∞/l1 guarantee: ||x-x*||∞≤ C Err1 /k This is actually stronger than the l1/l1 guarantee (cf. [CM’06], see also the Appendix) Note: [CM’04] originally proved a weaker statement where ||x-x*||∞≤ C||x||1 /k. The stronger guarantee follows from the analysis of [CCF’02] (cf. [GGIKMS’02]) who applied it to Err2

First attempt Matrix view: Hashing view: Estimator: xi*=Zh(i) A 0-1 wxn matrix A, with one 1 per column The i-th column has 1 at position h(i), where h(i) be chosen uniformly at random from {1…w} Hashing view: Z=Ax h hashes coordinates into “buckets” Z1…Zw Estimator: xi*=Zh(i) 0 0 1 0 0 1 0 1 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 xi xi* Z1 ………..Z(4/)k Closely related: [Estan-Varghese’03], “counting” Bloom filters

∑t not in S u {i}, h(i)=h(t) xt >  Err/k Analysis x We show xi* ≤ xi   Err/k with probability >1/2 Assume |xi1| ≥ … ≥ |xim| and let S={i1…ik} (“elephants”) When is xi* > xi   Err/k ? Event 1: S and i collide, i.e., h(i) in h(S-{i}) Probability: at most k/(4/)k = /4<1/4 (if <1) Event 2: many “mice” collide with i., i.e., ∑t not in S u {i}, h(i)=h(t) xt >  Err/k Probability: at most ¼ (Markov inequality) Total probability of “bad” events <1/2 xi1 xi2 … xik xi Z1 ………..Z(4/)k

Second try Algorithm: Analysis: Maintain d functions h1…hd and vectors Z1…Zd Estimator: Xi*=mint Ztht(i) Analysis: Pr[|xi*-xi| ≥ α Err/k ] ≤ 1/2d Setting d=O(log n) (and thus m=O(k log n) ) ensures that w.h.p |x*i-xi|< α Err/k

Part II Paper R/D Sketch length Encode time Column sparsity Recovery time Approx [CRT’04] [RV’05] D k log(n/k) nk log(n/k) nc l2 / l1 k logc n n log n [BGIKS’08] D k log(n/k) n log(n/k) log(n/k) nc l1 / l1 [IR’08], [BIR’08],[BI’09] D k log(n/k) n log(n/k) log(n/k) n log(n/k)* log l1 / l1

Compressed Sensing A.k.a. compressive sensing or compressive sampling [Candes-Romberg-Tao’04; Donoho’04] Signal acquisition/processing framework: Want to acquire a signal x=[x1…xn] Acquisition proceeds by computing Ax+noise of dimension m<<n We ignore noise in this lecture From Ax we want to recover an approximation x* of x Note: x* does not have to be k-sparse in general Method: solve the following program: minimize ||x*||1 subject to Ax*=Ax Guarantee: for some C>1 (1) ||x-x*||1 ≤ C mink-sparse x” ||x-x”||1 (2) ||x-x*||2 ≤ C mink-sparse x” ||x-x”||1 /k1/2 Notes: Sparsity in different basis -

Linear Programming Recovery: This is a linear program: minimize ||x*||1 subject to Ax*=Ax This is a linear program: minimize ∑i ti subject to -ti ≤ x*i ≤ ti Ax*=Ax Can solve in nΘ(1) time

Intuition LP: On the first sight, somewhat mysterious minimize ||x*||1 subject to Ax*=Ax On the first sight, somewhat mysterious But the approach has a long history in signal processing, statistics, etc. Intuition: The actual goal is to minimize ||x*||0 But this comes down to the same thing (if A is “nice”) The choice of L1 is crucial (L2 does not work) Ax*=Ax n=2, m=1, k=1

Restricted Isometry Property A matrix A satisfies RIP of order k with constant δ if for any k-sparse vector x we have (1-δ) ||x||2 ≤ ||Ax||2 ≤ (1+δ) ||x||2 Theorem 1: If each entry of A is i.i.d. as N(0,1), and m=Θ(k log(n/k)), then A satisfies (k,1/3)-RIP w.h.p. Theorem 2: (O(k),1/3)-RIP implies L2/L1 guarantee [CRT’04] [RV’05] D k log(n/k) nk log(n/k) nc l2 / l1 k logc n n log n

dense vs. sparse Restricted Isometry Property (RIP) [Candes-Tao’04]  is k-sparse  ||||2 ||A||2  C ||||2 Consider m x n 0-1 matrices with d ones per column Do they satisfy RIP ? No, unless m=(k2) [Chandar’07] However, they can satisfy the following RIP-1 property [Berinde-Gilbert-Indyk-Karloff-Strauss’08]:  is k-sparse  d (1-) ||||1 ||A||1  d||||1 Sufficient (and necessary) condition: the underlying graph is a ( k, d(1-/2) )-expander

 is k-sparse  d (1-) ||||1 ||A||1  d||||1 Expanders n m A bipartite graph is a (k,d(1-))-expander if for any left set S, |S|≤k, we have |N(S)|≥(1-)d |S| Objects well-studied in theoretical computer science and coding theory Constructions: Probabilistic: m=O(k log (n/k)) Explicit: m=k quasipolylog n High expansion implies RIP-1:  is k-sparse  d (1-) ||||1 ||A||1  d||||1 [Berinde-Gilbert-Indyk-Karloff-Strauss’08] S N(S) d m n

Proof: d(1-/2)-expansion  RIP-1 Want to show that for any k-sparse  we have d (1-) ||||1 ||A ||1  d||||1 RHS inequality holds for any  LHS inequality: W.l.o.g. assume |1|≥… ≥|k| ≥ |k+1|=…= |n|=0 Consider the edges e=(i,j) in a lexicographic order For each edge e=(i,j) define r(e) s.t. r(e)=-1 if there exists an edge (i’,j)<(i,j) r(e)=1 if there is no such edge Claim 1: ||A||1 ≥∑e=(i,j) |i|re Claim 2: ∑e=(i,j) |i|re ≥ (1-) d||||1 d m n

Recovery: algorithms

L1 minimization Algorithm: Theoretical performance: minimize ||x*||1 subject to Ax=Ax* Theoretical performance: RIP1 replaces RIP2 L1/L1 guarantee Experimental performance Thick line: Gaussian matrices Thin lines: prob. levels for sparse matrices (d=10) Same performance! k/m m/n [BGIKS’08] D k log(n/k) n log(n/k) log(n/k) nc l1 / l1

||x*+zei-x||p << ||x* - x||p Matching Pursuit(s) A x*-x Ax-Ax* = i i Iterative algorithm: given current approximation x* : Find (possibly several) i s. t. Ai “correlates” with Ax-Ax* . This yields i and z s. t. ||x*+zei-x||p << ||x* - x||p Update x* Sparsify x* (keep only k largest entries) Repeat Norms: p=2 : CoSaMP, SP, IHT etc (RIP) p=1 : SMP, SSMP (RIP-1) p=0 : LDPC bit flipping (sparse matrices)

Sequential Sparse Matching Pursuit [Berinde-Indyk’09] Algorithm: x*=0 Repeat T times Repeat S=O(k) times Find i and z that minimize* ||A(x*+zei)-Ax||1 x* = x*+zei Sparsify x* (set all but k largest entries of x* to 0) Similar to SMP, but updates done sequentially A i N(i) color code copies ? if Ax contains copies of k, then we can reverse this process. If we look at each coordinate on the left, and its neighbors, then we can perform a voting process uses insights developed by l2/l1 Emphasize the algorithm works for *any* signals x Ax-Ax* x-x* * Set z=median[ (Ax*-Ax)N(I).Instead, one could first optimize (gradient) i and then z [ Fuchs’09]

Experiments 256x256 SSMP is ran with S=10000,T=20. SMP is ran for 100 iterations. Matrix sparsity is d=8.

Conclusions Sparse approximation using sparse matrices State of the art: deterministically can do 2 out of 3: Near-linear encoding/decoding O(k log (n/k)) measurements Approximation guarantee with respect to L2/L1 norm Open problems: 3 out of 3 ? Explicit constructions ? This talk

References Survey: A. Gilbert, P. Indyk, “Sparse recovery using sparse matrices”, Proceedings of IEEE, June 2010. Courses: “Streaming, sketching, and sub-linear space algorithms”, Fall’07 “Sub-linear algorithms” (with Ronitt Rubinfeld), Fall’10 Blogs: Nuit blanche: nuit-blanche.blogspot.com/