Numerical Linear Algebra in the Streaming Model

Slides:



Advertisements
Similar presentations
Estimating Distinct Elements, Optimally
Advertisements

Rectangle-Efficient Aggregation in Spatial Data Streams Srikanta Tirthapura David Woodruff Iowa State IBM Almaden.
Fast Moment Estimation in Data Streams in Optimal Space Daniel Kane, Jelani Nelson, Ely Porat, David Woodruff Harvard MIT Bar-Ilan IBM.
1+eps-Approximate Sparse Recovery Eric Price MIT David Woodruff IBM Almaden.
The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.
Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO Based on a paper in STOC, 2012.
Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO.
Optimal Space Lower Bounds for All Frequency Moments David Woodruff MIT
Numerical Linear Algebra in the Streaming Model Ken Clarkson - IBM David Woodruff - IBM.
Optimal Space Lower Bounds for all Frequency Moments David Woodruff Based on SODA 04 paper.
The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.
Optimal Bounds for Johnson- Lindenstrauss Transforms and Streaming Problems with Sub- Constant Error T.S. Jayram David Woodruff IBM Almaden.
Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.
Xiaoming Sun Tsinghua University David Woodruff MIT
Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden.
Sublinear Algorithms … Lecture 23: April 20.
On the Amortized Complexity of Zero-Knowledge Proofs Ronald Cramer, CWI Ivan Damgård, Århus University.
Shortest Vector In A Lattice is NP-Hard to approximate
3D Geometry for Computer Graphics
Dimensionality Reduction PCA -- SVD
Sketching for M-Estimators: A Unified Approach to Robust Regression
Turnstile Streaming Algorithms Might as Well Be Linear Sketches Yi Li Huy L. Nguyen David Woodruff.
Object Orie’d Data Analysis, Last Time Finished NCI 60 Data Started detailed look at PCA Reviewed linear algebra Today: More linear algebra Multivariate.
1cs542g-term High Dimensional Data  So far we’ve considered scalar data values f i (or interpolated/approximated each component of vector values.
Sampling algorithms for l 2 regression and applications Michael W. Mahoney Yahoo Research (Joint work with P. Drineas.
Chapter 5 Orthogonality
Information Retrieval in Text Part III Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
Clustering In Large Graphs And Matrices Petros Drineas, Alan Frieze, Ravi Kannan, Santosh Vempala, V. Vinay Presented by Eric Anderson.
Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.
3D Geometry for Computer Graphics
Lecture 20 SVD and Its Applications Shang-Hua Teng.
Orthogonality and Least Squares
Sketching as a Tool for Numerical Linear Algebra David Woodruff IBM Almaden.
Sketching for M-Estimators: A Unified Approach to Robust Regression Kenneth Clarkson David Woodruff IBM Almaden.
6 6.3 © 2012 Pearson Education, Inc. Orthogonality and Least Squares ORTHOGONAL PROJECTIONS.
1cs542g-term Notes  Extra class next week (Oct 12, not this Friday)  To submit your assignment: me the URL of a page containing (links to)
How Robust are Linear Sketches to Adaptive Inputs? Moritz Hardt, David P. Woodruff IBM Research Almaden.
Linear Algebra Review By Tim K. Marks UCSD Borrows heavily from: Jana Kosecka Virginia de Sa (UCSD) Cogsci 108F Linear.
SVD(Singular Value Decomposition) and Its Applications
Tight Bounds for Graph Problems in Insertion Streams Xiaoming Sun and David P. Woodruff Chinese Academy of Sciences and IBM Research-Almaden.
AN ORTHOGONAL PROJECTION
SVD: Singular Value Decomposition
Orthogonality and Least Squares
MATH 685/ CSI 700/ OR 682 Lecture Notes Lecture 4. Least squares.
Data Stream Algorithms Lower Bounds Graham Cormode
Unique Games Approximation Amit Weinstein Complexity Seminar, Fall 2006 Based on: “Near Optimal Algorithms for Unique Games" by M. Charikar, K. Makarychev,
Advanced Computer Graphics Spring 2014 K. H. Ko School of Mechatronics Gwangju Institute of Science and Technology.
Low Rank Approximation and Regression in Input Sparsity Time David Woodruff IBM Almaden Joint work with Ken Clarkson (IBM Almaden)
Singular Value Decomposition and Numerical Rank. The SVD was established for real square matrices in the 1870’s by Beltrami & Jordan for complex square.
The Message Passing Communication Model David Woodruff IBM Almaden.
A Story of Principal Component Analysis in the Distributed Model David Woodruff IBM Almaden Based on works with Christos Boutsidis, Ken Clarkson, Ravi.
Information Complexity Lower Bounds
New Characterizations in Turnstile Streams with Applications
Hans Bodlaender, Marek Cygan and Stefan Kratsch
Streaming & sampling.
Background: Lattices and the Learning-with-Errors problem
Sublinear Algorithmic Tools 2
LSI, SVD and Data Management
Singular Value Decomposition
Turnstile Streaming Algorithms Might as Well Be Linear Sketches
The Communication Complexity of Distributed Set-Joins
Orthogonality and Least Squares
Linear sketching with parities
CSCI B609: “Foundations of Data Science”
Dimension versus Distortion a.k.a. Euclidean Dimension Reduction
Lecture 15: Least Square Regression Metric Embeddings
Sublinear Algorihms for Big Data
Orthogonality and Least Squares
Learning-Based Low-Rank Approximations
Presentation transcript:

Numerical Linear Algebra in the Streaming Model Ken Clarkson - IBM David Woodruff - IBM

The Problems Given n x d matrix A and n x d’ matrix B, we want estimators for The matrix product AT B The matrix X* minimizing ||AX-B|| A slightly generalized version of linear regression Given integer k, the matrix Ak of rank k minimizing ||A-Ak|| We consider the Frobenius matrix norm: square root of sum of squares

General Properties of Our Algorithms 1 pass over matrix entries, given in any order (allow multiple updates) Maintain compressed versions or “sketches” of matrices Do small work per entry to maintain the sketches Output result using the sketches Randomized approximation algorithms Since we minimize space complexity, we restrict matrix entries to be O(log nc) bits, or O(log nd) bits

Matrix Compression Methods In a line of similar efforts… Element-wise sampling [AM01], [AHK06] Row / column sampling: pick small random subset of the rows, columns, or both [DK01], [DKM04], [DMM08] Sketching / Random Projection: maintain a small number of random linear combinations of rows or columns [S06] Usually more than 1 pass Here: sketching

Outline Matrix Product Linear Regression Low-rank approximation

An Optimal Matrix Product Algorithm A and B have n rows, and a total of c columns, and we want to estimate ATB, so that ||Est-ATB|| · ε||A||¢||B|| Let S be an n x m sign (Rademacher) matrix Each entry is +1 or -1 with probability ½ m small, set to O(log 1/ δ) ε-2 Entries are O(log 1/δ)-wise independent Observation: E[ATSSTB/m] = ATE[SST]B/m = ATB Wouldn’t it be nice if all the algorithm did was maintain STA and STB, and output ATSSTB/m?

An Optimal Matrix Product Algorithm This does work, and we are able to improve the previous dependence on m: New Tail Estimate: for δ, ε > 0, there is m = O(log 1/ δ) ε-2 so that Pr[||ATSSTB/m-ATB|| > ε ||A|| ||B||] · δ (again ||C|| = [Σi, j Ci, j2]1/2) Follows from bounding O(log 1/δ)-th moment of ||ATSSTB/m-ATB||

Efficiency Easy to maintain sketches given updates O(m) time/update, O(mc log(nc)) bits of space for STA and STB Improves Sarlos’ algorithm by a log c factor. Sarlos’ algorithm based on JL Lemma JL preserves all entries of ATB up to an additive error, whereas we only preserve overall error Can compute [ATS][STB]/m via fast rectangular matrix multiplication

Matrix Product Lower Bound Our algorithm is space-optimal for constant δ a new lower bound Reduction from a communication game Augmented Indexing, players Alice and Bob Alice has random x 2 {0,1}s Bob has random i 2 {1, 2, …, s} also xi+1, …, xs Alice sends Bob one message Bob should output xi with probability at least 2/3 Theorem [MNSW]: Message must be (s) bits on average

Lower Bound Proof Set s := £(cε-2log cn) Alice makes matrix U Uses x1…xs Bob makes matrix U’ and B Uses i and xi+1, …, xs Alg input will be A:=U+U’ and B A and B are n x c/2 Alice: Runs streaming matrix product Alg on U Sends Alg state to Bob Bob continues Alg with A := U + U’ and B ATB determines xi with probability at least 2/3 By choice of U, U’, B Solving Augmented Indexing So space of Alg must be (s) = (cε-2log cn) bits

Lower Bound Details U = U(1); U(2); …, U(log (cn)); 0s Each U(k) is an £(ε-2) x c/2 submatrix with entries in {-10k, 10k} U(k)i, j = 10k if matched entry of x is 0, else U(k)i, j = -10k Bob’s index i corresponds to U(k*)i*, j* U’ is such that A = U+U’ = U(1); U(2); …, U(k*); 0s U’ is determined from xi+1, …, xs ATB is i*-th row of U(k*) ||A|| ¼ ||U(k*)|| since the entries of A are geometrically increasing ε2||A||2¢ ||B||2, the squared error, is small, so most entries of the approximation to ATB have the correct sign

Outline Matrix Product Linear Regression Low-rank approximation

Linear Regression The problem: minX ||AX-B|| X* minimizing this has X* = A-B, where A- is the pseudo-inverse of A Every matrix A = UΣVT using singular value decomposition (SVD) If A is n x d of rank k, then U is n x k with orthonormal columns Σ is k x k diagonal matrix, diagonal is positive V is d x k with orthonormal columns A- = VΣ-1UT Normal Equations: ATA X = ATB for optimal X

||AX’-B|| · (1+ε)||AX*-B|| Linear Regression Let S be an n x m sign matrix, m = O(dε-1log(1/δ)) The algorithm is Maintain STA and STB Return X’ solving minX ||ST(AX-B)|| Space is O(d2 ε-1log(1/δ)) words Improves Sarlos’ space by log c factor Space is optimal via new lower bound Main claim: With probability at least 1- δ, ||AX’-B|| · (1+ε)||AX*-B|| That is, relative error for X’ is small

Regression Analysis Why should X’ solving minX ||ST(AX-B)|| be good? ST approximately preserves AX-B for fixed X If this worked for all X, we’re done ST must preserve norms even for X’, chosen using S First reduce to showing that ||A(X*-X’)|| is small Use normal equation ATAX* = ATB Implies ||AX’-B||2 = ||AX*-B||2 + ||A(X’-X*)||2 Bounding ||A(X’-X*)||2 equivalent to bounding ||UTA(X’-X*)||2, where A = UΣVT, from SVD, and U is an orthonormal basis of the columnspace of A

Regression Analysis Continued Bounding ||¯||2 := ||UTA(X’-X*)||2 ||¯|| · ||UTSSTU¯/m|| + ||UTSSTU¯/m-¯|| Normal equations in sketch space imply (STA)T(STA)X’ = (STA)T(STB) UTSSTU¯ = UTSSTA(X’-X*) = UTSSTA(X’-X*) + UTSST(B-AX’) = UTSST(B-AX*) || UTSSTU¯/m|| = ||UTSST(B-AX*)/m|| · (ε/k)1/2||U||¢||B-AX*|| (new tail estimate) = ε1/2 ||B-AX*||

Regression Analysis Continued Hence, ||¯||2 := ||UTA(X’-X*)||2 ||¯|| · ||UTSST¯/m|| + ||UTSSTU¯/m-¯|| · ε1/2 ||B-AX*|| + ||UTSSTU¯/m-¯|| Recall the spectral norm: ||A||2 = supx ||Ax||/||x|| Implies ||CD|| · ||C||2 ||D|| ||UTSSTU¯/m-¯|| · ||UTSSTU/m-I ||2 ||¯|| Subspace JL: for m = (k log(1/δ)), ST approximately preserves lengths of all vectors in a k-space ||UTSSTU¯/m-¯|| · ||¯||/2 ||¯|| · 2ε1/2||AX*-B|| ||AX’-B||2 = ||AX*-B||2 + ||¯||2 = ||AX*-B||2 + 4ε||AX*-B||2

Regression Lower Bound Tight (d2 log (nd) ε-1) space lower bound Again a reduction from augmented indexing This time more complicated Embed log (nd) ε-1 independent regression sub-problems into hard instance Uses deletions and geometrically growing property, as in matrix product lower bound Choose the entries of A and b so that the algorithm’s output x encodes some entries of A

Regression Lower Bound Lower bound of (d2) already tricky because of bit complexity Natural approach: Alice has random d x d sign matrix A-1 b is a standard basis vector ei Alice computes A = (A-1)-1 and puts it into the stream. Solution x to minx ||Ax=b|| is i-th column of A-1 Bob can isolate entries of A-1, solving indexing Wrong: A has entries that can be exponentially small!

Regression Lower Bound We design A and b together (Aug. Index) 1 A1, 2 A1, 3 A1, 4 A1,5 0 1 A2, 3 A2, 4 A2,5 0 0 1 A3, 4 A3,5 0 0 0 1 A4,5 0 0 0 0 1 x1 x2 x3 x4 x5 A2,4 A3,4 1 = x5 = 0, x4 = 1, x3 = 0, x2 = 0, x1 = -A1,4

Outline Matrix Product Regression Low-rank approximation

Best Low-Rank Approximation For any matrix A and integer k, there is a matrix Ak of rank k that is closest to A among all matrices of rank k. Since rank of Ak is k, it is the product CDT of two k-column matrices C and D Ak can be found from the SVD (singular value decomposition), where C and D are orthogonal matrices U and VΣk This is a good compression of A LSI, PCA, recommendation systems, clustering

Best Low-Rank Approximation Previously, nothing was known for 1-pass low-rank approximation and relative error Even for k = 1, best upper bound O(nd log (nd)) bits Problem 28 of [Mut]: can one get sublinear space? We get 1-pass and O(kε-2(n+dε-2)log(nd)) space Update time is O(kε-4), so total work is O(Nkε-4), where N is the number of non-zero entries of A New space lower bound shows optimal up to 1/ε

Best Low-Rank Approximation and STA The sketch STA holds information about A In particular, there is a rank k matrix Ak’ in the rowspace of STA nearly as close to A as the closest rank k matrix Ak The rowspace of STA is the set of linear combinations of its rows That is, ||A-Ak’|| · (1+ε)||A-Ak|| Why is there such an Ak’?

Low-Rank Approximation via Regression Apply the regression results with A ! Ak, B ! A The X’ minimizing ||ST(AkX-A)|| has ||AkX’-A|| · (1+ ε)||Ak X*-A|| But here X* = I, and X’ = (ST Ak)- STA So the matrix AkX’ = Ak(STAk)-STA: Has rank k In the rowspace of STA Within 1+ε of smallest distance of any rank-k matrix

Low-Rank Approximation in 2 Passes Can’t use Ak(STAk)- STA without finding Ak Instead: maintain STA Can show that if GT has orthonormal rows, then the best rank-k approximation to A in the rowspace of GT is A’kGT, where A’k is the best rank-k approximation to AG After 1st pass, compute orthonormal basis GT for rowspace of STA In 2nd pass, maintain AG Afterwards, compute A’k and A’kGT

Low-Rank Approximation in 2 Passes A’k is best rank-k approximation to AG For any rank-k matrix Z, ||AGGT – A’kGT|| = ||AG-A’k|| · ||AG-Z|| · ||AGGT – ZGT|| For all Y: (AGGT-YGT) ¢ (A-AGGT)T = 0, so we can apply Pythagorean Theorem twice: ||A – A’kG|| = ||A-AGGT|| + ||AGGT – A’kGT|| · ||A-AGGT|| + ||AGGT – ZGT|| = ||A – ZGT||

1 Pass Algorithm With high probability,(*) ||AX’-B|| · (1+ε)||AX*-B||, where X* minimizes ||AX-B|| X’ minimizes ||STAX-STB|| Apply (*) with A ! AR and B ! A and X’ minimizing ||ST(ARX-A)|| So X’= (STAR)-STA has ||ARX’-A|| · (1+ε)minX ||ARX-A|| Columnspace of AR contains a (1+ε)-approximation to Ak So, ||ARX’-A|| · (1+ε)minX ||ARX-A|| · (1+ε)2 minX ||A-Ak|| Key idea: ARX’ = AR(STAR)-STA is easy to compute in 1-pass with small space and fast update time behaves like A (similar to SVD) use it instead of A in our 2-pass algorithm!

1 Pass Algorithm Algorithm: Maintain AR and STA Compute AR(STAR)-STA Let GT be an orthonormal basis for the rowspace of STA, as before Output the best rank-k approximation to AR(STAR)-STA in the rowspace of STA Same as 2-pass algorithm except we don’t need a second pass to project A onto the rowspace of STA Analysis is similar to that for regression

A Lower Bound binary string x matrix A index i and xi+1, xi+2, … k ε-1 columns per block 0s k rows 0, 0 … -1000, -1000 -1000, 1000 1000, 1000 … 0, 0 … 10000, -10000 -10000, -10000 … … … 10, -10 -10, 10 -10,-10 … -100, -100 100, 100 … n-k rows Error now dominated by block of interest Bob also inserts a k x k identity submatrix into block of interest

Lower Bound Details * 0s P*Ik 0s Block of interest: k ε-1 columns k rows 0s P*Ik 0s n-k rows * Bob inserts k x k identity submatrix, scaled by large value P Show any rank-k approximation must err on all of shaded region So good rank-k approximation likely has correct sign on Bob’s entry

Concluding Remarks Space bounds are tight for product, regression Sharpen prior upper bounds Prove optimal lower bounds Space bounds off by a factor of ε-1 for low-rank approximation First sub-linear (and near-optimal) 1-pass algorithm We have better upper bounds for restricted cases Improve the dependence on ε in the update time Lower bounds for multi-pass algorithms?