Sketching as a Tool for Numerical Linear Algebra David Woodruff IBM Almaden.

Slides:



Advertisements
Similar presentations
Chapter 6 Matrix Algebra.
Advertisements

1+eps-Approximate Sparse Recovery Eric Price MIT David Woodruff IBM Almaden.
Numerical Linear Algebra in the Streaming Model Ken Clarkson - IBM David Woodruff - IBM.
Numerical Linear Algebra in the Streaming Model
Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden.
Factorial Mixture of Gaussians and the Marginal Independence Model Ricardo Silva Joint work-in-progress with Zoubin Ghahramani.
Uniform Sampling for Matrix Approximation Michael Cohen, Yin Tat Lee, Cameron Musco, Christopher Musco, Richard Peng, Aaron Sidford M.I.T.
Section 4.6 (Rank).
Sketching for M-Estimators: A Unified Approach to Robust Regression
The General Linear Model. The Simple Linear Model Linear Regression.
Turnstile Streaming Algorithms Might as Well Be Linear Sketches Yi Li Huy L. Nguyen David Woodruff.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 25, 2006
1cs542g-term High Dimensional Data  So far we’ve considered scalar data values f i (or interpolated/approximated each component of vector values.
Sampling algorithms for l 2 regression and applications Michael W. Mahoney Yahoo Research (Joint work with P. Drineas.
Chapter 5 Orthogonality
Computer Graphics Recitation 5.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
Some useful linear algebra. Linearly independent vectors span(V): span of vector space V is all linear combinations of vectors v i, i.e.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 4 March 30, 2005
Approximate Nearest Neighbors and the Fast Johnson-Lindenstrauss Transform Nir Ailon, Bernard Chazelle (Princeton University)
Fast Approximation of Matrix Coherence and Statistical Leverage Michael W. Mahoney Stanford University ( For more info, see: cs.stanford.edu/people/mmahoney/
Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the.
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
Matrix sparsification and the sparse null space problem Lee-Ad GottliebWeizmann Institute Tyler NeylonBynomial Inc. TexPoint fonts used in EMF. Read the.
Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.
Lecture 12 Least Square Approximation Shang-Hua Teng.
Ordinary least squares regression (OLS)
Sketching for M-Estimators: A Unified Approach to Robust Regression Kenneth Clarkson David Woodruff IBM Almaden.
Basics of regression analysis
1cs542g-term Notes  Extra class next week (Oct 12, not this Friday)  To submit your assignment: me the URL of a page containing (links to)
Matrix Approach to Simple Linear Regression KNNL – Chapter 5.
How Robust are Linear Sketches to Adaptive Inputs? Moritz Hardt, David P. Woodruff IBM Research Almaden.
Linear Algebra Review By Tim K. Marks UCSD Borrows heavily from: Jana Kosecka Virginia de Sa (UCSD) Cogsci 108F Linear.
Linear Algebra Review 1 CS479/679 Pattern Recognition Dr. George Bebis.
ECON 1150 Matrix Operations Special Matrices
BMI II SS06 – Class 3 “Linear Algebra” Slide 1 Biomedical Imaging II Class 3 – Mathematical Preliminaries: Elementary Linear Algebra 2/13/06.
Matrix Algebra. Quick Review Quick Review Solutions.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 7- 1.
CHAPTER FIVE Orthogonality Why orthogonal? Least square problem Accuracy of Numerical computation.
+ Review of Linear Algebra Optimization 1/14/10 Recitation Sivaraman Balakrishnan.
Chapter 15 Modeling of Data. Statistics of Data Mean (or average): Variance: Median: a value x j such that half of the data are bigger than it, and half.
Machine Learning Seminar: Support Vector Regression Presented by: Heng Ji 10/08/03.
AN ORTHOGONAL PROJECTION
Matrices CHAPTER 8.1 ~ 8.8. Ch _2 Contents  8.1 Matrix Algebra 8.1 Matrix Algebra  8.2 Systems of Linear Algebra Equations 8.2 Systems of Linear.
Orthogonalization via Deflation By Achiya Dax Hydrological Service Jerusalem, Israel
Elementary Linear Algebra Anton & Rorres, 9th Edition
Section 2.3 Properties of Solution Sets
Review of Linear Algebra Optimization 1/16/08 Recitation Joseph Bradley.
Low Rank Approximation and Regression in Input Sparsity Time David Woodruff IBM Almaden Joint work with Ken Clarkson (IBM Almaden)
Iterative Row Sampling Richard Peng Joint work with Mu Li (CMU) and Gary Miller (CMU) CMU  MIT.
1 G Lect 4W Multiple regression in matrix terms Exploring Regression Examples G Multiple Regression Week 4 (Wednesday)
The Message Passing Communication Model David Woodruff IBM Almaden.
1 Objective To provide background material in support of topics in Digital Image Processing that are based on matrices and/or vectors. Review Matrices.
A Story of Principal Component Analysis in the Distributed Model David Woodruff IBM Almaden Based on works with Christos Boutsidis, Ken Clarkson, Ravi.
Lecture XXVII. Orthonormal Bases and Projections Suppose that a set of vectors {x 1,…,x r } for a basis for some space S in R m space such that r  m.
Estimating standard error using bootstrap
Introduction The central problems of Linear Algebra are to study the properties of matrices and to investigate the solutions of systems of linear equations.
Stochastic Streams: Sample Complexity vs. Space Complexity
Background: Lattices and the Learning-with-Errors problem
Sublinear Algorithmic Tools 2
Singular Value Decomposition
Turnstile Streaming Algorithms Might as Well Be Linear Sketches
Matrix Martingales in Randomized Numerical Linear Algebra
Overview Massive data sets Streaming algorithms Regression
Singular Value Decomposition SVD
Generally Discriminant Analysis
Maths for Signals and Systems Linear Algebra in Engineering Lecture 6, Friday 21st October 2016 DR TANIA STATHAKI READER (ASSOCIATE PROFFESOR) IN SIGNAL.
Lecture 15: Least Square Regression Metric Embeddings
Learning-Based Low-Rank Approximations
Outline Numerical Stability Singular Value Decomposition
Presentation transcript:

Sketching as a Tool for Numerical Linear Algebra David Woodruff IBM Almaden

2 Talk Outline  Regression  Exact Regression Algorithms  Sketching to speed up Least Squares Regression  Sketching to speed up Least Absolute Deviation (l 1 ) Regression  Low Rank Approximation  Sketching to speed up Low Rank Approximation  Recent Results and Open Questions  M-Estimators and robust regression  CUR decompositions

3 Regression Linear Regression  Statistical method to study linear dependencies between variables in the presence of noise. Example  Ohm's law V = R ∙ I  Find linear function that best fits the data

4 Regression Standard Setting  One measured variable b  A set of predictor variables a,…, a  Assumption: b = x + a x + … + a x +    is assumed to be noise and the x i are model parameters we want to learn  Can assume x 0 = 0  Now consider n observations of b 1d 1 1d d 0

5 Regression analysis Matrix form Input: n  d-matrix A and a vector b=(b 1,…, b n ) n is the number of observations; d is the number of predictor variables Output: x * so that Ax* and b are close  Consider the over-constrained case, when n À d  Assume that A has full column rank

6 Regression analysis Least Squares Method  Find x* that minimizes |Ax-b| 2 2 =  (b i – )²  A i* is i-th row of A  Certain desirable statistical properties  Closed form solution: x = (A T A) -1 A T b Method of least absolute deviation (l 1 -regression)  Find x* that minimizes |Ax-b| 1 =  |b i – |  Cost is less sensitive to outliers than least squares  Can solve via linear programming Time complexities are at least n*d 2, we want better!

7 Talk Outline  Regression  Exact Regression Algorithms  Sketching to speed up Least Squares Regression  Sketching to speed up Least Absolute Deviation (l 1 ) Regression  Low Rank Approximation  Sketching to speed up Low Rank Approximation  Recent Results and Open Questions  M-Estimators and robust regression  CUR decompositions

8 Sketching to solve least squares regression  How to find an approximate solution x to min x |Ax-b| 2 ?  Goal: output x‘ for which |Ax‘-b| 2 · (1+ε) min x |Ax-b| 2 with high probability  Draw S from a k x n random family of matrices, for a value k << n  Compute S*A and S*b  Output the solution x‘ to min x‘ |(SA)x-(Sb)| 2

9 How to choose the right sketching matrix S?  Recall: output the solution x‘ to min x‘ |(SA)x-(Sb)| 2  Lots of matrices work  S is d/ε 2 x n matrix of i.i.d. Normal random variables  Computing S*A may be slow…

10 How to choose the right sketching matrix S? [S]  S is a Johnson Lindenstrauss Transform  S = P*H*D  D is a diagonal matrix with +1, -1 on diagonals  H is the Hadamard transform  P just chooses a random (small) subset of rows of H*D  S*A can be computed much faster

11 Even faster sketching matrices [CW] [ [  CountSketch matrix  Define k x n matrix S, for k = O~(d 2 /ε 2 )  S is really sparse: single randomly chosen non-zero entry per column Surprisingly, this works!

12 Simpler and Sharper Proofs [MM, NN, N]  Let B = [A, b] be an n x (d+1) matrix  Let U be an orthonormal basis for the columns of B  Suffices to show |SUx| 2 = 1 ± ε for all unit x  Implies |S(Ax-b)| 2 = (1 ± ε) |Ax-b| 2 for all x  SU is a (d+1) 2 /ε 2 x (d+1) matrix  Suffices to show | U T S T SU – I| 2 · |U T S T SU – I| F · ε  Matrix product result: |CS T SD – CD| F 2 · [1/(# rows of S)] * |C| F 2 |D| F 2  Set C = U T and D = U. Then |U| 2 F = (d+1) and (# rows of S) = (d+1) 2 /ε 2 |SBx| 2 = (1±ε) |Bx| 2 for all x S is called a subspace embedding |SBx| 2 = (1±ε) |Bx| 2 for all x S is called a subspace embedding

13 Talk Outline  Regression  Exact Regression Algorithms  Sketching to speed up Least Squares Regression  Sketching to speed up Least Absolute Deviation (l 1 ) Regression  Low Rank Approximation  Sketching to speed up Low Rank Approximation  Recent Results and Open Questions  M-Estimators and robust regression  CUR decompositions

14 Sketching to solve l 1 -regression  How to find an approximate solution x to min x |Ax-b| 1 ?  Goal: output x‘ for which |Ax‘-b| 1 · (1+ε) min x |Ax-b| 1 with high probability  Natural attempt: Draw S from a k x n random family of matrices, for a value k << n  Compute S*A and S*b  Output the solution x‘ to min x‘ |(SA)x-(Sb)| 1  Turns out this does not work

15 Sketching to solve l 1 -regression [SW]  Why doesn’t outputting the solution x‘ to min x‘ |(SA)x-(Sb)| 1 work?  Don‘t know of k x n matrices S with small k for which if x‘ is solution to min x |(SA)x-(Sb)| 1 then |Ax‘-b| 1 · (1+ε) min x |Ax-b| 1 with high probability  Instead: can find an S so that |Ax‘-b| 1 · (d log d) min x |Ax-b| 1  S is a matrix of i.i.d. Cauchy random variables  Property: |Ax-b| 1 · |S(Ax-b)| 1 · (d log d) |Ax-b| 1

16 Cauchy random variables  Cauchy random variables not as nice as Normal (Gaussian) random variables  They don’t have a mean and have infinite variance  Ratio of two independent Normal random variables is Cauchy If a and b are scalars and C 1 and C 2 independent Cauchys, then a*C 1 + b*C 2 ~ (|a|+|b|)*C for a Cauchy C If a and b are scalars and C 1 and C 2 independent Cauchys, then a*C 1 + b*C 2 ~ (|a|+|b|)*C for a Cauchy C

17 Sketching to solve l 1 -regression  Main Idea: Let B = [A, b]. Compute a QR-factorization of S*B  Q has orthonormal columns and Q*R = S*B  B*R -1 is a “well-conditioning” of B:  Σ i=1 d |BR -1 e i | 1 · Σ i=1 d |SBR -1 e i | 1 · (d log d).5 Σ i=1 d |SBR -1 e i | 2 · d (d log d).5  |x| 1 · |x| 2 = |SBR -1 x| 2 · |SBR -1 x| 1 · (d log d) |BR -1 x| 1  These two properties make importance sampling work!

18  Want to estimate Σ i=1 n y i by sampling, for y i ¸ 0  Suppose we sample y i with probability p i  T = Σ i=1 n δ(y i sampled) y i /p i  E[T] = Σ i=1 n p i ¢ y i /p i = Σ i=1 n y i  Var[T] · E[T 2 ] = Σ i=1 n p i ¢ y i 2 / p i 2 · (Σ i=1 n y i ) max i y i /p i  Bound max i y i /p i by ε 2 (Σ i=1 n y i )  For us, y i = |(Ax-b) i | and this holds if p i = |e i BR -1 | 1 * poly(d/ε) ! Importance Sampling

19  To get a bound for all x, use Bernstein’s inequality and a net argument  Sample poly(d/ε) rows of B*R -1 where the i-th row is sampled proportional to its 1-norm  T is diagonal matrix with T i,i = 0 if row i not sampled, otherwise T i,i = 1/Pr[row i sampled]  |TBx| 1 = (1 ± ε )|Bx| 1 for all x  Solve regression on the (reweighted) samples! Importance Sampling

20 Sketching to solve l 1 -regression [MM]  Most expensive operation is computing S*A where S is the matrix of i.i.d. Cauchy random variables  All other operations are in the “smaller space”  Can speed this up by choosing S as follows: [ [ ¢ [ [ C 1 C 2 C 3 … C n

21 Further sketching improvements [WZ]  Can show you need a fewer number of sampled rows in later steps if instead choose S as follows  Instead of diagonal of Cauchy random variables, choose diagonal of reciprocals of exponential random variables [ [ ¢ [ [ 1/E 1 1/E 2 1/E 3 … 1/E n For recent work on fast sampling-based algorithms, see Richard’s talk! Uses max-stability of exponentials [Andoni]: max y i /e i ~ |y| 1 /e Uses max-stability of exponentials [Andoni]: max y i /e i ~ |y| 1 /e

22 Talk Outline  Regression  Exact Regression Algorithms  Sketching to speed up Least Squares Regression  Sketching to speed up Least Absolute Deviation (l 1 ) Regression  Low Rank Approximation  Sketching to speed up Low Rank Approximation  Recent Results and Open Questions  M-Estimators and robust regression  CUR decompositions

23 Low rank approximation  A is an n x d matrix  Typically well-approximated by low rank matrix  E.g., only high rank because of noise  Want to output a rank k matrix A’, so that |A-A’| F · (1+ε) |A-A k | F, w.h.p., where A k = argmin rank k matrices B |A-B| F (For matrix C, |C| F = (Σ i,j C i,j 2 ) 1/2 )

24 Solution to low-rank approximation [S]  Given n x d input matrix A  Compute S*A using a sketching matrix S with k << n rows. S*A takes random linear combinations of rows of A SA A  Project rows of A onto SA, then find best rank-k approximation to points inside of SA.

25 Low Rank Approximation Idea  Regression problem min X |A k X – A| F  Solution is X = I, and minimum is |A k – A| F  This is a generalized regression problem!  If S is a subspace embedding for column space of A k and also if for any matrices B, C,  |BS T SC – BC| F 2 · 1/(# rows of S) |B| F 2 |C| F 2  Then if X’ is the minimizer to min X |SA k X – SA| F, then |A k X’ – A| F · (1+ε) min X |A k X-A| F = (1+ε) |A k -A| F  But minimizer X’ = (SA k ) - SA is in the row span of SA!  S can be matrix of i.i.d. Normals  S can be a Fast Johnson Lindenstrauss Matrix  S can be a CountSketch matrix  S can be matrix of i.i.d. Normals  S can be a Fast Johnson Lindenstrauss Matrix  S can be a CountSketch matrix

26 Caveat: projecting the points onto SA is slow  Current algorithm: 1.Compute S*A (easy) 2.Project each of the rows onto S*A 3.Find best rank-k approximation of projected points inside of rowspace of S*A (easy)  Bottleneck is step 2  [CW] Turns out you can approximate the projection  Sketching for generalized regression again: min X |X(SA)-A| F 2

27 Talk Outline  Regression  Exact Regression Algorithms  Sketching to speed up Least Squares Regression  Sketching to speed up Least Absolute Deviation (l 1 ) Regression  Low Rank Approximation  Sketching to speed up Low Rank Approximation  Recent Results and Open Questions  M-Estimators and robust regression  CUR decompositions

28 M-Estimators and Robust Regression  Solve min x |Ax-b| M  M: R -> R ¸ 0  |y| M = Σ i=1 n M(y i )  Least squares and L 1 -regression are special cases  Huber function, given a parameter c:  M(y) = y 2 /(2c) for |y| · c  M(y) = |y|-c/2 otherwise  Enjoys smoothness properties of l 2 and robustness properties of l 1 [CW15] For M-estimators with at least linear and at most quadratic growth, can get O(1)-approximation in nnz(A) + poly(d) time

29 CUR Decompositions [BW14] Can find a CUR decomposition in O(nnz(A) log n) + n*poly(k/ε) time with O(k/ε) columns, O(k/ε) rows, and rank(U) = k

30 Open Questions  Recent monograph in NOW Publishers  D. Woodruff, “Sketching as a Tool for Numerical Linear Algebra”  Other types of low rank approximation: (Spectral) How quickly can we find a rank k matrix A’, so that |A-A’| 2 · (1+ε) |A-A k | 2, w.h.p., where A k = argmin rank k matrices B |A-B| 2 (Robust) How quickly can we find a rank k matrix A’, so that |A-A’| 1 · (1+ε) |A-A k | 1, w.h.p., where A k = argmin rank k matrices B |A-B| 1  For other questions regarding Schatten norms and communication- efficiency, see reference above. Thanks!