Sketching as a Tool for Numerical Linear Algebra David Woodruff IBM Almaden.

Sketching as a Tool for Numerical Linear Algebra David Woodruff IBM Almaden

2 Talk Outline  Regression  Exact Regression Algorithms  Sketching to speed up Least Squares Regression  Sketching to speed up Least Absolute Deviation (l 1 ) Regression  Low Rank Approximation  Sketching to speed up Low Rank Approximation  Recent Results and Open Questions  M-Estimators and robust regression  CUR decompositions

3 Regression Linear Regression  Statistical method to study linear dependencies between variables in the presence of noise. Example  Ohm's law V = R ∙ I  Find linear function that best fits the data

4 Regression Standard Setting  One measured variable b  A set of predictor variables a,…, a  Assumption: b = x + a x + … + a x +    is assumed to be noise and the x i are model parameters we want to learn  Can assume x 0 = 0  Now consider n observations of b 1d 1 1d d 0

5 Regression analysis Matrix form Input: n  d-matrix A and a vector b=(b 1,…, b n ) n is the number of observations; d is the number of predictor variables Output: x * so that Ax* and b are close  Consider the over-constrained case, when n À d  Assume that A has full column rank

6 Regression analysis Least Squares Method  Find x* that minimizes |Ax-b| 2 2 =  (b i – )²  A i* is i-th row of A  Certain desirable statistical properties  Closed form solution: x = (A T A) -1 A T b Method of least absolute deviation (l 1 -regression)  Find x* that minimizes |Ax-b| 1 =  |b i – |  Cost is less sensitive to outliers than least squares  Can solve via linear programming Time complexities are at least n*d 2, we want better!

8 Sketching to solve least squares regression  How to find an approximate solution x to min x |Ax-b| 2 ?  Goal: output x‘ for which |Ax‘-b| 2 · (1+ε) min x |Ax-b| 2 with high probability  Draw S from a k x n random family of matrices, for a value k << n  Compute S*A and S*b  Output the solution x‘ to min x‘ |(SA)x-(Sb)| 2

9 How to choose the right sketching matrix S?  Recall: output the solution x‘ to min x‘ |(SA)x-(Sb)| 2  Lots of matrices work  S is d/ε 2 x n matrix of i.i.d. Normal random variables  Computing S*A may be slow…

10 How to choose the right sketching matrix S? [S]  S is a Johnson Lindenstrauss Transform  S = P*H*D  D is a diagonal matrix with +1, -1 on diagonals  H is the Hadamard transform  P just chooses a random (small) subset of rows of H*D  S*A can be computed much faster

11 Even faster sketching matrices [CW] [ [ 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 -1 1 0 -1 0 0-1 0 0 0 0 0 1  CountSketch matrix  Define k x n matrix S, for k = O~(d 2 /ε 2 )  S is really sparse: single randomly chosen non-zero entry per column Surprisingly, this works!

12 Simpler and Sharper Proofs [MM, NN, N]  Let B = [A, b] be an n x (d+1) matrix  Let U be an orthonormal basis for the columns of B  Suffices to show |SUx| 2 = 1 ± ε for all unit x  Implies |S(Ax-b)| 2 = (1 ± ε) |Ax-b| 2 for all x  SU is a (d+1) 2 /ε 2 x (d+1) matrix  Suffices to show | U T S T SU – I| 2 · |U T S T SU – I| F · ε  Matrix product result: |CS T SD – CD| F 2 · [1/(# rows of S)] * |C| F 2 |D| F 2  Set C = U T and D = U. Then |U| 2 F = (d+1) and (# rows of S) = (d+1) 2 /ε 2 |SBx| 2 = (1±ε) |Bx| 2 for all x S is called a subspace embedding |SBx| 2 = (1±ε) |Bx| 2 for all x S is called a subspace embedding

15 Sketching to solve l 1 -regression [SW]  Why doesn’t outputting the solution x‘ to min x‘ |(SA)x-(Sb)| 1 work?  Don‘t know of k x n matrices S with small k for which if x‘ is solution to min x |(SA)x-(Sb)| 1 then |Ax‘-b| 1 · (1+ε) min x |Ax-b| 1 with high probability  Instead: can find an S so that |Ax‘-b| 1 · (d log d) min x |Ax-b| 1  S is a matrix of i.i.d. Cauchy random variables  Property: |Ax-b| 1 · |S(Ax-b)| 1 · (d log d) |Ax-b| 1

16 Cauchy random variables  Cauchy random variables not as nice as Normal (Gaussian) random variables  They don’t have a mean and have infinite variance  Ratio of two independent Normal random variables is Cauchy If a and b are scalars and C 1 and C 2 independent Cauchys, then a*C 1 + b*C 2 ~ (|a|+|b|)*C for a Cauchy C If a and b are scalars and C 1 and C 2 independent Cauchys, then a*C 1 + b*C 2 ~ (|a|+|b|)*C for a Cauchy C

17 Sketching to solve l 1 -regression  Main Idea: Let B = [A, b]. Compute a QR-factorization of S*B  Q has orthonormal columns and Q*R = S*B  B*R -1 is a “well-conditioning” of B:  Σ i=1 d |BR -1 e i | 1 · Σ i=1 d |SBR -1 e i | 1 · (d log d).5 Σ i=1 d |SBR -1 e i | 2 · d (d log d).5  |x| 1 · |x| 2 = |SBR -1 x| 2 · |SBR -1 x| 1 · (d log d) |BR -1 x| 1  These two properties make importance sampling work!

18  Want to estimate Σ i=1 n y i by sampling, for y i ¸ 0  Suppose we sample y i with probability p i  T = Σ i=1 n δ(y i sampled) y i /p i  E[T] = Σ i=1 n p i ¢ y i /p i = Σ i=1 n y i  Var[T] · E[T 2 ] = Σ i=1 n p i ¢ y i 2 / p i 2 · (Σ i=1 n y i ) max i y i /p i  Bound max i y i /p i by ε 2 (Σ i=1 n y i )  For us, y i = |(Ax-b) i | and this holds if p i = |e i BR -1 | 1 * poly(d/ε) ! Importance Sampling

19  To get a bound for all x, use Bernstein’s inequality and a net argument  Sample poly(d/ε) rows of B*R -1 where the i-th row is sampled proportional to its 1-norm  T is diagonal matrix with T i,i = 0 if row i not sampled, otherwise T i,i = 1/Pr[row i sampled]  |TBx| 1 = (1 ± ε )|Bx| 1 for all x  Solve regression on the (reweighted) samples! Importance Sampling

20 Sketching to solve l 1 -regression [MM]  Most expensive operation is computing S*A where S is the matrix of i.i.d. Cauchy random variables  All other operations are in the “smaller space”  Can speed this up by choosing S as follows: [ [ 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 -1 1 0 -1 0 0-1 0 0 0 0 0 1 ¢ [ [ C 1 C 2 C 3 … C n

21 Further sketching improvements [WZ]  Can show you need a fewer number of sampled rows in later steps if instead choose S as follows  Instead of diagonal of Cauchy random variables, choose diagonal of reciprocals of exponential random variables [ [ 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 -1 1 0 -1 0 0-1 0 0 0 0 0 1 ¢ [ [ 1/E 1 1/E 2 1/E 3 … 1/E n For recent work on fast sampling-based algorithms, see Richard’s talk! Uses max-stability of exponentials [Andoni]: max y i /e i ~ |y| 1 /e Uses max-stability of exponentials [Andoni]: max y i /e i ~ |y| 1 /e

23 Low rank approximation  A is an n x d matrix  Typically well-approximated by low rank matrix  E.g., only high rank because of noise  Want to output a rank k matrix A’, so that |A-A’| F · (1+ε) |A-A k | F, w.h.p., where A k = argmin rank k matrices B |A-B| F (For matrix C, |C| F = (Σ i,j C i,j 2 ) 1/2 )

24 Solution to low-rank approximation [S]  Given n x d input matrix A  Compute S*A using a sketching matrix S with k << n rows. S*A takes random linear combinations of rows of A SA A  Project rows of A onto SA, then find best rank-k approximation to points inside of SA.

26 Caveat: projecting the points onto SA is slow  Current algorithm: 1.Compute S*A (easy) 2.Project each of the rows onto S*A 3.Find best rank-k approximation of projected points inside of rowspace of S*A (easy)  Bottleneck is step 2  [CW] Turns out you can approximate the projection  Sketching for generalized regression again: min X |X(SA)-A| F 2

28 M-Estimators and Robust Regression  Solve min x |Ax-b| M  M: R -> R ¸ 0  |y| M = Σ i=1 n M(y i )  Least squares and L 1 -regression are special cases  Huber function, given a parameter c:  M(y) = y 2 /(2c) for |y| · c  M(y) = |y|-c/2 otherwise  Enjoys smoothness properties of l 2 and robustness properties of l 1 [CW15] For M-estimators with at least linear and at most quadratic growth, can get O(1)-approximation in nnz(A) + poly(d) time

29 CUR Decompositions [BW14] Can find a CUR decomposition in O(nnz(A) log n) + n*poly(k/ε) time with O(k/ε) columns, O(k/ε) rows, and rank(U) = k

30 Open Questions  Recent monograph in NOW Publishers  D. Woodruff, “Sketching as a Tool for Numerical Linear Algebra”  Other types of low rank approximation: (Spectral) How quickly can we find a rank k matrix A’, so that |A-A’| 2 · (1+ε) |A-A k | 2, w.h.p., where A k = argmin rank k matrices B |A-B| 2 (Robust) How quickly can we find a rank k matrix A’, so that |A-A’| 1 · (1+ε) |A-A k | 1, w.h.p., where A k = argmin rank k matrices B |A-B| 1  For other questions regarding Schatten norms and communication- efficiency, see reference above. Thanks!

Sketching as a Tool for Numerical Linear Algebra David Woodruff IBM Almaden.

Similar presentations

Presentation on theme: "Sketching as a Tool for Numerical Linear Algebra David Woodruff IBM Almaden."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Sketching as a Tool for Numerical Linear Algebra David Woodruff IBM Almaden.

Similar presentations

Presentation on theme: "Sketching as a Tool for Numerical Linear Algebra David Woodruff IBM Almaden."— Presentation transcript:

Similar presentations

About project

Feedback