Download presentation
Presentation is loading. Please wait.
Published byMalcolm Carpenter Modified over 9 years ago
1
Sketching as a Tool for Numerical Linear Algebra David Woodruff IBM Almaden
2
2 Talk Outline Regression Exact Regression Algorithms Sketching to speed up Least Squares Regression Sketching to speed up Least Absolute Deviation (l 1 ) Regression Low Rank Approximation Sketching to speed up Low Rank Approximation Recent Results and Open Questions M-Estimators and robust regression CUR decompositions
3
3 Regression Linear Regression Statistical method to study linear dependencies between variables in the presence of noise. Example Ohm's law V = R ∙ I Find linear function that best fits the data
4
4 Regression Standard Setting One measured variable b A set of predictor variables a,…, a Assumption: b = x + a x + … + a x + is assumed to be noise and the x i are model parameters we want to learn Can assume x 0 = 0 Now consider n observations of b 1d 1 1d d 0
5
5 Regression analysis Matrix form Input: n d-matrix A and a vector b=(b 1,…, b n ) n is the number of observations; d is the number of predictor variables Output: x * so that Ax* and b are close Consider the over-constrained case, when n À d Assume that A has full column rank
6
6 Regression analysis Least Squares Method Find x* that minimizes |Ax-b| 2 2 = (b i – )² A i* is i-th row of A Certain desirable statistical properties Closed form solution: x = (A T A) -1 A T b Method of least absolute deviation (l 1 -regression) Find x* that minimizes |Ax-b| 1 = |b i – | Cost is less sensitive to outliers than least squares Can solve via linear programming Time complexities are at least n*d 2, we want better!
7
7 Talk Outline Regression Exact Regression Algorithms Sketching to speed up Least Squares Regression Sketching to speed up Least Absolute Deviation (l 1 ) Regression Low Rank Approximation Sketching to speed up Low Rank Approximation Recent Results and Open Questions M-Estimators and robust regression CUR decompositions
8
8 Sketching to solve least squares regression How to find an approximate solution x to min x |Ax-b| 2 ? Goal: output x‘ for which |Ax‘-b| 2 · (1+ε) min x |Ax-b| 2 with high probability Draw S from a k x n random family of matrices, for a value k << n Compute S*A and S*b Output the solution x‘ to min x‘ |(SA)x-(Sb)| 2
9
9 How to choose the right sketching matrix S? Recall: output the solution x‘ to min x‘ |(SA)x-(Sb)| 2 Lots of matrices work S is d/ε 2 x n matrix of i.i.d. Normal random variables Computing S*A may be slow…
10
10 How to choose the right sketching matrix S? [S] S is a Johnson Lindenstrauss Transform S = P*H*D D is a diagonal matrix with +1, -1 on diagonals H is the Hadamard transform P just chooses a random (small) subset of rows of H*D S*A can be computed much faster
11
11 Even faster sketching matrices [CW] [ [ 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 -1 1 0 -1 0 0-1 0 0 0 0 0 1 CountSketch matrix Define k x n matrix S, for k = O~(d 2 /ε 2 ) S is really sparse: single randomly chosen non-zero entry per column Surprisingly, this works!
12
12 Simpler and Sharper Proofs [MM, NN, N] Let B = [A, b] be an n x (d+1) matrix Let U be an orthonormal basis for the columns of B Suffices to show |SUx| 2 = 1 ± ε for all unit x Implies |S(Ax-b)| 2 = (1 ± ε) |Ax-b| 2 for all x SU is a (d+1) 2 /ε 2 x (d+1) matrix Suffices to show | U T S T SU – I| 2 · |U T S T SU – I| F · ε Matrix product result: |CS T SD – CD| F 2 · [1/(# rows of S)] * |C| F 2 |D| F 2 Set C = U T and D = U. Then |U| 2 F = (d+1) and (# rows of S) = (d+1) 2 /ε 2 |SBx| 2 = (1±ε) |Bx| 2 for all x S is called a subspace embedding |SBx| 2 = (1±ε) |Bx| 2 for all x S is called a subspace embedding
13
13 Talk Outline Regression Exact Regression Algorithms Sketching to speed up Least Squares Regression Sketching to speed up Least Absolute Deviation (l 1 ) Regression Low Rank Approximation Sketching to speed up Low Rank Approximation Recent Results and Open Questions M-Estimators and robust regression CUR decompositions
14
14 Sketching to solve l 1 -regression How to find an approximate solution x to min x |Ax-b| 1 ? Goal: output x‘ for which |Ax‘-b| 1 · (1+ε) min x |Ax-b| 1 with high probability Natural attempt: Draw S from a k x n random family of matrices, for a value k << n Compute S*A and S*b Output the solution x‘ to min x‘ |(SA)x-(Sb)| 1 Turns out this does not work
15
15 Sketching to solve l 1 -regression [SW] Why doesn’t outputting the solution x‘ to min x‘ |(SA)x-(Sb)| 1 work? Don‘t know of k x n matrices S with small k for which if x‘ is solution to min x |(SA)x-(Sb)| 1 then |Ax‘-b| 1 · (1+ε) min x |Ax-b| 1 with high probability Instead: can find an S so that |Ax‘-b| 1 · (d log d) min x |Ax-b| 1 S is a matrix of i.i.d. Cauchy random variables Property: |Ax-b| 1 · |S(Ax-b)| 1 · (d log d) |Ax-b| 1
16
16 Cauchy random variables Cauchy random variables not as nice as Normal (Gaussian) random variables They don’t have a mean and have infinite variance Ratio of two independent Normal random variables is Cauchy If a and b are scalars and C 1 and C 2 independent Cauchys, then a*C 1 + b*C 2 ~ (|a|+|b|)*C for a Cauchy C If a and b are scalars and C 1 and C 2 independent Cauchys, then a*C 1 + b*C 2 ~ (|a|+|b|)*C for a Cauchy C
17
17 Sketching to solve l 1 -regression Main Idea: Let B = [A, b]. Compute a QR-factorization of S*B Q has orthonormal columns and Q*R = S*B B*R -1 is a “well-conditioning” of B: Σ i=1 d |BR -1 e i | 1 · Σ i=1 d |SBR -1 e i | 1 · (d log d).5 Σ i=1 d |SBR -1 e i | 2 · d (d log d).5 |x| 1 · |x| 2 = |SBR -1 x| 2 · |SBR -1 x| 1 · (d log d) |BR -1 x| 1 These two properties make importance sampling work!
18
18 Want to estimate Σ i=1 n y i by sampling, for y i ¸ 0 Suppose we sample y i with probability p i T = Σ i=1 n δ(y i sampled) y i /p i E[T] = Σ i=1 n p i ¢ y i /p i = Σ i=1 n y i Var[T] · E[T 2 ] = Σ i=1 n p i ¢ y i 2 / p i 2 · (Σ i=1 n y i ) max i y i /p i Bound max i y i /p i by ε 2 (Σ i=1 n y i ) For us, y i = |(Ax-b) i | and this holds if p i = |e i BR -1 | 1 * poly(d/ε) ! Importance Sampling
19
19 To get a bound for all x, use Bernstein’s inequality and a net argument Sample poly(d/ε) rows of B*R -1 where the i-th row is sampled proportional to its 1-norm T is diagonal matrix with T i,i = 0 if row i not sampled, otherwise T i,i = 1/Pr[row i sampled] |TBx| 1 = (1 ± ε )|Bx| 1 for all x Solve regression on the (reweighted) samples! Importance Sampling
20
20 Sketching to solve l 1 -regression [MM] Most expensive operation is computing S*A where S is the matrix of i.i.d. Cauchy random variables All other operations are in the “smaller space” Can speed this up by choosing S as follows: [ [ 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 -1 1 0 -1 0 0-1 0 0 0 0 0 1 ¢ [ [ C 1 C 2 C 3 … C n
21
21 Further sketching improvements [WZ] Can show you need a fewer number of sampled rows in later steps if instead choose S as follows Instead of diagonal of Cauchy random variables, choose diagonal of reciprocals of exponential random variables [ [ 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 -1 1 0 -1 0 0-1 0 0 0 0 0 1 ¢ [ [ 1/E 1 1/E 2 1/E 3 … 1/E n For recent work on fast sampling-based algorithms, see Richard’s talk! Uses max-stability of exponentials [Andoni]: max y i /e i ~ |y| 1 /e Uses max-stability of exponentials [Andoni]: max y i /e i ~ |y| 1 /e
22
22 Talk Outline Regression Exact Regression Algorithms Sketching to speed up Least Squares Regression Sketching to speed up Least Absolute Deviation (l 1 ) Regression Low Rank Approximation Sketching to speed up Low Rank Approximation Recent Results and Open Questions M-Estimators and robust regression CUR decompositions
23
23 Low rank approximation A is an n x d matrix Typically well-approximated by low rank matrix E.g., only high rank because of noise Want to output a rank k matrix A’, so that |A-A’| F · (1+ε) |A-A k | F, w.h.p., where A k = argmin rank k matrices B |A-B| F (For matrix C, |C| F = (Σ i,j C i,j 2 ) 1/2 )
24
24 Solution to low-rank approximation [S] Given n x d input matrix A Compute S*A using a sketching matrix S with k << n rows. S*A takes random linear combinations of rows of A SA A Project rows of A onto SA, then find best rank-k approximation to points inside of SA.
25
25 Low Rank Approximation Idea Regression problem min X |A k X – A| F Solution is X = I, and minimum is |A k – A| F This is a generalized regression problem! If S is a subspace embedding for column space of A k and also if for any matrices B, C, |BS T SC – BC| F 2 · 1/(# rows of S) |B| F 2 |C| F 2 Then if X’ is the minimizer to min X |SA k X – SA| F, then |A k X’ – A| F · (1+ε) min X |A k X-A| F = (1+ε) |A k -A| F But minimizer X’ = (SA k ) - SA is in the row span of SA! S can be matrix of i.i.d. Normals S can be a Fast Johnson Lindenstrauss Matrix S can be a CountSketch matrix S can be matrix of i.i.d. Normals S can be a Fast Johnson Lindenstrauss Matrix S can be a CountSketch matrix
26
26 Caveat: projecting the points onto SA is slow Current algorithm: 1.Compute S*A (easy) 2.Project each of the rows onto S*A 3.Find best rank-k approximation of projected points inside of rowspace of S*A (easy) Bottleneck is step 2 [CW] Turns out you can approximate the projection Sketching for generalized regression again: min X |X(SA)-A| F 2
27
27 Talk Outline Regression Exact Regression Algorithms Sketching to speed up Least Squares Regression Sketching to speed up Least Absolute Deviation (l 1 ) Regression Low Rank Approximation Sketching to speed up Low Rank Approximation Recent Results and Open Questions M-Estimators and robust regression CUR decompositions
28
28 M-Estimators and Robust Regression Solve min x |Ax-b| M M: R -> R ¸ 0 |y| M = Σ i=1 n M(y i ) Least squares and L 1 -regression are special cases Huber function, given a parameter c: M(y) = y 2 /(2c) for |y| · c M(y) = |y|-c/2 otherwise Enjoys smoothness properties of l 2 and robustness properties of l 1 [CW15] For M-estimators with at least linear and at most quadratic growth, can get O(1)-approximation in nnz(A) + poly(d) time
29
29 CUR Decompositions [BW14] Can find a CUR decomposition in O(nnz(A) log n) + n*poly(k/ε) time with O(k/ε) columns, O(k/ε) rows, and rank(U) = k
30
30 Open Questions Recent monograph in NOW Publishers D. Woodruff, “Sketching as a Tool for Numerical Linear Algebra” Other types of low rank approximation: (Spectral) How quickly can we find a rank k matrix A’, so that |A-A’| 2 · (1+ε) |A-A k | 2, w.h.p., where A k = argmin rank k matrices B |A-B| 2 (Robust) How quickly can we find a rank k matrix A’, so that |A-A’| 1 · (1+ε) |A-A k | 1, w.h.p., where A k = argmin rank k matrices B |A-B| 1 For other questions regarding Schatten norms and communication- efficiency, see reference above. Thanks!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.