Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden
Subspace Embeddings for the L1 norm with Applications to... Robust Regression and Hyperplane Fitting
3 Outline Massive data sets Regression analysis Our results Our techniques Concluding remarks
4 Massive data sets Examples Internet traffic logs Financial data etc. Algorithms Want nearly linear time or less Usually at the cost of a randomized approximation
5 Regression analysis Regression Statistical method to study dependencies between variables in the presence of noise.
6 Regression analysis Linear Regression Statistical method to study linear dependencies between variables in the presence of noise.
7 Regression analysis Linear Regression Statistical method to study linear dependencies between variables in the presence of noise. Example Ohm's law V = R I
8 Regression analysis Linear Regression Statistical method to study linear dependencies between variables in the presence of noise. Example Ohm's law V = R I Find linear function that best fits the data
9 Regression analysis Linear Regression Statistical method to study linear dependencies between variables in the presence of noise. Standard Setting One measured variable b A set of predictor variables a,…, a Assumption: b = x + a x + … + a x + is assumed to be noise and the x i are model parameters we want to learn Can assume x 0 = 0 Now consider n measured variables 1 d 1 1dd 0
10 Regression analysis Matrix form Input: n d-matrix A and a vector b=(b 1,…, b n ) n is the number of observations; d is the number of predictor variables Output: x * so that Ax* and b are close Consider the over-constrained case, when n À d Can assume that A has full column rank
11 Regression analysis Least Squares Method Find x* that minimizes (b i – )² A i* is i-th row of A Certain desirable statistical properties Method of least absolute deviation (l 1 -regression) Find x* that minimizes |b i – | Cost is less sensitive to outliers than least squares
12 Regression analysis Geometry of regression We want to find an x that minimizes |Ax-b| 1 The product Ax can be written as A *1 x 1 + A *2 x A *d x d where A *i is the i-th column of A This is a linear d-dimensional subspace The problem is equivalent to computing the point of the column space of A nearest to b in l 1 -norm
13 Regression analysis Solving l 1 -regression via linear programming Minimize (1,…,1) ( + ) Subject to: A x = b, 0 Generic linear programming gives poly(nd) time Best known algorithm is nd 5 log n + poly(d/ε) [Clarkson]
14 Our Results A (1+ε)-approximation algorithm for l 1 -regression problem Time complexity is nd poly(d/ε) (Clarksons is nd 5 log n + poly(d/ε)) First 1-pass streaming algorithm with small space (poly(d log n /ε) bits) Similar results for hyperplane fitting
15 Outline Massive data sets Regression analysis Our results Our techniques Concluding remarks
16 Our Techniques Notice that for any d x d change of basis matrix U, min x in R d |Ax-b| 1 = min x in R d |AUx-b| 1
17 Our Techniques Notice that for any y 2 R d, min x in R d |Ax-b| 1 = min x in R d |Ax-b+Ay| 1 We call b-Ay the residual, denoted b, and so min x in R d |Ax-b| 1 = min x in R d |Ax-b| 1
18 Rough idea behind algorithm of Clarkson Compute poly(d)- approximation Compute well-conditioned basis Sample rows from the well-conditioned basis and the residual of the poly(d)- approximation Solve l 1 -regression on the sample, obtaining vector x, and output x Find y such that |Ay-b| 1 · poly(d) min x in R d |Ax-b| 1 Let b = b-Ay be the residual Find y such that |Ay-b| 1 · poly(d) min x in R d |Ax-b| 1 Let b = b-Ay be the residual Find a basis U so that for all x in R d, |x| 1 /poly(d) · |AUx| 1 · poly(d) |x| 1 Find a basis U so that for all x in R d, |x| 1 /poly(d) · |AUx| 1 · poly(d) |x| 1 min x in R d |Ax-b| 1 = min x in R d |AUx – b| 1 Sample poly(d/ ε) rows of AUb proportional to their l 1 -norm. min x in R d |Ax-b| 1 = min x in R d |AUx – b| 1 Sample poly(d/ ε) rows of AUb proportional to their l 1 -norm. Takes nd 5 log n time Takes nd time Takes nd 5 log n time Takes poly(d/ ε) time Now generic linear programming is efficient
19 Our Techniques Suffices to show how to quickly compute 1.A poly(d)-approximation 2.A well-conditioned basis
20 Our main theorem Theorem There is a probability space over (d log d) n matrices R such that for any n d matrix A, with probability at least 99/100 we have for all x: |Ax| 1 |RAx| 1 d log d |Ax| 1 Embedding is linear is independent of A preserves lengths of an infinite number of vectors
21 Application of our main theorem Computing a poly(d)-approximation Compute RA and Rb Solve x = argmin x |RAx-Rb| 1 Main theorem applied to Ab implies x is a d log d – approximation RA, Rb have d log d rows, so can solve l 1 -regression efficiently Time is dominated by computing RA, a single matrix-matrix product
22 Application of our main theorem Computing a well-conditioned basis 1.Compute RA 2.Compute U so that RAU is orthonormal (in the l 2 -sense) 3.Output AU AU is well-conditioned because: |AUx| 1 · |RAUx| 1 · (d log d) 1/2 |RAUx| 2 = (d log d) 1/2 |x| 2 · (d log d) 1/2 |x| 1 and |AUx| 1 ¸ |RAUx| 1 /(d log d) ¸ |RAUx| 2 /(d log d) = |x| 2 /(d log d) ¸ |x| 1 /(d 3/2 log d) Life is really simple! Time dominated by computing RA and AU, two matrix-matrix products
23 Application of our main theorem It follows that we get an nd poly(d/ε) time algorithm for (1+ε)-approximate l 1 -regression
24 Whats left? We should prove our main theorem Theorem: There is a probability space over (d log d) n matrices R such that for any n d matrix A, with probability at least 99/100 we have for all x: |Ax| 1 |RAx| 1 d log d |Ax| 1 R is simple The entries of R are i.i.d. Cauchy random variables
25 Cauchy random variables pdf(z) = 1/(π(1+z) 2 ) for z in (- 1, 1 ) Infinite expectation and variance 1-stable: If z 1, z 2, …, z n are i.i.d. Cauchy, then for a 2 R n, a 1 ¢ z 1 + a 2 ¢ z 2 + … + a n ¢ z n » |a| 1 ¢ z, where z is Cauchy z
26 Proof of main theorem By 1-stability, For all rows r of R, » |Ax| 1 ¢ Z, where Z is a Cauchy RAx » (|Ax| 1 ¢ Z 1, …, |Ax| 1 ¢ Z d log d ), where Z 1, …, Z d log d are i.i.d. Cauchy |RAx| 1 = |Ax| 1 i |Z i | The |Z i | are half-Cauchy i |Z i | = (d log d) with probability 1-exp(-d) by Chernoff ε-net argument on {Ax | |Ax| 1 = 1} shows |RAx| 1 = |Ax| 1 ¢ (d log d) for all x Scale R by 1/(d log d) i |Z i | = (d log d) with probability 1-exp(-d) by Chernoff ε-net argument on {Ax | |Ax| 1 = 1} shows |RAx| 1 = |Ax| 1 ¢ (d log d) for all x Scale R by 1/(d log d) But i |Z i | is heavy-tailed z / (d log d)
27 Proof of main theorem i |Z i | is heavy-tailed, so |RAx| 1 = |Ax| 1 i |Z i | / (d log d) may be large Each |Z i | has c.d.f. asymptotic to 1-Θ(1/z) for z in [0, 1 ) No problem! We know there exists a well-conditioned basis of A We can assume the basis vectors are A *1, …, A *d |RA *i | 1 » |A *i | 1 ¢ i |Z i | / (d log d) With constant probability, i |RA *i | 1 = O(log d) i |A *i | 1
28 Proof of main theorem Suppose i |RA *i | 1 = O(log d) i |A *i | 1 for well-conditioned basis A *1, …, A *d We will use the Auerbach basis which always exists: For all x, |x| 1 · |Ax| 1 i |A *i | 1 = d I dont know how to compute such a basis, but it doesnt matter! i |RA *i | 1 = O(d log d) |RAx| 1 · i |RA *i x i | · |x| 1 i |RA *i | 1 = |x| 1 O(d log d) = O(d log d) |Ax| 1 Q.E.D.
29 Main Theorem Theorem There is a probability space over (d log d) n matrices R such that for any n d matrix A, with probability at least 99/100 we have for all x: |Ax| 1 |RAx| 1 d log d |Ax| 1
30 Outline Massive data sets Regression analysis Our results Our techniques Concluding remarks
31 Regression for data streams Streaming algorithm given additive updates to entries of A and b Pick random matrix R according to the distribution of main theorem Maintain RA and Rb during the stream Find x' that minimizes |RAx'-Rb| 1 using linear programming Compute U so that RAU is orthonormal The hard thing is sampling rows from AUb proportional to their norm Do not know U, b until end of stream Surpisingly, there is still a way to do this in a single pass by treating U, x as formal variables and plugging them in at the end Uses a noisy sampling data structure Omitted from talk Entries of R do not need to be independent
32 Hyperplane Fitting Reduces to d invocations of l 1 -regression Given n points in R d, find hyperplane minimizing sum of l 1 -distances of points to the hyperplane
33 Conclusion Main results Efficient algorithms for l 1 -regression and hyperplane fitting nd time improves previous nd 5 log n running time for l 1 -regression First oblivious subspace embedding for l 1