Sketching for M-Estimators: A Unified Approach to Robust Regression

Sketching for M-Estimators: A Unified Approach to Robust Regression
Kenneth Clarkson David Woodruff IBM Almaden

Regression Linear Regression
Statistical method to study linear dependencies between variables in the presence of noise. Example Ohm's law V = R ∙ I Find linear function that best fits the data

Regression Standard Setting One measured variable b
A set of predictor variables a ,…, a Assumption: b = x + a x + … + a x + e e is assumed to be noise and the xi are model parameters we want to learn Can assume x0 = 0 Now consider n observations of b 1 d 1 1 d d

Regression Matrix form
Input: nd-matrix A and a vector b=(b1,…, bn) n is the number of observations; d is the number of predictor variables Output: x* so that Ax* and b are close Consider the over-constrained case, when n À d

Fitness Measures Least Squares Method Find x* that minimizes |Ax-b|22
Ax* is the projection of b onto the column span of A Certain desirable statistical properties Closed form solution: x* = (ATA)-1 AT b Method of least absolute deviation (l1 -regression) Find x* that minimizes |Ax-b|1 = S |bi – <Ai*, x>| Cost is less sensitive to outliers than least squares Can solve via linear programming What about the many other fitness measures used in practice?

M-Estimators |y|M = Σi=1n G(yi) Measure function G: R -> R¸ 0
G(x) = G(-x), G(0) = 0 G is non-decreasing in |x| |y|M = Σi=1n G(yi) Solve minx |Ax-b|M Least squares and L1-regression are special cases

Huber Loss Function G(x) = x2/(2c) for |x| · c
Enjoys smoothness properties of l22 and robustness properties of l1

Other Examples L1-L2 G(x) = 2((1+x2/2)1/2 – 1) Fair estimator
G(x) = c2 [ |x|/c - log(1+|x|/c) ] Tukey estimator G(x) = c2/6 (1-[1-(x/c)2]3) if |x| · c = c2/ if |x| > c

|a/a’|2 ¸ G(a)/G(a’) ¸ CG |a/a’|
Nice M-Estimators An M-Estimator is nice if it has at least linear growth and at most quadratic growth There is CG > 0 so that for all a, a’ with |a| ¸ |a’| > 0, |a/a’|2 ¸ G(a)/G(a’) ¸ CG |a/a’| Any convex G satisfies the linear lower bound Any sketchable G satisfies the quadratic upper bound sketchable => there is a distribution on t x n matrices S for which |Sx|M = £(|x|M) with probability 2/3 and t is slow-growing function of n

Our Results Let nnz(A) denote # of non-zero entries of an n x d matrix A [Huber] O(nnz(A) log n) + poly(d log n / ε) time algorithm to output x’ so that w.h.p. |Ax’-b|H · (1+ε) minx |Ax-b|H [Nice M-Estimators] O(nnz(A)) + poly(d log n) time algorithm to output x’ so that for any constant C > 1, w.h.p. |Ax’-b|M · C*minx |Ax-b|M Remarks: - For convex nice M-estimators can solve with convex programming, but slow – poly(nd) time - Our algorithm for nice M-estimators is universal

Talk Outline Huber result Nice M-Estimators result

Naive Sampling Algorithm
x - b minx M S¢A - x S¢b x’ = argminx M S uniformly samples poly(d/ε) rows – this is a terrible algorithm

Leverage Score Sampling
For l2, the qi are the squared row norms in an orthonormal basis of A For lp, the qi are p-th powers of the p-norms of rows in a “well conditioned basis” [Dasgupta et al.] For lp-norms, there are probabilities q1, …, qn with Σi qi = poly(d/ε) so that sampling works A - x b minx M S¢A - x S¢b x’ = argminx M All qi can be found in O(nnz(A)log n) + poly(d) time S is diagonal. Si,i = 1/qi if row i is sampled, 0 otherwise

Huber Regression Algorithm
[Huber inequality]: For z 2 Rn, £(n-1/2) min(|z|1, |z|22/(2c)) · |z|H · |z|1 Proof by case analysis Sample from a mixture of l1-leverage scores and l2-leverage scores pi = n1/2¢(qi(1) + qi(2)) Our nnz(A)log n + poly(d/ε) algorithm After one step, number of rows < n1/2 poly(d/ε) Recursively solve a weighted Huber Weights do not grow quickly Once size is < n.01 poly(d/ε), solve by convex programming

Talk Outline Huber result Nice M-Estimators result

CountSketch For l2 regression, CountSketch with poly(d) rows works [Clarkson, W]: Compute S*A in nnz(A) time Compute x’ = argminx |SAx-Sb|2 in poly(d) time [ S =

The same M-Sketch works for all nice M-estimators!
x’ = argminx |TAx-Tb|M M-Sketch many analyses of this data structure don’t work since they reduce the problem to a non-convex problem we show it works for “lopsided” subspace embeddings [ [ S0 ¢ D0 S1 ¢ D1 S2 ¢ D2 … Slog n ¢ Dlog n T = Sketch used for estimating frequency moments [Indyk, W] and earthmover distance [Verbin, Zhang] Si are independent CountSketch matrices with poly(d) rows Di is n x n diagonal and uniformly samples a 1/(d log n)i fraction of the n rows

M-Sketch Intuition Consider a fixed y = Ax-b
For M-Sketch T, output |Ty|w, M = Σi wi G((Ty)i) [Contraction] |Ty|w,M ¸ ½ |y|M w.pr. 1-exp(-d log n) [Dilation] |Ty|w,M · 2 |y|M w.pr. 9/10 Contraction allows for a net argument (no scale-invariance!) Dilation implies the optimal y* does not dilate much Analysis uses “bucket crowding”, “level sets”, and “Ky-Fan Norms”

Conclusions Summary: [Huber] O(nnz(A) log n) + poly(d log n / ε) time algorithm [Nice M-Estimators] O(nnz(A)) + poly(d) time algorithm Followup Work / Questions: 1. Results for low rank approximation [Clarkson,W15] 2. (Meta-question) Apply streaming techniques to linear algebra - countsketch –> l2-regression - p-stable random variables -> lp regression for p in [1,2] - countsketch + heavy hitters -> nice M-estimators - Pagh’s tensorsketch -> polynomial kernel regression …

Sketching for M-Estimators: A Unified Approach to Robust Regression

Similar presentations

Presentation on theme: "Sketching for M-Estimators: A Unified Approach to Robust Regression"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Sketching for M-Estimators: A Unified Approach to Robust Regression

Similar presentations

Presentation on theme: "Sketching for M-Estimators: A Unified Approach to Robust Regression"— Presentation transcript:

Similar presentations

About project

Feedback