Uniform Sampling for Matrix Approximation Michael Cohen, Yin Tat Lee, Cameron Musco, Christopher Musco, Richard Peng, Aaron Sidford M.I.T.
OUTLINE Reducing Row Count Row Sampling and Leverage Scores Adaptive Uniform Sampling Proof of Structural Theorem
LARGE MATRICES Fundamental subproblem: reduce #rows n-by-d matrix A, nnz non-zeros nnz > n >> d Key routine in applications: size reductions via sampling A’ A Goal: approximate with A ’ with fewer rows Common requirement: ║ Ax ║ 2 ≈║ A’x ║ 2 ∀ x Denote as A ≈ A’
EXISTENCE OF APPROXIMATIONS ║ Ax ║ 2 2 = x T A T Ax A T A : d-by-d matrix A ’ QR factorization of A T A A’ = SA, S : d-by-n A’=SA A Runtime cost: Computing A T A : O(nnz × d) Finding A ’ from A T A : O(d ω ) When n >> d, nnz > poly(d), cost dominated by computing A T A
INPUT SPARSITY TIME Find A ’ ≈ A in O(nnz) time Oblivious subspace embeddings: Drineas, Magdon-Ismail, Mahoney, Woodruff `12: S = fast JL matrix Clarkson, Woodruff `12: S with one non-zero per column Nelson-Nguyen: near-optimal sparsity A’=SA A
OUR RESULT A ’ ≈ A consisting of O(dlogd) rescaled rows of A in O(nnz + d ω+θ ) time rowSample( A ) A ’ random half of rows of A A ” rowSample( A ’) τ approxProbability( A, A” ) return sample( A, τ ) Based on uniform + importance sampling [Cohen-P `14] extends to p-norms [Lee-P-Spielman `14] ‘self constructing’ solvers for linear systems in graph Laplacians Θ: any constant > 0
SUMMARY Goal: A’ with fewer rows s.t. A ≈ A’ Existence of A’ evident via QR factorization
OUTLINE Reducing Row Count Row Sampling and Leverage Scores Adaptive Uniform Sampling Proof of Structural Theorem
L 2 ROW SAMPLING A’ = SA A A ’ is a subset of rescaled rows of A Matrix S: One non-zero per row Generated adaptively based on A ObliviousAdaptive Works for (w.h.p.) Any A Input A Minimum nnz( S ) nO(d) ([BSS `09]) Rows of A ’Sums of rows of A Rows of A
CONNECTIONS A : edge-vertex incidence matrix x : labels on vertices Approximate algebraic algorithms Preconditioner in scientific computing Graph sparsification, expanders Row for edge uv: | a i x | 2 = ( x u – x v ) 2 x being 0/1 indicator vector: size of cut
IMPORTANCE SAMPLING Issue: only one non-zero row Keep a row, a i, with probability p i If picked, rescale (by p i -1/2 ) to keep expectation Uniform sampling: p i = n’/n norm sampling: p i =n’ ║ a i ║ 2 2 / ║ A ║ F 2 Issue: column with one entry Go from n to n’ rows:
MATRIX-CHERNOFF BOUNDS τ : L 2 statistical leverage scores τ i = a i T ( A T A ) + a i [Foster `49] Σ i τ i = rank ≤ d O( dlogd) rows a i row i of A Rudelson, Vershynin `07, Tropp `12: Sampling with p i ≥ τ i O( logd) gives A ≈ A’ w.h.p. M + : pseudo-inverse of M
HOW TO COMPUTE LEVERAGE SCORES? τ i = a i T ( A T A ) + a i Given A ’ ≈ A with O(dlogd) rows, can estimate leverage scores of A in O(nnz(A) + d ω+θ ) time Finding A ’ ≈ A need leverage scores Efficient leverage scoring computation needs A ’ ≈ A Chicken-and-egg problem One solution: use projections, then refine into row sample
SUMMARY Goal: A ’ with fewer rows s.t. A ≈ A ’ Existence of good A’ evident via QR factorization Leverage scores (w.r.t. a subset) good row samples Total leverage score ≤ d Chicken and egg problem: need A’ ≈ A to find A’ ≈ A
OUTLINE Reducing Row Count Row Sampling and Leverage Scores Adaptive Uniform Sampling Proof of Structural Theorem
ADAPTIVE ROW SAMPLING [Avron `11]: boost effectiveness of randomized algorithms via preconditioning Find an approximation A ’ Use A ’ to compute upper bounds of statistical leverage scores in A Sample A using these bounds to obtain better approximation A ’’
WIDELY USED IN PRACTICE Nystrom method on matrices: Pick random subset of rows/columns Compute on subset Extend result onto the full matrix Why is this effective? Post-processing: Theoretical: copy x over Practical: projection, least-squares fitting Uniform sampling does not give spectral approximations!
QUESTIONS THAT WE ADDRESS How well can uniform sampling perform? What do we gain from post-processing Are there weaker notions than ≈ that can be used in algorithmic pipelines? Uniform sample post-process
ASIDE: WHAT IS LEVERAGE SCORE? How easily a row can be constructed from other rows: τ i = min ║ x ║ 2 s.t. xA = a i This view implies: τ i ≤ 1 τ i ’ from a sample are good upper bounds A xaiai
UNDEFINED LEVERAGE SCORES A ’ from random sample may have smaller rank A ’ ∪ a i subset of rows of A good upper bounds Same cost: O(nnz + d ω+θ + time to approximate A ’ T A ’) Fix: add a i to A ’ when computing leverage score of a i sampl e post-process τ i = a i T ( A T A ) + a i Need: bound sum of these bounds
WHAT WE SHOW Runtime: T(nnz) = T(nnz/2) + O(nnz + d ω+θ ) Algorithmic consequence: recurse on A ’ to approximate ( A ’) T ( A ’) rowSample( A ) A ’ random half of rows of A A ” rowSample( A ’) τ approxProbability( A, A” ) return sample( A, τ ) Structural theorem : if we pick half the rows as A ’, the expected total estimate is ≤ 2d
SUMMARY Goal: A ’ with fewer rows s.t. A ≈ A ’ Existence of A ’ evident via QR factorization Leverage scores (w.r.t. a subset) good row samples Total leverage score ≤ d Chicken and egg problem: need A ’ ≈ A to find A ’ ≈ A Uniform sampling + correction widely used Strong guarantees for adaptive uniform sampling
OUTLINE Reducing Row Count Row Sampling and Leverage Scores Adaptive Uniform Sampling Proof of Structural Theorem
RANDOM PROCESS Formally: S: random subset of n/2 rows τ i ’ = leverage score of a i in a [S ∪ {i}] Claim: E S Σ i [ τ i ’] ≤ 2d Structural theorem : if we pick half the rows as A ’, the expected total estimate is ≤ 2d = n/2 × E S, i ∉ S [ τ i ’] Foster’s theorem: sum over rows in A ’ ≤ d Need to bound the rest, E S Σ i ∉ S [ τ i ’]
SET WITH ONE EXTRA ELEMENT E |S|=n/2, i ∉ S [ τ i ’] = E |S’|=n/2+1, i ∈ S’ [ τ i ’] Equivalent to: Picking S’ = S ∪ {i} at random with i picked at random from S’ E S, i ∉ S [ τ i ’ ]: over random subset S and random i ∉ S
E |S’|=n/2+1, i ∈ S’ [ τ i ’ ] Foster’s theorem: total leverage score in S’ ≤ d Average leverage score of a row of S’ ≤ d / (n/2 + 1) Total: n/2 × d / (n/2 + 1) ≤ d Overall bound follows by including rows from S Akin to: Backwards analysis (e.g. [Siedel `92]) Randomized O(m) time MST [Klein-Karger-Tarjan `93] E S, i ∉ S [ τ i ’] ≤ d / (n/2 + 1)
WHAT WE ALSO SHOW: Coherence reducing reweighting: For any α < 1, can reweigh d/α rows so all leverage scores ≤α This implies the sampling result Leverage score ≤ 1/2logd: uniform 1/2 A ’ gives A ’ ≈ A Sample only deviated on the rows that we changed Recomputing leverage scores w.r.t. A ’ finds these rows
SUMMARY Goal: A ’ with fewer rows s.t. A ≈ A ’ Existence of A ’ evident via QR factorization Leverage scores (w.r.t. a subset) good row samples Total leverage score ≤ d Chicken and egg problem: need A ’ ≈ A to find A ’ ≈ A Uniform sampling + correction widely used Strong guarantees for adaptive uniform sampling Backward analysis of adaptive uniform sampling Coherence reducing reweighting
FUTURE WORK Application in streaming sparsifiers? Algorithms for low rank approximations? Does norm sampling offer more? Backward analysis of other adaptive routines? How close to Nystrom methods can we get? More limited randomness? Reference: