Uniform Sampling for Matrix Approximation Michael Cohen, Yin Tat Lee, Cameron Musco, Christopher Musco, Richard Peng, Aaron Sidford M.I.T.

Slides:

Advertisements

Similar presentations

Numerical Linear Algebra in the Streaming Model Ken Clarkson - IBM David Woodruff - IBM.

Advertisements

Numerical Linear Algebra in the Streaming Model

Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.

Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden.

RandNLA: Randomized Numerical Linear Algebra

Dimensionality Reduction For k-means Clustering and Low Rank Approximation Michael Cohen, Sam Elder, Cameron Musco, Christopher Musco, and Madalina Persu.

Learning Measurement Matrices for Redundant Dictionaries Richard Baraniuk Rice University Chinmay Hegde MIT Aswin Sankaranarayanan CMU.

An Efficient Parallel Solver for SDD Linear Systems Richard Peng M.I.T. Joint work with Dan Spielman (Yale)

Sketching for M-Estimators: A Unified Approach to Robust Regression

Turnstile Streaming Algorithms Might as Well Be Linear Sketches Yi Li Huy L. Nguyen David Woodruff.

Iterative methods TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A A A A A A.

Preconditioning in Expectation Richard Peng Joint with Michael Cohen (MIT), Rasmus Kyng (Yale), Jakub Pachocki (CMU), and Anup Rao (Yale) MIT CMU theory.

Revisiting Revisiting the Nyström Method for Improved Large-scale Machine Learning Michael W. Mahoney Stanford University Dept. of Mathematics

Graph Sparsifiers by Edge-Connectivity and Random Spanning Trees Nick Harvey U. Waterloo Department of Combinatorics and Optimization Joint work with Isaac.

Graph Sparsifiers: A Survey Nick Harvey Based on work by: Batson, Benczur, de Carli Silva, Fung, Hariharan, Harvey, Karger, Panigrahi, Sato, Spielman,

Graph Sparsifiers: A Survey Nick Harvey UBC Based on work by: Batson, Benczur, de Carli Silva, Fung, Hariharan, Harvey, Karger, Panigrahi, Sato, Spielman,

Graph Sparsifiers by Edge-Connectivity and Random Spanning Trees Nick Harvey University of Waterloo Department of Combinatorics and Optimization Joint.

Graph Sparsifiers by Edge-Connectivity and Random Spanning Trees Nick Harvey U. Waterloo C&O Joint work with Isaac Fung TexPoint fonts used in EMF. Read.

Sampling from Gaussian Graphical Models via Spectral Sparsification Richard Peng M.I.T. Joint work with Dehua Cheng, Yu Cheng, Yan Liu and Shanghua Teng.

Sampling: an Algorithmic Perspective Richard Peng M.I.T.

Sampling algorithms for l 2 regression and applications Michael W. Mahoney Yahoo Research (Joint work with P. Drineas.

Randomized matrix algorithms and their applications

Fast Approximation of Matrix Coherence and Statistical Leverage Michael W. Mahoney Stanford University ( For more info, see: cs.stanford.edu/people/mmahoney/

Clustering In Large Graphs And Matrices Petros Drineas, Alan Frieze, Ravi Kannan, Santosh Vempala, V. Vinay Presented by Eric Anderson.

Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.

Randomness in Computation and Communication Part 1: Randomized algorithms Lap Chi Lau CSE CUHK.

L p Row Sampling by Lewis Weights Richard Peng Joint with Michael Cohen (M.I.T.) M.I.T.

Sketching as a Tool for Numerical Linear Algebra David Woodruff IBM Almaden.

Sketching for M-Estimators: A Unified Approach to Robust Regression Kenneth Clarkson David Woodruff IBM Almaden.

(work appeared in SODA 10’) Yuk Hei Chan (Tom)

1 Systems of Linear Equations Error Analysis and System Condition.

How Robust are Linear Sketches to Adaptive Inputs? Moritz Hardt, David P. Woodruff IBM Research Almaden.

Ch. 8 & 9 – Linear Sorting and Order Statistics What do you trade for speed?

AMSC 6631 Sparse Solutions of Linear Systems of Equations and Sparse Modeling of Signals and Images: Midyear Report Alfredo Nava-Tudela John J. Benedetto,

Graph Sparsifiers Nick Harvey University of British Columbia Based on joint work with Isaac Fung, and independent work of Ramesh Hariharan & Debmalya Panigrahi.

The Quasi-Randomness of Hypergraph Cut Properties Asaf Shapira & Raphael Yuster.

Faster least squares approximation

Graph Sparsifiers Nick Harvey Joint work with Isaac Fung TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A.

Yang Liu Shengyu Zhang The Chinese University of Hong Kong Fast quantum algorithms for Least Squares Regression and Statistic Leverage Scores.

4.8 Rank Rank enables one to relate matrices to vectors, and vice versa. Definition Let A be an m  n matrix. The rows of A may be viewed as row vectors.

Multifaceted Algorithm Design Richard Peng M.I.T..

Implicit Hitting Set Problems Richard M. Karp Erick Moreno Centeno DIMACS 20 th Anniversary.

Generating a d-dimensional linear subspace efficiently Raphael Yuster SODA’10.

Graphs, Vectors, and Matrices Daniel A. Spielman Yale University AMS Josiah Willard Gibbs Lecture January 6, 2016.

Low Rank Approximation and Regression in Input Sparsity Time David Woodruff IBM Almaden Joint work with Ken Clarkson (IBM Almaden)

Presented by Alon Levin

Iterative Row Sampling Richard Peng Joint work with Mu Li (CMU) and Gary Miller (CMU) CMU  MIT.

Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.

Sketching complexity of graph cuts Alexandr Andoni joint work with: Robi Krauthgamer, David Woodruff.

Generating Random Spanning Trees via Fast Matrix Multiplication Keyulu Xu University of British Columbia Joint work with Nick Harvey TexPoint fonts used.

Laplacian Matrices of Graphs: Algorithms and Applications ICML, June 21, 2016 Daniel A. Spielman.

Let W be a subspace of R n, y any vector in R n, and the orthogonal projection of y onto W. …

Laplacian Matrices of Graphs: Algorithms and Applications ICML, June 21, 2016 Daniel A. Spielman.

REVIEW Linear Combinations Given vectors and given scalars

Lp Row Sampling by Lewis Weights

Algorithm Frameworks Based on Adaptive Sampling

Richard Peng Georgia Tech Michael Cohen Jon Kelner John Peebles

Stochastic Streams: Sample Complexity vs. Space Complexity

Resparsification of Graphs

New Characterizations in Turnstile Streams with Applications

Parallel Algorithm Design using Spectral Graph Theory

Iterative Row Sampling

Sublinear Algorithmic Tools 3

Sublinear Algorithmic Tools 2

Density Independent Algorithms for Sparsifying

Turnstile Streaming Algorithms Might as Well Be Linear Sketches

CIS 700: “algorithms for Big Data”

Matrix Martingales in Randomized Numerical Linear Algebra

Overview Massive data sets Streaming algorithms Regression

CSCI B609: “Foundations of Data Science”

Presentation transcript:

Uniform Sampling for Matrix Approximation Michael Cohen, Yin Tat Lee, Cameron Musco, Christopher Musco, Richard Peng, Aaron Sidford M.I.T.

OUTLINE Reducing Row Count Row Sampling and Leverage Scores Adaptive Uniform Sampling Proof of Structural Theorem

LARGE MATRICES Fundamental subproblem: reduce #rows n-by-d matrix A, nnz non-zeros nnz > n >> d Key routine in applications: size reductions via sampling A’ A Goal: approximate with A ’ with fewer rows Common requirement: ║ Ax ║ 2 ≈║ A’x ║ 2 ∀ x Denote as A ≈ A’

EXISTENCE OF APPROXIMATIONS ║ Ax ║ 2 2 = x T A T Ax A T A : d-by-d matrix A ’  QR factorization of A T A A’ = SA, S : d-by-n A’=SA A Runtime cost: Computing A T A : O(nnz × d) Finding A ’ from A T A : O(d ω ) When n >> d, nnz > poly(d), cost dominated by computing A T A

INPUT SPARSITY TIME Find A ’ ≈ A in O(nnz) time Oblivious subspace embeddings: Drineas, Magdon-Ismail, Mahoney, Woodruff `12: S = fast JL matrix Clarkson, Woodruff `12: S with one non-zero per column Nelson-Nguyen: near-optimal sparsity A’=SA A

OUR RESULT A ’ ≈ A consisting of O(dlogd) rescaled rows of A in O(nnz + d ω+θ ) time rowSample( A ) A ’  random half of rows of A A ”  rowSample( A ’) τ  approxProbability( A, A” ) return sample( A, τ ) Based on uniform + importance sampling [Cohen-P `14] extends to p-norms [Lee-P-Spielman `14] ‘self constructing’ solvers for linear systems in graph Laplacians Θ: any constant > 0

SUMMARY Goal: A’ with fewer rows s.t. A ≈ A’ Existence of A’ evident via QR factorization

OUTLINE Reducing Row Count Row Sampling and Leverage Scores Adaptive Uniform Sampling Proof of Structural Theorem

L 2 ROW SAMPLING A’ = SA A A ’ is a subset of rescaled rows of A Matrix S: One non-zero per row Generated adaptively based on A ObliviousAdaptive Works for (w.h.p.) Any A Input A Minimum nnz( S ) nO(d) ([BSS `09]) Rows of A ’Sums of rows of A Rows of A

CONNECTIONS A : edge-vertex incidence matrix x : labels on vertices Approximate algebraic algorithms Preconditioner in scientific computing Graph sparsification, expanders Row for edge uv: | a i x | 2 = ( x u – x v ) 2 x being 0/1 indicator vector: size of cut

IMPORTANCE SAMPLING Issue: only one non-zero row Keep a row, a i, with probability p i If picked, rescale (by p i -1/2 ) to keep expectation Uniform sampling: p i = n’/n norm sampling: p i =n’ ║ a i ║ 2 2 / ║ A ║ F 2 Issue: column with one entry Go from n to n’ rows:

MATRIX-CHERNOFF BOUNDS τ : L 2 statistical leverage scores τ i = a i T ( A T A ) + a i [Foster `49] Σ i τ i = rank ≤ d  O( dlogd) rows a i row i of A Rudelson, Vershynin `07, Tropp `12: Sampling with p i ≥ τ i O( logd) gives A ≈ A’ w.h.p. M + : pseudo-inverse of M

HOW TO COMPUTE LEVERAGE SCORES? τ i = a i T ( A T A ) + a i Given A ’ ≈ A with O(dlogd) rows, can estimate leverage scores of A in O(nnz(A) + d ω+θ ) time Finding A ’ ≈ A need leverage scores Efficient leverage scoring computation needs A ’ ≈ A Chicken-and-egg problem One solution: use projections, then refine into row sample

SUMMARY Goal: A ’ with fewer rows s.t. A ≈ A ’ Existence of good A’ evident via QR factorization Leverage scores (w.r.t. a subset)  good row samples Total leverage score ≤ d Chicken and egg problem: need A’ ≈ A to find A’ ≈ A

OUTLINE Reducing Row Count Row Sampling and Leverage Scores Adaptive Uniform Sampling Proof of Structural Theorem

ADAPTIVE ROW SAMPLING [Avron `11]: boost effectiveness of randomized algorithms via preconditioning Find an approximation A ’ Use A ’ to compute upper bounds of statistical leverage scores in A Sample A using these bounds to obtain better approximation A ’’

WIDELY USED IN PRACTICE Nystrom method on matrices: Pick random subset of rows/columns Compute on subset Extend result onto the full matrix Why is this effective? Post-processing: Theoretical: copy x over Practical: projection, least-squares fitting Uniform sampling does not give spectral approximations!

QUESTIONS THAT WE ADDRESS How well can uniform sampling perform? What do we gain from post-processing Are there weaker notions than ≈ that can be used in algorithmic pipelines? Uniform sample post-process

ASIDE: WHAT IS LEVERAGE SCORE? How easily a row can be constructed from other rows: τ i = min ║ x ║ 2 s.t. xA = a i This view implies: τ i ≤ 1 τ i ’ from a sample are good upper bounds A xaiai

UNDEFINED LEVERAGE SCORES A ’ from random sample may have smaller rank A ’ ∪ a i subset of rows of A  good upper bounds Same cost: O(nnz + d ω+θ + time to approximate A ’ T A ’) Fix: add a i to A ’ when computing leverage score of a i sampl e post-process τ i = a i T ( A T A ) + a i Need: bound sum of these bounds

WHAT WE SHOW Runtime: T(nnz) = T(nnz/2) + O(nnz + d ω+θ ) Algorithmic consequence: recurse on A ’ to approximate ( A ’) T ( A ’) rowSample( A ) A ’  random half of rows of A A ”  rowSample( A ’) τ  approxProbability( A, A” ) return sample( A, τ ) Structural theorem : if we pick half the rows as A ’, the expected total estimate is ≤ 2d

SUMMARY Goal: A ’ with fewer rows s.t. A ≈ A ’ Existence of A ’ evident via QR factorization Leverage scores (w.r.t. a subset)  good row samples Total leverage score ≤ d Chicken and egg problem: need A ’ ≈ A to find A ’ ≈ A Uniform sampling + correction widely used Strong guarantees for adaptive uniform sampling

OUTLINE Reducing Row Count Row Sampling and Leverage Scores Adaptive Uniform Sampling Proof of Structural Theorem

RANDOM PROCESS Formally: S: random subset of n/2 rows τ i ’ = leverage score of a i in a [S ∪ {i}] Claim: E S Σ i [ τ i ’] ≤ 2d Structural theorem : if we pick half the rows as A ’, the expected total estimate is ≤ 2d = n/2 × E S, i ∉ S [ τ i ’] Foster’s theorem: sum over rows in A ’ ≤ d Need to bound the rest, E S Σ i ∉ S [ τ i ’]

SET WITH ONE EXTRA ELEMENT E |S|=n/2, i ∉ S [ τ i ’] = E |S’|=n/2+1, i ∈ S’ [ τ i ’] Equivalent to: Picking S’ = S ∪ {i} at random with i picked at random from S’ E S, i ∉ S [ τ i ’ ]: over random subset S and random i ∉ S

E |S’|=n/2+1, i ∈ S’ [ τ i ’ ] Foster’s theorem: total leverage score in S’ ≤ d Average leverage score of a row of S’ ≤ d / (n/2 + 1) Total: n/2 × d / (n/2 + 1) ≤ d Overall bound follows by including rows from S Akin to: Backwards analysis (e.g. [Siedel `92]) Randomized O(m) time MST [Klein-Karger-Tarjan `93] E S, i ∉ S [ τ i ’] ≤ d / (n/2 + 1)

WHAT WE ALSO SHOW: Coherence reducing reweighting: For any α < 1, can reweigh d/α rows so all leverage scores ≤α This implies the sampling result Leverage score ≤ 1/2logd: uniform 1/2 A ’ gives A ’ ≈ A Sample only deviated on the rows that we changed Recomputing leverage scores w.r.t. A ’ finds these rows

SUMMARY Goal: A ’ with fewer rows s.t. A ≈ A ’ Existence of A ’ evident via QR factorization Leverage scores (w.r.t. a subset)  good row samples Total leverage score ≤ d Chicken and egg problem: need A ’ ≈ A to find A ’ ≈ A Uniform sampling + correction widely used Strong guarantees for adaptive uniform sampling Backward analysis of adaptive uniform sampling Coherence reducing reweighting

FUTURE WORK Application in streaming sparsifiers? Algorithms for low rank approximations? Does norm sampling offer more? Backward analysis of other adaptive routines? How close to Nystrom methods can we get? More limited randomness? Reference: