Statistical Leverage and Improved Matrix Algorithms Michael W. Mahoney Yahoo Research ( For more info, see: )

Slides:



Advertisements
Similar presentations
Numerical Linear Algebra in the Streaming Model
Advertisements

Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden.
RandNLA: Randomized Numerical Linear Algebra
3D Geometry for Computer Graphics
Dimensionality Reduction For k-means Clustering and Low Rank Approximation Michael Cohen, Sam Elder, Cameron Musco, Christopher Musco, and Madalina Persu.
Dimensionality reduction. Outline From distances to points : – MultiDimensional Scaling (MDS) Dimensionality Reductions or data projections Random projections.
Dimensionality Reduction PCA -- SVD
Optimal Column-Based Low-Rank Matrix Reconstruction SODA’12 Ali Kemal Sinop Joint work with Prof. Venkatesan Guruswami.
Robot Vision SS 2005 Matthias Rüther 1 ROBOT VISION Lesson 3: Projective Geometry Matthias Rüther Slides courtesy of Marc Pollefeys Department of Computer.
Solving Linear Systems (Numerical Recipes, Chap 2)
Revisiting Revisiting the Nyström Method for Improved Large-scale Machine Learning Michael W. Mahoney Stanford University Dept. of Mathematics
Modern iterative methods For basic iterative methods, converge linearly Modern iterative methods, converge faster –Krylov subspace method Steepest descent.
1cs542g-term High Dimensional Data  So far we’ve considered scalar data values f i (or interpolated/approximated each component of vector values.
Determinants Bases, linear Indep., etc Gram-Schmidt Eigenvalue and Eigenvectors Misc
Sampling algorithms for l 2 regression and applications Michael W. Mahoney Yahoo Research (Joint work with P. Drineas.
Chapter 5 Orthogonality
Randomized matrix algorithms and their applications
Statistical Leverage and Improved Matrix Algorithms Michael W. Mahoney Stanford University.
Some useful linear algebra. Linearly independent vectors span(V): span of vector space V is all linear combinations of vectors v i, i.e.
1cs542g-term Notes  r 2 log r is technically not defined at r=0 but can be smoothly continued to =0 there  Question (not required in assignment):
Curve-Fitting Regression
Math for CSLecture 41 Linear Least Squares Problem Over-determined systems Minimization problem: Least squares norm Normal Equations Singular Value Decomposition.
Approximate Nearest Neighbors and the Fast Johnson-Lindenstrauss Transform Nir Ailon, Bernard Chazelle (Princeton University)
Fast Approximation of Matrix Coherence and Statistical Leverage Michael W. Mahoney Stanford University ( For more info, see: cs.stanford.edu/people/mmahoney/
Information Retrieval in Text Part III Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
Clustering In Large Graphs And Matrices Petros Drineas, Alan Frieze, Ravi Kannan, Santosh Vempala, V. Vinay Presented by Eric Anderson.
Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.
3D Geometry for Computer Graphics
Lecture 12 Least Square Approximation Shang-Hua Teng.
Ordinary least squares regression (OLS)
Linear and generalised linear models
Sketching as a Tool for Numerical Linear Algebra David Woodruff IBM Almaden.
ECE 530 – Analysis Techniques for Large-Scale Electrical Systems
Basics of regression analysis
Linear and generalised linear models Purpose of linear models Least-squares solution for linear models Analysis of diagnostics Exponential family and generalised.
1cs542g-term Notes  Extra class next week (Oct 12, not this Friday)  To submit your assignment: me the URL of a page containing (links to)
Information Retrieval Latent Semantic Indexing. Speeding up cosine computation What if we could take our vectors and “pack” them into fewer dimensions.
Randomized algorithms for matrices and data Michael W. Mahoney Stanford University November 2011 (For more info, see:
Introduction The central problems of Linear Algebra are to study the properties of matrices and to investigate the solutions of systems of linear equations.
Presented By Wanchen Lu 2/25/2013
Square n-by-n Matrix.
+ Review of Linear Algebra Optimization 1/14/10 Recitation Sivaraman Balakrishnan.
1 Information Retrieval through Various Approximate Matrix Decompositions Kathryn Linehan Advisor: Dr. Dianne O’Leary.
Faster least squares approximation
Linear Regression Andy Jacobson July 2006 Statistical Anecdotes: Do hospitals make you sick? Student’s story Etymology of “regression”
SVD: Singular Value Decomposition
Orthogonality and Least Squares
Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.
SUPA Advanced Data Analysis Course, Jan 6th – 7th 2009 Advanced Data Analysis for the Physical Sciences Dr Martin Hendry Dept of Physics and Astronomy.
Progress in identification of damping: Energy-based method with incomplete and noisy data Marco Prandina University of Liverpool.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Orthogonalization via Deflation By Achiya Dax Hydrological Service Jerusalem, Israel
MATH 685/ CSI 700/ OR 682 Lecture Notes Lecture 4. Least squares.
Sampling algorithms and core-sets for L p regression and applications Michael W. Mahoney Yahoo Research ( For more info, see:
Randomized Algorithms for Linear Algebraic Computations & Applications To access my web page: Petros Drineas Rensselaer Polytechnic Institute Computer.
Advanced Computer Graphics Spring 2014 K. H. Ko School of Mechatronics Gwangju Institute of Science and Technology.
Low Rank Approximation and Regression in Input Sparsity Time David Woodruff IBM Almaden Joint work with Ken Clarkson (IBM Almaden)
1 G Lect 4W Multiple regression in matrix terms Exploring Regression Examples G Multiple Regression Week 4 (Wednesday)
ECE 530 – Analysis Techniques for Large-Scale Electrical Systems Prof. Hao Zhu Dept. of Electrical and Computer Engineering University of Illinois at Urbana-Champaign.
1 SYSTEM OF LINEAR EQUATIONS BASE OF VECTOR SPACE.
A Story of Principal Component Analysis in the Distributed Model David Woodruff IBM Almaden Based on works with Christos Boutsidis, Ken Clarkson, Ravi.
Let W be a subspace of R n, y any vector in R n, and the orthogonal projection of y onto W. …
6 6.5 © 2016 Pearson Education, Ltd. Orthogonality and Least Squares LEAST-SQUARES PROBLEMS.
Lecture XXVII. Orthonormal Bases and Projections Suppose that a set of vectors {x 1,…,x r } for a basis for some space S in R m space such that r  m.
Introduction The central problems of Linear Algebra are to study the properties of matrices and to investigate the solutions of systems of linear equations.
Introduction The central problems of Linear Algebra are to study the properties of matrices and to investigate the solutions of systems of linear equations.
Singular Value Decomposition SVD
Generally Discriminant Analysis
Maths for Signals and Systems Linear Algebra in Engineering Lecture 6, Friday 21st October 2016 DR TANIA STATHAKI READER (ASSOCIATE PROFFESOR) IN SIGNAL.
Presentation transcript:

Statistical Leverage and Improved Matrix Algorithms Michael W. Mahoney Yahoo Research ( For more info, see: )

Modeling data as matrices Matrices often arise with data: n objects (“documents,” genomes, images, web pages), each with m features, may be represented by an m x n matrix A. NLA TCS SDA

Least Squares (LS) Approximation We are interested in over-constrained L2 regression problems, n >> d. Typically, no x such that Ax = b. Want to find the “best” x such that Ax ≈ b. Ubiquitous in applications & central to theory: Statistical interpretation: best linear unbiased estimator. Geometric interpretation: orthogonally project b onto span(A).

Exact solution to LS Approximation Cholesky Decomposition: If A is full rank and well-conditioned, decompose A T A = R T R, where R is upper triangular, and solve the normal equations: R T Rx=A T b. QR Decomposition: Slower but numerically stable, esp. if A is rank-deficient. Write A=QR, and solve Rx = Q T b. Singular Value Decomposition: Most expensive, but best if A is very ill-conditioned. Write A=U  V T, in which case: x OPT = A + b = V  -1 k U T b. Complexity is O(nd 2 ) for all of these, but constant factors differ. Projection of b on the subspace spanned by the columns of A Pseudoinverse of A

LS and Statistical Modeling Assumptions underlying its use: Relationship between “outcomes” and “predictors” is (approximately) linear. The error term  has mean zero. The error term  has constant variance. The errors are uncorrelated. The errors are normally distributed (or we have adequate sample size to rely on large sample theory). Check to ensure these assumptions have not been (too) violated!

Statistical Issues and Regression Diagnostics Statistical Model: b = Ax+  b = response; A (i) = carriers;  = error process b’ = A x opt = A(A T A) -1 A T b H = A(A T A) -1 A T is the “hat” matrix, i.e. projection onto span(A) Note: H=UU T, where U is any orthogonal matrix for span(A) Statistical Interpretation: H ij -- measures the leverage or influence exerted on b’ i by b j, H ii -- leverage/influence score of the i-th constraint Note: H ii = |U (i) | 2 2 = row “lengths” of spanning orthogonal matrix Trace(H)=d -- Diagnostic Rule of Thumb: Investigate if H ii > 2d/n

Overview Statistical Leverage and the Hat Matrix Faster Algorithms for Least Squares Approximation Better Algorithm for Column Subset Selection Problem Even better, both perform very well empirically!

An (expensive) LS sampling algorithm Algorithm 1.Randomly sample r constraints according to probabilities p i. 2.Solve the induced least-squares problem. Drineas, Mahoney, and Muthukrishnan (SODA, 2006) Theorem: Let: If the p i satisfy: for some   (0,1], then w.p. ≥ 1- , p i are statistical leverage scores! U (i) are any orthogonal basis for span(A).

A structural lemma Lemma: Assume that: Then, we get “relative-error” approximation: any matrix. any orthonormal basis for span(A). Approximate: by: Drineas, Mahoney, Muthukrishnan, and Sarlos (2007)

A “fast” LS sampling algorithm Algorithm: 1.Pre-process A and b with a “randomized Hadamard transform”. 2.Uniformly sample constraints. 3.Solve the induced problem: Drineas, Mahoney, Muthukrishnan, and Sarlos (2007) Main theorem: (1  )-approximation in time!!

R andomized Hadamard preprocessing Fact 1: Multiplication by H n D n doesn’t change the solution: Fact 2: Multiplication by H n D n is fast - only O(n log(r)) time, where r is the number of elements of the output vector we need to “touch”. Fact 3: Multiplication by H n D n approximately uniformizes all leverage scores: H n = n-by-n deterministic Hadamard matrix, and D n = n-by-n {+1/-1} random Diagonal matrix. Facts implicit or explicit in: Ailon & Chazelle (2006), or Ailon and Liberty (2008).

Overview Statistical Leverage and the Hat Matrix Faster Algorithms for Least Squares Approximation Better Algorithm for Column Subset Selection Problem Even better, both perform very well empirically!

Column Subset Selection Problem (CSSP) Notes: P C = CC + is the projector matrix onto span(C). The “best” rank-k approximation from the SVD gives a lower bound. Complexity of the problem? O(n k mn) trivially works; NP-hard if k grows as a function of n. (Civril & Magdon-Ismail ’07) Given an m-by-n matrix A and a rank parameter k, choose exactly k columns of A s.t. the m-by-k matrix C minimizes the error over all O(n k ) choices for C:

Prior work in NLA Numerical Linear Algebra algorithms for the CSSP Deterministic, typically greedy approaches. Deep connection with the Rank Revealing QR factorization. Strongest results so far (spectral norm): in O(mn 2 ) time (more generally, some function p(k,n)) Strongest results so far (Frobenius norm): in O(n k ) time

Working on p(k,n): 1965 – today

Theoretical computer science contributions Theoretical Computer Science algorithms for the CSSP 1.Randomized approaches, with some failure probability. 2.More than k columns are picked, e.g., O(poly(k)) columns chosen. 3.Very strong bounds for the Frobenius norm in low polynomial time. 4.Not many spectral norm bounds.

Prior work in TCS Drineas, Mahoney, and Muthukrishnan 2005, “subspace sampling” O(mn 2 ) time, O(k 2 /  2 ) columns-> (1±  )-approximation. O(mn 2 ) time, O(k log k/  2 ) columns -> (1±  )-approximation. Deshpande and Vempala “volume” and “iterative” sampling O(mnk 2 ) time, O(k 2 log k/  2 ) columns -> (1±  )-approximation. They also prove the existence of k columns of A forming a matrix C, s.t. Compare to prior best existence result:

The strongest Frobenius norm bound Theorem: Given an m-by-n matrix A, there exists an O(mn 2 ) algorithm that picks at most O( k log k /  2 ) columns of A such that with probability at least Algorithm: Use subspace sampling probabilities /leverage score probabilities to sample O(k log k /  2 ) columns. Drineas, Mahoney, and Muthukrishnan (2006)

Subspace sampling probabilities NOTE: The rows of V k T are orthonormal, but its columns (V k T ) (i) are not. Subspace sampling probs: in O(mn 2 ) time, compute: V k : orthogonal matrix containing the top k right singular vectors of A.  k : diagonal matrix containing the top k singular values of A. These p i are statistical leverage scores! V k(i) are any orthogonal basis for span(A k ).

Other work bridging NLA/TCS Woolfe, Liberty, Rohklin, and Tygert 2007 (also Martinsson, Rohklin, and Tygert 2006) O(mn log k) time, k columns Same spectral norm bounds as prior work Application of the Fast Johnson-Lindenstrauss transform of Ailon-Chazelle Nice empirical evaluation. Question: How to improve bounds for CSSP? Not obvious that bounds improve if allow NLA to choose more columns. Not obvious how to get around TCS need to over-sample to O(k log(k)) to preserve rank.

Algorithm: Given an m-by-n matrix A and rank parameter k: (Randomized phase) Randomly select c = O(k logk) columns according to “leverage score probabilities”. (Deterministic phase) Run a deterministic algorithm on the above columns* to pick exactly k columns of A. Theorem: Let C be the m-by-k matrix of the selected columns. Our algorithm runs in O(mn 2 ) and satisfies, w.p. ≥ , * Not so simple … Actually, run QR on the down-sampled k-by-O(k log k) version of V k T. A hybrid two-stage algorithm Boutsidis, Mahoney, and Drineas (2007)

Comparison: spectral norm 1.Our running time is comparable with NLA algorithms for this problem. 2.Our spectral norm bound grows as a function of (n-k) 1/4 instead of (n-k) 1/2 ! 3.Do notice that with respect to k our bound is k 1/4 log 1/2 k worse than previous work. 4.To the best of our knowledge, our result is the first asymptotic improvement of the work of Gu & Eisenstat Our algorithm runs in O(mn 2 ) and satisfies, with probability at least ,

Comparison: Frobenius norm 1.We provide an efficient algorithmic result. 2.We guarantee a Frobenius norm bound that is at most (k logk) 1/2 worse than the best known existential result. Our algorithm runs in O(mn 2 ) and satisfies, with probability at least ,

Overview Statistical Leverage and the Hat Matrix Faster Algorithms for Least Squares Approximation Better Algorithm for Column Subset Selection Problem Even better, both perform very well empirically!

TechTC Term-document data Representative examples that cluster well in the low-dimensional space.

TechTC Term-document data Representative examples that cluster well in the low-dimensional space.

Conclusion Statistical Leverage and the Hat Matrix Faster Algorithms for Least Squares Approximation Better Algorithm for Column Subset Selection Problem Even better, both perform very well empirically!

Workshop on “Algorithms for Modern Massive Data Sets” ( Stanford University and Yahoo! Research, June 25-28, 2008 Objectives: - Address algorithmic, mathematical, and statistical challenges in modern statistical data analysis. - Explore novel techniques for modeling and analyzing massive, high-dimensional, and nonlinear- structured data. - Bring together computer scientists, mathematicians, statisticians, and data analysis practitioners to promote cross-fertilization of ideas. Organizers: M. W. Mahoney, L-H. Lim, P. Drineas, and G. Carlsson. Sponsors: NSF, Yahoo! Research, PIMS, DARPA. NLA TCS SDA