Randomized Algorithms for Linear Algebraic Computations & Applications To access my web page: Petros Drineas Rensselaer Polytechnic Institute Computer.

Slides:

Advertisements

Similar presentations

Numerical Linear Algebra in the Streaming Model Ken Clarkson - IBM David Woodruff - IBM.

Advertisements

RandNLA: Randomized Numerical Linear Algebra

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.

Fast Algorithms For Hierarchical Range Histogram Constructions

Dimensionality Reduction For k-means Clustering and Low Rank Approximation Michael Cohen, Sam Elder, Cameron Musco, Christopher Musco, and Madalina Persu.

Dimensionality reduction. Outline From distances to points : – MultiDimensional Scaling (MDS) Dimensionality Reductions or data projections Random projections.

Dimensionality Reduction PCA -- SVD

Bayesian Robust Principal Component Analysis Presenter: Raghu Ranganathan ECE / CMR Tennessee Technological University January 21, 2011 Reading Group (Xinghao.

Object Orie’d Data Analysis, Last Time Finished NCI 60 Data Started detailed look at PCA Reviewed linear algebra Today: More linear algebra Multivariate.

1cs542g-term High Dimensional Data  So far we’ve considered scalar data values f i (or interpolated/approximated each component of vector values.

Principal Component Analysis CMPUT 466/551 Nilanjan Ray.

Principal Component Analysis

Lecture outline Dimensionality reduction

“Random Projections on Smooth Manifolds” -A short summary

Sampling algorithms for l 2 regression and applications Michael W. Mahoney Yahoo Research (Joint work with P. Drineas.

Chapter 5 Orthogonality

Randomized matrix algorithms and their applications

Statistical Leverage and Improved Matrix Algorithms Michael W. Mahoney Stanford University.

CUR Matrix Decompositions for Improved Data Analysis Michael W. Michael W. Mahoney Yahoo Research (Joint work with.

Hongtao Cheng 1 Efficiently Supporting Ad Hoc Queries in Large Datasets of Time Sequences Author:Flip Korn H. V. Jagadish Christos Faloutsos From ACM.

Fast Approximation of Matrix Coherence and Statistical Leverage Michael W. Mahoney Stanford University ( For more info, see: cs.stanford.edu/people/mmahoney/

Information Retrieval in Text Part III Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.

Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.

The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.

Clustering In Large Graphs And Matrices Petros Drineas, Alan Frieze, Ravi Kannan, Santosh Vempala, V. Vinay Presented by Eric Anderson.

Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.

Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.

Unsupervised Learning

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006

3D Geometry for Computer Graphics

3D Geometry for Computer Graphics

Kathryn Linehan Advisor: Dr. Dianne O’Leary

Less is More: Compact Matrix Decomposition for Large Sparse Graphs

Sketching as a Tool for Numerical Linear Algebra David Woodruff IBM Almaden.

DATA MINING LECTURE 7 Dimensionality Reduction PCA – SVD

ECE 530 – Analysis Techniques for Large-Scale Electrical Systems

1cs542g-term Notes  Extra class next week (Oct 12, not this Friday)  To submit your assignment: me the URL of a page containing (links to)

SVD(Singular Value Decomposition) and Its Applications

Summarized by Soo-Jin Kim

NUS CS5247 A dimensionality reduction approach to modeling protein flexibility By, By Miguel L. Teodoro, George N. Phillips J* and Lydia E. Kavraki Rice.

Chapter 2 Dimensionality Reduction. Linear Methods

CSE554AlignmentSlide 1 CSE 554 Lecture 8: Alignment Fall 2014.

1 Information Retrieval through Various Approximate Matrix Decompositions Kathryn Linehan Advisor: Dr. Dianne O’Leary.

Feature extraction 1.Introduction 2.T-test 3.Signal Noise Ratio (SNR) 4.Linear Correlation Coefficient (LCC) 5.Principle component analysis (PCA) 6.Linear.

CSE554AlignmentSlide 1 CSE 554 Lecture 5: Alignment Fall 2011.

EMIS 8381 – Spring Netflix and Your Next Movie Night Nonlinear Programming Ron Andrews EMIS 8381.

Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.

Statistical Leverage and Improved Matrix Algorithms Michael W. Mahoney Yahoo Research ( For more info, see: )

Orthogonalization via Deflation By Achiya Dax Hydrological Service Jerusalem, Israel

Elementary Linear Algebra Anton & Rorres, 9th Edition

MATH 685/ CSI 700/ OR 682 Lecture Notes Lecture 4. Least squares.

CSE554AlignmentSlide 1 CSE 554 Lecture 8: Alignment Fall 2013.

A Note on Rectangular Quotients By Achiya Dax Hydrological Service Jerusalem, Israel

Welcome to RPI CS! Theory Group Professors: Mark Goldberg Associate Professors: Daniel Freedman, Mukkai Krishnamoorthy, Malik Magdon- Ismail, Bulent Yener.

Low Rank Approximation and Regression in Input Sparsity Time David Woodruff IBM Almaden Joint work with Ken Clarkson (IBM Almaden)

DATA MINING LECTURE 8 Sequence Segmentation Dimensionality Reduction.

Matrix Factorization & Singular Value Decomposition Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

Unsupervised Learning II Feature Extraction

1 Objective To provide background material in support of topics in Digital Image Processing that are based on matrices and/or vectors. Review Matrices.

A Story of Principal Component Analysis in the Distributed Model David Woodruff IBM Almaden Based on works with Christos Boutsidis, Ken Clarkson, Ravi.

CSE 554 Lecture 8: Alignment

Streaming & sampling.

Singular Value Decomposition

Overview Massive data sets Streaming algorithms Regression

Principal Component Analysis

Singular Value Decomposition SVD

Parallelization of Sparse Coding & Dictionary Learning

Dimension reduction : PCA and Clustering

Feature space tansformation methods

Presentation transcript:

Randomized Algorithms for Linear Algebraic Computations & Applications To access my web page: Petros Drineas Rensselaer Polytechnic Institute Computer Science Department drineas

Matrices and network applications Matrices represent networks … Graphs are often used in network applications. (e.g., network traffic monitoring, web structure analysis, social network mining, protein interaction networks, etc.) Graphs are represented by matrices. (e.g., adjacency matrices, edge-incidence matrices, etc.) More recently: time-evolving graphs, represented by tensors – multi-mode arrays – in recent literature. Goal: discover patterns and anomalies in spite of the high dimensionality of data.

Matrices and network applications Matrices represent networks … Graphs are often used in network applications. (e.g., network traffic monitoring, web structure analysis, social network mining, protein interaction networks, etc.) Graphs are represented by matrices. (e.g., adjacency matrices, edge-incidence matrices, etc.) More recently: time-evolving graphs, represented by tensors – multi-mode arrays – in recent literature. Goal: discover patterns and anomalies in spite of the high dimensionality of data. Linear algebra and numerical analysis provide the fundamental mathematical and algorithmic tools to deal with matrix and tensor computations.

Randomized algorithms Randomization and sampling allow us to design provably accurate algorithms for problems that are:  Massive (e.g., matrices (graphs) so large that can not be stored at all, or can only be stored in slow, secondary memory devices)  Computationally expensive or even NP-hard (e.g., combinatorial problems such as the Column Subset Selection Problem)

Example: monitoring IP flows Network administrator monitoring the (source, destination) IP flows over time: m sources n destinations A ij = count of exchanged flows between i-th source and j-th destination Tasks - Find patterns, summaries, anomalies, etc. in the static setting, or - in the dynamic setting (tensor representation).

Interplay Theoretical Computer Science Randomized and approximation algorithms Numerical Linear Algebra Matrix computations and Linear Algebra (ie., perturbation theory) Applications Data mining & information retrieval (e.g., human genetics, internet data, electronic circuit testing data, etc.)

Students and collaborators StudentsCollaborators Asif Javed (PhD, graduated)M.W. Mahoney (Stanford U) Christos Boutsidis (PhD, 3 rd year)J. Sun (IBM T.J. Watson Research Center) Jamey Lewis (PhD, 2 nd year)S. Muthu (Google) Elena Sebe (undergrad, now at RPI)E. Ziv (UCSF, School of Medicine) John Payne (undergrad, now at CMU)K. K. Kidd (Yale U, Dept. of Genetics) Richard Alimi (undergrad, now at Yale)P. Paschou (Democritus U., Greece, Genetics) Our group keeps close ties with:  Yahoo! Research  Sandia National Laboratories  Funding from NSF and Yahoo! Research.

Overview  From the Singular Value Decomposition (SVD) to CUR-type decompositions  Additive and relative error CUR-type decompositions  Future directions

The Singular Value Decomposition (SVD) Matrix rows: points (vectors) in a Euclidean space, e.g., given 2 objects (x & d), each described with respect to two features, we get a 2-by-2 matrix. Two objects are “close” if the angle between their corresponding vectors is small. m objects n features

SVD, intuition Let the blue circles represent m data points in a 2-D Euclidean space. Then, the SVD of the m-by-2 matrix of the data will return … 1st (right) singular vector 1st (right) singular vector: direction of maximal variance, 2nd (right) singular vector 2nd (right) singular vector: direction of maximal variance, after removing the projection of the data along the first singular vector.

1st (right) singular vector 2nd (right) singular vector Singular values  1 : measures how much of the data variance is explained by the first singular vector.  2 : measures how much of the data variance is explained by the second singular vector. 11 22

SVD: formal definition  rank of A U (V): orthogonal matrix containing the left (right) singular vectors of A.  : diagonal matrix containing the singular values of A. Let  1 ¸  2 ¸ … ¸   be the entries of . Exact computation of the SVD takes O(min{mn 2, m 2 n}) time. The top k left/right singular vectors/values can be computed faster using Lanczos/Arnoldi methods. 0 0

AVTVT  U= objects features significant noise significant sig. = Rank-k approximations via the SVD

Rank-k approximations (A k ) U k (V k ): orthogonal matrix containing the top k left (right) singular vectors of A.  k : diagonal matrix containing the top k singular values of A.

PCA and SVD Principal Components Analysis (PCA) essentially amounts to the computation of the Singular Value Decomposition (SVD) of a covariance matrix. SVD is “the Rolls-Royce and the Swiss Army Knife of Numerical Linear Algebra.”* *Dianne O’Leary, MMDS ’06

U k solves an optimization problem… (1) It turns out that C = U k is (one of many) solutions to the above problem. (2) The minimal residual error is equal to the Frobenius norm of A-A k. (3) The above observations hold for any unitarily invariant norm (e.g., the spectral norm). Notation: P C A is the projection of A on the subspace spanned by the columns of C. Given an m £ n matrix A, we seek an m-by-k matrix C that minimizes the residual Frobenius norm:

SVD issues … SVD and PCA are often used to summarize and approximate matrices and they have enjoyed enormous success in data analysis. BUT

SVD issues … SVD and PCA are often used to summarize and approximate matrices and they have enjoyed enormous success in data analysis. BUT - For large sparse graphs they require large amounts of memory, exactly because the resulting matrices are not sparse any more. (Common networks such as the web, the Internet topology graph, the who-trusts-whom social network are all large and sparse.) - Running time becomes an issue. - Assigning “meaning” to the singular vectors (reification) is tricky …

Alternative: CUR-type decompositions 1.A “sketch” consisting of a few rows/columns of the matrix may replace the SVD. 2.Rows/columns are drawn randomly, using various importance sampling schemes. 3.The choice of the sampling probabilities is critical for the accuracy of the approximation. Create an approximation to the original matrix which can be stored in much less space. O(1) columns O(1) rows Carefully chosen U

Advantages In applications where large, sparse matrices appear we can design algorithms for CUR-type decompositions that are computable after two “passes” (sequential READS) through the matrices, require O(m+n) RAM space (compare to O(mn) for the SVD), can be computed very fast (extra O(m+n) time after the two passes), and preserve sparsity.

Advantages In applications where large, sparse matrices appear we can design algorithms for CUR-type decompositions that are computable after two “passes” (sequential READS) through the matrices, require O(m+n) RAM space (compare to O(mn) for the SVD), can be computed very fast (extra O(m+n) time after the two passes), and preserve sparsity. Caveat: accuracy loss.

Some notation… Given an m £ n matrix A: (i) A (i) denotes the i-th column of the matrix as a column vector, (ii) A (i) denotes the i-th row of the matrix as a row vector, (iii) |A (i) | or |A (i) | denotes the Euclidean norm of the corresponding row or column, (iv) Finally,  will be an accuracy parameter in (0,1) and  a failure probability.

Computing CUR (Drineas & Kannan SODA ’03, Drineas, Kannan, & Mahoney SICOMP ’06) C consists of c = O(1/ε 2 ) columns of A and R consists of r = O(1/ε 2 ) rows of A.

Computing CUR (Drineas & Kannan SODA ’03, Drineas, Kannan, & Mahoney SICOMP ’06) C consists of c = O(1/ε 2 ) columns of A and R consists of r = O(1/ε 2 ) rows of A. C (and R) is created using importance sampling, e.g. columns (rows) are picked in i.i.d. trials with respect to probabilities

Computing U Intuition: The CUR algorithm essentially expresses every row of the matrix A as a linear combination of a small subset of the rows of A. This small subset consists of the rows in R.

Computing U Intuition: The CUR algorithm essentially expresses every row of the matrix A as a linear combination of a small subset of the rows of A. This small subset consists of the rows in R. Given a row of A – say A (i) – the algorithm computes a good fit for the row A (i) using the rows in R as the basis, by approximately solving Notice that only c = O(1) element of the i-th row are given as input. However, a vector of coefficients u can still be computed.

Computing U Given c elements of A (i) the algorithm computes a good fit for the row A (i) using the rows in R as the basis, by approximately solving: In the process of computing U, we fix its rank to be a positive constant k, which is part of the input.

Computing U Given c elements of A (i) the algorithm computes a good fit for the row A (i) using the rows in R as the basis, by approximately solving: In the process of computing U, we fix its rank to be a positive constant k, which is part of the input. Note: Since CUR is of rank k, ||A-CUR|| 2,F > ||A-A k || 2,F. Thus, we should choose a k such that ||A-A k || 2,F is small.

Error bounds (Frobenius norm) Assume A k is the “best” rank k approximation to A (through SVD). Then We need to pick O(k/ε 2 ) rows and O(k/ε 2 ) columns.

Error bounds (2-norm) Assume A k is the “best” rank k approximation to A (through SVD). Then We need to pick O(1/ε 2 ) rows and O(1/ε 2 ) columns and set k = (1/ε)-1.

Application to network flow data (Sun, Xie, Zhang, & Faloutsos SDM ’07) The data: - A traffic trace consisting of TCP flow records collected at the backbone router of a class-B university network. - Each record in the trace corresponds to a directional TCP flow between two hosts; timestamps indicate when the flow started and finished. - Approx. 22,000 hosts/destinations; approx. 800,000 packets per hour between all host- destination pairs.

Application to network flow data (Sun, Xie, Zhang, & Faloutsos SDM ’07) The data: - A traffic trace consisting of TCP flow records collected at the backbone router of a class-B university network. - Each record in the trace corresponds to a directional TCP flow between two hosts; timestamps indicate when the flow started and finished. - Approx. 22,000 hosts/destinations; approx. 800,000 packets per hour between all host- destination pairs. Data from 10am-11am on January 6, 2005

Application to network flow data (Sun, Xie, Zhang, & Faloutsos SDM ’07) Goal: - Detect abnormal (anomalous) hosts by measuring the reconstruction error for each host! - In other words, if a host can not be accurately expressed as a linear combination of a small set of other hosts, then it potentially represents an anomaly and should be flagged.

Application to network flow data (Sun, Xie, Zhang, & Faloutsos SDM ’07) Goal: - Detect abnormal (anomalous) hosts by measuring the reconstruction error for each host! - In other words, if a host can not be accurately expressed as a linear combination of a small set of other hosts, then it potentially represents an anomaly and should be flagged. Simulation experiments: - Starting with real flow matrices, abnormalities are injected in the data by inserting: Abnormal source hosts: A source host is randomly selected and all the corresponding row entries are set to 1 (scanner host that sends flows to every other destination in the network) Abnormal destination hosts: A destination is randomly picked, and 90% of its corresponding column entries are set to 1 (denial of service attack from a large number of hosts).

Application to network flow data (Sun, Xie, Zhang, & Faloutsos SDM ’07) Goal: - Detect abnormal (anomalous) hosts by measuring the reconstruction error for each host! - In other words, if a host can not be accurately expressed as a linear combination of a small set of other hosts, then it potentially represents an anomaly and should be flagged. Simulation experiments: - Starting with real flow matrices, abnormalities are injected in the data by inserting: Abnormal source hosts: A source host is randomly selected and all the corresponding row entries are set to 1 (scanner host that sends flows to every other destination in the network) Abnormal destination hosts: A destination is randomly picked, and 90% of its corresponding column entries are set to 1 (denial of service attack from a large number of hosts). Results: By applying a variant of CUR (removing duplicate rows/columns) and by keeping about 500 columns and rows, recall is close to 100% and precision is close to 97%.

More accurate CUR decompositions? An alternative perspective: Q. Can we find the “best” set of columns and rows to include in C and R? Randomized and/or deterministic strategies are acceptable, even at the loss of efficiency (running time, memory, and sparsity). Optimal U

More accurate CUR decompositions? An alternative perspective: Prior results by E. E. Tyrtyrshnikov and collaborators (LAA ’97) implied rather weak error bounds if we choose the columns and rows of A that define a parallelpiped of maximal volume. Q. Can we find the “best” set of columns and rows to include in C and R? Randomized and/or deterministic strategies are acceptable, even at the loss of efficiency (running time, memory, and sparsity). Optimal U

Relative-error CUR-type decompositions O(1) columns O(1) rows Carefully chosen U Goal: make (some norm) of A-CUR small. For any matrix A, we can find C, U and R such that the norm of A – CUR is almost equal to the norm of A-A k.

From SVD to relative error CUR Exploit structural properties of CUR to analyze data: Instead of reifying the Principal Components: Use PCA (a.k.a. SVD) to find how many Principal Components are needed to “explain” the data. Run CUR and pick columns/rows instead of eigen-columns and eigen-rows! Assign meaning to actual columns/rows of the matrix! Much more intuitive! m objects n features

From SVD to relative error CUR Exploit structural properties of CUR to analyze data: Instead of reifying the Principal Components: Use PCA (a.k.a. SVD) to find how many Principal Components are needed to “explain” the data. Run CUR and pick columns/rows instead of eigen-columns and eigen-rows! Assign meaning to actual columns/rows of the matrix! Much more intuitive! Caveat: Relative-error CUR-type decompositions retain sparsity, but necessitate the same amount of time as the computation of A k and O(mn) memory. m objects n features

Theorem: relative error CUR (Drineas, Mahoney, & Muthukrishnan SIMAX ’08) For any k, O(SVD k (A)) time suffices to construct C, U, and R s.t. holds with probability at least.7, by picking O( k log k /  2 ) columns, and O( k log 2 k /  6 ) rows. O(SVD k (A)): time to compute the top k left/right singular vectors and values of A.

Comparison with additive error CUR Let A k be the “best” rank k approximation to A. Then, after two passes through A, we can pick O(k/  4 ) rows and O(k/  4 ) columns, such that Additive error might prohibitively large in many applications! This “coarse” CUR might not capture the relevant structure in the data.

A constant factor CUR construction (Mahoney and Drineas, PNAS, to appear) For any k, O(SVD k (A)) time suffices to construct C, U, and R s.t. holds with probability at least.7, by picking O( k log k /  2 ) columns and O( k log k /  2 ) rows.

Relative-error CUR decomposition 1.How do we draw the columns and rows of A to include in C and R? 2.How do we construct U? Create an approximation to A, using rows and columns of A O(1) columns O(1) rows Carefully chosen U Goal: Provide very good bounds for some norm of A – CUR.

Step 1: subspace sampling for C INPUT: matrix A, rank parameter k, number of columns c OUTPUT: matrix of selected columns C Compute the probabilities p i ; For each j = 1,2,…,n, pick the j-th column of A with probability min{1,cp j } Let C be the matrix containing the sampled columns; (C has · c columns in expectation)

Subspace sampling (Frobenius norm) Remark: The rows of V k T are orthonormal vectors, but its columns (V k T ) (i) are not. Subspace sampling V k : orthogonal matrix containing the top k right singular vectors of A.  k : diagonal matrix containing the top k singular values of A. Normalization s.t. the p j sum up to 1

Subspace sampling (Frobenius norm) Remark: The rows of V k T are orthonormal vectors, but its columns (V k T ) (i) are not. Subspace sampling V k : orthogonal matrix containing the top k right singular vectors of A.  k : diagonal matrix containing the top k singular values of A. Normalization s.t. the p j sum up to 1 Leverage scores (many references in the statistics community)

Step 1: subspace sampling for R INPUT: matrix A, rank parameter k, number of rows r OUTPUT: matrix of selected rows R Compute the probabilities p i ; For each j = 1,2,…,m, pick the j-th row of A with probability min{1,rp j }. Let R be the matrix containing the sampled rows. (R has · r rows in expectation)

Subspace sampling (Frobenius norm) Subspace sampling U k : orthogonal matrix containing the top k left singular vectors of A.  k : diagonal matrix containing the top k singular values of A. Normalization s.t. the p i sum up to 1 Leverage scores (many references in the statistics community) Remark: The columns of U k are orthonormal vectors, but its rows (U k ) (i) are not.

Algorithm CUR C contains columns 7, 14, and 43 of A R contains rows 10, 33, and 42 of A U = C + AR + (c = r = O(k log(k) /  2 ) in worst case) ColumnSelect on A ColumnSelect on A T U = C + AR + Leverage (columns) Leverage (rows)

Computing subspace probabilities faster Subspace sampling (expensive) Open problem: Is it possible to compute/approximate the subspace sampling probabilities faster?

Computing subspace probabilities faster Subspace sampling (expensive) Open problem: Is it possible to compute/approximate the subspace sampling probabilities faster? Partial answer: Assuming k is O(1), it is known that by leveraging the Fast Johnson-Lindenstrauss transform of (Ailon & Chazelle STOC ’06) the singular vectors of a matrix can be approximated very accurately in O(mn) time. (Sarlos FOCS ’06, Drineas, Mahoney, Muthukrishnan, & Sarlos ’07, as well as work by Tygert, Rokhlin, and collaborators in PNAS)

Computing subspace probabilities faster Subspace sampling (expensive) Open problem: Is it possible to compute/approximate the subspace sampling probabilities faster? Partial answer: Assuming k is O(1), it is known that by leveraging the Fast Johnson-Lindenstrauss transform of (Ailon & Chazelle STOC ’06) the singular vectors of a matrix can be approximated very accurately in O(mn) time. (Sarlos FOCS ’06, Drineas, Mahoney, Muthukrishnan, & Sarlos ’07, as well as work by Tygert, Rokhlin, and collaborators in PNAS) Problem: To the best of my knowledge, this does not immediately imply an algorithm to (provably) approximate the subspace sampling probabilities.

CUR decompositions: a summary G.W. Stewart (Num. Math. ’99, TR ’04 ) C: variant of the QR algorithm R: variant of the QR algorithm U: minimizes ||A-CUR|| F No a priori bounds Solid experimental performance Goreinov, Tyrtyshnikov, & Zamarashkin (LAA ’97, Cont. Math. ’01) C: columns that span max volume U: W + R: rows that span max volume Existential result Error bounds depend on ||W + || 2 Spectral norm bounds! Williams & Seeger (NIPS ’01) C: uniformly at random U: W + R: uniformly at random Experimental evaluation A is assumed PSD Connections to Nystrom method Drineas and Kannan (SODA ’03, SICOMP ’06) C: w.r.t. column lengths U: in linear/constant time R: w.r.t. row lengths Randomized algorithm Provable, a priori, bounds Explicit dependency on A – A k Drineas, Mahoney, & Muthukrishnan (SODA ’06, SIMAX ’08) C: depends on singular vectors of A. U: (almost) W + R: depends on singular vectors of C (1+  ) approximation to A – A k Computable in O(mnk 2 ) time (experimentally) Mahoney & Drineas (PNAS, to appear) C: depends on singular vectors of A. U: C + AR + R: depends on singular vectors of A (2+  ) approximation to A – A k Computable in O(mnk 2 ) time (experimentally)

Future directions  Faster algorithms Leveraging preprocessing tools (i.e., the Fast Johnson-Lindenstrauss Transform, Ailon-Chazelle ’06)  Deterministic variants Upcoming work by C. Boutsidis  Matching lower bounds  Implementations and widely disseminated software  Similar results for non-linear dimensionality reduction techniques: kernel-PCA.