Learning-Based Low-Rank Approximations

Slides:



Advertisements
Similar presentations
1+eps-Approximate Sparse Recovery Eric Price MIT David Woodruff IBM Almaden.
Advertisements

Numerical Linear Algebra in the Streaming Model Ken Clarkson - IBM David Woodruff - IBM.
Numerical Linear Algebra in the Streaming Model
Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden.
Approximate List- Decoding and Hardness Amplification Valentine Kabanets (SFU) joint work with Russell Impagliazzo and Ragesh Jaiswal (UCSD)
3D Geometry for Computer Graphics
Chapter 28 – Part II Matrix Operations. Gaussian elimination Gaussian elimination LU factorization LU factorization Gaussian elimination with partial.
Sparse Recovery Using Sparse (Random) Matrices Piotr Indyk MIT Joint work with: Radu Berinde, Anna Gilbert, Howard Karloff, Martin Strauss and Milan Ruzic.
On the Power of Adaptivity in Sparse Recovery Piotr Indyk MIT Joint work with Eric Price and David Woodruff, 2011.
Sparse Recovery Using Sparse (Random) Matrices Piotr Indyk MIT Joint work with: Radu Berinde, Anna Gilbert, Howard Karloff, Martin Strauss and Milan Ruzic.
Image acquisition using sparse (pseudo)-random matrices Piotr Indyk MIT.
Sparse Recovery (Using Sparse Matrices)
Dimensionality Reduction For k-means Clustering and Low Rank Approximation Michael Cohen, Sam Elder, Cameron Musco, Christopher Musco, and Madalina Persu.
Dimensionality Reduction PCA -- SVD
Sketching for M-Estimators: A Unified Approach to Robust Regression
Turnstile Streaming Algorithms Might as Well Be Linear Sketches Yi Li Huy L. Nguyen David Woodruff.
1cs542g-term High Dimensional Data  So far we’ve considered scalar data values f i (or interpolated/approximated each component of vector values.
Principal Component Analysis CMPUT 466/551 Nilanjan Ray.
Principal Component Analysis
Sampling algorithms for l 2 regression and applications Michael W. Mahoney Yahoo Research (Joint work with P. Drineas.
Approximate Nearest Neighbors and the Fast Johnson-Lindenstrauss Transform Nir Ailon, Bernard Chazelle (Princeton University)
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
Clustering In Large Graphs And Matrices Petros Drineas, Alan Frieze, Ravi Kannan, Santosh Vempala, V. Vinay Presented by Eric Anderson.
Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.
Sketching as a Tool for Numerical Linear Algebra David Woodruff IBM Almaden.
Sketching for M-Estimators: A Unified Approach to Robust Regression Kenneth Clarkson David Woodruff IBM Almaden.
A Sparse Solution of is Necessarily Unique !! Alfred M. Bruckstein, Michael Elad & Michael Zibulevsky The Computer Science Department The Technion – Israel.
How Robust are Linear Sketches to Adaptive Inputs? Moritz Hardt, David P. Woodruff IBM Research Almaden.
Principle Component Analysis (PCA) Networks (§ 5.8) PCA: a statistical procedure –Reduce dimensionality of input vectors Too many features, some of them.
Deep Learning – Fall 2013 Instructor: Bhiksha Raj Paper: T. D. Sanger, “Optimal Unsupervised Learning in a Single-Layer Linear Feedforward Neural Network”,
 Karthik Gurumoorthy  Ajit Rajwade  Arunava Banerjee  Anand Rangarajan Department of CISE University of Florida 1.
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
1. Systems of Linear Equations and Matrices (8 Lectures) 1.1 Introduction to Systems of Linear Equations 1.2 Gaussian Elimination 1.3 Matrices and Matrix.
Low Rank Approximation and Regression in Input Sparsity Time David Woodruff IBM Almaden Joint work with Ken Clarkson (IBM Almaden)
Sampling Based Range Partition for Big Data Analytics + Some Extras Milan Vojnović Microsoft Research Cambridge, United Kingdom Joint work with Charalampos.
A Story of Principal Component Analysis in the Distributed Model David Woodruff IBM Almaden Based on works with Christos Boutsidis, Ken Clarkson, Ravi.
SketchVisor: Robust Network Measurement for Software Packet Processing
Dimensionality Reduction and Principle Components Analysis
Linear Algebra review (optional)
Introduction The central problems of Linear Algebra are to study the properties of matrices and to investigate the solutions of systems of linear equations.
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Packing to fewer dimensions
New Characterizations in Turnstile Streams with Applications
Fast Dimension Reduction MMDS 2008
Estimating L2 Norm MIT Piotr Indyk.
Streaming & sampling.
Nonnegative polynomials and applications to learning
Sublinear Algorithmic Tools 2
LSI, SVD and Data Management
Privacy-Preserving Classification
Packing to fewer dimensions
Singular Value Decomposition
Sketching and Embedding are Equivalent for Norms
Turnstile Streaming Algorithms Might as Well Be Linear Sketches
Lecture 21 SVD and Latent Semantic Indexing and Dimensional Reduction
Image Restoration and Denoising
Design of Hierarchical Classifiers for Efficient and Accurate Pattern Classification M N S S K Pavan Kumar Advisor : Dr. C. V. Jawahar.
Overview Massive data sets Streaming algorithms Regression
Principal Component Analysis
Word Embedding Word2Vec.
Connecting Data with Domain Knowledge in Neural Networks -- Use Deep learning in Conventional problems Lizhong Zheng.
Outline Singular Value Decomposition Example of PCA: Eigenfaces.
Feature space tansformation methods
Lecture 15: Least Square Regression Metric Embeddings
Minwise Hashing and Efficient Search
Linear Algebra review (optional)
Packing to fewer dimensions
Ax = b Methods for Solution of the System of Equations (ReCap):
(Learned) Frequency Estimation Algorithms
Presentation transcript:

Learning-Based Low-Rank Approximations Piotr Indyk Ali Vakilian Yang Yuan MIT

Linear Sketches Many algorithms are obtained using linear sketches: Input: represented by x (or A) Sketching: compress x into Sx S=sketch matrix Computation: on Sx Examples: Dimensionality reduction (e.g., Johnson-Lindenstrauss lemma) Streaming algorithms Matrices are implicit, e.g. hash functions Compressed sensing Linear algebra (regression, low-rank approximation,..) … 𝑥 𝑖 Count-Min sketch

Learned Linear Sketches S is almost always a random matrix Independent, FJLT, sparse,... Pros: simple, efficient, worst-case guarantees Cons: does not adapt to data Why not learn S from examples ? Dimensionality reduction: e.g., PCA Compressed sensing: talk by Ali Mousavi Autoencoders: x → Sx →x’ Streaming algorithms: talk by Ali Vakilian Linear algebra ? This talk Learned Oracle Stream element Heavy Not Heavy Unique Bucket Sketching Alg (e.g. CM)

Low Rank Approximation Singular Value Decomposition (SVD) Any matrix A = U Σ V, where: U has orthonormal columns Σ is diagonal V has orthonormal rows Rank-k approximation: Ak = Uk Σk Vk Equivalently: Ak = argminrank-k matrices B ||A-B||F

Approximate Low Rank Approximation Instead of Ak = argminrank-k matrices B ||A-B||F output a rank-k matrix A’, so that ||A-A’||F ≤ (1+ε) ||A-Ak||F Hopefully more efficient than computing exact Ak Sarlos’06, Clarkson-Woodruff’09,13,…. See Woodruff’14 for a survey Most of these algos use linear sketches SA S can be dense (FJLT) or sparse (0/+1/-1) We focus on sparse S

Sarlos-ClarksonWoodruff Streaming algorithm (two passes): Compute SA (first pass) Compute orthonormal V that spans rowspace of SA Compute AVT (second pass) Return SCW(S,A):= [AVT]k V Space: Suppose that A is n x d, S is m x n Then SA is m x d, AVT is n x m Space proportional to m Theory: m = O(k/ε)

Learning-Based Low-Rank Approximation Sample matrices A1...AN Find S that minimizes 𝑖 | |𝐴 𝑖 −𝑆𝐶𝑊(𝑆, 𝐴 𝑖 )|| 𝐹 Use S happily ever after … (as long data come from the same distribution) “Details”: Use sparse matrices S Random support, optimize values Optimize using SGD in Pytorch Need to differentiate the above w.r.t. S Represent SVD as a sequence of power-method applications (each is differentiable)

𝑖 | |𝐴 𝑖 −𝑆𝐶𝑊(𝑆, 𝐴 𝑖 )|| 𝐹 - ||𝐴 𝑖 − 𝐴 𝑖 𝑘 | ​ 𝐹 Evaluation Datasets: Videos: MIT Logo, Friends, Eagle Hyperspectral images (HS-SOD) TechTC-300 200/400 training, 100 testing Optimize the matrix S Compute the empirical recovery error 𝑖 | |𝐴 𝑖 −𝑆𝐶𝑊(𝑆, 𝐴 𝑖 )|| 𝐹 - ||𝐴 𝑖 − 𝐴 𝑖 𝑘 | ​ 𝐹 Compare to random matrices S

Results k=10 Tech Hyper MIT Logo

Fallback option Learned matrices work (much) better, but no guarantees per matrix Solution: combine S with random rows R Lemma: augmenting R with additional (learned) matrix S cannot increase the error of SCW “Sketch monotonicity” The algorithm inherits worst-case guarantees from R

Mixed matrices - results k m Sketch Logo Hyper Tech   10 20 Learned 0.1 0.52 2.95 Mixed 0.2 0.78 3.73 Random 2.09 2.92 7.99 40 0.04 0.28 1.16 0.05 0.34 1.31 0.45 1.12 3.28

Conclusions/Questions Learned sketches can improve the accuracy/measurement tradeoff for low-rank approximation Improves space Questions: Improve time Requires sketching on both sides, more complicated Guarantees: Sampling complexity Minimizing loss function (provably) Other sketch-monotone algorithms ?

Wacky idea A general approach to learning-based algorithm: Take a randomized algorithm Learn random bits from examples The last two talks can be viewed as instantiations of this approach Random partitions → learning-based partitions Random matrices → learning-based matrices “Super-derandomization”: more efficient algorithm with weaker guarantees (as opposed to less efficient algorithm with stronger guarantees)