Author: Vikas Sindhwani and Amol Ghoting Presenter: Jinze Li

Slides:



Advertisements
Similar presentations
Nonnegative Matrix Factorization with Sparseness Constraints S. Race MA591R.
Advertisements

Linear Algebra Applications in Matlab ME 303. Special Characters and Matlab Functions.
Multi-Label Prediction via Compressed Sensing By Daniel Hsu, Sham M. Kakade, John Langford, Tong Zhang (NIPS 2009) Presented by: Lingbo Li ECE, Duke University.
More MR Fingerprinting
MATH 685/ CSI 700/ OR 682 Lecture Notes
Numerical Algorithms Matrix multiplication
Bayesian Robust Principal Component Analysis Presenter: Raghu Ranganathan ECE / CMR Tennessee Technological University January 21, 2011 Reading Group (Xinghao.
Information Retrieval in Text Part III Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.
UNC Chapel Hill Lin/Manocha/Foskey Optimization Problems In which a set of choices must be made in order to arrive at an optimal (min/max) solution, subject.
Ordinary least squares regression (OLS)
Monica Garika Chandana Guduru. METHODS TO SOLVE LINEAR SYSTEMS Direct methods Gaussian elimination method LU method for factorization Simplex method of.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
AMSC 6631 Sparse Solutions of Linear Systems of Equations and Sparse Modeling of Signals and Images: Midyear Report Alfredo Nava-Tudela John J. Benedetto,
Summarized by Soo-Jin Kim
Cs: compressed sensing
Fast and incoherent dictionary learning algorithms with application to fMRI Authors: Vahid Abolghasemi Saideh Ferdowsi Saeid Sanei. Journal of Signal Processing.
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
Practical Dynamic Programming in Ljungqvist – Sargent (2004) Presented by Edson Silveira Sobrinho for Dynamic Macro class University of Houston Economics.
The Fundamentals: Algorithms, the Integers & Matrices.
Computer Animation Rick Parent Computer Animation Algorithms and Techniques Optimization & Constraints Add mention of global techiques Add mention of calculus.
Distributed Nonnegative Matrix Factorization for Web- Scale Dyadic Data Analysis on MapReduce Challenge : the scalability of available tools Definition.
Learning to Sense Sparse Signals: Simultaneous Sensing Matrix and Sparsifying Dictionary Optimization Julio Martin Duarte-Carvajalino, and Guillermo Sapiro.
Data Structure Introduction.
Sparse Signals Reconstruction Via Adaptive Iterative Greedy Algorithm Ahmed Aziz, Ahmed Salim, Walid Osamy Presenter : 張庭豪 International Journal of Computer.
Direct Methods for Sparse Linear Systems Lecture 4 Alessandra Nardi Thanks to Prof. Jacob White, Suvranu De, Deepak Ramaswamy, Michal Rewienski, and Karen.
Sparse & Redundant Representation Modeling of Images Problem Solving Session 1: Greedy Pursuit Algorithms By: Matan Protter Sparse & Redundant Representation.
Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.
Optimization Problems In which a set of choices must be made in order to arrive at an optimal (min/max) solution, subject to some constraints. (There may.
Large-Scale Matrix Factorization with Missing Data under Additional Constraints Kaushik Mitra University of Maryland, College Park, MD Sameer Sheoreyy.
Two dimensional arrays A two dimensional m x n array A is a collection of m. n elements such that each element is specified by a pair of integers (such.
Massive Support Vector Regression (via Row and Column Chunking) David R. Musicant and O.L. Mangasarian NIPS 99 Workshop on Learning With Support Vectors.
LINKED LISTS.
Advanced Sorting 7 2  9 4   2   4   7
PIVOTING The pivot or pivot element is the element of a matrix, or an array, which is selected first by an algorithm (e.g. Gaussian elimination, simplex.
EMGT 6412/MATH 6665 Mathematical Programming Spring 2016
Compressive Coded Aperture Video Reconstruction
Tools for Decision Analysis: Analysis of Risky Decisions
Indexing Structures for Files and Physical Database Design
CHP - 9 File Structures.
Computer Programming BCT 1113
More Linking Up with Linked Lists
The minimum cost flow problem
Resource Elasticity for Large-Scale Machine Learning
Degree and Eigenvector Centrality
Singular Value Decomposition
Chapter 11: Indexing and Hashing
Chapter 6. Large Scale Optimization
Estimating 2-view relationships
Chap 3. The simplex method
Chapter 3: The Efficiency of Algorithms
Indexing and Hashing Basic Concepts Ordered Indices
GENERAL VIEW OF KRATOS MULTIPHYSICS
Principal Component Analysis
Numerical Analysis Lecture14.
Parallel Sorting Algorithms
Parallelization of Sparse Coding & Dictionary Learning
Numerical Analysis Lecture13.
Chapter 8. General LP Problems
Pointer analysis.
Chapter 8. General LP Problems
Chapter 4: Simulation Designs
C++ Array 1.
Chapter 8. General LP Problems
Non-Negative Matrix Factorization
Error Correction Coding
Computer Animation Algorithms and Techniques
Chapter 6. Large Scale Optimization
Arrays.
Lecture-Hashing.
Multidisciplinary Optimization
Presentation transcript:

Author: Vikas Sindhwani and Amol Ghoting Presenter: Jinze Li Large-Scale Distributed Non-negative Sparse Coding and Sparse Dictionary Learning Author: Vikas Sindhwani and Amol Ghoting Presenter: Jinze Li

Problem Introduction we are given a collection of N data points or signals in a high- dimensional space RD: xi ∈ RD, 1 ≤ i ≤ N. Let hj ∈ RD, 1 ≤ j ≤ K denote a dictionary of K “atoms” which we collect as rows of the matrix H = (h1 . . . hK) T ∈ RK*D Given a suitable dictionary H, the goal of sparse coding is to represent datapoints approximately as sparse linear combinations of atoms in the dictionary.

Formulation Constraints : Wi >=0 non-negative coding ||Wi||p < γ sparse coding Hj >=0 non-negative dictionary ||hj ||q <=v sparse dictionary ||hj ||22 =1 uniqueness

First Term in the objective function is be used to measures the squared reconstruction error. The Second term is a regularizer that enforces the learnt dictionary to be close to a prior H0. (If H0 is unknown, λ will be set to 0) Maintaining H as a sparse matrix and turn its updates across iterations into very cheap sequential operations.

Formulation Optimization Strategy Basic idea Block Coordinate Descent(BCD) Each iteration of the algorithm cycles over the variables W and h1…hk, Optimizing a single varible at a time while holding others fixed W can be optimized parallel Dictionary learning phase. We cycle sequentially over h1 ......hk. Keeping W fixed and solve the resulting subproblem

Non-negative Sparse Coding OMP(orthogonal matching pursuit) Help in reducing the current reconstruction residual. Lasso: Helps on numerical optimization procedures.

OMP(Orthogonal matching pursuit) s=Hx and S=HHT s is vextor of cosine similarities between singals and dictionary atoms S is the K*K matrix of inter-atom similarities

OMP

Non-negative Lasso defined as δ(w) = 0 use proximal Method In proximal methods, the idea is to linearize the smooth component, R. around current iterate(Wt)

Sparse Dictionary Learning We now discuss the dictionary learning phase where we update H Use Hierarchical Alternating Least Squares algorithm (HALS) which solves rank-one minimization problems to sequentially update both the rows of H as well as the columns of W

Sparse Dictionary Learning

Sparse Dictionary Learning Projection problem First, let h be any ν- sparse vector in RD and let I(h) be its support. Let i1 . . . iD be a permutation of the integer set 1 ≤ i ≤ D such that qi1 , . . . qiD is in sorted order and define I * = {i1, . . . , is} where s is the largest integer such that s ≤ ν and qi1 > . . . > qis > 0. Now its easy to see that, The solution for this problem will be

Implement Details Parallelization strategy hinges The objective function is separable in the sparse coding variables which therefore can be optimized in an embarrassingly parallel fashion By enforcing hard sparsity on the dictionary elements we turn H into an object that can manipulated very efficiently in memory

Efficient Matrix Computations whether they are materialized in memory in our single-node multicore and Hadoop-based cluster implementation

the maximum memory requirement for W can be bounded by O(N γ) for NOMP Alternatively, matrix-vector products against S may be computed in time O(Kν) implicitly as Sv = H(HT v).

Single-Node In-memory Multicore Implementation Plan1: The first plan does not materialize the matrix W, but explicitly maintains the dense D×K matrix XTW. As each invocation of NOMP or NLASSO completes, the associated sparse coding vector wi is used to update the summary statistics matrices, WTW and XTW, and then discarded, since everything needed for dictionary learning is contained in these summary statistics. When DK << N γ, this leaves more room to accommodate larger datasizes in memory. However, not materializing W means that NLASSO cannot exploit warm-starts from the sparse coding solutions found with respect to the previous dictionary.

plan2 In an alternate plan, we materialize W instead which consumes lesser memory if N γ << DK. We then serve the columns of XTW, i.e., the vectors XT vk in Eqn 14, by performing a sparse matrix-vector product on the fly. However, this requires extracting a column from W which is stored in a row-major dynamic sparse format. To make column extraction efficient, we utilize the fact that the indices of non-zero entries for each row are held in sorted order, and the associated value arrays conform to this order. Hence, we can keep a record of where the i th column was found for each row, and simply advance this pointer when the (i + 1)th column is required. Thus, all columns can be served efficiently with one pass over W.

Benefits for plan2 the matrix-vector product of XT against the columns of W can be parallelized and NLASSO can use warm-starts.

Clustering

The execution proceeds in two phases, the preprocessing phase and the learning phase. The preprocessing phase is a one-time step whose objective is to re-organize the data for better parallel execution of the BCD algorithm. The learning phase is iterative and is coordinated by a control node that first spawns NOMP or NLASSO MR jobs on the worker nodes, then runs sequential dictionary learning and finally monitors the global mean reconstruction error to decide on convergence.

Performance comparison