Jinbo Bi Joint work with Jiangwen Sun, Jin Lu, and Tingyang Xu

Slides:



Advertisements
Similar presentations
Yinyin Yuan and Chang-Tsun Li Computer Science Department
Advertisements

Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv: )
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
Sparse Modeling for Finding Representative Objects Ehsan Elhamifar Guillermo Sapiro Ren´e Vidal Johns Hopkins University University of Minnesota Johns.
Multi-Task Compressive Sensing with Dirichlet Process Priors Yuting Qi 1, Dehong Liu 1, David Dunson 2, and Lawrence Carin 1 1 Department of Electrical.
Graph Laplacian Regularization for Large-Scale Semidefinite Programming Kilian Weinberger et al. NIPS 2006 presented by Aggeliki Tsoli.
An Introduction to Sparse Coding, Sparse Sensing, and Optimization Speaker: Wei-Lun Chao Date: Nov. 23, 2011 DISP Lab, Graduate Institute of Communication.
Robust Multi-Kernel Classification of Uncertain and Imbalanced Data
1cs542g-term High Dimensional Data  So far we’ve considered scalar data values f i (or interpolated/approximated each component of vector values.
Sampling algorithms for l 2 regression and applications Michael W. Mahoney Yahoo Research (Joint work with P. Drineas.
Principal Component Analysis
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Robust Real-time Object Detection by Paul Viola and Michael Jones ICCV 2001 Workshop on Statistical and Computation Theories of Vision Presentation by.
Information Retrieval in Text Part III Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.
Support Vector Machines
Support Vector Regression David R. Musicant and O.L. Mangasarian International Symposium on Mathematical Programming Thursday, August 10, 2000
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
1 1.1 © 2012 Pearson Education, Inc. Linear Equations in Linear Algebra SYSTEMS OF LINEAR EQUATIONS.
Mining Discriminative Components With Low-Rank and Sparsity Constraints for Face Recognition Qiang Zhang, Baoxin Li Computer Science and Engineering Arizona.
Non Negative Matrix Factorization
Cs: compressed sensing
计算机学院 计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知 计算机学院 Perceptron Revisited: Linear Separators Binary classification.
Center for Evolutionary Functional Genomics Large-Scale Sparse Logistic Regression Jieping Ye Arizona State University Joint work with Jun Liu and Jianhui.
The Group Lasso for Logistic Regression Lukas Meier, Sara van de Geer and Peter Bühlmann Presenter: Lu Ren ECE Dept., Duke University Sept. 19, 2008.
Low-Rank Kernel Learning with Bregman Matrix Divergences Brian Kulis, Matyas A. Sustik and Inderjit S. Dhillon Journal of Machine Learning Research 10.
EECS 730 Introduction to Bioinformatics Microarray Luke Huan Electrical Engineering and Computer Science
Sparse Signals Reconstruction Via Adaptive Iterative Greedy Algorithm Ahmed Aziz, Ahmed Salim, Walid Osamy Presenter : 張庭豪 International Journal of Computer.
Regularization and Feature Selection in Least-Squares Temporal Difference Learning J. Zico Kolter and Andrew Y. Ng Computer Science Department Stanford.
RSVM: Reduced Support Vector Machines Y.-J. Lee & O. L. Mangasarian First SIAM International Conference on Data Mining Chicago, April 6, 2001 University.
Learning Spectral Clustering, With Application to Speech Separation F. R. Bach and M. I. Jordan, JMLR 2006.
A Convergent Solution to Tensor Subspace Learning.
Ariadna Quattoni Xavier Carreras An Efficient Projection for l 1,∞ Regularization Michael Collins Trevor Darrell MIT CSAIL.
Large-Scale Matrix Factorization with Missing Data under Additional Constraints Kaushik Mitra University of Maryland, College Park, MD Sameer Sheoreyy.
Arab Open University Faculty of Computer Studies M132: Linear Algebra
Ultra-high dimensional feature selection Yun Li
Multi-label Prediction via Sparse Infinite CCA Piyush Rai and Hal Daume III NIPS 2009 Presented by Lingbo Li ECE, Duke University July 16th, 2010 Note:
Dynamic Background Learning through Deep Auto-encoder Networks Pei Xu 1, Mao Ye 1, Xue Li 2, Qihe Liu 1, Yi Yang 2 and Jian Ding 3 1.University of Electronic.
Searching a Linear Subspace Lecture VI. Deriving Subspaces There are several ways to derive the nullspace matrix (or kernel matrix). ◦ The methodology.
Massive Support Vector Regression (via Row and Column Chunking) David R. Musicant and O.L. Mangasarian NIPS 99 Workshop on Learning With Support Vectors.
Regularized Least-Squares and Convex Optimization.
Jinbo Bi Joint work with Tingyang Xu, Chi-Ming Chen, Jason Johannesen
Semi-Supervised Clustering
Compressive Coded Aperture Video Reconstruction
Deeply learned face representations are sparse, selective, and robust
Document Clustering Based on Non-negative Matrix Factorization
Multiplicative updates for L1-regularized regression
An Artificial Intelligence Approach to Precision Oncology
Bag-of-Visual-Words Based Feature Extraction
Solver & Optimization Problems
5 Systems of Linear Equations and Matrices
Glenn Fung, Murat Dundar, Bharat Rao and Jinbo Bi
René Vidal and Xiaodong Fan Center for Imaging Science
Jianping Fan Dept of CS UNC-Charlotte
Learning with information of features
Collaborative Filtering Matrix Factorization Approach
Discriminative Frequent Pattern Analysis for Effective Classification
Optimal sparse representations in general overcomplete bases
Spectral Clustering Eric Xing Lecture 8, August 13, 2010
Concept Decomposition for Large Sparse Text Data Using Clustering
CS 485G: Special Topics in Data Mining
Learning Incoherent Sparse and Low-Rank Patterns from Multiple Tasks
University of Wisconsin - Madison
Label propagation algorithm
SVMs for Document Ranking
An Efficient Projection for L1-∞ Regularization
NON-NEGATIVE COMPONENT PARTS OF SOUND FOR CLASSIFICATION Yong-Choon Cho, Seungjin Choi, Sung-Yang Bang Wen-Yi Chu Department of Computer Science &
University of Wisconsin - Madison
Learning and Memorization
Outline Sparse Reconstruction RIP Condition
Presentation transcript:

Multi-view Sparse Co-clustering via Proximal Alternating Linearized Minimization Jinbo Bi Joint work with Jiangwen Sun, Jin Lu, and Tingyang Xu Department of Computer Science and Engineering University of Connecticut

Outline Existing Co-clustering methods Motivation of our problem – multi-view sparse co-clustering Our formulation - low-rank matrix approximation Optimization of our formulated problem – proximal alternating linearized minimization Experimental results Conclusion We review what existing co-clustering methods do, then we talk about why we want to solve multi-view sparse co-clustering. Our approach is formulated based low-rank matrix approximation. We developed a proximal alternating linearized minimization algorithm to optimize the related optimization problem. We discuss our experimental results and then conclude.

Existing co-clustering methods Bi-clustering – jointly cluster rows and columns of a data matrix Clustering in subspaces – search subspaces, and find different cluster solutions in the different subspace Existing co-clustering methods are largely divided into two lines. One line is called bi-clustering where we jointly cluster rows and columns of a data matrix, so each cluster of subjects differ from others on a subset of features rather than all features. Bi-clustering is similar to subspace clustering where we cluster subjects into subspaces or more generally we search for subspaces in the data dimension to find different subject groupings in the different subspaces. A data matrix

Existing co-clustering methods Bi-clustering – jointly cluster rows and columns of a data matrix Clustering in subspaces – search subspaces, and find different cluster solutions in the different subspace Existing co-clustering methods are largely developed into two lines. One line is called bi-clustering where we jointly cluster rows and columns of a data matrix, so each cluster of subjects differ from others on a subset of features rather than all features. Bi-clustering is similar to subspace clustering where we cluster subjects into subspaces or more generally we search for subspaces in the data dimension to find different subject groupings in the different subspaces. A data matrix

Existing co-clustering methods Multi-view co-clustering – e.g., co-regularized spectral clustering (using all features to compute a similarity matrix for each view) Another line is relevant to multiple data matrices, called multi-view co-clustering where the same subjects are viewed in different data sources. We could do cluster analysis in each view, but we want to resultant subject clusters to be consistent across the views. Data matrix 1 Data matrix 2

Multi-view sparse co-clustering Find subspaces in each view so to identify clusters in the subspaces that are consistent across the views Now the problem we want to solve is called multi-view sparse co-clustering where we find subspaces in each view so to identify clusters that are agreed across the views. The fundamental assumption here is that the subject clusters may exist in subspaces rather in all dimensions. Data matrix 1 Data matrix 2

Motivation – example applications Derive subtypes of a complex disease in both clinical symptoms and genetic variants A disease subtype may be characterized by a subset of symptoms (not all symptoms) Usually only very few genetic variants from DNA are associated with a disease subtype Detect conservative gene co-regulation among multiple species Genes are co-regulated (up or down regulated) only at certain stages (not all stages) for every species The same subset of genes may be co-regulated at different stages for different species This problem is encountered in many scientific domains. Here, we give two bioinformatics problems as examples. When we derive subtypes of a complex disorder, the subtypes need to defined in both clinical symptoms and genetic variations. A disease subtype may be characterized by a subset of symptom rather than all symptoms. We may identify specific genes may be associated with a disease subtype not the entire DNA. Hence, the subtype clusters exists in subspaces of the two views. When we detect gene co-regulation conservative across species, each species gives us a data matrix containing gene expression levels observed at different developmental stages. The genes can up or down regulated at certain stages that may differ among species but not all stages.

Our formulation – single view Based on low-rank matrix approximation – perform a sequence of rank-one approximations to the data matrices u v T Our solution to the multi-view sparse co-clustering is based on low-rank matrix approximation. We propose to perform a sequence of rank-one approximation to the data matrices. For one data matrix, our task is similar to biclustering. If we approximate this data matrix by the product of vectors u and v, and we require both u and v to be sparse. For instance, if they have non-zero values here, the first cluster is identified. This means we impose 0 norm-regularization on the u and v when solve the approximation. Note that in this process, we don’t care the actual values of the nonzero entries, but how many non-zeros entries and where they are located, so a 0-norm penalty is appropriate. Require sparse vectors to be used in the decomposition

Our formulation – multiple view Given m data matrices, we want the u’s to have non-zero entries at the same positions We hence use a binary vector to connect the different views Now given multiple say m data matrices, to find the same subject clusters, we need the left vectors u to have their non-zeros entries at the same positions. We hence use a binary indicator vector w to connect the different views when we solve the approximation problems. In each view of data approximation, we multiply u componentwisely by this shared w. Now we no longer need u to be sparse because once w is sparse, it will enforces all left vectors of the views to have the same sparsity pattern. This optimization problem is equivalent to solving a non-convex and non-smooth minim where is the set of binary vectors of length n Equivalently, it solves a non-convex and non-smooth minimization problem

Optimization algorithm The framework of proximal alternating linearized minimization (PALM) (Bolte et al, 2014) Can solve an optimization problem with multiple blocks of variables Only requires the objective function is smooth at the term that uses all variables. Only requires that the smooth part of the objective has component-wise Lipchitz continuous gradient for convergence Has been proved to globally converge to a critical point of the problem if the problem satisfies certain conditions To effectively optimize the problem, we developed an algorithm based on the framework of proximal alternating linearized minimization, in short PALM. So PALM is a framework to solve optimization problems with multiple blocks of variables. It only requires the objective function is smooth at the term that uses all variables. It only requires that the objective function is componentwise Lipchitz continous for convergence. Most proximal operator methods require the objective function to be globally Lipchitz continuous. It has been proved to globally converge to a critical point of the problem if the problem satisfies certain conditions.

Optimization algorithm We derive a PALM algorithm that alternates between optimizing u’s, v’s and ω To solve for each block of variables, we use a proximal gradient method, for instance, we solve uk by where h is the smooth part, is a constant, and We hence alternate between optimizing u’s, v’s and w. For each group of variables, we use a proximal gradient method. That gives us a closed-form updating formula, for instance, this is the formula to update u for each view where gamma is a pre-chosen constant, and L is the Lipchitz modulus of the gradient. Hence,

Optimization algorithm When solve for v’s and ω, we can similarly derive the proximal operator problems but now with L0-regularizer When we solve for v’s and w, we similarly derive the proximal operator problems but now with the 0-norm regularization.

Optimization algorithm Both of the proximal operator problems have closed-form solutions Let Let These two sub-problems also have closed-form solutions as follows. For instance, for the shared vector w, we first compute the solution of the unconstrained problem, then we threshold this vector and we only maintain those entries whose magnitude is greater than a threshold. This threshold is determined by the hyper-parameter s_w. α and β are thresholds determined by and

Optimization algorithm Convergence analysis shows the following result Theorem: Let z be the vector consisting of all variables of the proposed problem, and {zt} be a sequence generated by our PALM algorithm. Then the sequence {zt} has finite length and converges to a critical point of the problem. The algorithm takes computation time of O(nmd). Our convergence analysis shows this PALM based algorithm can globally converge.

Computational results The proposed algorithm was tested on Simulations Benchmark datasets Comparison Single view sparse low-rank approximation Kernel addition Kernel product Co-regularized spectral (Kumar et al 2011) Co-trained spectral (Kumar & Daume III, 2011) Multi-view CCA (Chaudhuri et al, 2009) Multi-view feature learning (Wang et al, 2013) We tested our algorithms in simulations and on benchmark data. We compared it with 7 other methods, the first three methods are baseline methods, the other four are state of the art co-clustering methods.

Simulations We synthesized two views of data Genetic view: 1092 subjects, 100 genetic markers from 1000 Genome Project Clinical view: synthesized 9 clinical variables for the 1092 subjects Created two clusters in each view, the two clusters each are associated with 10 genetic markers and 3 clinical variables randomly picked Subjects not in the two clusters form the third cluster We added noise measured by a parameter e, the larger the e, the more agreed the cluster solutions are in the two views In simulation, read the slides.

Results in simulations This table shows the normalized mutual information that computes the mutual information between the synethesized cluster assignments and the cluster assignments resulted from each method and normalized by the cluster entropies. It ranges from 0 to 1, and The higher the better. The proposed method clearly outperformed the other methods. NMI: normalized mutual information computes the mutual information between two cluster assignments normalized by the cluster entropies.

Results in simulations Among all of the comparison methods, only our method can identify features for the clusters. This table summarizes the feature selection performance. Our algorithm is pretty accurate to recover the true features. TF: true features, TPF: true positive features, FPF: false positive features

Benchmark datasets UCI Handwritten digits dataset: 2000 examples 6 views The views have different features, e.g., 240 pixel averages in 2 by 3 sub-images in one view, 76 Fourier coefficients in another view Crowd-sourcing dataset: 584 images 2 views One view has 15,369 image features, the other has 108 labels provided by 27 online labelers We used two benchmark datasets. UCI hanwritten digits data which has 6 views and 2000 examples, and crowd-sourcing dataset which has 2 views and 584 examples.

Results on benchmark data Again here is the normalized mutual information values. Our method achieved the highest values among the methods. NMI: normalized mutual information computes the mutual information between two cluster assignments normalized by the cluster entropies.

Conclusion We believe this is the first method that searches subspaces for multi-view consistent clusters. The proposed PALM based algorithm is efficient. At each alternative step, it computes an analytical formula. It has a linear complexity of computation time The algorithm can globally converge to a critical point of the problem Our approach directly solves a formulation with the L0 regularization (rather than its approximation) The take-home message is the following.

Thank you! References Supported by NSF and NIH grants Bolte et al, Proximal alternating linearized minimization for nonconvex and nonsmooth problems, Mathematical Programming, 146(1):459-494, 2014 Chaudhuri et al, Multi-view clustering via canonical correlation analysis, International Conference on Machine Learning, pp. 129-136, 2009. Kumar et al, A co-training approach for multi-view spectral clustering, International Conference on Machine Learning, pp. 393-400, 2011 Kumar et al, Co-regularized multi-view spectral clustering, Advances in Neural Information Processing Systems, pp. 1413-1421, 2011. Wang et al, Multi-view clustering and feature learning via structured sparsity, International Conference on Machine Learning, JMLR 28:352-360, 2013 Supported by NSF and NIH grants Thank you! This work is supported by the following federal grants.