Informatics and Mathematical Modelling / Intelligent Signal Processing 1 EMMDS 2009 July 3rd, 2009 Clustering on the Simplex Morten Mørup DTU Informatics.

Slides:

Advertisements

Similar presentations

Bayesian Belief Propagation

Advertisements

Lecture 9 Support Vector Machines

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.

SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.

Community Detection with Edge Content in Social Media Networks Paper presented by Konstantinos Giannakopoulos.

Constrained Approximate Maximum Entropy Learning (CAMEL) Varun Ganapathi, David Vickrey, John Duchi, Daphne Koller Stanford University TexPoint fonts used.

Tighter and Convex Maximum Margin Clustering Yu-Feng Li (LAMDA, Nanjing University, China) Ivor W. Tsang.

Randomized Sensing in Adversarial Environments Andreas Krause Joint work with Daniel Golovin and Alex Roper International Joint Conference on Artificial.

A Geometric Perspective on Machine Learning 何晓飞浙江大学计算机学院 1.

Informatics and Mathematical Modelling / Cognitive Sysemts Group 1 MLSP 2010 September 1st Archetypal Analysis for Machine Learning Morten Mørup DTU Informatics.

Probabilistic Clustering-Projection Model for Discrete Data

Machine Learning and Data Mining Clustering

MATH 685/ CSI 700/ OR 682 Lecture Notes

Principal Component Analysis CMPUT 466/551 Nilanjan Ray.

Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.

EE 290A: Generalized Principal Component Analysis Lecture 6: Iterative Methods for Mixture-Model Segmentation Sastry & Yang © Spring, 2011EE 290A, University.

Locally Constraint Support Vector Clustering

University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.

Unsupervised Learning: Clustering Rong Jin Outline  Unsupervised learning  K means for clustering  Expectation Maximization algorithm for clustering.

An Introduction to Kernel-Based Learning Algorithms K.-R. Muller, S. Mika, G. Ratsch, K. Tsuda and B. Scholkopf Presented by: Joanna Giforos CS8980: Topics.

Efficient Convex Relaxation for Transductive Support Vector Machine Zenglin Xu 1, Rong Jin 2, Jianke Zhu 1, Irwin King 1, and Michael R. Lyu 1 4. Experimental.

Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Message Passing Algorithms for Optimization

Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction

Support Vector Machines

Unsupervised Learning

Visual Recognition Tutorial

Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.

Clustering with Bregman Divergences Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, Joydeep Ghosh Presented by Rohit Gupta CSci 8980: Machine Learning.

Informatics and Mathematical Modelling / Intelligent Signal Processing ISCAS Morten Mørup Approximate L0 constrained NMF/NTF Morten Mørup Informatics.

An Introduction to Support Vector Machines Martin Law.

Incomplete Graphical Models Nan Hu. Outline Motivation K-means clustering Coordinate Descending algorithm Density estimation EM on unconditional mixture.

Image Segmentation Rob Atlas Nick Bridle Evan Radkoff.

Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.

Informatics and Mathematical Modelling / Intelligent Signal Processing 1 EUSIPCO’09 27 August 2009 Tuning Pruning in Sparse Non-negative Matrix Factorization.

Mining Discriminative Components With Low-Rank and Sparsity Constraints for Face Recognition Qiang Zhang, Baoxin Li Computer Science and Engineering Arizona.

Boris Babenko Department of Computer Science and Engineering University of California, San Diego Semi-supervised and Unsupervised Feature Scaling.

CSC321: Neural Networks Lecture 12: Clustering Geoffrey Hinton.

EMIS 8381 – Spring Netflix and Your Next Movie Night Nonlinear Programming Ron Andrews EMIS 8381.

Sparse Matrix Factorizations for Hyperspectral Unmixing John Wright Visual Computing Group Microsoft Research Asia Sept. 30, 2010 TexPoint fonts used in.

计算机学院计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知计算机学院 Perceptron Revisited: Linear Separators Binary classification.

Probabilistic Graphical Models

Informatics and Mathematical Modelling / Intelligent Signal Processing 1 Sparse’09 8 April 2009 Sparse Coding and Automatic Relevance Determination for.

An Introduction to Support Vector Machines (M. Law)

ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.

Low-Rank Kernel Learning with Bregman Matrix Divergences Brian Kulis, Matyas A. Sustik and Inderjit S. Dhillon Journal of Machine Learning Research 10.

Direct Robust Matrix Factorization Liang Xiong, Xi Chen, Jeff Schneider Presented by xxx School of Computer Science Carnegie Mellon University.

Learning Spectral Clustering, With Application to Speech Separation F. R. Bach and M. I. Jordan, JMLR 2006.

An Efficient Greedy Method for Unsupervised Feature Selection

University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

Lecture 2: Statistical learning primer for biologists

Large-Scale Matrix Factorization with Missing Data under Additional Constraints Kaushik Mitra University of Maryland, College Park, MD Sameer Sheoreyy.

CS654: Digital Image Analysis Lecture 28: Advanced topics in Image Segmentation Image courtesy: IEEE, IJCV.

Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised,

Ultra-high dimensional feature selection Yun Li

Privacy-Preserving Support Vector Machines via Random Kernels Olvi Mangasarian UW Madison & UCSD La Jolla Edward Wild UW Madison March 3, 2016 TexPoint.

Linli Xu Martha White Dale Schuurmans University of Alberta

Jinbo Bi Joint work with Jiangwen Sun, Jin Lu, and Tingyang Xu

Multiplicative updates for L1-regularized regression

LECTURE 09: BAYESIAN ESTIMATION (Cont.)

The minimum cost flow problem

Multimodal Learning with Deep Boltzmann Machines

Machine Learning Basics

Latent Variables, Mixture Models and EM

Learning latent structure in complex networks 1 2

Chapter 11 Limitations of Algorithm Power

Spectral Clustering Eric Xing Lecture 8, August 13, 2010

Non-Negative Matrix Factorization

Machine Learning and Data Mining Clustering

Presentation transcript:

Informatics and Mathematical Modelling / Intelligent Signal Processing 1 EMMDS 2009 July 3rd, 2009 Clustering on the Simplex Morten Mørup DTU Informatics Intelligent Signal Processing Technical University of Denmark TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA

Informatics and Mathematical Modelling / Intelligent Signal Processing 2 EMMDS 2009 July 3rd, 2009 Joint work with Lars Kai Hansen DTU Informatics Intelligent Signal Processing Technical University of Denmark Christian Walder DTU Informatics Intelligent Signal Processing Technical University of Denmark

Informatics and Mathematical Modelling / Intelligent Signal Processing 3 EMMDS 2009 July 3rd, 2009 Clustering Cluster analysis or clustering is the assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense. (Wikipedia)

Informatics and Mathematical Modelling / Intelligent Signal Processing 4 EMMDS 2009 July 3rd, 2009 Clustering approaches K-means iterative refinement algorithm (Lloyd, 1982; Hartigan, 1979) Problem NP-complete (Megiddo and Supowit, 1984) Relaxations of the hard assigment problem: Annealing approaches based on temperature parameter (T0 the original clustering problem is recovered) (see for instance Hofmann and Buhmann, 1997) Fuzzy clustering (Hathaway and Bezdek, 1988) Expectation Maximization (Mixture of Gaussians) Spectral Clustering Previously relaxations are either not exact or dependent on some problem specific annealing parameter in order to recover the original binary combinatorial assignments. Assignmnt Step (S): Assign each data point to the cluster with closest mean value Update Step (C): Calculate the new mean value for each cluster No single change in assignment better than current assignment (1-spin stability). Guarantee of optimality: Drawbacks:

Informatics and Mathematical Modelling / Intelligent Signal Processing 5 EMMDS 2009 July 3rd, 2009 From the K-means objective to Pairwise Clustering K-mean objective Pairwise Clustering (Buhmann and Hofmann, 1994) K similarity matrix, K=X T X equivalent to the k-means objective

Informatics and Mathematical Modelling / Intelligent Signal Processing 6 EMMDS 2009 July 3rd, 2009 Although Clustering is hard there is room to be simple(x) minded! Binary Combinatorial (BC) Simplicial Relaxation (SR)

Informatics and Mathematical Modelling / Intelligent Signal Processing 7 EMMDS 2009 July 3rd, 2009 The simplicial relaxation (SR) admits standard continuous optimization to solve for the pairwise clustering problems. For instance by normalization invariant projected gradient ascent:

Informatics and Mathematical Modelling / Intelligent Signal Processing 8 EMMDS 2009 July 3rd, 2009 Brown and grey clusters each contain 1000 data-points in R 2 Whereas the remaining clusters each have 250 data-points. Synthetic data example K-means SR-clustering

Informatics and Mathematical Modelling / Intelligent Signal Processing 9 EMMDS 2009 July 3rd, 2009 SR-clustering algorithm driven by high density regions

Informatics and Mathematical Modelling / Intelligent Signal Processing 10 EMMDS 2009 July 3rd, 2009 SR-clustering ( init =1)SR-clustering ( init =0.01) Lloyd’s K-means Thus, solutions in general substantially better than Lloyd’s algorithm having the same computational complexity

Informatics and Mathematical Modelling / Intelligent Signal Processing 11 EMMDS 2009 July 3rd, components50 components100 components K-means SR-clustering ( init =1) SR-clustering ( init =0.01)

Informatics and Mathematical Modelling / Intelligent Signal Processing 12 EMMDS 2009 July 3rd, 2009 SR-clustering for Kernel based semi- supervised learning (Basu et al, 2004, Kulis et al. 2005, Kulis et al, 2009) Kernel based semi-supervised learning based on pairwise clustering

Informatics and Mathematical Modelling / Intelligent Signal Processing 13 EMMDS 2009 July 3rd, 2009 Simplicial relaxation admit solving the problem as a (non-convex) continous optimization problem

Informatics and Mathematical Modelling / Intelligent Signal Processing 14 EMMDS 2009 July 3rd, 2009 Class labels can be handled explicitly fixing Must and cannot links can be absorbed into the Kernel Hence the problem reduces more or less to standard SR-clustering problem for the estimation of S

Informatics and Mathematical Modelling / Intelligent Signal Processing 15 EMMDS 2009 July 3rd, 2009 Thus, Lagrange multipliers give a measure of conflict between the data and the supervision At stationarity we have that the gradients of elements in each column of S that are 1 are larger than elements that are 0. Thus, evaluating the impact of the supervision can be done estimating the minimal lagrange multipliers that guarantee stationarity of the solution obtained by the SR-clustering algorithm. This is a convex optimization problem

Informatics and Mathematical Modelling / Intelligent Signal Processing 16 EMMDS 2009 July 3rd, 2009 Digit classification with one miss-labeled data observation from each class.

Informatics and Mathematical Modelling / Intelligent Signal Processing 17 EMMDS 2009 July 3rd, 2009 Community Detection in Complex Networks Communities/modules: a natural divisions of network nodes into densely connected subgroups (Newman & Girvan 2003) G(V,E) Adjacency Matrix A Community detection algorithm Permuted adjacency matrix PAP T Permutation P of graph from clustering assignment S

Informatics and Mathematical Modelling / Intelligent Signal Processing 18 EMMDS 2009 July 3rd, 2009 Common Community detection objectives Hamiltonian (Fu & Anderson, 1986, Reichardt & Bornholdt, 2004) Modularity (Newman & Girvan, 2004) Generic problems of the form

Informatics and Mathematical Modelling / Intelligent Signal Processing 19 EMMDS 2009 July 3rd, 2009 Again we can make an exact relaxation to the simplex!

Informatics and Mathematical Modelling / Intelligent Signal Processing 20 EMMDS 2009 July 3rd, 2009

Informatics and Mathematical Modelling / Intelligent Signal Processing 21 EMMDS 2009 July 3rd, 2009

Informatics and Mathematical Modelling / Intelligent Signal Processing 22 EMMDS 2009 July 3rd, 2009 SR-clustering of complex networks Quality of solutions comparable to results obtained by extensive Gibbs sampling

Informatics and Mathematical Modelling / Intelligent Signal Processing 23 EMMDS 2009 July 3rd, 2009 So far we have demonstrated how binary combinatorial constraints are recovered at stationarity when relaxing the problems to the simplex. However, simplex constraints also holds promising data mining properties of their own!

Informatics and Mathematical Modelling / Intelligent Signal Processing 24 EMMDS 2009 July 3rd, 2009 Def: The convex hull/convex envelope of X R MN is the minimal convex set containing X. (Informally it can be described as a rubber band wrapped around the data points.) Finding the convex hull is solvable in linear time, O (N) (McCallum and D. Avis, 1979) However, the size of the convex set grows exponentially with the dimensionality of the data, O (log M-1 (N)) (Dwyer, 1988) The Convex Hull The Principal Convex Hull (PCH) Def: The best convex set of size K according to some measure of distortion D(·|·) (Mørup et al. 2009). (Informally it can be described as a less flexible rubber band that wraps most of the data points.)

Informatics and Mathematical Modelling / Intelligent Signal Processing 25 EMMDS 2009 July 3rd, 2009 C: Give the fraction in which observations in X are used to form each feature (distinct aspects/freaks). In general C will be very sparse!! S: Give the fraction each observation resembles each distinct aspects XC. (note when K large enough such that the PCH recover the convex hull) The mathematical formulation of the Principal Convex Hull (PCH) is given by two simplex constraints ”Principal” in terms of the Frobenius norm X  X C S

Informatics and Mathematical Modelling / Intelligent Signal Processing 26 EMMDS 2009 July 3rd, 2009 Relation between the PCH model, low rank decomposition and clustering approaches PCH naturally bridges clustering and low-rank approximations!

Informatics and Mathematical Modelling / Intelligent Signal Processing 27 EMMDS 2009 July 3rd, 2009 Two important properties of the PCH model The PCH model is invariant to affine transformation and scaling The PCH model is unique up to permutation of the components

Informatics and Mathematical Modelling / Intelligent Signal Processing 28 EMMDS 2009 July 3rd, 2009 A feature extraction example More contrast in features than obtained by clustering approaches. As such, PCH aim for distict aspects/regions in data The PCH model strives to attain Platonic ”Ideal Forms”

Informatics and Mathematical Modelling / Intelligent Signal Processing 29 EMMDS 2009 July 3rd, 2009 PCH model for PET data (Positron Emission Tomography) Data contain 3 components: High-Binding regions Low-binding regions Non-binding regions Each voxel given concentration fraction of these regions XC S

Informatics and Mathematical Modelling / Intelligent Signal Processing 30 EMMDS 2009 July 3rd, 2009 NMF spectroscopy of samples of mixtures of propanol butanol and pentanol.

Informatics and Mathematical Modelling / Intelligent Signal Processing 31 EMMDS 2009 July 3rd, 2009 Collaborative filtering example Medium size and large size Movie lens data ( Medium size: 1,000,209 ratings of 3,952 movies by 6,040 users Large size: 10,000,054 ratings of 10,677 movies given by 71,567

Informatics and Mathematical Modelling / Intelligent Signal Processing 32 EMMDS 2009 July 3rd, 2009 Conclusion The simplex offers unique data mining properties Simplicial relaxations (SR) form exact relaxation of common hard assignment clustering problems, i.e. K-means, Pairwise Clustering and Community detection in graphs. SR Enable to solve binary combinatorial problems using standard solvers from continuous optimization. The proposed SR-clustering algorithm outperforms traditional iterative refinement algorithms No need for annealing parameter. hard assignments guaranteed at stationarity (Theorem 1 and 2) Semi-Supervised learning can be posed as continuous optimization problem with associated lagrange multipliers giving an evaluation measure of each supervised constraint

Informatics and Mathematical Modelling / Intelligent Signal Processing 33 EMMDS 2009 July 3rd, 2009 The Principal Convex Hull (PCH) formed by two types of simplex constraints Extract distinct aspects of the data Relevant for data mining in general where low rank approximation and clustering approaches have been invoked. Conclusion cont.

Informatics and Mathematical Modelling / Intelligent Signal Processing 34 EMMDS 2009 July 3rd, 2009 A reformulation of ”Lex Parsimoniae” Simplicity is the ultimate sophistication. Simplexity is the ultimate sophistication. - Leonardo Da Vinci The simplest explanation is usually the best. The simplex explanation is usually the best. - William of Ockham The presented work is described in: M. Mørup and L. K. Hansen ”An Exact Relaxation of Clustering”, Submitted JMLR 2009 M. Mørup, C. Walder and L. K. Hansen ”Simplicial Semi-supervised Learning”, submitted M. Mørup and L. K. Hansen ” Platonic Forms Revisited”, submitted