Generating multidimensional embeddings based on fuzzy memberships

Slides:



Advertisements
Similar presentations
Partitional Algorithms to Detect Complex Clusters
Advertisements

Component Analysis (Review)
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
An Introduction of Support Vector Machine
Carolina Galleguillos, Brian McFee, Serge Belongie, Gert Lanckriet Computer Science and Engineering Department Electrical and Computer Engineering Department.
Input Space versus Feature Space in Kernel- Based Methods Scholkopf, Mika, Burges, Knirsch, Muller, Ratsch, Smola presented by: Joe Drish Department of.
Nonlinear Dimension Reduction Presenter: Xingwei Yang The powerpoint is organized from: 1.Ronald R. Coifman et al. (Yale University) 2. Jieping Ye, (Arizona.
Pattern Recognition and Machine Learning: Kernel Methods.
Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.
COMPUTER AIDED DIAGNOSIS: FEATURE SELECTION Prof. Yasser Mostafa Kadah –
Gene selection using Random Voronoi Ensembles Stefano Rovetta Department of Computer and Information Sciences, University of Genoa, Italy Francesco masulli.
Lecture 21: Spectral Clustering
Spectral Clustering 指導教授 : 王聖智 S. J. Wang 學生 : 羅介暐 Jie-Wei Luo.
Principal Component Analysis
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
1 Introduction to Kernels Max Welling October (chapters 1,2,3,4)
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Segmentation Graph-Theoretic Clustering.
Support Vector Machines
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Diffusion Maps and Spectral Clustering
Clustering Unsupervised learning Generating “classes”
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Manifold learning: Locally Linear Embedding Jieping Ye Department of Computer Science and Engineering Arizona State University
Segmentation Course web page: vision.cis.udel.edu/~cv May 7, 2003  Lecture 31.
IEEE TRANSSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE
Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
An Introduction to Support Vector Machines (M. Law)
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
SemiBoost : Boosting for Semi-supervised Learning Pavan Kumar Mallapragada, Student Member, IEEE, Rong Jin, Member, IEEE, Anil K. Jain, Fellow, IEEE, and.
Project 11: Determining the Intrinsic Dimensionality of a Distribution Okke Formsma, Nicolas Roussis and Per Løwenborg.
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
CS Statistical Machine learning Lecture 12 Yuan (Alan) Qi Purdue CS Oct
A Kernel Approach for Learning From Almost Orthogonal Pattern * CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin * B. Scholkopf.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Learning Kernel Classifiers 1. Introduction Summarized by In-Hee Lee.
Spectral Clustering Shannon Quinn (with thanks to William Cohen of Carnegie Mellon University, and J. Leskovec, A. Rajaraman, and J. Ullman of Stanford.
A Tutorial on Spectral Clustering Ulrike von Luxburg Max Planck Institute for Biological Cybernetics Statistics and Computing, Dec. 2007, Vol. 17, No.
SemiBoost : Boosting for Semi-supervised Learning Pavan Kumar Mallapragada, Student Member, IEEE, Rong Jin, Member, IEEE, Anil K. Jain, Fellow, IEEE, and.
Clustering (2) Center-based algorithms Fuzzy k-means Density-based algorithms ( DBSCAN as an example ) Evaluation of clustering results Figures and equations.
Non-separable SVM's, and non-linear classification using kernels Jakob Verbeek December 16, 2011 Course website:
Spectral Methods for Dimensionality
Support Feature Machine for DNA microarray data
PREDICT 422: Practical Machine Learning
Chapter 7. Classification and Prediction
Deep Feedforward Networks
Intrinsic Data Geometry from a Training Set
Multi-task learning approaches to modeling context-specific networks
LECTURE 11: Advanced Discriminant Analysis
LECTURE 10: DISCRIMINANT ANALYSIS
Data Mining K-means Algorithm
Mean Shift Segmentation
Clustering (3) Center-based algorithms Fuzzy k-means
CSE 4705 Artificial Intelligence
Degree and Eigenvector Centrality
Outline Nonlinear Dimension Reduction Brief introduction Isomap LLE
Segmentation Graph-Theoretic Clustering.
Hidden Markov Models Part 2: Algorithms
REMOTE SENSING Multispectral Image Classification
EE513 Audio Signals and Systems
Spectral Clustering Eric Xing Lecture 8, August 13, 2010
Dimension reduction : PCA and Clustering
Feature space tansformation methods
LECTURE 09: DISCRIMINANT ANALYSIS
NonLinear Dimensionality Reduction or Unfolding Manifolds
Marios Mattheakis and Pavlos Protopapas
Presentation transcript:

Generating multidimensional embeddings based on fuzzy memberships Stefano Rovetta, Francesco Masulli, Maurizio Filippone Department of Computer and Information Sciences University of Genova

Introduction Current data collection methods are high-throughput (general trend in all disciplines) Data analysis research is oriented toward similarity-based methods exploiting mutual relationships between data items Example: KERNEL METHODS Example: SPECTRAL METHODS Potential for more powerful/more compact representations Advantages depend on: Relationship between data cardinality and dimensionality Availability of efficient methods to exploit this data representation

The reference problem We address gene expression analysis with DNA microarray experiments As well-known, typical features of this problem are: High dimensionality Low cardinality High variability (noise) We are interested in the explorative phase, typically based on cluster analysis (or simply clustering) The above features have prompted the development of efficient clustering techniques We aim at improving these methods for better quality results more powerful methods

Outline of the talk Some problems in high dimensionality Similarity-based representations: a short review Fuzzy modeling of data collections Embedding in the space of memberships Membership Embedding as a spectral problem The probe selection problem and strategies Experiments and results Closing remarks

Some problems in high dimensionality The 2 most used, traditional clustering methods are k-Means Hierarchical Agglomerative Clustering (HAC) variants k-Means looks for data concentrations (approximations of mixture distributions) May not work well when data are very sparse

Some problems in high dimensionality HAC variants use several linkage criteria Based on similarity structure (the data similarity matrix) A common problem: no attempt at directly indicating clusters Data or clusters are progressively joined in pairs regardless of their density Actual clustering may be performed only afterward and by additional criteria (dendrogram depth distribution, cophenetic matrix analysis, agglomerative coefficients) This and other drawbacks have prompted development of more sophisticated methods based on similarity structures

Similarity-based representations A similarity matrix is wij = similarity between data items i and j Similarity has a suitable definition depending on the nature of data May be derived by a metric or by other functions, or even given as input Metric similarity matrices are symmetric and positive semidefinite Data directly given as a similarity matrix may have any type of inconsistence (see works by Buhmann et al.)

Similarity-based representations Some solutions are: Hierarchical methods capable of actual clustering e.g. the Farthest Neighbour Approach by Rovetta and Masulli Generalized similarity-based methods e.g. the approach by Pekalska and Duin Kernel methods, where wij = k(xi , xj ) with k( ) a semidefinite positive (Mercer) kernel function Spectral methods, where Wij = weight of link connecting xi and xj on a complete graph built on all data points

Kernel methods Kernel methods have originally been adopted in pattern recognition for applying linear classification methods to nonlinearly separable problems (Support Vector Machines) k(xi , xj ) = f (xi )f (xj ) for a suitable nonlinear mapping f ( ) (possibly unavailable in explicit form) k ( ) measures similarity in the mapped space f (x ) (it is an inner product) Subsequently, several similar problems have been tackled with the so-called kernel approach : Principal Component Analysis Novelty detection (one-class classification) Clustering

Spectral methods Spectral graph theory studies properties of the Lagrangian spectrum of a graph, that is, the ordered set of the eigenvalues of the graph's Lagrangian matrix A Lagrangian matrix is defined as L = D – W, where: W the weight matrix (adjacency matrix of the graph, with weights applied to edges) D the degree matrix, a diagonal matrix such that Dii is the sum of edge weights incident on vertex i A data set with a similarity matrix corresponds to a complete graph where: vertex i corresponds to data point xi edge weight Wij is the similarity between points xi and xj

Spectral methods x3 x1 x2 x1 x2 x4 x5 x3 x4 x5

Spectral methods L is not full-rank and its first eigenvalue is zero. The multiplicity of the zero eigenvalue is equal to the number of connected components in the graph. For a similarity derived from a metric, the graph is undirected and L is symmetric (Lij = Lji ), and the graph is complete (no zeroes in W or L ). We have only one connected component and eigenvalue 0 has multiplicity 1 Therefore, a spectral clustering problems corresponds to analyzing the first few eigenvalues, excluding the first one, to find out the most strongly connected components which are clusters in data This procedure has computational disadvantages (solution of an eigenproblem) but can discover a wide range of cluster shapes

Notes about spectral methods The actual clustering is often performed as k-Means on the embedding of data in the space of the first few eigenvectors This has been proved to be theoretically sound: clusters are made more evident by this mapping Spectral methods are a recent field and are currently actively studied Several connections between Spectral and Kernel methods have been recently pointed out In some cases even equivalences We can also study various normalized Laplacians such as L = D –1 (D – W ) = I – D –1 W This generalized definition has some properties not found in the standard Lagrangian

Fuzzy modeling of data collections Let's come back to the problem of representing efficiently high-dimensional data for the purpose of clustering Suppose we have a data set X = { xi }, X  X X is not required to be a vector space, but we need a similarity measure s on X Let Y = { y1, ..., yc } be a set of c points in X We can characterize a point x  X in terms of how well Y represents (approximates) x We decide to term the points y1, ..., yc probes

Fuzzy modeling of data collections We compute the similarity s ( x , yj ) for each point in Y using s ( x , yj ) = e –|| x – yj ||2b These can be organized into a vector u, such that uj = s ( x , yj ) INTERPRETATION: uj is the fuzzy membership of x to the fuzzy set yj The membership to each point is a mutually exclusive concept Therefore we should normalize these memberships to sum up to 1: Sj uj = 1  vj = uj / Sm um

Embedding in the space of memberships We have now defined the matrix V such that Vij = normalized membership of data point xi to probe yj This is a new representation of data X embedded in the space of memberships to probes Y It is a similarity-based representation If X is a vector space, V has the advantage of being potentially lower-dimensional (c instead of the original data dimension d, with c ≤ d ) Even if X is not a vector space, V is always a vector representation

Clustering in the Membership Embedding space Select a set Y of c probes from the data points in X Map X (n x d ) into V (n x c) Apply your preferred clustering algorithm to the mapped dataset V Why should this method give better clusters? One reason is simply lower dimensionality. This helps in applying methods like k-Means There is a more sound reason for this

Membership Embedding as a spectral approach It can be shown that our membership embedding V can be used to define the Lagrangian of a data graph This is not a complete graph, but one where only probes have edges to every point in the set. Points which are not probes are linked only to probes. x3 x3 y1 x1 x2 x1 x2 y2 x4 x5 x4 x5

Membership Embedding as a spectral approach By adding zero entries to appropriate locations, it is possible to use the entries of V to build a normalized Laplacian such that: Lik = Vij for all j such that yj coincides with xk Lii = 1 for all i Lik = 0 in all other cases This is the Laplacian of the non-complete graph seen earlier Therefore, by appropriate selection of the probes, this reduced graph spectral approach is exactly equivalent to the membership embedding approach

The probe selection problem and strategies Analytic selection of the probes is an open problem It is a combinatorial problem with a search space of exponential size (exactly 2n) if the optimal number c of probes is not given, as is usually the case The cost function is also computationally demanding, since we are asking for a matrix that is able to separate well the clusters This condition can be stated in terms of eigenvalues, which requires solving an eigenproblem

Simulated annealing approach Vector g is the state of a formal physical system Indicator variable gi selects point xi as a probe Energy: E = e + lc e clustering quality measure (obtained by experimental validation) lc complexity penalty, with parameter l State transition: some components of g are switched at random, 1 → 0 or 0 → 1 The number of switchings is bounded by parameters Probability of transition: T is gradually reduced until convergence

Simulated annealing approach

Experiments and results Datasets: Leukemia by Golub et al. Clustering algorithm: Fuzzy c-Means Experiment setup: compare clustering results obtained with different embeddings: Original data space Distance embedding (Euclidean distance from probes) Membership embedding (as described earlier) Evaluation: e = Representation Error We label each cluster with the majority class found among its points We compute the number of mismatches (points in clusters having a different class)

Experiments and results Comparative performance during training

Experiments and results Final results: comparative performance and values of b

Experiments and results Result of the probe selection process

Experiments and results Dependence of RE on the Fuzzy c-Means fuzziness parameter m

Closing remarks The Membership Embedding method effectively reduces the dimensionality of data to be clustered It provides good experimental results in clustering tasks Equivalence with a spectral approach has been proved This may account for the measured improvement in performance Probe selection is done by simulated annealing Further work will address other techniques for probe selection and their algorithmic characterization More experiments are on plan (interesting problems to solve anyone?)