Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.

Slides:



Advertisements
Similar presentations
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Advertisements

Fast Algorithms For Hierarchical Range Histogram Constructions
Yue Han and Lei Yu Binghamton University.
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Clustering II CMPUT 466/551 Nilanjan Ray. Mean-shift Clustering Will show slides from:
Model-based clustering of gene expression data Ka Yee Yeung 1,Chris Fraley 2, Alejandro Murua 3, Adrian E. Raftery 2, and Walter L. Ruzzo 1 1 Department.
Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very.
Biased Normalized Cuts 1 Subhransu Maji and Jithndra Malik University of California, Berkeley IEEE Conference on Computer Vision and Pattern Recognition.
10/11/2001Random walks and spectral segmentation1 CSE 291 Fall 2001 Marina Meila and Jianbo Shi: Learning Segmentation by Random Walks/A Random Walks View.
Clustered alignments of gene- expression time series data Adam A. Smith, Aaron Vollrath, Cristopher A. Bradfield and Mark Craven Department of Biosatatistics.
Lecture 21: Spectral Clustering
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Mutual Information Mathematical Biology Seminar
Normalized Cuts and Image Segmentation Jianbo Shi and Jitendra Malik, Presented by: Alireza Tavakkoli.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Region Segmentation. Find sets of pixels, such that All pixels in region i satisfy some constraint of similarity.
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
A Unified View of Kernel k-means, Spectral Clustering and Graph Cuts
University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou Ekaterini.
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Real-time Combined 2D+3D Active Appearance Models Jing Xiao, Simon Baker,Iain Matthew, and Takeo Kanade CVPR 2004 Presented by Pat Chan 23/11/2004.
Segmentation Graph-Theoretic Clustering.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Computer Vision - A Modern Approach Set: Segmentation Slides by D.A. Forsyth Segmentation and Grouping Motivation: not information is evidence Obtain a.
1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Ensemble Clustering.
Presented By Wanchen Lu 2/25/2013
Alignment and classification of time series gene expression in clinical studies Tien-ho Lin, Naftali Kaminski and Ziv Bar-Joseph.
Segmentation Course web page: vision.cis.udel.edu/~cv May 7, 2003  Lecture 31.
Machine Learning Problems Unsupervised Learning – Clustering – Density estimation – Dimensionality Reduction Supervised Learning – Classification – Regression.
Chapter 14: SEGMENTATION BY CLUSTERING 1. 2 Outline Introduction Human Vision & Gestalt Properties Applications – Background Subtraction – Shot Boundary.
Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008.
Clustering Spatial Data Using Random Walk David Harel and Yehuda Koren KDD 2001.
A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo.
Ensembles of Partitions via Data Resampling
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
A genetic approach to the automatic clustering problem Author : Lin Yu Tseng Shiueng Bien Yang Graduate : Chien-Ming Hsiao.
1 Markov Random Fields with Efficient Approximations Yuri Boykov, Olga Veksler, Ramin Zabih Computer Science Department CORNELL UNIVERSITY.
Andreas Papadopoulos - [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International.
Spectral Sequencing Based on Graph Distance Rong Liu, Hao Zhang, Oliver van Kaick {lrong, haoz, cs.sfu.ca {lrong, haoz, cs.sfu.ca.
Learning Spectral Clustering, With Application to Speech Separation F. R. Bach and M. I. Jordan, JMLR 2006.
Associative Hierarchical CRFs for Object Class Image Segmentation
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Model-based evaluation of clustering validation measures.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Cluster validation Integration ICES Bioinformatics.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007.
Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring T.R. Golub et al., Science 286, 531 (1999)
SGPP: Spatial Gaussian Predictive Process Models for Neuroimaging Data Yimei Li Department of Biostatistics St. Jude Children’s Research Hospital Joint.
Multi-label Prediction via Sparse Infinite CCA Piyush Rai and Hal Daume III NIPS 2009 Presented by Lingbo Li ECE, Duke University July 16th, 2010 Note:
Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.
A Tutorial on Spectral Clustering Ulrike von Luxburg Max Planck Institute for Biological Cybernetics Statistics and Computing, Dec. 2007, Vol. 17, No.
An unsupervised conditional random fields approach for clustering gene expression time series Chang-Tsun Li, Yinyin Yuan and Roland Wilson Bioinformatics,
May 2003 SUT Color image segmentation – an innovative approach Amin Fazel May 2003 Sharif University of Technology Course Presentation base on a paper.
Edge Preserving Spatially Varying Mixtures for Image Segmentation Giorgos Sfikas, Christophoros Nikou, Nikolaos Galatsanos (CVPR 2008) Presented by Lihan.
Document Clustering with Prior Knowledge Xiang Ji et al. Document Clustering with Prior Knowledge. SIGIR 2006 Presenter: Suhan Yu.
Clustering (2) Center-based algorithms Fuzzy k-means Density-based algorithms ( DBSCAN as an example ) Evaluation of clustering results Figures and equations.
Motoki Shiga, Ichigaku Takigawa, Hiroshi Mamitsuka
Semi-Supervised Clustering
Chapter 7. Classification and Prediction
Hierarchical Clustering
Markov Random Fields with Efficient Approximations
Segmentation Graph-Theoretic Clustering.
Learning with information of features
GPX: Interactive Exploration of Time-series Microarray Data
Consensus Partition Liang Zheng 5.21.
Spectral Clustering Eric Xing Lecture 8, August 13, 2010
Presentation transcript:

Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007

Outline Introduction Methods Experiment Conclusion 2

Introduction Class discovery consists of two steps: – A clustering algorithm is adopted to partition the sample into K parts. – A cluster validity index is applied to determine the optimal K value. For the class discovery problem, we focus on discovering the underlying classes from the samples. 3

Introduction Recently, researchers are paying more attention to class discovery based on the consensus clustering approaches. They consist of two major steps: – Generating a cluster ensemble based on a clustering algorithm. – Finding a consensus partition based on this ensemble. 4

Introduction Consensus clustering have five types: 1)Using different clustering algorithms as the basic clustering algorithms to obtain different solutions. 2)Using random initializations of a single clustering algorithm. 3)Sub-sampling, re-sampling or adding noise to the original data. 4)Using selected subsets of features. 5)Using different K values to generate different clustering solutions 5

Methods In this paper, the approach belongs to type 4, in which the cluster ensemble is generated using different gene subsets. Graph-based consensus clustering (GCC). 6

Methods Overview of the framework for GCC algorithm Subspace generation Subspace clustering Cluster ensemble Cluster discovery 7

The framework for GCC algorithm The framework: 8

The framework for GCC algorithm The framework: 9

Subspace generation A constant, which presents the number of genes in the subspace is generated by: where is a uniform random variable, and, for is the total number of genes. 10

Subspace generation Then, it selects the gene one by one until genes are obtained. The index of each randomly selected gene is determined as: where denotes the hth gene, and is a uniform random variable. 11

Subspace generation Finally, the randomly selected genes are used to construct a subspace. 12 one sample genes Randomly selection genes

The framework for GCC algorithm The framework: 13

Subspace clustering In the selected subspace, GCC performs two clustering approaches: – Correlation clustering Correlation analysis Graph partition – K-means 14

Correlation clustering Correlation analysis: calculate the correlation matrix (CM) whose entries, is the number of samples. where and denotes the ith and jth samples. 15

Correlation clustering Graph partition: use the normalized cut algorithm to partition the samples to K classes based on the CM. A graph can be constructed, whose vertices correspond to samples, and edges are the correlation between the samples (i.e. CM). 16

Correlation clustering “Normalized cuts” is proposed by Shi and Malik in 1997, CVPR. It’s an image segmentation method. – Pixels as vertices. – Similarity between pixels as weight edge. 17

Correlation clustering Like the normalized cuts method, we could find the label vector by solve the generalized eigenvalue problem: where is an diagonal matrix with as diagonal, is the correlation matrix. The label vector is composed from the second smaller eigenvector. 18

K-means To minimize total intra-cluster variance, or the squared error function: where is the center of cluster. 19

Subspace clustering After obtaining the predicted labels, the adjacency matrix is constructed by the labels, whose elements are defined as: where and denote the predicted labels of the samples and. 20

The framework for GCC algorithm The framework: 21

Cluster ensemble For each, GCC repeats the above two steps B times, and obtains – B clustering solutions – B adjacency matrices GCC constructs a consensus matrix by merging the adjacency matrix as: 22 where represents the probability that two samples in the same class.

Cluster ensemble Then, GCC constructs a graph and applies the normalized cuts method. It means the clustering result when the number of clusters is K. 23

The framework for GCC algorithm The framework: 24

Cluster discovery Define an aggregated consensus matrix : Then, GCC converts it to a binary matrix : By the same way, GCC converts to. 25

Cluster discovery We should compare clustering results with the aggregated matrix to decide the proper value of K. Modified Rand Index: 26 The degree of agreement between and Penalty term for a large set of clusters.

Cluster discovery The optimal number of classes is selected as It considers the relationship between each clustering solution and the average clustering solution. 27

Experiment Experiment setting Relationship between ARI and Experiment results 28

Experiment setting Four combination algorithms comparison: – GCC corr (GCC with correlation clustering) – GCC K-means (GCC with K-means) – CC HC (CC with hierarchical clustering with average linkage) – CC SOM (CC with Self-Organizing Maps) Consensus Clustering (CC) is proposed by Monti et al. in 2003, a type 3(re-sampling) consensus clustering algorithm. 29

Experiment setting Parameters setting: The datasets: 30

Experiment setting Adjusted Rand Index (ARI): 31 Maximum index Expected index Real index The number of samples in the kth class in the true partition. The number of samples in the ith class in the predicted partition.

Relationship between ARI and 32 The change of ARI with respect to different K:

Relationship between ARI and The change of with respect to different K: 33

Relationship between ARI and The correlation analysis of ARI and : 34 The degree of dependence between ARI and is high.

Experiment results Estimated optimal K value by different approaches: 35 ground truthError terms

Experiment results The corresponding values of ARI: 36 The GCC approaches outperform the CC approaches.

Experiment results The effect of the maximum K value: 37 When K max increases, GCC corr still correctly estimate the number of clusters in Synthetic2 dataset.

Experiment results The effect of the maximum K value: 38 When K max increases, GCC corr still correctly estimate the number of clusters in Leukemia dataset.

Experiment results The effect of the maximum K value: 39 ζ decreases slightly when K max increases. ARI is not affected when K max increases.

Conclusion This paper proposes the design of a new framework, known as GCC, to discover the classes of the samples in gene expression data. GCC can successfully estimate the true number of classes for the datasets in experiments. 40