Thanh Le, Katheleen J. Gardiner University of Colorado Denver

Slides:

Advertisements

Similar presentations

Yinyin Yuan and Chang-Tsun Li Computer Science Department

Advertisements

Clustering k-mean clustering Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.

. Context-Specific Bayesian Clustering for Gene Expression Data Yoseph Barash Nir Friedman School of Computer Science & Engineering Hebrew University.

Clustering Beyond K-means

Outline Data with gaps clustering on the basis of neuro-fuzzy Kohonen network Adaptive algorithm for probabilistic fuzzy clustering Adaptive probabilistic.

Model-based clustering of gene expression data Ka Yee Yeung 1,Chris Fraley 2, Alejandro Murua 3, Adrian E. Raftery 2, and Walter L. Ruzzo 1 1 Department.

PROBABILISTIC DISTANCE MEASURES FOR PROTOTYPE-BASED RULES Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Poland, School of.

Machine Learning and Data Mining Clustering

Clustering approaches for high- throughput data Sushmita Roy BMI/CS 576 Nov 12 th, 2013.

Clustering (1) Clustering Similarity measure Hierarchical clustering Model-based clustering Figures from the book Data Clustering by Gan et al.

COMP 328: Final Review Spring 2010 Nevin L. Zhang Department of Computer Science & Engineering The Hong Kong University of Science & Technology

Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol

Mutual Information Mathematical Biology Seminar

Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.

Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.

Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

A gentle introduction to Gaussian distribution. Review Random variable Coin flip experiment X = 0X = 1 X: Random variable.

Multiple Human Objects Tracking in Crowded Scenes Yao-Te Tsai, Huang-Chia Shih, and Chung-Lin Huang Dept. of EE, NTHU International Conference on Pattern.

Object Class Recognition Using Discriminative Local Features Gyuri Dorko and Cordelia Schmid.

Expectation Maximization for GMM Comp344 Tutorial Kai Zhang.

Cluster Analysis for Gene Expression Data Ka Yee Yeung Center for Expression Arrays Department of Microbiology.

Clustering with Bregman Divergences Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, Joydeep Ghosh Presented by Rohit Gupta CSci 8980: Machine Learning.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

For Better Accuracy Eick: Ensemble Learning

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

Probability-based imputation method for fuzzy cluster analysis of gene expression microarray data Thanh Le, Tom Altman and Katheleen Gardiner University.

DNA microarray technology allows an individual to rapidly and quantitatively measure the expression levels of thousands of genes in a biological sample.

Clustering of DNA Microarray Data Michael Slifker CIS 526.

ArrayCluster: an analytic tool for clustering, data visualization and module ﬁnder on gene expression proﬁles 組員：李祥豪謝紹陽江建霖.

COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.

1 Statistical NLP: Lecture 9 Word Sense Disambiguation.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.

Randomized Algorithms for Bayesian Hierarchical Clustering

Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.

New Measures of Data Utility Mi-Ja Woo National Institute of Statistical Sciences.

Haojun Sun,ShengruiWang*,Qingshan Jiang Received 16 December 2002; received in revised form 29 March 2004; accepted 29 March 2004 Presenter Chia-Cheng.

Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.

DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.

INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.

Cluster validation Integration ICES Bioinformatics.

Flat clustering approaches

A new initialization method for Fuzzy C-Means using Fuzzy Subtractive Clustering Thanh Le, Tom Altman University of Colorado Denver July 19, 2011.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Validity index for clusters of different sizes and densities Presenter: Jun-Yi Wu Authors: Krista Rizman.

Date: 2011/1/11 Advisor: Dr. Koh. Jia-Ling Speaker: Lin, Yi-Jhen Mr. KNN: Soft Relevance for Multi-label Classification (CIKM’10) 1.

Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

Computational methods for inferring cellular networks II Stat 877 Apr 17 th, 2014 Sushmita Roy.

Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.

Clustering (2) Center-based algorithms Fuzzy k-means Density-based algorithms ( DBSCAN as an example ) Evaluation of clustering results Figures and equations.

Introduction to Machine Learning Nir Ailon Lecture 12: EM, Clustering and More.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. A Cluster Validity Measure With Outlier Detection for Support Vector Clustering Presenter : Lin, Shu-Han.

Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.

Clustering (1) Clustering Similarity measure Hierarchical clustering

Machine Learning Logistic Regression

Classification of unlabeled data:

Clustering (3) Center-based algorithms Fuzzy k-means

Clustering Evaluation The EM Algorithm

Latent Variables, Mixture Models and EM

Machine Learning Logistic Regression

TOP DM 10 Algorithms C4.5 C 4.5 Research Issue:

Statistical NLP: Lecture 9

SMEM Algorithm for Mixture Models

Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models

EM Algorithm and its Applications

Clustering (2) & EM algorithm

Machine Learning and Data Mining Clustering

Statistical NLP : Lecture 9 Word Sense Disambiguation

Presentation transcript:

A validation method for fuzzy clustering A biological problem of gene expression data Thanh Le, Katheleen J. Gardiner University of Colorado Denver July 18th, 2011

Overview Introduction fzBLE Datasets Experimental results Discussion: Data clustering: approaches and current challenges fzBLE a novel method for validation of clustering results Datasets artificial and real datasets for testing fzBLE Experimental results Discussion: Advantages and limitations of fzBLE

Clustering problem Genes are clustered based on Similarity Dissimilarity Clusters are described by Boundaries & overlaps Number of clusters Compactness within clusters Separation between clusters

Clustering approaches Hierarchical approach Partitioning approach Hard clustering approach Crisp cluster boundaries Crisp cluster membership Soft/Fuzzy clustering approach Overlapping cluster boundaries Soft/Fuzzy membership Appropriate for many real-world problems

Fuzzy C-Means algorithm The model Features: Fuzzy membership, soft cluster boundaries, One gene can belong to multiple clusters & be assigned to multiple biological processes

Fuzzy C-Means (contd.) Possibility-based model Model parameters estimated using an iteration process Rapid convergence Most appropriate for gene expression data Challenges: Determining the number of clusters Avoiding local optima The goodness-of-fit to validate clustering results

Methods for fuzzy clustering validation Methods based on compactness and separation Problem: Over-fit - the larger the number of cluster is, the better the cluster index is. No rationale for how to scale the two factors in the model Methods based on goodness of fit Statistics approach Expectation-Maximization (EM) method Slowly convergent, particularly at cluster boundaries because of the exponential function. Inappropriate to real dataset because of the model assumption of data distributions: Gaussian, chi-squared…

The fzBLE method for cluster validation Cluster using Fuzzy C-Means clustering algorithm Validate using the goodness-of-fit (the log likelihood estimator) and Bayesian approach

Cluster validation: Goodness-of-fit & fuzzy clustering Convert the possibility model into a probability model Use Bayesian approach to compute the statistics. Apply the Central Limit Theory To effectively represent the data distribution Model selection based on goodness-of-fit

Datasets Artificial datasets Real datasets Finite mixture model based datasets Real datasets Iris, Wine and Glass datasets at UC Irvine Machine Learning Repository Gene datasets which are more complex Yeast cell cycle gene expression (Yeast) Yeast gene functional annotations (Yeast-MIPS) Rat Central Nervous System (RCNS) gene expression

Experimental results on artificial datasets Correctness Ratios in determining the number of clusters # clusters fzBLE PC PE FS XB CWB PBMF BR CF 3 1.00 0.42 0.83 0.00 4 0.92 5 0.75 6 0.58 7 0.67 8 9 0.33 PC-partition coefficient, PE-partition entropy, FS-Fukuyama-Sugeno, XB-Xie and Beni, CWB-Compose Within and Between scattering, PBMF-Pakhira, Bandyopadhyay and Maulik Fuzzy, BR-Rezaee B., CF-Compactness factor; loop=5, #cluster range=[2,12]

Experimental results on Glass dataset Algorithm Cluster Validity Scores and Decisions (highlighted in yellow) # clusters fzble PC PE FS XB CWB PBMF BR CF 2 -1135.6886 0.8884 0.1776 0.3700 0.7222 6538.9311 0.3732 1.9817 0.5782 3 -1127.6854 0.8386 0.2747 0.1081 0.7817 4410.3006 0.4821 1.5004 0.4150 4 -1119.2457 0.8625 0.2515 -0.0630 0.6917 3266.5876 0.4463 1.0455 0.3354 5 -1123.2826 0.8577 0.2698 -0.1978 0.6450 2878.8912 0.4610 0.8380 0.2818 6 -1113.8339 0.8004 0.3865 -0.2050 1.4944 5001.1752 0.3400 0.8371 0.2430 7 -1116.5724 0.8183 0.3650 -0.2834 1.3802 5109.6082 0.3891 0.6914 0.2214 8 -1127.2626 0.8190 0.3637 -0.3948 1.4904 7172.2250 0.6065 0.5916 0.2108 9 -1117.7484 0.8119 0.3925 -0.3583 1.7503 8148.7667 0.3225 0.5634 0.1887 10 -1122.1585 0.8161 0.3852 -0.4214 1.7821 9439.3785 0.3909 0.4926 0.1758 11 -1121.9848 0.8259 0.3689 -0.4305 1.6260 9826.4211 0.3265 0.4470 0.1704 12 -1135.0453 0.8325 0.3555 -0.5183 1.4213 11318.4879 0.5317 0.3949 0.1591 13 -1138.9462 0.8317 0.3556 -0.5816 1.4918 14316.7592 0.6243 0.3544 0.1472

Experimental results on RCNS - more complex dataset; two-factor scaling issue Algorithm Cluster Validity Scores and Decisions (highlighted in yellow) #clusters fzble PC PE FS XB CWB PBMF BR CF 2 -580.0728 0.9942 0.0121 -568.7972 0.0594 5.5107 4.2087 1.1107 177.8094 3 -564.1986 0.9430 0.0942 -487.6104 0.4877 4.1309 4.2839 1.6634 117.9632 4 -561.0169 0.9142 0.1470 -430.4863 0.9245 6.1224 3.3723 1.3184 99.1409 5 -561.7420 0.8900 0.1941 -397.0935 1.3006 9.4770 2.6071 1.1669 88.5963 6 -552.9153 0.8695 0.2387 -300.6564 2.5231 20.6496 1.9499 1.1026 84.0905 7 -556.2905 0.8707 0.2386 -468.3121 2.1422 21.0187 2.8692 0.7875 57.5159 8 -555.3507 0.8925 0.2078 -462.0673 1.7245 20.0113 2.5323 0.5894 52.0348 9 -558.8686 0.8863 0.2192 -512.4278 1.6208 22.4772 2.6041 0.5019 45.9214 10 -565.8360 0.8847 0.2241 -644.1451 1.1897 21.9932 3.4949 0.3918 33.1378 112 genes during RCNS development at 9 time points 6 clusters, 4 of which are functionality-annotated (Somogyi et al. 1995, Wen et al. 1998)

Discussion: The advantages of fzBLE Performs better than other approaches on 3 levels of data. Compactness-separation approaches Solves the over-fit problem using goodness-of-fit. Eliminates need for two scaling factors Mixture model with EM approach Rapid convergence No assumption on data distribution The approach of scaling the two factors: compactness and separation is similar to that of scaling gene expression by within condition before clustering. The problem is that: The number of genes in each chip is known while we are not sure the number of clusters The values in multiple experimental conditions are consistent (fc, log of fc,…) while the values of the two factor are not.

Discussion: The limitations of fzBLE Depends on internal validity External validities are needed Biological validity GO terms, Pathways, PPI Future work on gene expression: Distance definition based on biological context Combine fzBLE with biological homology and stability indices

Thank you! Questions? We acknowledge the support from National Institutes of Health Linda Crnic Institute Vietnamese Ministry of Education and Training