A validation method for fuzzy clustering A biological problem of gene expression data Thanh Le, Katheleen J. Gardiner University of Colorado Denver July 18th, 2011
Overview Introduction fzBLE Datasets Experimental results Discussion: Data clustering: approaches and current challenges fzBLE a novel method for validation of clustering results Datasets artificial and real datasets for testing fzBLE Experimental results Discussion: Advantages and limitations of fzBLE
Clustering problem Genes are clustered based on Similarity Dissimilarity Clusters are described by Boundaries & overlaps Number of clusters Compactness within clusters Separation between clusters
Clustering approaches Hierarchical approach Partitioning approach Hard clustering approach Crisp cluster boundaries Crisp cluster membership Soft/Fuzzy clustering approach Overlapping cluster boundaries Soft/Fuzzy membership Appropriate for many real-world problems
Fuzzy C-Means algorithm The model Features: Fuzzy membership, soft cluster boundaries, One gene can belong to multiple clusters & be assigned to multiple biological processes
Fuzzy C-Means (contd.) Possibility-based model Model parameters estimated using an iteration process Rapid convergence Most appropriate for gene expression data Challenges: Determining the number of clusters Avoiding local optima The goodness-of-fit to validate clustering results
Methods for fuzzy clustering validation Methods based on compactness and separation Problem: Over-fit - the larger the number of cluster is, the better the cluster index is. No rationale for how to scale the two factors in the model Methods based on goodness of fit Statistics approach Expectation-Maximization (EM) method Slowly convergent, particularly at cluster boundaries because of the exponential function. Inappropriate to real dataset because of the model assumption of data distributions: Gaussian, chi-squared…
The fzBLE method for cluster validation Cluster using Fuzzy C-Means clustering algorithm Validate using the goodness-of-fit (the log likelihood estimator) and Bayesian approach
Cluster validation: Goodness-of-fit & fuzzy clustering Convert the possibility model into a probability model Use Bayesian approach to compute the statistics. Apply the Central Limit Theory To effectively represent the data distribution Model selection based on goodness-of-fit
Datasets Artificial datasets Real datasets Finite mixture model based datasets Real datasets Iris, Wine and Glass datasets at UC Irvine Machine Learning Repository Gene datasets which are more complex Yeast cell cycle gene expression (Yeast) Yeast gene functional annotations (Yeast-MIPS) Rat Central Nervous System (RCNS) gene expression
Experimental results on artificial datasets Correctness Ratios in determining the number of clusters # clusters fzBLE PC PE FS XB CWB PBMF BR CF 3 1.00 0.42 0.83 0.00 4 0.92 5 0.75 6 0.58 7 0.67 8 9 0.33 PC-partition coefficient, PE-partition entropy, FS-Fukuyama-Sugeno, XB-Xie and Beni, CWB-Compose Within and Between scattering, PBMF-Pakhira, Bandyopadhyay and Maulik Fuzzy, BR-Rezaee B., CF-Compactness factor; loop=5, #cluster range=[2,12]
Experimental results on Glass dataset Algorithm Cluster Validity Scores and Decisions (highlighted in yellow) # clusters fzble PC PE FS XB CWB PBMF BR CF 2 -1135.6886 0.8884 0.1776 0.3700 0.7222 6538.9311 0.3732 1.9817 0.5782 3 -1127.6854 0.8386 0.2747 0.1081 0.7817 4410.3006 0.4821 1.5004 0.4150 4 -1119.2457 0.8625 0.2515 -0.0630 0.6917 3266.5876 0.4463 1.0455 0.3354 5 -1123.2826 0.8577 0.2698 -0.1978 0.6450 2878.8912 0.4610 0.8380 0.2818 6 -1113.8339 0.8004 0.3865 -0.2050 1.4944 5001.1752 0.3400 0.8371 0.2430 7 -1116.5724 0.8183 0.3650 -0.2834 1.3802 5109.6082 0.3891 0.6914 0.2214 8 -1127.2626 0.8190 0.3637 -0.3948 1.4904 7172.2250 0.6065 0.5916 0.2108 9 -1117.7484 0.8119 0.3925 -0.3583 1.7503 8148.7667 0.3225 0.5634 0.1887 10 -1122.1585 0.8161 0.3852 -0.4214 1.7821 9439.3785 0.3909 0.4926 0.1758 11 -1121.9848 0.8259 0.3689 -0.4305 1.6260 9826.4211 0.3265 0.4470 0.1704 12 -1135.0453 0.8325 0.3555 -0.5183 1.4213 11318.4879 0.5317 0.3949 0.1591 13 -1138.9462 0.8317 0.3556 -0.5816 1.4918 14316.7592 0.6243 0.3544 0.1472
Experimental results on RCNS - more complex dataset; two-factor scaling issue Algorithm Cluster Validity Scores and Decisions (highlighted in yellow) #clusters fzble PC PE FS XB CWB PBMF BR CF 2 -580.0728 0.9942 0.0121 -568.7972 0.0594 5.5107 4.2087 1.1107 177.8094 3 -564.1986 0.9430 0.0942 -487.6104 0.4877 4.1309 4.2839 1.6634 117.9632 4 -561.0169 0.9142 0.1470 -430.4863 0.9245 6.1224 3.3723 1.3184 99.1409 5 -561.7420 0.8900 0.1941 -397.0935 1.3006 9.4770 2.6071 1.1669 88.5963 6 -552.9153 0.8695 0.2387 -300.6564 2.5231 20.6496 1.9499 1.1026 84.0905 7 -556.2905 0.8707 0.2386 -468.3121 2.1422 21.0187 2.8692 0.7875 57.5159 8 -555.3507 0.8925 0.2078 -462.0673 1.7245 20.0113 2.5323 0.5894 52.0348 9 -558.8686 0.8863 0.2192 -512.4278 1.6208 22.4772 2.6041 0.5019 45.9214 10 -565.8360 0.8847 0.2241 -644.1451 1.1897 21.9932 3.4949 0.3918 33.1378 112 genes during RCNS development at 9 time points 6 clusters, 4 of which are functionality-annotated (Somogyi et al. 1995, Wen et al. 1998)
Discussion: The advantages of fzBLE Performs better than other approaches on 3 levels of data. Compactness-separation approaches Solves the over-fit problem using goodness-of-fit. Eliminates need for two scaling factors Mixture model with EM approach Rapid convergence No assumption on data distribution The approach of scaling the two factors: compactness and separation is similar to that of scaling gene expression by within condition before clustering. The problem is that: The number of genes in each chip is known while we are not sure the number of clusters The values in multiple experimental conditions are consistent (fc, log of fc,…) while the values of the two factor are not.
Discussion: The limitations of fzBLE Depends on internal validity External validities are needed Biological validity GO terms, Pathways, PPI Future work on gene expression: Distance definition based on biological context Combine fzBLE with biological homology and stability indices
Thank you! Questions? We acknowledge the support from National Institutes of Health Linda Crnic Institute Vietnamese Ministry of Education and Training