Clustering with FITCH en UPGMA Bob W. Kooi, David M. Stork and Jorn de Haan Theoretical Biology
Literature Pattern recognition S. Theodoridis and K. Koutroumbas, Academic press 1999 Data analysis in community and landscape ecology R. Jongman, C. ter Braak and O. van Tongeren, Pudoc 1987
Clustering Grouping of similar objects How? Sequential Hierarchical Optimum Based
Algorithms Fitch-Margoliasch Hierarchical Optimum Based Input: Dissimilarity Matrix Output: Tree UPGMA Hierarchical Input: Similarity Matrix Output: Tree Cutting at a specific level in tree results in a clustering
Distance Matrix Euclidean Distance (dissimilarity) N number data points a, b profile vectors
Distance Matrix Inproduct (similarity) N number data points a, b profile vectors
Fitch-Margoliasch Objective is to find that tree which minimizes D is observed distance between i and j d is expected distance between i and j P = 0 is power
Fitch-Margoliasch Evaluation of all trees and pick best according to criterion is not possible for more than a small number of genes is no guarantee that best tree is found Since number of trees is very large there
Fitch-Margoliasch First two genes are taken: one tree possible Then next gene is taken: limited number of possibilities to add to existing tree, take the best tree Continue until all genes are added
UPGMA Join the two genes most similar to each other Calculate GAP of this hypothetical gene as weighted average of the merged genes and calculate new distance matrix Repeat this step until all genes form one cluster
K-means Optimum based algorithm the lower J is the better the clustering The variables j are vectors of length N Number of iterations needed is large Stirling Numbers of the Second Kind gives the number of ways of partitioning a set of m elements into k nonempty subsets
Damap DAta MAnipulation Program Adds noise (uniform and Gaussian) Normalizes data Calculates slopes of normalized data Calculates Distance Matrix using Euclidean Distance Coded in JAVA (platform independent) Tested with data from Somogyi et al. 1997
Scheme Damap
Adding slopes of normalized data enhances sensitivity Slope = With the assumption that t 2 - t 1 = 1 Normalized Data Normalization by setting max( a ) = 1
Input of DAMAP Raw data mRNA levels during development of neurological cells in rats ( Somogyi et al. 1997)
Intermediate results of DAMAP Normalized Data Slopes (with equidistant time steps)
Output of DAMAP Distance matrix
3rd Party programs FITCH and UPGMA (neighbor) PHYLIP, PHYLogeny Inference Program Used in phylogenic studies; evolution etc. K-means R, S-Plus like package
Cladogram Tree
Phenogram Tree
Appliance in Medical Biology The most widespread used technique of clustering is published by Eisen et al. (1998). The article of Eisen et al. has 740 citations by other articles in Web of Science. Hierarchical clustering is most used as clustering technique, it was used in 48.4 percent of the cluster analyses (n=31).
Cluster method
Cluster software
Distance measure measure percentage Euclidean 30 Pearson's correlation 55 Bray Curtis distance 5 not mentioned 10