Download presentation
Presentation is loading. Please wait.
1
Clustering with FITCH en UPGMA Bob W. Kooi, David M. Stork and Jorn de Haan Theoretical Biology
2
Literature Pattern recognition S. Theodoridis and K. Koutroumbas, Academic press 1999 Data analysis in community and landscape ecology R. Jongman, C. ter Braak and O. van Tongeren, Pudoc 1987
3
Clustering Grouping of similar objects How? Sequential Hierarchical Optimum Based
4
Algorithms Fitch-Margoliasch Hierarchical Optimum Based Input: Dissimilarity Matrix Output: Tree UPGMA Hierarchical Input: Similarity Matrix Output: Tree Cutting at a specific level in tree results in a clustering
5
Distance Matrix Euclidean Distance (dissimilarity) N number data points a, b profile vectors
6
Distance Matrix Inproduct (similarity) N number data points a, b profile vectors
7
Fitch-Margoliasch Objective is to find that tree which minimizes D is observed distance between i and j d is expected distance between i and j P = 0 is power
8
Fitch-Margoliasch Evaluation of all trees and pick best according to criterion is not possible for more than a small number of genes is no guarantee that best tree is found Since number of trees is very large there
9
Fitch-Margoliasch First two genes are taken: one tree possible Then next gene is taken: limited number of possibilities to add to existing tree, take the best tree Continue until all genes are added
10
UPGMA Join the two genes most similar to each other Calculate GAP of this hypothetical gene as weighted average of the merged genes and calculate new distance matrix Repeat this step until all genes form one cluster
11
K-means Optimum based algorithm the lower J is the better the clustering The variables j are vectors of length N Number of iterations needed is large Stirling Numbers of the Second Kind gives the number of ways of partitioning a set of m elements into k nonempty subsets
12
Damap DAta MAnipulation Program Adds noise (uniform and Gaussian) Normalizes data Calculates slopes of normalized data Calculates Distance Matrix using Euclidean Distance Coded in JAVA (platform independent) Tested with data from Somogyi et al. 1997
13
Scheme Damap
14
Adding slopes of normalized data enhances sensitivity Slope = With the assumption that t 2 - t 1 = 1 Normalized Data Normalization by setting max( a ) = 1
15
Input of DAMAP Raw data mRNA levels during development of neurological cells in rats ( Somogyi et al. 1997)
16
Intermediate results of DAMAP Normalized Data Slopes (with equidistant time steps)
17
Output of DAMAP Distance matrix
18
3rd Party programs FITCH and UPGMA (neighbor) PHYLIP, PHYLogeny Inference Program Used in phylogenic studies; evolution etc. K-means R, S-Plus like package
19
Cladogram Tree
20
Phenogram Tree
21
Appliance in Medical Biology The most widespread used technique of clustering is published by Eisen et al. (1998). The article of Eisen et al. has 740 citations by other articles in Web of Science. Hierarchical clustering is most used as clustering technique, it was used in 48.4 percent of the cluster analyses (n=31).
22
Cluster method
23
Cluster software
24
Distance measure measure percentage Euclidean 30 Pearson's correlation 55 Bray Curtis distance 5 not mentioned 10
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.