Clustering with FITCH en UPGMA Bob W. Kooi, David M. Stork and Jorn de Haan Theoretical Biology.

Clustering with FITCH en UPGMA Bob W. Kooi, David M. Stork and Jorn de Haan Theoretical Biology

Literature Pattern recognition S. Theodoridis and K. Koutroumbas, Academic press 1999 Data analysis in community and landscape ecology R. Jongman, C. ter Braak and O. van Tongeren, Pudoc 1987

Clustering Grouping of similar objects How? Sequential Hierarchical Optimum Based

Algorithms Fitch-Margoliasch Hierarchical Optimum Based Input: Dissimilarity Matrix Output: Tree UPGMA Hierarchical Input: Similarity Matrix Output: Tree Cutting at a specific level in tree results in a clustering

Distance Matrix Euclidean Distance (dissimilarity) N number data points a, b profile vectors

Distance Matrix Inproduct (similarity) N number data points a, b profile vectors

Fitch-Margoliasch Objective is to find that tree which minimizes D is observed distance between i and j d is expected distance between i and j P = 0 is power

Fitch-Margoliasch Evaluation of all trees and pick best according to criterion is not possible for more than a small number of genes is no guarantee that best tree is found Since number of trees is very large there

Fitch-Margoliasch First two genes are taken: one tree possible Then next gene is taken: limited number of possibilities to add to existing tree, take the best tree Continue until all genes are added

UPGMA Join the two genes most similar to each other Calculate GAP of this hypothetical gene as weighted average of the merged genes and calculate new distance matrix Repeat this step until all genes form one cluster

K-means Optimum based algorithm the lower J is the better the clustering The variables  j are vectors of length N Number of iterations needed is large Stirling Numbers of the Second Kind gives the number of ways of partitioning a set of m elements into k nonempty subsets

Damap DAta MAnipulation Program Adds noise (uniform and Gaussian) Normalizes data Calculates slopes of normalized data Calculates Distance Matrix using Euclidean Distance Coded in JAVA (platform independent) Tested with data from Somogyi et al. 1997

Scheme Damap

Adding slopes of normalized data enhances sensitivity Slope = With the assumption that t 2 - t 1 = 1 Normalized Data Normalization by setting max( a ) = 1

Input of DAMAP Raw data mRNA levels during development of neurological cells in rats ( Somogyi et al. 1997)

Intermediate results of DAMAP Normalized Data Slopes (with equidistant time steps)

Output of DAMAP Distance matrix

3rd Party programs FITCH and UPGMA (neighbor) PHYLIP, PHYLogeny Inference Program Used in phylogenic studies; evolution etc. K-means R, S-Plus like package

Cladogram Tree

Phenogram Tree

Appliance in Medical Biology The most widespread used technique of clustering is published by Eisen et al. (1998). The article of Eisen et al. has 740 citations by other articles in Web of Science. Hierarchical clustering is most used as clustering technique, it was used in 48.4 percent of the cluster analyses (n=31).

Cluster method

Cluster software

Distance measure measure percentage Euclidean 30 Pearson's correlation 55 Bray Curtis distance 5 not mentioned 10

Clustering with FITCH en UPGMA Bob W. Kooi, David M. Stork and Jorn de Haan Theoretical Biology.

Similar presentations

Presentation on theme: "Clustering with FITCH en UPGMA Bob W. Kooi, David M. Stork and Jorn de Haan Theoretical Biology."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Clustering with FITCH en UPGMA Bob W. Kooi, David M. Stork and Jorn de Haan Theoretical Biology.

Similar presentations

Presentation on theme: "Clustering with FITCH en UPGMA Bob W. Kooi, David M. Stork and Jorn de Haan Theoretical Biology."— Presentation transcript:

Similar presentations

About project

Feedback