Presentation is loading. Please wait.

Presentation is loading. Please wait.

Clustering with FITCH en UPGMA Bob W. Kooi, David M. Stork and Jorn de Haan Theoretical Biology.

Similar presentations


Presentation on theme: "Clustering with FITCH en UPGMA Bob W. Kooi, David M. Stork and Jorn de Haan Theoretical Biology."— Presentation transcript:

1 Clustering with FITCH en UPGMA Bob W. Kooi, David M. Stork and Jorn de Haan Theoretical Biology

2 Literature Pattern recognition S. Theodoridis and K. Koutroumbas, Academic press 1999 Data analysis in community and landscape ecology R. Jongman, C. ter Braak and O. van Tongeren, Pudoc 1987

3 Clustering Grouping of similar objects How? Sequential Hierarchical Optimum Based

4 Algorithms Fitch-Margoliasch Hierarchical Optimum Based Input: Dissimilarity Matrix Output: Tree UPGMA Hierarchical Input: Similarity Matrix Output: Tree Cutting at a specific level in tree results in a clustering

5 Distance Matrix Euclidean Distance (dissimilarity) N number data points a, b profile vectors

6 Distance Matrix Inproduct (similarity) N number data points a, b profile vectors

7 Fitch-Margoliasch Objective is to find that tree which minimizes D is observed distance between i and j d is expected distance between i and j P = 0 is power

8 Fitch-Margoliasch Evaluation of all trees and pick best according to criterion is not possible for more than a small number of genes is no guarantee that best tree is found Since number of trees is very large there

9 Fitch-Margoliasch First two genes are taken: one tree possible Then next gene is taken: limited number of possibilities to add to existing tree, take the best tree Continue until all genes are added

10 UPGMA Join the two genes most similar to each other Calculate GAP of this hypothetical gene as weighted average of the merged genes and calculate new distance matrix Repeat this step until all genes form one cluster

11 K-means Optimum based algorithm the lower J is the better the clustering The variables  j are vectors of length N Number of iterations needed is large Stirling Numbers of the Second Kind gives the number of ways of partitioning a set of m elements into k non­empty subsets

12 Damap DAta MAnipulation Program Adds noise (uniform and Gaussian) Normalizes data Calculates slopes of normalized data Calculates Distance Matrix using Euclidean Distance Coded in JAVA (platform independent) Tested with data from Somogyi et al. 1997

13 Scheme Damap

14 Adding slopes of normalized data enhances sensitivity Slope = With the assumption that t 2 - t 1 = 1 Normalized Data Normalization by setting max( a ) = 1

15 Input of DAMAP Raw data mRNA levels during development of neurological cells in rats ( Somogyi et al. 1997)

16 Intermediate results of DAMAP Normalized Data Slopes (with equidistant time steps)

17 Output of DAMAP Distance matrix

18 3rd Party programs FITCH and UPGMA (neighbor) PHYLIP, PHYLogeny Inference Program Used in phylogenic studies; evolution etc. K-means R, S-Plus like package

19 Cladogram Tree

20 Phenogram Tree

21 Appliance in Medical Biology The most widespread used technique of clustering is published by Eisen et al. (1998). The article of Eisen et al. has 740 citations by other articles in Web of Science. Hierarchical clustering is most used as clustering technique, it was used in 48.4 percent of the cluster analyses (n=31).

22 Cluster method

23 Cluster software

24 Distance measure measure percentage Euclidean 30 Pearson's correlation 55 Bray Curtis distance 5 not mentioned 10


Download ppt "Clustering with FITCH en UPGMA Bob W. Kooi, David M. Stork and Jorn de Haan Theoretical Biology."

Similar presentations


Ads by Google