Clustering with FITCH en UPGMA Bob W. Kooi, David M. Stork and Jorn de Haan Theoretical Biology.

Slides:



Advertisements
Similar presentations
Yinyin Yuan and Chang-Tsun Li Computer Science Department
Advertisements

BioInformatics (3).
Basic Gene Expression Data Analysis--Clustering
Hierarchical Clustering
Cluster Analysis: Basic Concepts and Algorithms
1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Introduction to Bioinformatics
Cluster Analysis Hal Whitehead BIOL4062/5062. What is cluster analysis? Non-hierarchical cluster analysis –K-means Hierarchical divisive cluster analysis.
UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL.
Molecular Evolution Revised 29/12/06
Clustering II.
Mutual Information Mathematical Biology Seminar
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou Ekaterini.
Cluster Analysis: Basic Concepts and Algorithms
Computational Biology, Part 12 Expression array cluster analysis Robert F. Murphy, Shann-Ching Chen Copyright  All rights reserved.
Clustering (Gene Expression Data) 6.095/ Computational Biology: Genomes, Networks, Evolution LectureOctober 4, 2005.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Introduction to Hierarchical Clustering Analysis Pengyu Hong 09/16/2005.
Adapted by Doug Downey from Machine Learning EECS 349, Bryan Pardo Machine Learning Clustering.
Introduction to Bioinformatics - Tutorial no. 12
Cluster Analysis for Gene Expression Data Ka Yee Yeung Center for Expression Arrays Department of Microbiology.
Fuzzy K means.
CLUSTERING (Segmentation)
Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
Evaluating Performance for Data Mining Techniques
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Terminology of phylogenetic trees
Clustering methods Course code: Pasi Fränti Speech & Image Processing Unit School of Computing University of Eastern Finland Joensuu,
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Lecture 11. Microarray and RNA-seq II
Pattern Recognition Introduction to bioinformatics 2006 Lecture 4.
Microarrays.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
Quantitative analysis of 2D gels Generalities. Applications Mutant / wild type Physiological conditions Tissue specific expression Disease / normal state.
More About Clustering Naomi Altman Nov '06. Assessing Clusters Some things we might like to do: 1.Understand the within cluster similarity and between.
Clustering Gene Expression Data BMI/CS 576 Colin Dewey Fall 2010.
An Overview of Clustering Methods Michael D. Kane, Ph.D.
Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Fuzzy C-Means Clustering
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree like diagram that.
Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to)
Factor & Cluster Analyses. Factor Analysis Goals Data Process Results.
Computational Biology Group. Class prediction of tumor samples Supervised Clustering Detection of Subgroups in a Class.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
Topic 4: Cluster Analysis Analysis of Customer Behavior and Service Modeling.
Canadian Bioinformatics Workshops
Data Clustering Michael J. Watts
Clustering methods Tree building methods for distance-based trees
Hierarchical clustering approaches for high-throughput data
Self-organizing map numeric vectors and sequence motifs
Hierarchical Clustering
Presentation transcript:

Clustering with FITCH en UPGMA Bob W. Kooi, David M. Stork and Jorn de Haan Theoretical Biology

Literature Pattern recognition S. Theodoridis and K. Koutroumbas, Academic press 1999 Data analysis in community and landscape ecology R. Jongman, C. ter Braak and O. van Tongeren, Pudoc 1987

Clustering Grouping of similar objects How? Sequential Hierarchical Optimum Based

Algorithms Fitch-Margoliasch Hierarchical Optimum Based Input: Dissimilarity Matrix Output: Tree UPGMA Hierarchical Input: Similarity Matrix Output: Tree Cutting at a specific level in tree results in a clustering

Distance Matrix Euclidean Distance (dissimilarity) N number data points a, b profile vectors

Distance Matrix Inproduct (similarity) N number data points a, b profile vectors

Fitch-Margoliasch Objective is to find that tree which minimizes D is observed distance between i and j d is expected distance between i and j P = 0 is power

Fitch-Margoliasch Evaluation of all trees and pick best according to criterion is not possible for more than a small number of genes is no guarantee that best tree is found Since number of trees is very large there

Fitch-Margoliasch First two genes are taken: one tree possible Then next gene is taken: limited number of possibilities to add to existing tree, take the best tree Continue until all genes are added

UPGMA Join the two genes most similar to each other Calculate GAP of this hypothetical gene as weighted average of the merged genes and calculate new distance matrix Repeat this step until all genes form one cluster

K-means Optimum based algorithm the lower J is the better the clustering The variables  j are vectors of length N Number of iterations needed is large Stirling Numbers of the Second Kind gives the number of ways of partitioning a set of m elements into k non­empty subsets

Damap DAta MAnipulation Program Adds noise (uniform and Gaussian) Normalizes data Calculates slopes of normalized data Calculates Distance Matrix using Euclidean Distance Coded in JAVA (platform independent) Tested with data from Somogyi et al. 1997

Scheme Damap

Adding slopes of normalized data enhances sensitivity Slope = With the assumption that t 2 - t 1 = 1 Normalized Data Normalization by setting max( a ) = 1

Input of DAMAP Raw data mRNA levels during development of neurological cells in rats ( Somogyi et al. 1997)

Intermediate results of DAMAP Normalized Data Slopes (with equidistant time steps)

Output of DAMAP Distance matrix

3rd Party programs FITCH and UPGMA (neighbor) PHYLIP, PHYLogeny Inference Program Used in phylogenic studies; evolution etc. K-means R, S-Plus like package

Cladogram Tree

Phenogram Tree

Appliance in Medical Biology The most widespread used technique of clustering is published by Eisen et al. (1998). The article of Eisen et al. has 740 citations by other articles in Web of Science. Hierarchical clustering is most used as clustering technique, it was used in 48.4 percent of the cluster analyses (n=31).

Cluster method

Cluster software

Distance measure measure percentage Euclidean 30 Pearson's correlation 55 Bray Curtis distance 5 not mentioned 10