Hierarchical clustering approaches for high-throughput data

Slides:



Advertisements
Similar presentations
Clustering k-mean clustering Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Advertisements

Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Hierarchical Clustering, DBSCAN The EM Algorithm
Albert Gatt Corpora and Statistical Methods Lecture 13.
Brandon Andrews CS6030.  What is a phylogenetic tree?  Goals in a phylogenetic tree generator  Distance based method  Fitch-Margoliash Method Example.
Introduction to Bioinformatics
Clustering approaches for high- throughput data Sushmita Roy BMI/CS 576 Nov 12 th, 2013.
Phylogenetic trees Sushmita Roy BMI/CS 576 Sep 23 rd, 2014.
Clustering II.
UPGMA Algorithm.  Main idea: Group the taxa into clusters and repeatedly merge the closest two clusters until one cluster remains  Algorithm  Add a.
Lecture 7 – Algorithmic Approaches Justification: Any estimate of a phylogenetic tree has a large variance. Therefore, any tree that we can demonstrate.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Introduction to Bioinformatics Algorithms Clustering.
L16: Micro-array analysis Dimension reduction Unsupervised clustering.
Distance methods. UPGMA: similar to hierarchical clustering but not additive Neighbor-joining: more sophisticated and additive What is additivity?
Clustering with FITCH en UPGMA Bob W. Kooi, David M. Stork and Jorn de Haan Theoretical Biology.
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
Phylogenetic trees Sushmita Roy BMI/CS 576
Introduction to Bioinformatics Algorithms Clustering and Microarray Analysis.
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
Clustering Unsupervised learning Generating “classes”
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.
Clustering Gene Expression Data BMI/CS 576 Colin Dewey Fall 2010.
CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree like diagram that.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Flat clustering approaches
Computational Biology Group. Class prediction of tumor samples Supervised Clustering Detection of Subgroups in a Class.
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
Distance-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS 576 Fall 2015.
C LUSTERING José Miguel Caravalho. CLUSTER ANALYSIS OR CLUSTERING IS THE TASK OF ASSIGNING A SET OF OBJECTS INTO GROUPS ( CALLED CLUSTERS ) SO THAT THE.
Clustering [Idea only, Chapter 10.1, 10.2, 10.4].
Clustering Gene Expression Data BMI/CS 776 Mark Craven April 2002.
CLUSTER ANALYSIS. Cluster Analysis  Cluster analysis is a major technique for classifying a ‘mountain’ of information into manageable meaningful piles.
Graph clustering to detect network modules
Unsupervised Learning
Clustering (1) Clustering Similarity measure Hierarchical clustering
Clustering CSC 600: Data Mining Class 21.
CZ5211 Topics in Computational Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
Classification with Gene Expression Data
Data Structures: Disjoint Sets, Segment Trees, Fenwick Trees
Data Mining K-means Algorithm
Mean Shift Segmentation
Multiple Alignment and Phylogenetic Trees
Description Given a linear collection of items x1, x2, x3,….,xn
K-means and Hierarchical Clustering
6. Introduction to nonparametric clustering
Revision (Part II) Ke Chen
Clustering.
Information Organization: Clustering
KAIST CS LAB Oh Jong-Hoon
Clustering BE203: Functional Genomics Spring 2011 Vineet Bafna and Trey Ideker Trey Ideker Acknowledgements: Jones and Pevzner, An Introduction to Bioinformatics.
Clustering.
Consensus Partition Liang Zheng 5.21.
Lecture 7 – Algorithmic Approaches
Phylogeny.
Text Categorization Berlin Chen 2003 Reference:
Hierarchical Clustering
Clustering Techniques
Clustering.
Machine Learning and Data Mining Clustering
Unsupervised Learning
Presentation transcript:

Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 www.biostat.wisc.edu/bmi576 colin.dewey@wisc.edu Fall 2016

The clustering landscape There are many different clustering algorithms They differ with respect to several properties hierarchical vs. flat hard (no uncertainty about which profiles belong to a cluster) vs. soft clusters overlapping (a profile can belong to multiple clusters) vs. non-overlapping deterministic (same clusters produced every time for a given data set) vs. stochastic distance (similarity) measure used

Hierarchical vs. Flat clustering Flat clustering (e.g., K-means and Gaussian Mixture Models) Number of clusters, K, is pre-specified Each object is assigned to one of these clusters Hierarchical clustering Hierarchical relationships established between all objects A threshold on the maximum dissimilarity can be used to convert a hierarchical clustering into a flat clustering Multiple flat clusterings can be produced by varying the threshold

Hierarchical clustering height of bar indicates degree of distance within cluster distance scale leaves represent objects to be clustered (e.g. genes or samples)

Hierarchical clustering example clustering of chromatin marks measured near genes in a particular cell type (induced pluripotent cell (iPS)) Columns correspond to chromatin marks Eight marks Five activating H3K14 H3K18 H3K4me3 H3K9ac H3K79me2 Three repressive H3K9me2 H3K9me3 H3K27me3 Data from Sridharan et al.

Hierarchical clustering Input: a set X={x1,.. xn} of data points, where each xi is a p-dimensional vector Output: a rooted tree with the data points at the leaves of the tree Two major strategies top-down (divisive) bottom-up (agglomerative) Both strategies iteratively build a tree by splitting (to-down) or merging (bottom-up) subsets of data points We will focus on bottom-up clustering

Top-down clustering Basic idea: use a flat clustering method to recursively split a set, X, of data points into K (usually K=2) disjoint subsets topdown_cluster(X): if X has only one element x: return a tree with a single leaf node labeled by x else: X1, X2 = flat_cluster(X, K=2) T1 = topdown_cluster(X1) T2 = topdown_cluster(X2) return a tree with children T1 and T2

Bottom-up hierarchical clustering // each instance is initially its own cluster, and a leaf in tree // find least distant pair in C // create a new cluster for pair // Add new cluster to list of clusters to be joined in the tree

Distance between two clusters The distance between two clusters cu and cv can be determined in several ways single link: distance of two most similar profiles complete link: distance of two least similar profiles average link: average distance between profiles

Haven’t We Already Seen This? Hierarchical clustering is very similar to distance-based phylogenetic methods Average link hierarchical clustering is equivalent to UPGMA for phylogenetics

Differences between general clustering and phylogenetic inference what a tree represents phylogenetic inference: tree represents hypothesized sequence of evolutionary events; internal nodes represent hypothetical ancestors clustering: inferred tree represents similarity of instances; internal nodes don’t represent ancestors form of tree UPGMA: rooted tree neighbor joining: unrooted hierarchical clustering: rooted tree how distances among clusters are calculated UPGMA: average link neighbor joining: based on additivity hierarchical clustering: various

Complete-link vs. single-link distances

Updating distances efficiently If we just merged and into , we can determine distance to each other cluster as follows single link: complete link: average link:

Effect of different linkage methods Single linkage Complete linkage Single linkage might result in a “chaining” effect Average linkage

Flat clustering from a hierarchical clustering We can always generate a flat clustering from a hierarchical clustering by “cutting” the tree at some distance threshold cutting here results in 2 clusters in 4 clusters

Naïve computational complexity The naïve implementation of hierarchical clustering has time complexity, where n is the number of objects computing the initial distance matrix takes time there are merging steps on each step, we have to update the distance matrix and select the next pair of clusters to merge

Computational Complexity for single-link clustering, we can update and pick the next pair in time, resulting in an algorithm for complete-link and average-link we can do these steps in time resulting in an method

How to pick the right clustering algorithm? If you have a sense of what the right number of clusters are, K-means or Gaussian mixture models might be good If you want to control for the extent of dissimilarity you should use hierarchical Hierarchical clustering is deterministic Always gives the same solution with the same distance metric K-means and Gaussian mixture model are non-deterministic We have talked about clustering of gene expression profiles However clustering could be used to find groupings among more complex objects All we need is to define the right distance metric

Summary of clustering Many different methods for clustering Flat clustering Hierarchical clustering Dissimilarity metrics among objects can influence clustering results a lot Picking the number of clusters is difficult but there are some ways to do this Evaluation of clusters is often difficult Comparison with other sources of information can help assess cluster quality