Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.

Slides:



Advertisements
Similar presentations
Albert Gatt Corpora and Statistical Methods Lecture 13.
Advertisements

Unsupervised learning
Brandon Andrews CS6030.  What is a phylogenetic tree?  Goals in a phylogenetic tree generator  Distance based method  Fitch-Margoliash Method Example.
Machine Learning and Data Mining Clustering
Introduction to Bioinformatics
Clustering approaches for high- throughput data Sushmita Roy BMI/CS 576 Nov 12 th, 2013.
Structural Inference of Hierarchies in Networks BY Yu Shuzhi 27, Mar 2014.
Clustering II.
Lecture 7 – Algorithmic Approaches Justification: Any estimate of a phylogenetic tree has a large variance. Therefore, any tree that we can demonstrate.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Introduction to Bioinformatics Algorithms Clustering.
1 Text Clustering. 2 Clustering Partition unlabeled examples into disjoint subsets of clusters, such that: –Examples within a cluster are very similar.
L16: Micro-array analysis Dimension reduction Unsupervised clustering.
Distance methods. UPGMA: similar to hierarchical clustering but not additive Neighbor-joining: more sophisticated and additive What is additivity?
Clustering with FITCH en UPGMA Bob W. Kooi, David M. Stork and Jorn de Haan Theoretical Biology.
Tree Clustering & COBWEB. Remember: k-Means Clustering.
CISC667, F05, Lec15, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Phylogenetic Trees (II) Distance-based methods.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 17: Hierarchical Clustering 1.
Introduction to Hierarchical Clustering Analysis Pengyu Hong 09/16/2005.
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
Phylogenetic trees Sushmita Roy BMI/CS 576
Introduction to Bioinformatics Algorithms Clustering and Microarray Analysis.
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
Clustering Unsupervised learning Generating “classes”
Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by HAO-WEI, YEH.
Hierarchical Clustering
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo.
Text Clustering.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
tch?v=Y6ljFaKRTrI Fireflies.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Clustering Algorithms k-means Hierarchic Agglomerative Clustering (HAC) …. BIRCH Association Rule Hypergraph Partitioning (ARHP) Categorical clustering.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.
Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of.
Clustering Gene Expression Data BMI/CS 576 Colin Dewey Fall 2010.
CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
High-throughput omic datasets and clustering
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree like diagram that.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 15 10/13/2011.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Flat clustering approaches
Information Retrieval and Organisation Chapter 17 Hierarchical Clustering Dell Zhang Birkbeck, University of London.
Hierarchical Clustering
1 CAP5510 – Bioinformatics Phylogeny Tamer Kahveci CISE Department University of Florida.
Computational Biology Group. Class prediction of tumor samples Supervised Clustering Detection of Subgroups in a Class.
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
Distance-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS 576 Fall 2015.
Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by David Williams Paper Discussion Group ( )
C LUSTERING José Miguel Caravalho. CLUSTER ANALYSIS OR CLUSTERING IS THE TASK OF ASSIGNING A SET OF OBJECTS INTO GROUPS ( CALLED CLUSTERS ) SO THAT THE.
Clustering [Idea only, Chapter 10.1, 10.2, 10.4].
Data Mining and Text Mining. The Standard Data Mining process.
Clustering Gene Expression Data BMI/CS 776 Mark Craven April 2002.
CLUSTER ANALYSIS. Cluster Analysis  Cluster analysis is a major technique for classifying a ‘mountain’ of information into manageable meaningful piles.
Graph clustering to detect network modules
Unsupervised Learning
CZ5211 Topics in Computational Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
Hierarchical clustering approaches for high-throughput data
Information Organization: Clustering
Clustering.
Text Categorization Berlin Chen 2003 Reference:
Hierarchical Clustering
Clustering Techniques
Clustering.
Unsupervised Learning
Presentation transcript:

Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS Fall 2015

The clustering landscape There are many different clustering algorithms They differ with respect to several properties – hierarchical vs. flat – hard (no uncertainty about which profiles belong to a cluster) vs. soft clusters – overlapping (a profile can belong to multiple clusters) vs. non-overlapping – deterministic (same clusters produced every time for a given data set) vs. stochastic – distance (similarity) measure used

Hierarchical vs. Flat clustering Flat clustering (e.g., K -means and Gaussian Mixture Models) – Number of clusters, K, is pre-specified – Each object is assigned to one of these clusters Hierarchical clustering – Hierarchical relationships established between all objects – A threshold on the maximum dissimilarity can be used to convert a hierarchical clustering into a flat clustering Multiple flat clusterings can be produced by varying the threshold

Hierarchical clustering leaves represent objects to be clustered (e.g. genes or samples) height of bar indicates degree of distance within cluster distance scale 0

Data from Sridharan et al Hierarchical clustering example clustering of chromatin marks measured near genes in a particular cell type (induced pluripotent cell (iPS)) –Columns correspond to chromatin marks –Eight marks –Five activating –H3K14 –H3K18 –H3K4me3 –H3K9ac –H3K79me2 –Three repressive –H3K9me2 –H3K9me3 –H3K27me3

Hierarchical clustering Input: a set X={x 1,.. x n } of data points, where each x i is a p-dimensional vector Output: a rooted tree with the data points at the leaves of the tree Two major strategies – top-down (divisive) – bottom-up (agglomerative) Both strategies iteratively build a tree by splitting (to- down) or merging (bottom-up) subsets of data points We will focus on bottom-up clustering

Top-down clustering Basic idea: use a flat clustering method to recursively split a set, X, of data points into K (usually K=2) disjoint subsets topdown_cluster(X): if X has only one element x: return a tree with a single leaf node labeled by x else: X1, X2 = flat_cluster(X, K=2) T1 = topdown_cluster(X1) T2 = topdown_cluster(X2) return a tree with children T1 and T2

Bottom-up hierarchical clustering // each instance is initially its own cluster, and a leaf in tree // find least distant pair in C // create a new cluster for pair // Add new cluster to list of clusters to be joined in the tree

Distance between two clusters The distance between two clusters c u and c v can be determined in several ways – single link: distance of two most similar profiles – complete link: distance of two least similar profiles – average link: average distance between profiles

Haven’t We Already Seen This? Hierarchical clustering is very similar to distance- based phylogenetic methods Average link hierarchical clustering is equivalent to UPGMA for phylogenetics

Differences between general clustering and phylogenetic inference what a tree represents – phylogenetic inference: tree represents hypothesized sequence of evolutionary events; internal nodes represent hypothetical ancestors – clustering: inferred tree represents similarity of instances; internal nodes don’t represent ancestors form of tree – UPGMA: rooted tree – neighbor joining: unrooted – hierarchical clustering: rooted tree how distances among clusters are calculated – UPGMA: average link – neighbor joining: based on additivity – hierarchical clustering: various

Complete-link vs. single-link distances complete link single link

Updating distances efficiently If we just merged and into, we can determine distance to each other cluster as follows – single link: – complete link: – average link:

Effect of different linkage methods Complete linkage Average linkage Single linkage Single linkage might result in a “chaining” effect

Flat clustering from a hierarchical clustering cutting here results in 2 clusters cutting here results in 4 clusters We can always generate a flat clustering from a hierarchical clustering by “cutting” the tree at some distance threshold

Naïve computational complexity The naïve implementation of hierarchical clustering has time complexity, where n is the number of objects – computing the initial distance matrix takes time – there are merging steps – on each step, we have to update the distance matrix and select the next pair of clusters to merge

Computational Complexity for single-link clustering, we can update and pick the next pair in time, resulting in an algorithm for complete-link and average-link we can do these steps in time resulting in an method

How to pick the right clustering algorithm? If you have a sense of what the right number of clusters are, K-means or Gaussian mixture models might be good If you want to control for the extent of dissimilarity you should use hierarchical Hierarchical clustering is deterministic – Always gives the same solution with the same distance metric K-means and Gaussian mixture model are non-deterministic We have talked about clustering of gene expression profiles – However clustering could be used to find groupings among more complex objects – All we need is to define the right distance metric

Summary of clustering Many different methods for clustering – Flat clustering – Hierarchical clustering – Dissimilarity metrics among objects can influence clustering results a lot Picking the number of clusters is difficult but there are some ways to do this Evaluation of clusters is often difficult – Comparison with other sources of information can help assess cluster quality