Clustering Gene Expression Data BMI/CS 576 Colin Dewey Fall 2010.

Slides:



Advertisements
Similar presentations
Clustering II.
Advertisements

Clustering.
Cluster Analysis: Basic Concepts and Algorithms
Hierarchical Clustering, DBSCAN The EM Algorithm
K Means Clustering , Nearest Cluster and Gaussian Mixture
Segmentation and Fitting Using Probabilistic Methods
Machine Learning and Data Mining Clustering
Introduction to Bioinformatics
Clustering approaches for high- throughput data Sushmita Roy BMI/CS 576 Nov 12 th, 2013.
Lecture 6 Image Segmentation
Clustering II.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Lecture 7 – Algorithmic Approaches Justification: Any estimate of a phylogenetic tree has a large variance. Therefore, any tree that we can demonstrate.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Region Segmentation. Find sets of pixels, such that All pixels in region i satisfy some constraint of similarity.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
1 Text Clustering. 2 Clustering Partition unlabeled examples into disjoint subsets of clusters, such that: –Examples within a cluster are very similar.
L16: Micro-array analysis Dimension reduction Unsupervised clustering.
Clustering.
What is Cluster Analysis?
Cluster Analysis for Gene Expression Data Ka Yee Yeung Center for Expression Arrays Department of Microbiology.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
Introduction to Bioinformatics Algorithms Clustering and Microarray Analysis.
Radial Basis Function Networks
Clustering Unsupervised learning Generating “classes”
CSC321: Neural Networks Lecture 12: Clustering Geoffrey Hinton.
A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Text Clustering.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
tch?v=Y6ljFaKRTrI Fireflies.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of.
Clustering.
Clustering Algorithms Presented by Michael Smaili CS 157B Spring
Lecture 6 Spring 2010 Dr. Jianjun Hu CSCE883 Machine Learning.
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall 6.8: Clustering Rodney Nielsen Many / most of these.
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Flat clustering approaches
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
Distance-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS 576 Fall 2015.
Clustering Gene Expression Data BMI/CS 776 Mark Craven April 2002.
Big Data Infrastructure Week 9: Data Mining (4/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.
Data Science Practical Machine Learning Tools and Techniques 6.8: Clustering Rodney Nielsen Many / most of these slides were adapted from: I. H. Witten,
Hidden Markov Models BMI/CS 576
Clustering CSC 600: Data Mining Class 21.
Data Mining K-means Algorithm
Hierarchical clustering approaches for high-throughput data
Text Categorization Berlin Chen 2003 Reference:
Clustering Techniques
Clustering.
Presentation transcript:

Clustering Gene Expression Data BMI/CS Colin Dewey Fall 2010

Gene Expression Profiles we’ll assume we have a 2D matrix of gene expression measurements –rows represent genes –columns represent different experiments, time points, individuals etc. (what we can measure using one* microarray) we’ll refer to individual rows or columns as profiles –a row is a profile for a gene * Depending on the number of genes being considered, we might actually use several arrays per experiment, time point, individual.

Expression Profile Example rows represent human genes columns represent people with leukemia

Expression Profile Example rows represent yeast genes columns represent time points in a given experiment

Task Definition: Clustering Gene Expression Profiles given: expression profiles for a set of genes or experiments/individuals/time points (whatever columns represent) do: organize profiles into clusters such that –instances in the same cluster are highly similar to each other –instances from different clusters have low similarity to each other

Motivation for Clustering exploratory data analysis –understanding general characteristics of data –visualizing data generalization –infer something about an instance (e.g. a gene) based on how it relates to other instances everyone else is doing it

The Clustering Landscape there are many different clustering algorithms they differ along several dimensions –hierarchical vs. partitional (flat) –hard (no uncertainty about which instances belong to a cluster) vs. soft clusters –disjunctive (an instance can belong to multiple clusters) vs. non-disjunctive –deterministic (same clusters produced every time for a given data set) vs. stochastic –distance (similarity) measure used

Distance/Similarity Measures many clustering methods employ a distance (similarity) measure to assess the distance between –a pair of instances –a cluster and an instance –a pair of clusters given a distance value, it is straightforward to convert it into a similarity value not necessarily straightforward to go the other way we’ll describe our algorithms in terms of distances

Distance Metrics properties of metrics some distance metrics Manhattan Euclidean e ranges over the individual measurements for x i and x j (non-negativity) (identity) (symmetry) (triangle inequality)

Hierarchical Clustering: A Dendogram leaves represent instances (e.g. genes) height of bar indicates degree of distance within cluster distance scale 0

Hierarchical Clustering of Expression Data

Hierarchical Clustering can do top-down (divisive) or bottom-up (agglomerative) in either case, we maintain a matrix of distance (or similarity) scores for all pairs of –instances –clusters (formed so far) –instances and clusters

Bottom-Up Hierarchical Clustering // each object is initially its own cluster, and a leaf in tree // find least distant pair in C // create a new cluster for pair

Haven’t We Already Seen This? this algorithm is very similar to UPGMA and neighbor joining; there are some differences what tree represents –phylogenetic inference: tree represents hypothesized sequence of evolutionary events; internal nodes represent hypothetical ancestors –clustering: inferred tree represents similarity of instances; internal nodes don’t represent ancestors form of tree –UPGMA: rooted tree –neighbor joining: unrooted –hierarchical clustering: rooted tree how distances among clusters are calculated –UPGMA: average link –neighbor joining: based on additivity –hierarchical clustering: various

Distance Between Two Clusters the distance between two clusters can be determined in several ways –single link: distance of two most similar instances –complete link: distance of two least similar instances –average link: average distance between instances

Complete-Link vs. Single-Link Distances complete link single link

Updating Distances Efficiently if we just merged and into, we can determine distance to each other cluster as follows –single link: –complete link: –average link:

Computational Complexity the naïve implementation of hierarchical clustering has time complexity, where n is the number of instances –computing the initial distance matrix takes time –there are merging steps –on each step, we have to update the distance matrix and select the next pair of clusters to merge

Computational Complexity for single-link clustering, we can update and pick the next pair in time, resulting in an algorithm for complete-link and average-link we can do these steps in time resulting in an method see for an improved and corrected discussion of the computational complexity of hierarchical clustering

Partitional Clustering divide instances into disjoint clusters –flat vs. tree structure key issues –how many clusters should there be? –how should clusters be represented?

Partitional Clustering Example

cutting here results in 2 clusters cutting here results in 4 clusters Partitional Clustering from a Hierarchical Clustering we can always generate a partitional clustering from a hierarchical clustering by “cutting” the tree at some level

K-Means Clustering assume our instances are represented by vectors of real values put k cluster centers in same space as instances each cluster is represented by a vector consider an example in which our vectors have 2 dimensions instancescluster center

K-Means Clustering each iteration involves two steps –assignment of instances to clusters –re-computation of the means assignment re-computation of means

K-Means Clustering: Updating the Means for a set of instances that have been assigned to a cluster, we re-compute the mean of the cluster as follows

K-Means Clustering // determine which instances are assigned to this cluster // update the cluster center

K-means Clustering Example f1 f2 x1 x2 x3 x4 Given the following 4 instances and 2 clusters initialized as shown. Assume the distance function is

K-means Clustering Example (Continued) assignments remain the same, so the procedure has converged

EM Clustering in k-means as just described, instances are assigned to one and only one cluster we can do “soft” k-means clustering via an Expectation Maximization (EM) algorithm –each cluster represented by a distribution (e.g. a Gaussian) –E step: determine how likely is it that each cluster “generated” each instance –M step: adjust cluster parameters to maximize likelihood of instances

Representation of Clusters in the EM approach, we’ll represent each cluster using an m-dimensional multivariate Gaussian where is the mean of the Gaussian is the covariance matrix this is a representation of a Gaussian in a 2-D space

Cluster generation We model our data points with a generative process Each point is generated by: –Choosing a cluster (1,2,…,k) by sampling from probability distribution over the clusters Prob(cluster j) = P j –Sampling a point from the Gaussian distribution N j

EM Clustering the EM algorithm will try to set the parameters of the Gaussians,, to maximize the log likelihood of the data, X

EM Clustering the parameters of the model,, include the means, the covariance matrix and sometimes prior weights for each Gaussian here, we’ll assume that the covariance matrix is fixed; we’ll focus just on setting the means and the prior weights

EM Clustering: Hidden Variables on each iteration of k-means clustering, we had to assign each instance to a cluster in the EM approach, we’ll use hidden variables to represent this idea for each instance we have a set of hidden variables we can think of as being 1 if is a member of cluster j and 0 otherwise

EM Clustering: the E-step recall that is a hidden variable which is 1 if generated and 0 otherwise in the E-step, we compute, the expected value of this hidden variable + assignment

EM Clustering: the M-step given the expected values, we re-estimate the means of the Gaussians and the cluster probabilities can also re-estimate the covariance matrix if we’re varying it

EM Clustering Example Consider a one-dimensional clustering problem in which the data given are: x 1 = -4 x 2 = -3 x 3 = -1 x 4 = 3 x 5 = 5 The initial mean of the first Gaussian is 0 and the initial mean of the second is 2. The Gaussians both have variance = 4; their density function is: where denotes the mean (center) of the Gaussian. Initially, we set P 1 = P 2 = 0.5

EM Clustering Example

EM Clustering Example: E Step

EM Clustering Example: M-step

EM Clustering Example here we’ve shown just one step of the EM procedure we would continue the E- and M-steps until convergence

EM and K-Means Clustering both will converge to a local maximum both are sensitive to initial positions (means) of clusters have to choose value of k for both

Cross Validation to Select k with EM clustering, we can run the method with different values of k, use CV to evaluate each clustering compute to evaluate clustering clustering then run method on all data once we’ve picked k

Evaluating Clustering Results given random data without any “structure”, clustering algorithms will still return clusters the gold standard: do clusters correspond to natural categories? do clusters correspond to categories we care about? (there are lots of ways to partition the world)

Evaluating Clustering Results some approaches –external validation e.g. do genes clustered together have some common function? –internal validation How well does clustering optimize intra- cluster similarity and inter-cluster dissimilarity? –relative validation How does it compare to other clusterings using these criteria? e.g. with a probabilistic method (such as EM) we can ask: how probable does held-aside data look as we vary the number of clusters.

Comments on Clustering there many different ways to do clustering; we’ve discussed just a few methods hierarchical clusters may be more informative, but they’re more expensive to compute clusterings are hard to evaluate in many cases

The CLICK Algorithm Sharan & Shamir, ISMB 2000 instances to be clustered (e.g. genes) represented as vertices in a graph weighted, undirected edges represent similarity of instances

CLICK: How Do We Get Graph? assume pairwise similarity values are normally distributed estimate the parameters of these distributions and Pr(mates) (the prob that two randomly chosen instances are mates) from the data for mates (instances in same “true” cluster) for non-mates

CLICK: How Do We Get Graph? let be the probability density function for similarity values when i and j are mates then set the weight of an edge by: prune edges with weights < specified non-negative threshold t

The Basic CLICK Algorithm /* does graph have just one vertex? */ /* does graph satisfy stopping criterion? */ /* partition graph, call recursively */

Minimum Weight Cuts a cut of a graph is a subset of edges whose removal disconnects the graph a minimum weight cut is the cut with the smallest sum of edge weights can be found efficiently

Deciding When a Subgraph Represents a Kernel we can test a cut C against two hypotheses we can then score C by

Deciding When a Subgraph Represents a Kernel if we assume a complete graph, the minimum weight cut algorithm finds a cut that minimizes this ratio, i.e. thus, we accept and call G a kernel iff

Deciding When a Subgraph Represents a Kernel but we don’t have a complete graph we call G a kernel iff where approximates the contribution of missing edges

The Full CLICK Algorithm the basic CLICK algorithm produces kernels of clusters add two more operations adoption: find singletons that are similar, and hence can be adopted by kernels merge: merge similar clusters

The Full CLICK Algorithm

CLICK Experiment: Fibroblast Serum Response Data show table 2 from paper figure from: Sharan & Shamir, ISMB 2000