L16: Micro-array analysis Dimension reduction Unsupervised clustering.

Slides:



Advertisements
Similar presentations
Clustering II.
Advertisements

Clustering k-mean clustering Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
Cluster Analysis: Basic Concepts and Algorithms
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
PARTITIONAL CLUSTERING
Clustering approaches for high- throughput data Sushmita Roy BMI/CS 576 Nov 12 th, 2013.
Principal Component Analysis CMPUT 466/551 Nilanjan Ray.
Clustering II.
Principal Component Analysis
SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech  Distance Metrics: Measuring similarity using the Euclidean and Correlation.
L15:Microarray analysis (Classification) The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
Introduction to Bioinformatics Algorithms Clustering.
4. Ad-hoc I: Hierarchical clustering
Clustering Petter Mostad. Clustering vs. class prediction Class prediction: Class prediction: A learning set of objects with known classes A learning.
L15:Microarray analysis (Classification). The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
Computational Biology, Part 12 Expression array cluster analysis Robert F. Murphy, Shann-Ching Chen Copyright  All rights reserved.
Introduction to Bioinformatics Algorithms Clustering.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
CSE182-L17 Clustering Population Genetics: Basics.
Introduction to Bioinformatics - Tutorial no. 12
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
1 Cluster Analysis EPP 245 Statistical Analysis of Laboratory Data.
Introduction to Bioinformatics Algorithms Clustering and Microarray Analysis.
Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.
BIONFORMATIC ALGORITHMS Ryan Tinsley Brandon Lile May 9th, 2014.
Sp’10Bafna/Ideker Classification (SVMs / Kernel method)
Gene expression & Clustering (Chapter 10)
Lecture 11. Microarray and RNA-seq II
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of.
Clustering.
What is the determinant of What is the determinant of
Lecture 3 1.Different centrality measures of nodes 2.Hierarchical Clustering 3.Line graphs.
CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree like diagram that.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Principle Component Analysis and its use in MA clustering Lecture 12.
Lecture 6 Ordination Ordination contains a number of techniques to classify data according to predefined standards. The simplest ordination technique is.
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
C LUSTERING José Miguel Caravalho. CLUSTER ANALYSIS OR CLUSTERING IS THE TASK OF ASSIGNING A SET OF OBJECTS INTO GROUPS ( CALLED CLUSTERS ) SO THAT THE.
Clustering [Idea only, Chapter 10.1, 10.2, 10.4].
Graph clustering to detect network modules
Unsupervised Learning
Unsupervised Learning
PREDICT 422: Practical Machine Learning
Semi-Supervised Clustering
Clustering CSC 600: Data Mining Class 21.
CZ5211 Topics in Computational Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
Data Mining K-means Algorithm
Principal Component Analysis (PCA)
Hierarchical clustering approaches for high-throughput data
Principal Component Analysis
Clustering BE203: Functional Genomics Spring 2011 Vineet Bafna and Trey Ideker Trey Ideker Acknowledgements: Jones and Pevzner, An Introduction to Bioinformatics.
Clustering.
Multivariate Statistical Methods
Dimension reduction : PCA and Clustering
Feature space tansformation methods
Text Categorization Berlin Chen 2003 Reference:
Hierarchical Clustering
Clustering.
Unsupervised Learning
Unsupervised Learning
Presentation transcript:

L16: Micro-array analysis Dimension reduction Unsupervised clustering

PCA: motivating example Consider the expression values of 2 genes over 6 samples. Clearly, the expression of g 1 is not informative, and it suffices to look at g 2 values. Dimensionality can be reduced by discarding the gene g 1 g1g1 g2g2

PCA: Ex2 Consider the expression values of 2 genes over 6 samples. Clearly, the expression of the two genes is highly correlated. Projecting all the genes on a single line could explain most of the data.

PCA Suppose all of the data were to be reduced by projecting to a single line  from the mean. How do we select the line  ? m 

PCA cont’d Let each point x k map to x’ k =m+a k . We want to minimize the error Observation 1: Each point x k maps to x’ k = m +  T (x k -m)  –(a k =  T (x k -m)) m  xkxk x’ k

Proof of Observation 1 Differentiating w.r.t a k

Minimizing PCA Error To minimize error, we must maximize  T S  By definition, =  T S  implies that is an eigenvalue, and  the corresponding eigenvector. Therefore, we must choose the eigenvector corresponding to the largest eigenvalue.

PCA The single best dimension is given by the eigenvector of the largest eigenvalue of S The best k dimensions can be obtained by the eigenvectors {  1,  2, …,  k } corresponding to the k largest eigenvalues. To obtain the k dimensional surface, take B T M BTBT 1T1T M

Clustering Suppose we are not given any classes. Instead, we are asked to partition the samples into clusters that make sense. Alternatively, partition genes into clusters. Clustering is part of unsupervised learning

Microarray Data Microarray data are usually transformed into an intensity matrix (below) The intensity matrix allows biologists to make correlations between different genes (even if they are dissimilar) and to understand how genes functions might be related Clustering comes into play Time 1Time iTime N Gene 1108 Gene Gene Gene 4783 Gene 5123 Intensity (expression level) of gene at measured time ……

Clustering of Microarray Data Plot each gene as a point in N-dimensional space Make a distance matrix for the distance between every two gene points in the N- dimensional space Genes with a small distance share the same expression characteristics and might be functionally related or similar Clustering reveals groups of functionally related genes

Clusters Graphing the intensity matrix in multi-dimensional space

The Distance Matrix, d

Homogeneity and Separation Principles Homogeneity: Elements within a cluster are close to each other Separation: Elements in different clusters are further apart from each other …clustering is not an easy task! Given these points a clustering algorithm might make two distinct clusters as follows

Bad Clustering This clustering violates both Homogeneity and Separation principles Close distances from points in separate clusters Far distances from points in the same cluster

Good Clustering This clustering satisfies both Homogeneity and Separation principles

Clustering Techniques Agglomerative: Start with every element in its own cluster, and iteratively join clusters together Divisive: Start with one cluster and iteratively divide it into smaller clusters Hierarchical: Organize elements into a tree, leaves represent genes and the length of the paths between leaves represents the distances between genes. Similar genes lie within the same subtrees.

Hierarchical Clustering Initially, each element is its own cluster Merge the two closest clusters, and recurse Key question: What is closest? How do you compute the distance between clusters?

Hierarchical Clustering: Computing Distances d min (C, C * ) = min d(x,y) for all elements x in C and y in C * –Distance between two clusters is the smallest distance between any pair of their elements d avg (C, C * ) = (1 / |C * ||C|) ∑ d(x,y) for all elements x in C and y in C * –Distance between two clusters is the average distance between all pairs of their elements

Computing Distances (continued) However, we still need a base distance metric for pairs of gene: Euclidean distance Manhattan distance Dot Product Mutual information What are some qualitative differences between these?

Geometrical interpretation of distances The distance measures are all related. In some cases, the magnitude of the vector is important, in other cases it is not. ||X-Y|| 2 ||X-Y|| 1  =c. cos -1 (X T Y)

Comparison between metrics Euclidean and Manhattan tend to perform similarly and emphasize the overall magnitude of expression. The dot-product is very useful if the ‘shape’ of the expression vector is more important than its magnitude. The above metrics are less useful for identifying genes for which the expression levels are anti-correlated. One might imagine an instance in which the same transcription factor can cause both enhancement and repression of expression. In this case, the squared correlation (r 2 ) or mutual information is sometimes used.

But how many orderings can we have? 12453

For n leaves there are n-1 internal nodes Each flip in an internal node creates a new linear ordering of the leaves There are therefore 2 n-1 orderings E.g., flip this node

Bar-Joseph et al. Bioinformatics (2001)

Computing an Optimal Ordering Define L T (u,v) as the optimum score of all orderings for the subtree rooted at T where –u is the left node, and –v is the right node Is it sufficient to compute L T (u,v) for all T,u,v ? u T v

T T1T2 mku v L T (u,v) = max k,m {L T1 (u,k)+ L T2 (u,m) }

Time complexity of the algorithm? The recursion L T (u,w) is applied for each T,u,v. Each recursion takes O(n 2 ) time. Each pair of nodes has a unique Least common ancestor. L T (u,w) only needs to be computed if LCA(u,w) = T Total time O(n 4 ) T u w

Speed Improvements For all m in L T1 (u,R) –If L T1 (u,m)+L T2 (k 0,w)+ C(T 1,T 2 ) <= CurrMax Exit loop –For all k in L T1 (w,L) If L T1 (u,m)+L T2 (k,w)+C(T 1,T 2 ) <= CurrMax –Exit loop Else recompute CurrMax. In practice, this leads to great speed improvements 1500 genes, 7 hrs. changes to 7 min.