Learn from chips: Microarray data analysis and clustering CS 374 Yu Bai Nov. 16, 2004
Outlines Background & motivation Algorithms overview fuzzy k-mean clustering (1st paper) Independent component analysis(2nd paper)
CHIP-ing away at medical questions Why does cancer occur? Diagnosis Treatment Drug design Molecular level understanding Snapshot of gene expression “(DNA) Microarray”
Spot your genes Cancer cell Cy5 dye Isolation RNA Cy3 dye Known gene sequences Cy5 dye Isolation RNA Glass slide (chip) Cy3 dye Normal cell
Matrix of expression E 1 E 2 E 3 Gene 1 Gene 2 Exp 2 Exp 3 Exp 1 Gene N
Why care about “clustering” ? Gene 1 Gene 2 Gene N E1 E2 E3 Gene N Gene 1 Gene 2 Discover functional relation Similar expression functionally related Assign function to unknown gene Find which gene controls which other genes
A review: microarray data analysis Supervised (Classification) Un-supervised (Clustering) “Heuristic” methods: - Hierarchical clustering - k mean clustering - Self organizing map - Others Probability-based methods: - Principle component analysis (PCA) - Independent component analysis (ICA) -Others Say heuristic methods are limited by….
Heuristic methods: distance metrix 1. Euclidean distance: D(X,Y)=sqrt[(x1-y1)2+(x2-y2)2+…(xn-yn)2] 2. (Pearson) Correlation coefficient R(X,Y)=1/n*∑[(xi-E(x))/x *(yi-E(y))/y] x= sqrt(E(x2)-E(x)2); E(x)=expected value of x R=1 if x=y 0 if E(xy)=E(x)E(y) 3. Other choices for distances… Similarity between the expression patterns of the genes Physical explaination of correlation coefficient: to what extend that the behavior of one gene affects the other
Hierarchical clustering Easy Depends on where to start the grouping Trouble to interpret “tree” structure Hard to interpret the relation between nodes, e.g. one group of gene repress another group, they are anti-correlated and far away from each other
K-mean clustering Overall optimization How to initiate Local minima How many (k) How to initiate Local minima Generally, heuristic methods have no established means to determine the “correct” number of clusters and to choose “best” algorithm
Probability-based methods: Principle component analysis (PCA) Pearson 1901; Everitt 1992; Basilevksy 1994 Common use: reduce dimension & filter noise Goal: find “uncorrelated ” component(s) that account for as much of variance by initial variables as possible (Linear) “Uncorrelated”: E[xy]=E[x]*E[y] x ≠ y PCA --- statistical tool by Pearson 1901 --- goal to reduce dimention and filtering noise --- principal components: account for as much of variance by initial variables as possible while remaining mutually uncorrelated and orthogonal -- these components are linear transformed from original ones, so possible to represent some meanings (example of running?)
PCA algorithm ATA = U Λ UT Digest principle components Exp1-n Exp1-n “Column-centered” matrix: A Covariance matrix: ATA Eigenvalue Decomposition ATA = U Λ UT U: Eigenvectors (principle components) Λ: Eigenvalues Digest principle components Gaussian assumption Exp1-n genes Exp1-n Eigenarray X Λ U Eigenarray Why center What is covariance matrix: the correlation between pair of exp in the sense of the comparison of gene expression pattern ED: decompose X’X, i.e. covariance matrix if X is centered Finally: get projection of principle component onto experiments First two are most important, first is averaged expression weighted by the variance in a particular experiment (how different. Correlated, anticorrelated in I experiment) 2nd represents change in expression over the experimental conditions (the trend of expression pattern changes across experiment arrays) Q: is there any information lost by using covariance matrix other than initial array? For gaussion variables, 1) their correlation/relation can always be estimated/represented in a linear manner (correlation), so covariance matrix is sufficient to represent data information. 2) uncorrelated== independent
Are biologists satisfied ? Biological process is non-Gaussian Super-Gaussian model “Faithful” vs. “meaningful” … Gene5 Gene4 Gene3 Gene2 Gene1 … Gene5 Gene4 Gene3 Gene2 Gene1 Ribosome Biogenesis Energy Pathway Biological Regulators Expression level Fatithful---uncorrelated/orthogonal bases are most efficient to represent MxN data space Meaningful: Biological process is independent, can we find those independent components, good chance to have biological sense. Facing an expression pattern, Want to know what is the underlying mechanism or combination of mechanisms
Equal to “source separation” Mixture 1 ?
Independent vs. uncorrelated The fact that sources are independent E[g(x)f(y)]=E[g(x)]*E[f(y)] x≠y stronger than uncorrelated Source x1 Source x2 Two mixtures: y1= 2*x1 + 3*x2 y2= 4*x1 + x2 y1 y2 y1 y2 Independent components principle components Our observation are two gene expression patterns in two experiments If we decompose to independent components--- ….. To uncorrelated components Can PCA decomposition help us at this point? Answer is no.
Independent component analysis(ICA) As independent as possible In the sense of maxmizing some function which measures the independence Simplified notation Find “unmixing” matrix A which makes s1,…, sm as independent as possible
(Linear) ICA algorithm “Likehood function” = Log (probability of observation) Y= WX p(x) = |detW| p(y) p(y)= Π pi (yi) What can be called a ICA model: Independent component, si, with the possible exception of one, must be non-gaussian Number of observed linear mixtures m (experiments) >= number of independent components A must be full of column rank Variable Distribution in observation sigmoid function g is the Is derivative of logistic function Mold sigmoid so that its slope fits unimodal distr of varying kurtosis L(y,W) = Log p(x) = Log |detW| + Σ Log pi (yi)
(Linear) ICA algorithm Find W maximize L(y,W) Super-Gaussian model
First paper: Gasch, et al. (2002) Genome biology, 3, 1-59 Improve the detection of conditional coregulation in gene expression by fuzzy k-means clustering
Biology is “fuzzy” Many genes are conditionally co-regulated Genes are co-regulated with different groups of genes under different conditions k-mean clustering vs. fuzzy k-mean: Xi: expression of ith gene Vj: jth cluster center
FuzzyK flowchart 1st cycle 2nd cycle Remove correlated genes(>0.7) 3rd cycle Initial Vj = PCA eigenvectors Replicates of centroids are averaged Vj’= Σi m2XiVj WXi Xi Σi m2XiVj WXi weight WXi evaluates the correlation of Xi with others
FuzzyK performance k is more “definitive” Recover clusters in classical methods Cell wall and secretion factors Uncover new gene clusters Because genes are not forced to belong to only a single cluster, replicated centroids are averaged e.g. results with initial k=300 add only 30 more centroid to those initiated with k=120, and most of the 30 are local minima (not reproduced in bootstrapping simulation) In fact, many genes have significant membership in more than one clusters, suggesting condition specific expression is triggered by different cellular signals Reveal new promoter sequence
Second paper: ICA is so new… Lee, et al. (2003) Genome biology, 4, R76 Systematic evaluation of ICA with respect to other clustering methods (PCA, k-mean)
From linear to non-linear Linear ICA: X = AS X: expression matrix (N conditions X K genes) si= independent vector of K gene levels xj=Σi ajisi Or In non-linear case, a general function f on top of linear transform matrix to convert X to an independent source matrix Non-linear ICA: X= f(AS)
How to do non-linear ICA? Construct feature space F Mapping X to Ψ in F ICA of Ψ Input space feature space IRn IRL Normally, L>N Let find the general function f, project x to a another space where the mapped data can be linearly decomposed to source matrix Define a feature space such that the mapped data and the biological process has linear relationship L is normally greater than n (experiment in X) L will be determined by kernel function and input data space
Kernel trick Kernel function: k(xi,xj)=Φ(xi)Φ(xj) xi (ith column of X), xj in |Rn are mapped to Φ(xi), Φ(xj) in feature space Construct F = construct ΦV={Φ(v1), Φ(v2),… Φ(vL)} to be the basis of F , i.e. rank(ΦVT ΦV)=L ΦVT ΦV = [ ] ; choose vectors {v1…vL} from {xi} k(v1,v1) … k(v1, vL) : : k(vL,v1) … k(vL,vL) Gaussian and polynormial kernels were tested Mapped points in F: Ψ[xi] =[ ]1/2 [ ] k(v1,v1) …k(v1, vL) : : k(vL,v1) …k(vL,vL) k(v1,xi) : k(vk,xi)
ICA-based clustering Independent component yi=(yi1,yi2,…yiK), i=1,…M “Load” – the jth entry of yi is the load of jth gene Two clusters per component Clusteri,1 = {gene j| yij= (C%xK)largest load in yi} Clusteri,2 = {gene j| yij= (C%xK)smallest load in yi} After done with ICA, cluster genes based on their loads along the independent sources/components
Evaluate biological significance Clusters from ICs Functional Classes GO 2 Cluster 1 GO 1 Cluster 2 Cluster 3 GO m GO i Cluster n ICA: genes are clustered to each independent component (biological process), genes are sorted by their loads within the Component. We evaluate the significance of a cluster sharing a common function by looking at the probability when this happens by chance Cluster i Calculate the p value for each pair : probability that they share many genes by change GO j
Evaluate biological significance ( )( ) f i g-f n-i Prob of sharing i genes = Functional class Microarray data g n ( ) g ( )( ) f i g-f n-i “P-value”: p = 1-Σi=1k-1 f n g n ( ) i k True positive = n Basically, the prob of one expression cluster share I genes with one functional cluster randomly is You have a number of choices to have n out of g genes that are functional meaningful, among those choices, you calculate the chances of only I genes have the particular f function and others don’t N is so huge This function f is so specific to this cluster so almost no gene with this function is not shared. k Sensitivity = f
Who is better ? Conclusion: ICA based clustering Is general better
References Su-in lee,(2002) group talk:“Microarray data analysis using ICA” Altman, et al. 2001, “whole genome expression analysis: challenges beyond clustering”, Curr. Opin. Stru. Biol. 11, 340 Hyvarinen, et al. 1999, “Survey on Independent Component analysis” Neutral Comp Surv 2,94-128 Alter, et al. 2000, “singular value decomposition for genome wide expression data processing and modeling” PNAS, 97, 10101 Harmeling et al. “Kernel feature spaces & nonlinear blind source separation”