Outlines Background & motivation Algorithms overview

Slides:

Advertisements

Similar presentations

Basic Gene Expression Data Analysis--Clustering

Advertisements

Lectures 9 – Oct 26, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall.

Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.

Face Recognition and Biometric Systems

Microarray technology and analysis of gene expression data Hillevi Lindroos.

Principal Component Analysis CMPUT 466/551 Nilanjan Ray.

Principal Component Analysis

Mutual Information Mathematical Biology Seminar

SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech  Distance Metrics: Measuring similarity using the Euclidean and Correlation.

Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.

L15:Microarray analysis (Classification) The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.

Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.

University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou Ekaterini.

Dimensional reduction, PCA

L16: Micro-array analysis Dimension reduction Unsupervised clustering.

‘Gene Shaving’ as a method for identifying distinct sets of genes with similar expression patterns Tim Randolph & Garth Tan Presentation for Stat 593E.

Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.

09/05/2005 סמינריון במתמטיקה ביולוגית Dimension Reduction - PCA Principle Component Analysis.

Independent Component Analysis (ICA) and Factor Analysis (FA)

The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.

Three Algorithms for Nonlinear Dimensionality Reduction Haixuan Yang Group Meeting Jan. 011, 2005.

Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.

Microarray analysis Algorithms in Computational Biology Spring 2006 Written by Itai Sharon.

Exploring Microarray data Javier Cabrera. Outline 1.Exploratory Analysis Steps. 2.Microarray Data as Multivariate Data. 3.Dimension Reduction 4.Correlation.

ICA Alphan Altinok. Outline  PCA  ICA  Foundation  Ambiguities  Algorithms  Examples  Papers.

ICA-based Clustering of Genes from Microarray Expression Data Su-In Lee 1, Serafim Batzoglou 2 1 Department.

Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC Lecture 7: Coding and Representation 1 Computational Architectures in.

Statistical Analysis of Microarray Data

Principal Component Analysis. Philosophy of PCA Introduced by Pearson (1901) and Hotelling (1933) to describe the variation in a set of multivariate data.

Survey on ICA Technical Report, Aapo Hyvärinen, 1999.

Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.

Presented By Wanchen Lu 2/25/2013

Whole Genome Expression Analysis

Next. A Big Thanks Again Prof. Jason Bohland Quantitative Neuroscience Laboratory Boston University.

Analysis and Management of Microarray Data Dr G. P. S. Raghava.

Principles of Pattern Recognition

Clustering of DNA Microarray Data Michael Slifker CIS 526.

Feature extraction 1.Introduction 2.T-test 3.Signal Noise Ratio (SNR) 4.Linear Correlation Coefficient (LCC) 5.Principle component analysis (PCA) 6.Linear.

Learning a Kernel Matrix for Nonlinear Dimensionality Reduction By K. Weinberger, F. Sha, and L. Saul Presented by Michael Barnathan.

Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics Brigham Young University Dept. Integrative Biology.

ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.

Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.

Measure Independence in Kernel Space Presented by: Qiang Lou.

ISOMAP TRACKING WITH PARTICLE FILTER Presented by Nikhil Rane.

Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.

Clustering Features in High-Throughput Proteomic Data Richard Pelikan (or what’s left of him) BIOINF 2054 April

Blind Information Processing: Microarray Data Hyejin Kim, Dukhee KimSeungjin Choi Department of Computer Science and Engineering, Department of Chemical.

Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.

Analyzing Expression Data: Clustering and Stats Chapter 16.

Principal Component Analysis (PCA)

Principal Component Analysis and Linear Discriminant Analysis for Feature Reduction Jieping Ye Department of Computer Science and Engineering Arizona State.

Multivariate statistical methods. Multivariate methods multivariate dataset – group of n objects, m variables (as a rule n>m, if possible). confirmation.

Methods of multivariate analysis Ing. Jozef Palkovič, PhD.

Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.

Principal Component Analysis

Unsupervised Learning

PREDICT 422: Practical Machine Learning

Object Orie’d Data Analysis, Last Time

LECTURE 11: Advanced Discriminant Analysis

Gene expression.

Gene Set Enrichment Analysis

Principal Component Analysis (PCA)

Principal Component Analysis

Dimension reduction : PCA and Clustering

Feature space tansformation methods

Generally Discriminant Analysis

Support Vector Machines and Kernels

Feature Selection Methods

Unsupervised Learning

Presentation transcript:

Learn from chips: Microarray data analysis and clustering CS 374 Yu Bai Nov. 16, 2004

Outlines Background & motivation Algorithms overview fuzzy k-mean clustering (1st paper) Independent component analysis(2nd paper)

CHIP-ing away at medical questions Why does cancer occur? Diagnosis Treatment Drug design Molecular level understanding Snapshot of gene expression “(DNA) Microarray”

Spot your genes Cancer cell Cy5 dye Isolation RNA Cy3 dye Known gene sequences Cy5 dye Isolation RNA Glass slide (chip) Cy3 dye Normal cell

Matrix of expression E 1 E 2 E 3 Gene 1 Gene 2 Exp 2 Exp 3 Exp 1 Gene N

Why care about “clustering” ? Gene 1 Gene 2 Gene N E1 E2 E3 Gene N Gene 1 Gene 2 Discover functional relation Similar expression functionally related Assign function to unknown gene Find which gene controls which other genes

A review: microarray data analysis Supervised (Classification) Un-supervised (Clustering) “Heuristic” methods: - Hierarchical clustering - k mean clustering - Self organizing map - Others Probability-based methods: - Principle component analysis (PCA) - Independent component analysis (ICA) -Others Say heuristic methods are limited by….

Heuristic methods: distance metrix 1. Euclidean distance: D(X,Y)=sqrt[(x1-y1)2+(x2-y2)2+…(xn-yn)2] 2. (Pearson) Correlation coefficient R(X,Y)=1/n*∑[(xi-E(x))/x *(yi-E(y))/y] x= sqrt(E(x2)-E(x)2); E(x)=expected value of x R=1 if x=y 0 if E(xy)=E(x)E(y) 3. Other choices for distances… Similarity between the expression patterns of the genes Physical explaination of correlation coefficient: to what extend that the behavior of one gene affects the other

Hierarchical clustering Easy Depends on where to start the grouping Trouble to interpret “tree” structure Hard to interpret the relation between nodes, e.g. one group of gene repress another group, they are anti-correlated and far away from each other

K-mean clustering Overall optimization How to initiate Local minima How many (k) How to initiate Local minima Generally, heuristic methods have no established means to determine the “correct” number of clusters and to choose “best” algorithm

Probability-based methods: Principle component analysis (PCA) Pearson 1901; Everitt 1992; Basilevksy 1994 Common use: reduce dimension & filter noise Goal: find “uncorrelated ” component(s) that account for as much of variance by initial variables as possible (Linear) “Uncorrelated”: E[xy]=E[x]*E[y] x ≠ y PCA --- statistical tool by Pearson 1901 --- goal to reduce dimention and filtering noise --- principal components: account for as much of variance by initial variables as possible while remaining mutually uncorrelated and orthogonal -- these components are linear transformed from original ones, so possible to represent some meanings (example of running?)

PCA algorithm ATA = U Λ UT Digest principle components Exp1-n Exp1-n “Column-centered” matrix: A Covariance matrix: ATA Eigenvalue Decomposition ATA = U Λ UT U: Eigenvectors (principle components) Λ: Eigenvalues Digest principle components Gaussian assumption Exp1-n genes Exp1-n Eigenarray X Λ U Eigenarray Why center What is covariance matrix: the correlation between pair of exp in the sense of the comparison of gene expression pattern ED: decompose X’X, i.e. covariance matrix if X is centered Finally: get projection of principle component onto experiments First two are most important, first is averaged expression weighted by the variance in a particular experiment (how different. Correlated, anticorrelated in I experiment) 2nd represents change in expression over the experimental conditions (the trend of expression pattern changes across experiment arrays) Q: is there any information lost by using covariance matrix other than initial array? For gaussion variables, 1) their correlation/relation can always be estimated/represented in a linear manner (correlation), so covariance matrix is sufficient to represent data information. 2) uncorrelated== independent

Are biologists satisfied ? Biological process is non-Gaussian Super-Gaussian model “Faithful” vs. “meaningful” … Gene5 Gene4 Gene3 Gene2 Gene1 … Gene5 Gene4 Gene3 Gene2 Gene1 Ribosome Biogenesis Energy Pathway Biological Regulators Expression level Fatithful---uncorrelated/orthogonal bases are most efficient to represent MxN data space Meaningful: Biological process is independent, can we find those independent components, good chance to have biological sense. Facing an expression pattern, Want to know what is the underlying mechanism or combination of mechanisms

Equal to “source separation” Mixture 1 ?

Independent vs. uncorrelated The fact that sources are independent E[g(x)f(y)]=E[g(x)]*E[f(y)] x≠y stronger than uncorrelated Source x1 Source x2 Two mixtures: y1= 2*x1 + 3*x2 y2= 4*x1 + x2 y1 y2 y1 y2 Independent components principle components Our observation are two gene expression patterns in two experiments If we decompose to independent components--- ….. To uncorrelated components Can PCA decomposition help us at this point? Answer is no.

Independent component analysis(ICA) As independent as possible In the sense of maxmizing some function which measures the independence Simplified notation Find “unmixing” matrix A which makes s1,…, sm as independent as possible

(Linear) ICA algorithm “Likehood function” = Log (probability of observation) Y= WX p(x) = |detW| p(y) p(y)= Π pi (yi) What can be called a ICA model: Independent component, si, with the possible exception of one, must be non-gaussian Number of observed linear mixtures m (experiments) >= number of independent components A must be full of column rank Variable Distribution in observation sigmoid function g is the Is derivative of logistic function Mold sigmoid so that its slope fits unimodal distr of varying kurtosis L(y,W) = Log p(x) = Log |detW| + Σ Log pi (yi)

(Linear) ICA algorithm Find W maximize L(y,W) Super-Gaussian model

First paper: Gasch, et al. (2002) Genome biology, 3, 1-59 Improve the detection of conditional coregulation in gene expression by fuzzy k-means clustering

Biology is “fuzzy” Many genes are conditionally co-regulated Genes are co-regulated with different groups of genes under different conditions k-mean clustering vs. fuzzy k-mean: Xi: expression of ith gene Vj: jth cluster center

FuzzyK flowchart 1st cycle 2nd cycle Remove correlated genes(>0.7) 3rd cycle Initial Vj = PCA eigenvectors Replicates of centroids are averaged Vj’= Σi m2XiVj WXi Xi Σi m2XiVj WXi weight WXi evaluates the correlation of Xi with others

FuzzyK performance k is more “definitive” Recover clusters in classical methods Cell wall and secretion factors Uncover new gene clusters Because genes are not forced to belong to only a single cluster, replicated centroids are averaged e.g. results with initial k=300 add only 30 more centroid to those initiated with k=120, and most of the 30 are local minima (not reproduced in bootstrapping simulation) In fact, many genes have significant membership in more than one clusters, suggesting condition specific expression is triggered by different cellular signals Reveal new promoter sequence

Second paper: ICA is so new… Lee, et al. (2003) Genome biology, 4, R76 Systematic evaluation of ICA with respect to other clustering methods (PCA, k-mean)

From linear to non-linear Linear ICA: X = AS X: expression matrix (N conditions X K genes) si= independent vector of K gene levels xj=Σi ajisi Or In non-linear case, a general function f on top of linear transform matrix to convert X to an independent source matrix Non-linear ICA: X= f(AS)

How to do non-linear ICA? Construct feature space F Mapping X to Ψ in F ICA of Ψ Input space feature space IRn IRL Normally, L>N Let find the general function f, project x to a another space where the mapped data can be linearly decomposed to source matrix Define a feature space such that the mapped data and the biological process has linear relationship L is normally greater than n (experiment in X) L will be determined by kernel function and input data space

Kernel trick Kernel function: k(xi,xj)=Φ(xi)Φ(xj) xi (ith column of X), xj in |Rn are mapped to Φ(xi), Φ(xj) in feature space Construct F = construct ΦV={Φ(v1), Φ(v2),… Φ(vL)} to be the basis of F , i.e. rank(ΦVT ΦV)=L ΦVT ΦV = [ ] ; choose vectors {v1…vL} from {xi} k(v1,v1) … k(v1, vL) : : k(vL,v1) … k(vL,vL) Gaussian and polynormial kernels were tested Mapped points in F: Ψ[xi] =[ ]1/2 [ ] k(v1,v1) …k(v1, vL) : : k(vL,v1) …k(vL,vL) k(v1,xi) : k(vk,xi)

ICA-based clustering Independent component yi=(yi1,yi2,…yiK), i=1,…M “Load” – the jth entry of yi is the load of jth gene Two clusters per component Clusteri,1 = {gene j| yij= (C%xK)largest load in yi} Clusteri,2 = {gene j| yij= (C%xK)smallest load in yi} After done with ICA, cluster genes based on their loads along the independent sources/components

Evaluate biological significance Clusters from ICs Functional Classes GO 2 Cluster 1 GO 1 Cluster 2 Cluster 3 GO m GO i Cluster n ICA: genes are clustered to each independent component (biological process), genes are sorted by their loads within the Component. We evaluate the significance of a cluster sharing a common function by looking at the probability when this happens by chance Cluster i Calculate the p value for each pair : probability that they share many genes by change GO j

Evaluate biological significance ( )( ) f i g-f n-i Prob of sharing i genes = Functional class Microarray data g n ( ) g ( )( ) f i g-f n-i “P-value”: p = 1-Σi=1k-1 f n g n ( ) i k True positive = n Basically, the prob of one expression cluster share I genes with one functional cluster randomly is You have a number of choices to have n out of g genes that are functional meaningful, among those choices, you calculate the chances of only I genes have the particular f function and others don’t N is so huge This function f is so specific to this cluster so almost no gene with this function is not shared. k Sensitivity = f

Who is better ? Conclusion: ICA based clustering Is general better

References Su-in lee,(2002) group talk:“Microarray data analysis using ICA” Altman, et al. 2001, “whole genome expression analysis: challenges beyond clustering”, Curr. Opin. Stru. Biol. 11, 340 Hyvarinen, et al. 1999, “Survey on Independent Component analysis” Neutral Comp Surv 2,94-128 Alter, et al. 2000, “singular value decomposition for genome wide expression data processing and modeling” PNAS, 97, 10101 Harmeling et al. “Kernel feature spaces & nonlinear blind source separation”