Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.

Slides:

Advertisements

Similar presentations

Text mining Gergely Kótyuk Laboratory of Cryptography and System Security (CrySyS) Budapest University of Technology and Economics

Advertisements

Outlines Background & motivation Algorithms overview

Cluster analysis for microarray data Anja von Heydebreck.

Dimensionality reduction. Outline From distances to points : – MultiDimensional Scaling (MDS) Dimensionality Reductions or data projections Random projections.

Dimensionality Reduction PCA -- SVD

Dimension reduction (1)

Clustering and Dimensionality Reduction Brendan and Yifang April

More Microarray Analysis: Unsupervised Approaches Matt Hibbs Troyanskaya Lab.

1cs542g-term High Dimensional Data  So far we’ve considered scalar data values f i (or interpolated/approximated each component of vector values.

Principal Component Analysis

Gene expression analysis summary Where are we now?

DNA Microarray Bioinformatics - #27611 Program Normalization exercise (from last week) Dimension reduction theory (PCA/Clustering) Dimension reduction.

SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech  Distance Metrics: Measuring similarity using the Euclidean and Correlation.

Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.

Unsupervised Learning - PCA The neural approach->PCA; SVD; kernel PCA Hertz chapter 8 Presentation based on Touretzky + various additions.

Classification of Microarray Data. Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical.

Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Dimension reduction : PCA and Clustering by Agnieszka S. Juncker

Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.

09/05/2005 סמינריון במתמטיקה ביולוגית Dimension Reduction - PCA Principle Component Analysis.

Clustering Petter Mostad. Clustering vs. class prediction Class prediction: Class prediction: A learning set of objects with known classes A learning.

Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction

The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.

Classification of Microarray Data. Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical.

Microarray analysis Algorithms in Computational Biology Spring 2006 Written by Itai Sharon.

Adapted by Doug Downey from Machine Learning EECS 349, Bryan Pardo Machine Learning Clustering.

Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.

Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:

Dimension reduction : PCA and Clustering by Agnieszka S. Juncker Part of the slides is adapted from Chris Workman.

Analysis of GO annotation at cluster level by H. Bjørn Nielsen Slides from Agnieszka S. Juncker.

Dimensionality Reduction. Multimedia DBs Many multimedia applications require efficient indexing in high-dimensions (time-series, images and videos, etc)

1cs542g-term Notes  Extra class next week (Oct 12, not this Friday)  To submit your assignment: me the URL of a page containing (links to)

Agenda Dimension reduction Principal component analysis (PCA)

Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.

The Tutorial of Principal Component Analysis, Hierarchical Clustering, and Multidimensional Scaling Wenshan Wang.

Chapter 2 Dimensionality Reduction. Linear Methods

Next. A Big Thanks Again Prof. Jason Bohland Quantitative Neuroscience Laboratory Boston University.

Feature extraction 1.Introduction 2.T-test 3.Signal Noise Ratio (SNR) 4.Linear Correlation Coefficient (LCC) 5.Principle component analysis (PCA) 6.Linear.

CSE554AlignmentSlide 1 CSE 554 Lecture 5: Alignment Fall 2011.

PCA, Clustering and Classification by Agnieszka S. Juncker Part of the slides is adapted from Chris Workman.

Computer Vision Lab. SNU Young Ki Baik Nonlinear Dimensionality Reduction Approach (ISOMAP, LLE)

Descriptive Statistics vs. Factor Analysis Descriptive statistics will inform on the prevalence of a phenomenon, among a given population, captured by.

Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.

MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:

Introduction to Statistical Analysis of Gene Expression Data Feng Hong Beespace meeting April 20, 2005.

Analysis of GO annotation at cluster level by Agnieszka S. Juncker.

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

Course Work Project Project title “Data Analysis Methods for Microarray Based Gene Expression Analysis” Sushil Kumar Singh (batch ) IBAB, Bangalore.

CpSc 881: Machine Learning PCA and MDS. 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources.

Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.

V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.

Analyzing Expression Data: Clustering and Stats Chapter 16.

Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.

Principal Component Analysis and Linear Discriminant Analysis for Feature Reduction Jieping Ye Department of Computer Science and Engineering Arizona State.

Classification Categorization is the process in which ideas and objects are recognized, differentiated and understood. Categorization implies that objects.

Principal Components Analysis ( PCA)

Unsupervised Learning II Feature Extraction

Machine Learning Supervised Learning Classification and Regression K-Nearest Neighbor Classification Fisher’s Criteria & Linear Discriminant Analysis Perceptron:

Unsupervised Learning

PREDICT 422: Practical Machine Learning

Lecture 8:Eigenfaces and Shared Features

Dimension reduction : PCA and Clustering by Agnieszka S. Juncker

Principal Component Analysis

PCA, Clustering and Classification by Agnieszka S. Juncker

Analysis of GO annotation at cluster level by Agnieszka S. Juncker

Dimension reduction : PCA and Clustering

Feature space tansformation methods

Unsupervised Learning

Presentation transcript:

Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU

Sample Preparation Hybridization Sample Preparation Hybridization Array design Probe design Array design Probe design Question Experimental Design Question Experimental Design Buy Chip/Array Statistical Analysis Fit to Model (time series) Statistical Analysis Fit to Model (time series) Expression Index Calculation Expression Index Calculation Advanced Data Analysis ClusteringPCA Classification Promoter Analysis Meta analysisSurvival analysisRegulatory Network Advanced Data Analysis ClusteringPCA Classification Promoter Analysis Meta analysisSurvival analysisRegulatory Network Normalization Image analysis The DNA Array Analysis Pipeline Comparable Gene Expression Data Comparable Gene Expression Data

What is Principal Component Analysis (PCA)? Numerical method Dimensionality reduction technique Primarily for visualization of arrays/samples ”Unsupervised” method used to explore the intrinsic variability of the data w.r.t. the independent variables (factors) in the study Note: Dependent variables are those that are observed to change in response to independent variables. Independent variables are deliberately manipulated to invoke changes in the dependent variables.

PCA Performs a rotation of the data that maximizes the variance in the new axes Projects high dimensional data into a low dimensional sub-space (visualized in 2-3 dims) Often captures much of the total data variation in a few dimensions (< 5) Exact solutions require a fully determined system (matrix with full rank) –i.e. A “square” matrix with independent rows

Principal components 1 st Principal component (PC1) –Direction along which there is greatest variation 2 nd Principal component (PC2) –Direction with maximum variation left in data, orthogonal to PC1

Singular Value Decomposition An implementation of PCA Defined in terms of matrices: X is the expression data matrix U are the left eigenvectors V are the right eigenvectors S are the singular values (S 2 = Λ)

Singular Value Decomposition

Singular Value Decomposition: X = USV T SVD of X produces two orthonormal bases: –Left singular vectors (U) –Right singular vectors (V) Right singular vectors span the space of the gene transcriptional responses Left singular vectors span the space of the assay expression profiles Following Orly Alter et al., PNAS 2000, we refer to: –left singular vectors {u k } as eigenarrays –right singular vectors {v k } as eigengenes

Singular Value Decomposition Requirements: –No missing values –“Centered” observations, i.e. normalize data such that each gene has mean = 0

Singular Values: indicate variance by dimension Example, m=24 Example, m>180

Eigenvectors (eigenarrays, rows)

PCA projections (as XY-plot)

Related methods Factor Analysis* Multidimensional scaling (MDS) Generalized multidimensional scaling (GMDS) Semantic mapping Isomap Independent component analysis (ICA) * Factor analysis is often confused with PCA though the two methods are related but distinct. Factor analysis is equivalent to PCA if the error terms in the factor analysis model are assumed to all have the same variance.

Why do we cluster? Organize observed data into meaningful structures Summarize large data sets Used when we have no a priori hypotheses Optimization: –Minimize within cluster distances –Maximize between cluster distances

Many types of clustering methods Method: –K-class –Hierarchical, e.g. UPGMA Agglomerative (bottom-up) Divisive (top-down) –Graph theoretic Information used: –Supervised vs unsupervised Final description of the items: –Partitioning vs non-partitioning –fuzzy, multi-class

Hierarchical clustering Representation of all pair-wise distances Parameters: none (distance measure) Results: –One large cluster –Hierarchical tree (dendrogram) Deterministic

Hierarchical clustering – UPGMA Algorithm Assign each item to its own cluster Join the nearest clusters Re-estimate the distance between clusters Repeat for 1 to n

Hierarchical clustering

Hierarchical Clustering Data with clustering order and distances Dendrogram representation 2D data is a special (simple) case!

Hierarchical Clustering Original data space Merging steps define a dendrogram

K-means - Algorithm J. B. MacQueen (1967): "Some Methods for classification and Analysis of Multivariate Observations", Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, University of California Press, 1:

K-mean clustering, K=3

K-Means Iteration i Iteration i+1 Circles: “prototypes” (parameters to fit) Squares: data points

K-means clustering Cell Cycle data

Self Organizing Maps (SOM) Partitioning method (similar to the K-means method) Clusters are organized in a two-dimensional grid Size of grid must be specified –(eg. 2x2 or 3x3) SOM algorithm finds the optimal organization of data in the grid

SOM - example

Comparison of clustering methods Hierarchical clustering –Distances between all variables –Time consuming with a large number of gene –Advantage to cluster on selected genes K-means clustering –Faster algorithm –Does only show relations between all variables SOM –Machine learning algorithm

Distance measures Euclidian distance Vector angle distance Pearsons distance

Comparison of distance measures

Summary Dimension reduction important to visualize data Methods: –Principal Component Analysis –Clustering Hierarchical K-means Self organizing maps (distance measure important)

UPGMA Clustering of ANOVA Results

DNA Microarray Analysis Overview/Review PCA (using SVD) Cluster analysis Normalization Before After