Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.

Slides:



Advertisements
Similar presentations
Clustering k-mean clustering Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Advertisements

Basic Gene Expression Data Analysis--Clustering
Outlines Background & motivation Algorithms overview
Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University.
Cluster analysis for microarray data Anja von Heydebreck.
Dimensionality Reduction PCA -- SVD
Dimension reduction (1)
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
PCA + SVD.
1er. Escuela Red ProTIC - Tandil, de Abril, 2006 Principal component analysis (PCA) is a technique that is useful for the compression and classification.
Principal Component Analysis
Gene expression analysis summary Where are we now?
DNA Microarray Bioinformatics - #27611 Program Normalization exercise (from last week) Dimension reduction theory (PCA/Clustering) Dimension reduction.
SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech  Distance Metrics: Measuring similarity using the Euclidean and Correlation.
Classification of Microarray Data. Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical.
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
09/05/2005 סמינריון במתמטיקה ביולוגית Dimension Reduction - PCA Principle Component Analysis.
Clustering Petter Mostad. Clustering vs. class prediction Class prediction: Class prediction: A learning set of objects with known classes A learning.
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
Classification of Microarray Data. Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Clustering. 2 Outline  Introduction  K-means clustering  Hierarchical clustering: COBWEB.
Microarray analysis Algorithms in Computational Biology Spring 2006 Written by Itai Sharon.
Introduction to Bioinformatics - Tutorial no. 12
Fuzzy K means.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker Part of the slides is adapted from Chris Workman.
Revision (Part II) Ke Chen COMP24111 Machine Learning Revision slides are going to summarise all you have learnt from Part II, which should be helpful.
Introduction to DNA microarrays DTU - January Hanne Jarmer.
Analysis of GO annotation at cluster level by H. Bjørn Nielsen Slides from Agnieszka S. Juncker.
Dimensionality Reduction. Multimedia DBs Many multimedia applications require efficient indexing in high-dimensions (time-series, images and videos, etc)
Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Summarized by Soo-Jin Kim
Chapter 2 Dimensionality Reduction. Linear Methods
Feature extraction 1.Introduction 2.T-test 3.Signal Noise Ratio (SNR) 4.Linear Correlation Coefficient (LCC) 5.Principle component analysis (PCA) 6.Linear.
Microarrays.
Clustering in Microarray Data-mining and Challenges Beyond Qing-jun Wang Center for Biophysics & Computational Biology University of Illinois at Urbana-Champaign.
PCA, Clustering and Classification by Agnieszka S. Juncker Part of the slides is adapted from Chris Workman.
es/by-sa/2.0/. Principal Component Analysis & Clustering Prof:Rui Alves Dept Ciencies Mediques.
Descriptive Statistics vs. Factor Analysis Descriptive statistics will inform on the prevalence of a phenomenon, among a given population, captured by.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
Introduction to Statistical Analysis of Gene Expression Data Feng Hong Beespace meeting April 20, 2005.
Analysis of GO annotation at cluster level by Agnieszka S. Juncker.
Clustering Gene Expression Data BMI/CS 576 Colin Dewey Fall 2010.
Clustering.
Marketing Research Aaker, Kumar, Day and Leone Tenth Edition Instructor’s Presentation Slides 1.
Project 11: Determining the Intrinsic Dimensionality of a Distribution Okke Formsma, Nicolas Roussis and Per Løwenborg.
Course Work Project Project title “Data Analysis Methods for Microarray Based Gene Expression Analysis” Sushil Kumar Singh (batch ) IBAB, Bangalore.
CpSc 881: Machine Learning PCA and MDS. 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to)
Principle Component Analysis and its use in MA clustering Lecture 12.
Principal Component Analysis and Linear Discriminant Analysis for Feature Reduction Jieping Ye Department of Computer Science and Engineering Arizona State.
Classification Categorization is the process in which ideas and objects are recognized, differentiated and understood. Categorization implies that objects.
Principal Components Analysis ( PCA)
Unsupervised Learning II Feature Extraction
PREDICT 422: Practical Machine Learning
Machine Learning Clustering: K-means Supervised Learning
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
Principal Component Analysis (PCA)
Microarray Clustering
Principal Component Analysis
PCA, Clustering and Classification by Agnieszka S. Juncker
Analysis of GO annotation at cluster level by Agnieszka S. Juncker
Descriptive Statistics vs. Factor Analysis
Dimension reduction : PCA and Clustering
Principal Component Analysis
Presentation transcript:

Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis DTU

Sample Preparation Hybridization Sample Preparation Hybridization Array design Probe design Array design Probe design Question Experimental Design Question Experimental Design Buy Chip/Array Statistical Analysis Fit to Model (time series) Statistical Analysis Fit to Model (time series) Expression Index Calculation Expression Index Calculation Advanced Data Analysis ClusteringPCA Classification Promoter Analysis Meta analysisSurvival analysisRegulatory Network Advanced Data Analysis ClusteringPCA Classification Promoter Analysis Meta analysisSurvival analysisRegulatory Network Normalization Image analysis The DNA Array Analysis Pipeline Comparable Gene Expression Data Comparable Gene Expression Data

Motivation: Multidimensional data Pat1Pat2Pat3Pat4Pat5Pat6Pat7Pat8Pat _at _at _s_at _at _at _at _at _x_at _at _s_at _s_at _at _s_at _s_at _x_at _at _x_at _s_at _s_at _at _s_at _at _at _at _at _s_at _s_at _s_at _at _at _at _at _at _s_at _s_at _s_at _at _at _at _s_at _s_at _s_at _at _at _x_at _at _s_at _s_at _at _at

PCA

Principal Component Analysis (PCA) Numerical method Dimensionality reduction technique Primarily for visualization of arrays/samples Performs a rotation of the data that maximizes the variance in the new axes Projects high dimensional data into a low dimensional sub-space (visualized in 2-3 dims) Often captures much of the total data variation in a few dimensions (< 5) Exact solutions require a fully determined system (matrix with full rank) –i.e. A “square” matrix with independent rows

Principal components 1 st Principal component (PC1) –Direction along which there is greatest variation 2 nd Principal component (PC2) –Direction with maximum variation left in data, orthogonal to PC1

Principal components General about principal components –summary variables –linear combinations of the original variables –uncorrelated with each other –capture as much of the original variance as possible

Principal components - Variance

Singular Value Decomposition

Requirements: –No missing values –“Centered” observations, i.e. normalize data such that each gene has mean = 0

PCA of ALPS patients vs. healthy controls

PCA of leukemia patients

PCA of treated cell lines 4 conditions, 3 batches

PCA projections (as XY-plot)

Eigenvectors (eigenarrays, rows)

PCA of cell cycle data - based on only 500 cell cycle regulated genes

PCA of cell cycle data

PCA of cell cycle data broken scanner (sample 12-16)

Clustering

Why do we cluster? Organize observed data into meaningful structures Summarize large data sets Used when we have no a priori hypotheses

Many types of clustering methods Method: –K-class –Hierarchical, e.g. UPGMA Agglomerative (bottom-up) Divisive (top-down) –Graph theoretic

Hierarchical clustering Representation of all pair-wise distances Parameters: none (distance measure) Results: –One large cluster –Hierarchical tree (dendrogram) Deterministic

Hierarchical clustering – UPGMA Algorithm Assign each item to its own cluster Join the nearest clusters Re-estimate the distance between clusters Repeat for 1 to n

Hierarchical clustering

Hierarchical Clustering Data with clustering order and distances Dendrogram representation 2D data is a special (simple) case!

Hierarchical clustering example: leukemia patients (based on all genes)

Hierarchical clustering example: leukemia data, significant genes

K-mean clustering Partition data into K clusters Parameter: Number of clusters (K) must be chosen Randomilized initialization: –different clusters each time

K-mean - Algorithm Assign each item a class in 1 to K (randomly) For each class 1 to K –Calculate the centroid (one of the K- means) –Calculate distance from centroid to each item Assign each item to the nearest centroid Repeat until no items are re-assigned (convergence)

K-mean clustering, K=3

Self Organizing Maps (SOM) Partitioning method (similar to the K-means method) Clusters are organized in a two-dimensional grid Size of grid must be specified –(eg. 2x2 or 3x3) SOM algorithm finds the optimal organization of data in the grid

SOM - example

K-means clustering Cell cycle data

K-means clustering example: Cluster profiles, treated cell lines

Comparison of clustering methods Hierarchical clustering –Distances between all variables –Time consuming with a large number of gene –Advantage to cluster on selected genes K-means clustering –Faster algorithm –Does only show relations between all variables SOM –Machine learning algorithm

Distance measures Euclidian distance Vector angle distance Pearsons distance

Comparison of distance measures

Summary Dimension reduction important to visualize data Methods: –Principal Component Analysis –Clustering Hierarchical K-means Self organizing maps (distance measure important)

Coffee break Next: Exercises in PCA and clustering