September 2002 Center for Statistics, transnational University Limburg, Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium 1 Graphical.

Slides:

Advertisements

Similar presentations

Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"

Advertisements

Pattern Recognition for the Natural Sciences Explorative Data Analysis Principal Component Analysis (PCA) Lutgarde Buydens, IMM, Analytical Chemistry.

Dimension reduction (1)

High-dimensional data analysis: Microarrays and multiple testing Mark van de Wiel 1,2 1. Dep. of Mathematics, VU University Amsterdam 2. Dep. of Biostatistics.

Gene expression analysis summary Where are we now?

Microarrays Dr Peter Smooker,

Unsupervised Learning - PCA The neural approach->PCA; SVD; kernel PCA Hertz chapter 8 Presentation based on Touretzky + various additions.

Factor Analysis There are two main types of factor analysis:

Bio277 Lab 2: Clustering and Classification of Microarray Data Jess Mar Department of Biostatistics Quackenbush Lab DFCI

4 th NETTAB Workshop Camerino, 5 th -7 th September 2004 Alberto Bertoni, Raffaella Folgieri, Giorgio Valentini

‘Gene Shaving’ as a method for identifying distinct sets of genes with similar expression patterns Tim Randolph & Garth Tan Presentation for Stat 593E.

Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.

Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.

Contingency tables and Correspondence analysis Contingency table Pearson’s chi-squared test for association Correspondence analysis using SVD Plots References.

Exploring Microarray data Javier Cabrera. Outline 1.Exploratory Analysis Steps. 2.Microarray Data as Multivariate Data. 3.Dimension Reduction 4.Correlation.

1 Cluster Analysis EPP 245 Statistical Analysis of Laboratory Data.

PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Prof.Dr.Cevdet Demir

Generate Affy.dat file Hyb. cRNA Hybridize to Affy arrays Output as Affy.chp file Text Self Organized Maps (SOMs) Functional annotation Pathway assignment.

Cluster Analysis Hierarchical and k-means. Expression data Expression data are typically analyzed in matrix form with each row representing a gene and.

Multivariate Data and Matrix Algebra Review BMTRY 726 Spring 2012.

Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.

Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.

Gene expression profiling identifies molecular subtypes of gliomas

Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 1 Correspondence Analysis Chapter 14.

Copyright 2000, Media Cybernetics, L.P. Array-Pro ® Analyzer Software.

The Tutorial of Principal Component Analysis, Hierarchical Clustering, and Multidimensional Scaling Wenshan Wang.

Chapter 2 Dimensionality Reduction. Linear Methods

Whole Genome Expression Analysis

Classification (Supervised Clustering) Naomi Altman Nov '06.

DNA microarray technology allows an individual to rapidly and quantitatively measure the expression levels of thousands of genes in a biological sample.

Clustering of DNA Microarray Data Michael Slifker CIS 526.

Basic concepts in ordination

ArrayCluster: an analytic tool for clustering, data visualization and module ﬁnder on gene expression proﬁles 組員：李祥豪謝紹陽江建霖.

Element 2: Discuss basic computational intelligence methods.

1 Dimension Reduction Examples: 1. DNA MICROARRAYS: Khan et al (2001): 4 types of small round blue cell tumors (SRBCT) Neuroblastoma (NB) Rhabdomyosarcoma.

Exploratory multivariate analysis of genome scale data... Aedín Culhane Dana-Farber Cancer Institute/Harvard School of Public Health.

Microarray - Leukemia vs. normal GeneChip System.

Scenario 6 Distinguishing different types of leukemia to target treatment.

SINGULAR VALUE DECOMPOSITION (SVD)

Introduction to Statistical Analysis of Gene Expression Data Feng Hong Beespace meeting April 20, 2005.

Class 23, 2001 CBCl/AI MIT Bioinformatics Applications and Feature Selection for SVMs S. Mukherjee.

Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics

Support Vector Machines and Gene Function Prediction Brown et al PNAS. CS 466 Saurabh Sinha.

Cluster validation Integration ICES Bioinformatics.

Analyzing Expression Data: Clustering and Stats Chapter 16.

Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.

PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Richard Brereton

Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring T.R. Golub et al., Science 286, 531 (1999)

Principal Component Analysis (PCA)

CSE182 L14 Mass Spec Quantitation MS applications Microarray analysis.

Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.

Multivariate statistical methods. Multivariate methods multivariate dataset – group of n objects, m variables (as a rule n>m, if possible). confirmation.

Methods of multivariate analysis Ing. Jozef Palkovič, PhD.

Unsupervised Learning

PREDICT 422: Practical Machine Learning

Exploring Microarray data

School of Computer Science & Engineering

Microarray - Leukemia vs. normal GeneChip System.

Lab 4.1 From Database to Data mining

Molecular Classification of Cancer

Claudio Lottaz and Rainer Spang

Volume 1, Issue 2, Pages (March 2002)

Genome-wide epigenetic analysis delineates a biologically distinct immature acute leukemia with myeloid/T-lymphoid features by Maria E. Figueroa, Bas J.

Arjun Pennathur, MD, Liqiang Xi, MD, Virginia R. Litle, MD, William E

Cluster Analysis in Bioinformatics

Volume 7, Issue 4, Pages (April 2005)

Dimension reduction : PCA and Clustering

A, unsupervised hierarchical clustering of the expression of probe sets differentially expressed in the oral mucosa of smokers versus never smokers. A,

Claudio Lottaz and Rainer Spang

Unsupervised Learning

Presentation transcript:

September 2002 Center for Statistics, transnational University Limburg, Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium 1 Graphical Exploration of Gene Expression Data by Spectral Map Analysis L. Wouters, H. Göhlmann, L. Bijnens, P. Lewi

September 2002 Center for Statistics, transnational University Limburg, Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium 2 Outline Short introduction to microarray technology Exploratory multivariate data analysis Multivariate projection methods Principal component analysis Correspondence factor analysis Spectral map analysis Case data: Leukemia data (Golub, 1999) Comparison of multivariate projection methods

September 2002 Center for Statistics, transnational University Limburg, Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium 3 Introduction to DNA Microarray Experiments Hybridization Labelling mRNA Cell Chip Gene Probe 18µm Scan

September 2002 Center for Statistics, transnational University Limburg, Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium 4 Structure of Microarray Data n genes p samples Y = p  n (e.g. p = 38, n = 5327)

September 2002 Center for Statistics, transnational University Limburg, Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium 5 Exploratory Multivariate Analysis & Gene Expression Data Supervised learning: Which genes discriminate known groups of subjects ? Unsupervised learning: 1.Do gene expression profiles reveal groups of patients (class discovery) ? 2.Which genes correlate with these groups ? clustering methods projection methods

September 2002 Center for Statistics, transnational University Limburg, Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium 6 Multivariate Projection Methods & Gene Expression Data Principal Component Analysis Lefkovits, I., et al. (1988) Hilsenbeck S.G., et al. (1999) Correspondence Factor Analysis Fellenberg, K., et al. (2001) Spectral Map Analysis

September 2002 Center for Statistics, transnational University Limburg, Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium 7 Key Steps in Multivariate Projection Methods Lewi, P.J. (1993) Weighting by marginal means or other Re-expression replace each table element by its logarithm Closure (row, column, or double) divide each table element by marginal totals Centering (row, column, or double) subtract from each table element marginal mean Normalization (row, column, or global) divide each table element by standard deviation Factorization generalized SVD Projection of scores and loadings with proper scaling biplot

September 2002 Center for Statistics, transnational University Limburg, Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium 8 Principal Component Analysis Pearson (1901), Hotelling (1933) Optional log-transformation Constant weighting of row and columns Column centering Column normalization Centering and normalization handle tables asymmetrically: R-mode - Q-mode analysis

September 2002 Center for Statistics, transnational University Limburg, Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium 9 Correspondence Factor Analysis Benzécri (1973), Greenacre (1984) Weighting of row and columns by marginal row and column totals Double closure (rank reduction) Double centering Global normalization Distances are chi-square related

September 2002 Center for Statistics, transnational University Limburg, Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium 10 Spectral Map Analysis Lewi (1976) Constant weighting of row and columns, or weighting by marginal row and column totals Logarithmic re-expression Double centering (rank reduction) Global normalization Distances are contrasts (log-ratio) Areas of symbols are proportional to marginal row-and column- totals or to a selected column

September 2002 Center for Statistics, transnational University Limburg, Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium 11 Spectral Map Analysis Double Centering On log-basis related to double closure Treats row- and column-space equivalently Geometrically results in a projection of the data on a hyperplane orthogonal to the line of identiy Number of dimensions is reduced by one Effectively removes a “size” component from the data

September 2002 Center for Statistics, transnational University Limburg, Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium 12 Properties of Gene Expression Data Presence of extreme high values: logarithmic re-expression Importance of level of gene expression, differences at lower levels are less reliable: weighting for marginal totals Extreme rectangular shape of matrices, e.g rows, 20 columns: assymetric factor scaling (downplay variability among genes) label only most extreme genes

September 2002 Center for Statistics, transnational University Limburg, Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium 13 Case Study MIT Leukemia Data (Golub, 1999) Training sample: 38 bone marrow samples 27 acute lymphoblastic leukemia ALL (T-lineage, B-lineage) 11 acute myeloid leukemia AML 6817 genes (5327 used) arbitrary choice of 50 informative genes to discriminate between ALL and AML Validation sample: 34 bone marrow samples 20 ALL (T-lineage, B-lineage), 14 AML

September 2002 Center for Statistics, transnational University Limburg, Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium 14 Class Discovery & Gene Detection Use case study to evaluate & validate three multivariate projection methods: Ability to discover (known) classes Identify genes related to these classes Quantify relationships

September 2002 Center for Statistics, transnational University Limburg, Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium 15 Principal Component Analysis AML ALL-T ALL-B

September 2002 Center for Statistics, transnational University Limburg, Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium 16 Principal Component Analysis Conclusions Results of PCA obscured by presence of size component, i.e. overall level of expression of genes Poor performance in class discovery Useless for detecting genes related to classes

September 2002 Center for Statistics, transnational University Limburg, Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium 17 Correspondence Factor Analysis AML ALL-T ALL-B

September 2002 Center for Statistics, transnational University Limburg, Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium 18 Correspondence Factor Analysis Conclusions Excellent performance in class discovery Difficult to detect genes related to classes Difficult to interpret distances between genes, samples (chi- square values)

September 2002 Center for Statistics, transnational University Limburg, Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium 19 Spectral Map Analysis Marginal Means Weighted AML ALL-T ALL-B

September 2002 Center for Statistics, transnational University Limburg, Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium 20 Spectral Map Analysis Conclusions Excellent performance in class discovery Allows to detect genes related to classes Distances between objects are contrasts (ratios) Suggests data reduction

September 2002 Center for Statistics, transnational University Limburg, Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium 21 Weighted Spectral Map Analysis 0.5 % Most Extreme Genes AML ALL-T ALL-B

September 2002 Center for Statistics, transnational University Limburg, Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium 22 Positioning of Validation Set AML ALL-T ALL-B

September 2002 Center for Statistics, transnational University Limburg, Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium 23 Golub Data Set Conclusion Weighted SMA allows to reduce data from 5327 genes to 27 genes without loss of important information Positioning validation data indicates 27 genes are also predictive for new cases Further data reduction possible ? Quantify samples for gene specificity ?

September 2002 Center for Statistics, transnational University Limburg, Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium 24 Spectral Mapping as a Tool for Quantitative Differential Gene Expression AML ALL-T ALL-B

September 2002 Center for Statistics, transnational University Limburg, Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium 25 Conclusions Weighted SMA outperforms the other projection methods in: Visualizing data Class detection of biological samples Identifying important genes, confirmed by literature search Reducing the data Identifying & interpreting relations between genes and subjects Quantifying differential gene expression

September 2002 Center for Statistics, transnational University Limburg, Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium 26 SPM Library Principal component analysis PCA<-spectralmap(DATA,center="column",normal="column") plot(PCA,scale="uvc",label.tol=c(1,1),col.group=GRP) Correspondence analysis CFA<spectralmap(DATA,logtrans=False,row.weight="mean”, col.weight="mean",closure="double") plot(CFA,scale="uvc",label.tol=c(1,1),col.group=GRP) Spectral map analysis, weighted by marginal totals SPM<-spectralmap(DATA,row.weight="mean“,col.weight=“mean”) plot(SPM,scale="uvc",col.group=GRP)

September 2002 Center for Statistics, transnational University Limburg, Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium 27 References Wouters, L., Göhlmann, H.W., Bijnens, L., Kass, S.U., Molenberghs, G., Lewi, P.J. (2002). Graphical Exploration of Gene Expression Data: A Comparative Study of Three Multivariate Methods. submitted Biometrics. SPM-library availability: