More On Preprocessing Javier Cabrera. Outline 1.Transform the data into a scale suitable for analysis. 2.Remove the effects of systematic and obfuscating.

Slides:



Advertisements
Similar presentations
Tests of Significance for Regression & Correlation b* will equal the population parameter of the slope rather thanbecause beta has another meaning with.
Advertisements

P. J. Munson, National Institutes of Health, Nov. 2001Page 1 A "Consistency" Test for Determining the Significance of Gene Expression Changes on Replicate.
Supervised and unsupervised analysis of gene expression data Bing Zhang Department of Biomedical Informatics Vanderbilt University
ECS 289A Presentation Jimin Ding Problem & Motivation Two-component Model Estimation for Parameters in above model Define low and high level gene expression.
Filtering and Normalization of Microarray Gene Expression Data Waclaw Kusnierczyk Norwegian University of Science and Technology Trondheim, Norway.
Microarray technology and analysis of gene expression data Hillevi Lindroos.
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L10.1 CorrelationCorrelation The underlying principle of correlation analysis.
Getting the numbers comparable
Correlation and Regression. Spearman's rank correlation An alternative to correlation that does not make so many assumptions Still measures the strength.
DNA Microarray Bioinformatics - #27612 Normalization and Statistical Analysis.
Microarray Data Preprocessing and Clustering Analysis
Differentially expressed genes
1 Test of significance for small samples Javier Cabrera.
Exploring Microarray data Javier Cabrera. Outline 1.Exploratory Analysis Steps. 2.Microarray Data as Multivariate Data. 3.Dimension Reduction 4.Correlation.
Linear Regression/Correlation
EDUC 200C Section 4 – Review Melissa Kemmerle October 19, 2012.
Statistical hypothesis testing – Inferential statistics II. Testing for associations.
JAVED KHAN ET AL. NATURE MEDICINE – Volume 7 – Number 6 – JUNE 2001
Gene based diagnostic prediction of cancers by using Artificial Neural Network Liya Wang ECE/CS/ME539.
Principal Component Analysis. Philosophy of PCA Introduced by Pearson (1901) and Hotelling (1933) to describe the variation in a set of multivariate data.
Filtering and Normalization of Microarray Gene Expression Data Waclaw Kusnierczyk Norwegian University of Science and Technology Trondheim, Norway.
Microarray Data Analysis Illumina Gene Expression Data Analysis Yun Lian.
(4) Within-Array Normalization PNAS, vol. 101, no. 5, Feb Jianqing Fan, Paul Tam, George Vande Woude, and Yi Ren.
Regression and Correlation Methods Judy Zhong Ph.D.
Practical Issues in Microarray Data Analysis Mark Reimers National Cancer Institute Bethesda Maryland.
Classification of multiple cancer types by multicategory support vector machines using gene expression data.
Analysis and Management of Microarray Data Dr G. P. S. Raghava.
Correlation.
Applying statistical tests to microarray data. Introduction to filtering Recall- Filtering is the process of deciding which genes in a microarray experiment.
Probe-Level Data Normalisation: RMA and GC-RMA Sam Robson Images courtesy of Neil Ward, European Application Engineer, Agilent Technologies.
1 Dimension Reduction Examples: 1. DNA MICROARRAYS: Khan et al (2001): 4 types of small round blue cell tumors (SRBCT) Neuroblastoma (NB) Rhabdomyosarcoma.
Lecture 13 Chi-square and sample variance Finish the discussion of chi-square distribution from lecture 12 Expected value of sum of squares equals n-1.
Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics Brigham Young University Dept. Integrative Biology.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
A A R H U S U N I V E R S I T E T Faculty of Agricultural Sciences Introduction to analysis of microarray data David Edwards.
Regression Analysis Week 8 DIAGNOSTIC AND REMEDIAL MEASURES Residuals The main purpose examining residuals Diagnostic for Residuals Test involving residuals.
Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks From Nature Medicine 7(6) 2001 By Javed.
1 Course #412 Analyzing Microarray Data using the mAdb System April 1-2, :00 pm - 4:00pm Intended for users of the.
Introduction to Statistical Analysis of Gene Expression Data Feng Hong Beespace meeting April 20, 2005.
Statistical Methods for Identifying Differentially Expressed Genes in Replicated cDNA Microarray Experiments Presented by Nan Lin 13 October 2002.
Statistics for Differential Expression Naomi Altman Oct. 06.
A Significance Test for r An estimator r    = 0 ? t-test.
Application of Class Discovery and Class Prediction Methods to Microarray Data Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics.
Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics
Correlation Forensic Statistics CIS205. Introduction Chi-squared shows the strength of relationship between variables when the data is of count form However,
For a specific gene x ij = i th measurement under condition j, i=1,…,6; j=1,2 Is a Specific Gene Differentially Expressed Differential expression.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Machine Learning 5. Parametric Methods.
Variability & Statistical Analysis of Microarray Data GCAT – Georgetown July 2004 Jo Hardin Pomona College
Distinguishing active from non active genes: Main principle: DNA hybridization -DNA hybridizes due to base pairing using H-bonds -A/T and C/G and A/U possible.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Nonparametric Statistics
Chapter 11 Linear Regression and Correlation. Explanatory and Response Variables are Numeric Relationship between the mean of the response variable and.
Quantitative Methods Residual Analysis Multiple Linear Regression C.W. Jackson/B. K. Gordor.
Nonparametric Statistics
Exploring Microarray data
Correlation – Regression
Significance Analysis of Microarrays (SAM)
Inverse Transformation Scale Experimental Power Graphing
Nonparametric Statistics
Linear Regression/Correlation
The regression model in matrix form
Volume 6, Issue 5, Pages e5 (May 2018)
Significance Analysis of Microarrays (SAM)
Sam Norman-Haignere, Nancy G. Kanwisher, Josh H. McDermott  Neuron 
Getting the numbers comparable
Linear Regression and Correlation
Linear Regression and Correlation
Linear Regression and Correlation
Presentation transcript:

More On Preprocessing Javier Cabrera

Outline 1.Transform the data into a scale suitable for analysis. 2.Remove the effects of systematic and obfuscating sources of variation. 3.Identify discrepant observations.

Outline Preprocessing => Quality of downstream analyses log transformation, X  log(X) The variation of logged intensities may be less dependent on magnitude, Logs reduces the skewness of highly skewed distributions. Taking logs improves variance estimation. 2. Other Transformations Power transformations (X  X  for some  =1/2, 1/3 or other) Amaratunga and Cabrera (2000), Tusher et al (2001) 3. Variance stabilizing transformations X  log(X+c) : Symmetrizing the spot intensity data and stabilizing their variances.

Transformations 4. Rocke and Durbin (2001) arrays with replicate spots. Analogy: models used for estimating concentration of analyte:X =  +  e  +   mean background,  true expression level;  and  normally distributed error (   2,   2 ) 5. Durbin et al (2002) generalized log transformation: - ,   2 and   2 must be estimated.

Power Transformations  must be estimated. -Three criteria: - Equal variances: CV ( gene variances) Low skewness: mean( skewness) No Mean Variance correlation: correlation between mean and variance

Example 1: Tissue Data Tissue data: 3 treatments applied to mice tissue. (A,B,C) Arrays: Treatment A: 11 Treatment B: 11 Treatment C: 19 Genes: 3487 genes. Gene expression matrix X: Dim(X)=100x41 treatA.1 treatA.2 treatA.3 treatA.4 treatA.5 treatA.6 treatA.7 treatA.8 treatA.9 treatA.10 treatA.11 treatB.12 treatB >

Power Trans (X-3.60 ) -0.4 Quantile Normalized Raw Data Equal 75pctl Log Transformed

Gene selection for classification - Left panel: PC2 vs PC1 plot log transformation - Right panel: PC2 vs PC1 plot power transformation

Example 2: Khan et al (2001): 4 types of small round blue cell tumors (SRBC) - Neuroblastoma (NB) - Rhabdomyosarcoma (RMS) - Ewing family of tumors (EWS) - Burkitt lymphomas (BL) Training set= 63 (23 EWS, 20 RMS, 12 NB, 8 BL) Testing set= 25 (6 EWS, 5 RMS, 6 NB, 3 BL, 5 ot) Genes: Of 6567 initial genes, 2308 genes were selected because they showed minimal expression Subset A: Cells: 23 EWS and 20 RMS from training set. 100 most significant genes after performing a t-test. Gene expression matrix X: Dim(X)=100x43 EWS.T1 EWS.T2 EWS.T3 EWS.T4 EWS.T6 EWS.T7 EWS.T9 EWS.T11 EWS.T12 EWS.T13 EWS.T14 EWS.T15 EWS.T19 EWS.C8 EWS.C3 EWS.C2 EWS.C4 EWS.C6 EWS.C

Power Trans -(X-0.66 ) Quantile Normalized Raw Data Equal 75pctl Log Transformed

Judging the success of a normalization {Y g1 } and {Y g2 }. Successful workflow =>Arrays are monotonically related to each other. Pearson’s correlation coefficient: measures linearity rather than agreement. Concordance correlation coefficient :

Judging the success of a normalization {Y g1 } and {Y g2 }. Successful workflow =>Arrays are monotonically related to each other. -Spearman’s rank correlation coefficient: R gi is the rank of Y gi when the {Y gi } are ranked from 1 to G.

Concordance Map Image Plot of Concordance Correlations: X44 X45 X46 X47 X48 X49 X50 X X X X X X X

Concordance Map Image Plot of Concordance Correlations: X44 X45 X46 X47 X48 X49 X50 X X X X X X X

Linear correlation Standard Normal t dist, df=6 t dist, df=2

correlation 1. If the distributional properties of the values change substantially during a normalization (e.g., the skewness is decreased), it is possible that the concordance correlation coefficients might increase, but this may only be an artificial improvement. 2.For microarrays that have been normalized by equating all the quantiles, the concordance correlation coefficient will be equal to Pearson’s correlation coefficient. This is because, after such a normalization, the quantiles of both samples are identical and, therefore, both means are equal and both variances are equal too 3.Spearman’s rank correlation coefficient is equal to (a) Pearson’s correlation coefficient calculated on the ranks of the data (b) the concordance correlation coefficient calculated on the ranks of the data.