Download presentation
Presentation is loading. Please wait.
1
More On Preprocessing Javier Cabrera
2
Outline 1.Transform the data into a scale suitable for analysis. 2.Remove the effects of systematic and obfuscating sources of variation. 3.Identify discrepant observations.
3
Outline Preprocessing => Quality of downstream analyses log transformation, X log(X) The variation of logged intensities may be less dependent on magnitude, Logs reduces the skewness of highly skewed distributions. Taking logs improves variance estimation. 2. Other Transformations Power transformations (X X for some =1/2, 1/3 or other) Amaratunga and Cabrera (2000), Tusher et al (2001) 3. Variance stabilizing transformations X log(X+c) : Symmetrizing the spot intensity data and stabilizing their variances.
4
Transformations 4. Rocke and Durbin (2001) arrays with replicate spots. Analogy: models used for estimating concentration of analyte:X = + e + mean background, true expression level; and normally distributed error ( 2, 2 ) 5. Durbin et al (2002) generalized log transformation: - , 2 and 2 must be estimated.
5
Power Transformations must be estimated. -Three criteria: - Equal variances: CV ( gene variances) Low skewness: mean( skewness) No Mean Variance correlation: correlation between mean and variance
6
Example 1: Tissue Data Tissue data: 3 treatments applied to mice tissue. (A,B,C) Arrays: Treatment A: 11 Treatment B: 11 Treatment C: 19 Genes: 3487 genes. Gene expression matrix X: Dim(X)=100x41 treatA.1 treatA.2 treatA.3 treatA.4 treatA.5 treatA.6 treatA.7 treatA.8 treatA.9 treatA.10 treatA.11 treatB.12 treatB.13 1 3.706 3.900 3.877 3.769 3.654 3.805 3.661 3.878 4.213 3.989 3.877 3.797 3.743 2 3.762 4.034 4.402 3.912 3.889 3.988 4.280 3.901 4.385 3.835 4.051 4.583 4.973 3 4.140 4.114 4.182 4.200 4.117 4.029 4.200 4.137 4.344 4.122 3.989 4.273 4.368 4 3.555 3.555 3.555 3.555 3.555 3.555 3.555 3.621 4.181 3.555 3.555 3.555 3.571 5 4.228 4.152 3.828 4.216 3.889 3.923 3.912 4.102 4.273 3.858 4.031 4.144 3.976 6 6.622 6.749 6.625 6.883 6.865 6.335 6.241 6.201 5.895 6.548 6.577 6.298 6.546 7 7.322 7.437 7.523 7.267 7.586 7.562 7.238 7.294 6.812 7.557 7.370 7.497 6.834 8 3.555 3.555 3.555 3.555 3.555 3.555 3.555 3.591 4.165 3.555 3.555 3.555 3.571 9 4.756 4.605 4.935 4.295 4.510 4.571 4.396 4.804 4.639 5.239 4.402 4.502 4.248 10 4.468 4.306 4.483 4.396 4.432 4.008 4.475 4.357 4.344 4.208 4.147 4.227 4.436 >....................
7
Power Trans (X-3.60 ) -0.4 Quantile Normalized Raw Data Equal 75pctl Log Transformed
8
Gene selection for classification - Left panel: PC2 vs PC1 plot log transformation - Right panel: PC2 vs PC1 plot power transformation
9
Example 2: Khan et al (2001): 4 types of small round blue cell tumors (SRBC) - Neuroblastoma (NB) - Rhabdomyosarcoma (RMS) - Ewing family of tumors (EWS) - Burkitt lymphomas (BL) Training set= 63 (23 EWS, 20 RMS, 12 NB, 8 BL) Testing set= 25 (6 EWS, 5 RMS, 6 NB, 3 BL, 5 ot) Genes: Of 6567 initial genes, 2308 genes were selected because they showed minimal expression Subset A: Cells: 23 EWS and 20 RMS from training set. 100 most significant genes after performing a t-test. Gene expression matrix X: Dim(X)=100x43 EWS.T1 EWS.T2 EWS.T3 EWS.T4 EWS.T6 EWS.T7 EWS.T9 EWS.T11 EWS.T12 EWS.T13 EWS.T14 EWS.T15 EWS.T19 EWS.C8 EWS.C3 EWS.C2 EWS.C4 EWS.C6 EWS.C9 1 3.203 1.655 3.278 1.006 2.710 2.059 1.848 2.714 2.356 1.929 3.616 2.151 2.312 1.069 0.919 0.925 2.626 1.079 1.099 2 0.068 0.071 0.116 0.191 0.237 0.082 0.123 0.180 0.079 0.252 0.106 0.097 0.160 0.197 0.192 0.089 0.092 0.178 0.166 3 1.046 1.041 0.893 0.430 0.369 0.902 0.998 0.496 0.761 0.574 0.583 0.499 0.579 1.681 0.786 1.511 1.869 2.346 2.019..........
10
Power Trans -(X-0.66 ) -0.04 Quantile Normalized Raw Data Equal 75pctl Log Transformed
11
Judging the success of a normalization {Y g1 } and {Y g2 }. Successful workflow =>Arrays are monotonically related to each other. Pearson’s correlation coefficient: measures linearity rather than agreement. Concordance correlation coefficient :
12
Judging the success of a normalization {Y g1 } and {Y g2 }. Successful workflow =>Arrays are monotonically related to each other. -Spearman’s rank correlation coefficient: R gi is the rank of Y gi when the {Y gi } are ranked from 1 to G.
13
Concordance Map Image Plot of Concordance Correlations: X44 X45 X46 X47 X48 X49 X50 X44 1.000 0.703 0.622 0.706 0.674 0.746 0.694 X45 0.703 1.000 0.702 0.679 0.784 0.710 0.788 X46 0.622 0.702 1.000 0.791 0.683 0.562 0.776 X47 0.706 0.679 0.791 1.000 0.691 0.607 0.760 X48 0.674 0.784 0.683 0.691 1.000 0.770 0.832 X49 0.746 0.710 0.562 0.607 0.770 1.000 0.727 X50 0.694 0.788 0.776 0.760 0.832 0.727 1.000
14
Concordance Map Image Plot of Concordance Correlations: X44 X45 X46 X47 X48 X49 X50 X44 1.000 0.756 0.622 0.700 0.695 0.813 0.698 X45 0.756 1.000 0.813 0.722 0.793 0.710 0.803 X46 0.622 0.813 1.000 0.789 0.753 0.655 0.826 X47 0.700 0.722 0.789 1.000 0.714 0.663 0.763 X48 0.695 0.793 0.753 0.714 1.000 0.779 0.834 X49 0.813 0.710 0.655 0.663 0.779 1.000 0.742 X50 0.698 0.803 0.826 0.763 0.834 0.742 1.000
15
Linear correlation Standard Normal t dist, df=6 t dist, df=2
16
correlation 1. If the distributional properties of the values change substantially during a normalization (e.g., the skewness is decreased), it is possible that the concordance correlation coefficients might increase, but this may only be an artificial improvement. 2.For microarrays that have been normalized by equating all the quantiles, the concordance correlation coefficient will be equal to Pearson’s correlation coefficient. This is because, after such a normalization, the quantiles of both samples are identical and, therefore, both means are equal and both variances are equal too 3.Spearman’s rank correlation coefficient is equal to (a) Pearson’s correlation coefficient calculated on the ranks of the data (b) the concordance correlation coefficient calculated on the ranks of the data.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.