Microarray Data Analysis Xuming He Department of Statistics University of Illinois at Urbana-Champaign
Websites R Software: MAANOVA package: SMA package:
Data Normalization Global lowess normalization (within slide): often useful. Median Normalization (across the slides)
Global lowess Normalization Assume the changes are roughly symmetric for most genes. log 2 R/G -> log 2 R/G – c(A)= log 2 R/ (k(A)G), where c(A) is the lowess fit to M vs A plot.
Median Normalization (Optional) Normalize the median log ratios of each gene across all slides to 0. Formula of normalized data: where is the log ratio of gene g from slide i.
Transformations Shift-log transformation (Kerr et al. 2002) Curve Fitting Transformation (Yang et al b) Variance Stabilizing Transformation: Linlog Transformation
Shift-log transformation (Newton et al. 2001) Move the origin along the line by adding the same positive constant to both channels,, k indicates genes. The major effect is at shrinking the variance of log ratios at the low intensity end. By expanding the range of C to include negative values, we can increase the variance at the low intensity end.
Curve Fitting Transformation (Yang et al b) Add one spot-specific constant to the signal values of one channel and subtract the same constant from signals in the other channel prior to the log transformation,. Where C k is the spot-specific constant determined by the local regression line.
Linlog Transformation Assume that additive error should be dominant at low intensity and multiplicative error should be dominant at high intensity,, i indicates channels (g/r). In practice, people usually estimate d by the 25% quantile of the intensities. The Linlog does not correct the curvatures in MA plots. We can combine the Linlog with either shift-log or lowess.
Normalization Comparison GeneSpringMAANOVASMA Data InputRaw dataAdjusted intensities Raw data Negative Measurements Set to 0 or 0.01Can not handleMissing Dye SwapYes No Intensity Dependent Normalization Yes Normalize to a Percentile(within)YesNoMedian Normalize to Positive Control GenesYesNo Normalize to a Constant Value YesNo
Normalization Comparison (continued) GeneSpringMAANOVASMA Divide by Specific SamplesYesNo Normalize to medianYesNo Median PolishingYesNo Print-tip Group LowessNoSort ofYes Scaled Print-tip Group Lowess No Yes Shift TransformationNoYesNo Linear-log TransformationNoYesNo Linear-log Shift TransformationNoYesNo
Statistical Analysis One-sample t-Test Two-sample t-Test Nonparametric test (Wilcoxon-Mann-Whiteney test) Global Error Model Multiple Group comparisons (ANOVA)
One sample t-test H 0: log ratio =0 versus H A: log ratio is not 0. Reject H 0 if, where is the significance level.
Two sample t-test (equal variance) H 0 : log ratios of two groups are equal H A : log ratios of two groups are not equal Exact p-value for normal data, even for small samples.
Two sample t-test (unequal variance) T statistic: Approximate with Not exact p-value
Nonparametric test (Wilcoxon-Mann-Whiteney test) Use ranks of data When the number of replicates in each group is more than 5 Works for non-normal data Alternative: Permutation test
Global Error Model When there is no or few replicates Some assumption on the variance is made
Multiple Group Tests (One way ANOVA) Parametric test, assuming equal variance Parametric test, not assuming equal variance Nonparametric test (Kruskal-Wallis test)
Multiple Group Tests (Multi-way ANOVA) Control for several factors Example:, where log intensity of gene g on array i for dye j and condition k No need to do certain normalizations Equal variance is assumed
Test Comparison GeneSpringMAANOVASMA One Sample t-testYesNoIndirect Two sample t-test(equal variance)Yes Two sample t-test(unequal variance)YesNoYes Nonparametric testYesNo Global Error ModelYesNo One way ANOVAYes No Multiple group test—multi-way ANOVANoYesNo
Test Comparison GeneSpringMAANOVASMA Residual PlotNoYesNo NormalizationNeedNot needNeed Random effectNoYesNo PermutationNoYesNo Multiple test adjustmentBonferonni, FDR, Step- Down, Westfall and Young permutation FDRUse package: Multtest Average within-slide replicates AutomaticOptionalNo
Multiple Test Adjustment Bonferroni: Adjusted pvalue=p-value*N Step-down (Holm): controls the family-wise Type I error rate (FWER) Westfall and Young permutation: controls the FWER with permutation. More time consuming. FDR : controls the false discovery rate (FDR--- proportion of genes expected to be “significant” by chance relative to the proportion of identified genes.
References Y. H. Yang, S. Dudoit et al.(2002) Normalization for cDNA Microarray Data. Nucleic Acids Research, Vol. 30, No. 4, e15. Cui, Kerr and Churchill (2002), Data Transformation for cDNA Microarray Data. Submitted, find manuscript in Transformation for cDNA Microarray Data
R How to get help? help(t.test) ?t.test help->R language (html) Why R? Free, convenient, flexible
Data and Software Cattle experiment Two tissues: liver and spleen Dye swap Two replicates within slide Software R: MAANOVA R: SMA GeneSpring
MAANOVA--Data Format Combine the intensities of all arrays One example of intensity file: metarowmetacolrowcolIDR1G1flag1R2… 1299AW … 3399AW … ………………………… Grid infoIntensities of array1 Intensities of array2 flag for array1
MAANOVA--Data Format (continued) Design (Parameter) File For example: SampleID Condition
MAANOVA--Output Residual plot F-values, p-values, permutation p-values, adjusted p-values IDs of differentially expressed genes Volcano plot
SMA—Data Format Import the intensities of each array separately One example of intensity file IDF532B532F635B635 H3001A H3001A H3001A ……………
SMA--Output t-values, p-values, adjusted p-values IDs of differentially expressed genes * SMA does not provide p-values directly