Statistical Methods for Identifying Differentially Expressed Genes in Replicated cDNA Microarray Experiments Presented by Nan Lin 13 October 2002
Introduction to cDNA Microarray Experiment Single-slide Design – Two mRNA samples (red/green) on the same slide Multiple-slide Design – Two or more types of mRNA on different slides – Exclude: time-course experiment
Examples of Multiple-slide Design Apo AI – Treatment group: 8 mice with apo AI gene knocked out – Control group: 8 C57B1/6 mice – Cy5: each of 16 mice – Cy3: pooling cDNA from 8 control mice SR-BI – Treatment group: 8 SR-BI transgenic mice – Control group: 8 “normal” FVB mice Microarray Setup – 6384 spots, 4X4 grids with 19X21 spots in each
Single-slide Methods Two types – Based solely on intensity ratio R/G – Take into account overall transcript abundance measured by R*G Historical Review – Fold increase/decrease cut-offs ( ) – Probabilistic modeling based on distributional assumptions ( ) – Consider R*G ( ) e.g. Gamma-Gamma-Bernoulli
Summary of Single-slide Methods Producing a model dependent rule: drawing two curves in the (R,G) plane – Power (1-Type II error rate) – False positive rate (Type I error rate) Multiple testing Replication is needed because gene expression data are too noisy
Image Analysis “Raw” data: 16-bit TIFF files Addressing – Within a batch, important characteristics are similar Segmentation – Seeded region growing algorithm Background adjustment – Morphological opening (a nonlinear filter) Software package: Spot in R environment
Single-slide Data Display Plot log 2 R vs. log 2 G – variation less dependent on absolute magnitude – normalization is additive for logged intensities – evens out highly skewed distributions – a more realistic sense of variation Plot M=log 2 (R/G) vs. A=[log 2 (RG)]/2 – More revealing in terms of identifying spot artifacts and for normalization purpose
Normalization Identify and remove sources of systematic variation other than differential expression – Different labeling efficiencies and scanning properties for Cy3 and Cy5 – Different scanning parameters – Print-tip, spatial or plate effects Red intensity is often lower than green intensity The imbalance between R and G varies – across spots and between arrays – Overall spot intensity A – Location on the array, plate origin, etc.
An Example: Self-Self Experiment
Normalization (Cont.) Global normalization – subtract mean or median from all intensity log-ratios More complex normalization – Robust locally weighted regression M=spot intensity A+location+plate origin Use print-tip group to represent the spot locations log 2 (R/G) log 2 (R/G) –l(A,j) l(A,j): lowess in R (0.2<f<0.4) Control sequences
Apo AI: Normalization
Graphical Display for Test Statistics (I) Test statistics – H j : no association between treatment and the expression level of gene j, j=1,…,m. – Two-sided alternative – Two-sample Welch t-statistics – Replication is essential to assess the variability in treatment and control group – The joint distribution is estimated by a permutation procedure because the actual distribution is not a t- distribution
Graphical Display for Test Statistics (II) Quantile-Quantile plots
Graphical Display for Test Statistics (III) Plots vs. absolute expression levels
Multiple Hypothesis Testing: Adjusted p-values (I) P-value: P j =Pr(|T j |>=|t j ||H j ), j=1,…,m. Family-wise Type I Error Rate (FWER) – The probability of at least one Type I error in the family Strong Control of the FWER – Control the FWER for any combination of true and false hypotheses Weak Control of the FWER – Control the FWER only under the complete null hypothesis that all hypotheses in the family are true
Multiple Hypothesis Testing: Adjusted p-values (II) Adjusted p-value for H j – P j =inf{a: H j is rejected at FWER=a} – H j is rejected at FWER a if P j <=a P-value adjustment approaches – Bonferroni – Sidak single-step – Holm step-down – Westfall and Young step-down minP
Multiple Hypothesis Testing: Estimation of adjusted p-values (I)
Multiple Hypothesis Testing: Estimation of adjusted p-values (II)
Apo AI: Adjusted p-values (I)
Apo AI: Adjusted p-values (II)
Apo AI: Comparison with Single- slide Methods
Discussion M-A plots Normalization – Robust local regression, e.g. lowess Q-Q plots & Plots vs. absolute expression level False discovery rate (FDR) Replication is necessary Design issues Factorial experiments Joint behavior of genes R package SMA