Normalization of Microarray Data - how to do it! Henrik Bengtsson Terry Speed
Outline The X Data Set (R,G) (M,A) Transformation Background correction or not? Within slide normalization Across slide normalization Identifying differentially expressed genes The X2 Data Set
The X Data Set All slides are replicates and contains 5184 spots/genes. Three identical RNA preparations were done; (a) was hybridized to slide 1-3, (b) to slide 4-6, and (c) to slide 7-9. All data is collected by GenePix TM Scanner and Software. The following analysis was done using [R] and the sma library by Terry Speed Group. SlideTitleName 1Mutant (a) vs. Reference (a)dUDG558 2Mutant (a) vs. Reference (a)dUDG409 3Mutant (a) vs. Reference (a)dUDG405 4Mutant (b) vs. Reference (b)dUDG411 5Mutant (b) vs. Reference (b)dUDG412 6Mutant (b) vs. Reference (b)dUDG414 7Mutant (c) vs. Reference (c)dUDG413 8Mutant (c) vs. Reference (c)dUDG415 9Mutant (c) vs. Reference (c)dUDG813
(R,G) (M,A) Transformation “Observed” data {(R,G)} n= : R = red channel signal G = green channel signal (background corrected or not) Transformed data {(M,A)} n= : M = log 2 (R/G) (ratio), A = log 2 (R·G) 1/2 = 1/2·log 2 (R·G) (intensity) R=(2 2A+M ) 1/2, G=(2 2A-M ) 1/2
Background correction or not? Decision 1: No background correction
Within Slide Normalization Question: What kind of normalization should be applied: 1.No normalization, or 2.Global (lowess) normalization, or 3.Print-tip normalization, or 4.Scaled print-tip normalization?
No Normalization Non-normalized data {(M,A)} n= : M = log 2 (R/G)
Global (lowess) Normalization Global normalized data {(M,A)} n= : M norm = M-c(A) where c(A) is an intensity dependent function.
Print-tip Normalization Print-tip normalized data {(M,A)} n= : M p,norm = M p -c p (A);p=print tip (1-16) where c p (A) is an intensity dependent function for print tip p Print-tip layout
Scaled Print-tip Normalization Scaled print-tip normalized data {(M,A)} n= : M p,norm = s p ·(M p -c p (A));p=print tip (1-16) where s p is a scale factor for print tip p (Median Absolute Deviation). After print-tip normalizationAfter scaled print-tip normalization
Spatial Effects No normalizationGlobal normalization Print-tip normalization Scaled Print-tip normalization
Another Quick Example Scaled print-tip normalization:
Within Slide Normalization Summary Question: What kind of normalization should be applied: 1.No normalization, or 2.Global (lowess) normalization, or 3.Print-tip normalization, or 4.Scaled print-tip normalization? Decision 2: Scaled print-tip normalization.
Across Slides Normalization Scaled print-tip normalization Median Absolute Deviation (MAD) Scaling Averaging
Average Over All Slides The “average” slide:
Cutoff by M values Top 5% of the absolute M values (|M| > 0.56):
Cutoff by T values Top 5% of the absolute T values (|T|>8.6) s.t. SE(M) > 0.03:
SE Cutoff Level In this data set, the number of genes found is insensitive to the SE cutoff level. About 1000 of the genes with smallest SE can be cutoff before it affects the final results.
103 Differentially Expressed Genes Top 5% of the absolute T values (|T|>8.6) s.t. SE(M) > 0.03, and top 5% of the absolute M values (|M|>0.56):
Location of Differentially Expressed Genes Location of the 4x4 grid sized microarray
25 Differentially Expressed Genes Top 2% of the absolute T values (|T|>11) s.t. SE(M) > 0.03 and top 2% of the absolute M values (|M|>0.9): Gene:M avg A avg TSE
The X2 Data Set All slides are replicates and contains 5184 spots/genes. Three identical RNA preparations were done; (a) was hybridized to slide 1 & 2, (b) to slide 3 & 4, and (c) to slide 5 & 6. SlideTitleName 1Mutant (a) vs. Reference (a)dUDG816 2Mutant (a) vs. Reference (a)dUDG817 3Mutant (b) vs. Reference (b)dUDG818 4Mutant (b) vs. Reference (b)dUDG820 5Mutant (c) vs. Reference (c)dUDG821 6Mutant (c) vs. Reference (c)dUDG822
93 Differentially Expressed Genes Top 5% of the absolute T values (|T|>5.6) s.t. SE(M) > 0.03) and top 5% of the absolute M values (|M|>0.38):
Top 2% of the absolute T values (|T|>7.1) s.t. SE(M) > 0.03 and top 2% of the absolute M values (|M|>0.53): 25 Differentially Expressed Genes Gene:M avg A avg TSE
Acknowledgement Thanks to: Jean Yee Hwa Yang [R] Software (free): The Statistical Microarray Analysis (sma) library (free):