A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb Analysis of (cDNA) Microarray Data: Part I. Sources of Bias and Normalisation
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb Data included in GEXEX a.Whole data stored and “securely” available b.GP3xCLI on each hybridisation 2.Relaxed data acquisition criteria a.Signal to Noise > 1.00 (relaxer (sp?) exist) b.Mean to Median > 0.85 (Tran et al. 2002) 3.Data Normalisation 4.Mixed-Model Equations a.Check Residuals (plot Residuals vs Predicted) b.Check REML estimates of Variance Components c.Proportion of Total Variance due to Gene x Variety 5.Process Gene x Treatment BLUPs Differentially Expressed Genes a.t-statistics Z-score P-value b.Mixtures of Distributions Posterior Probabilities MICROARRAY ANALYSIS 6.Process Differentially Expressed genes a.Hierarchical clustering b.Gene ontology analysis My (Educated?) View
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb BASIC PIECES FOR SIGNAL DETECTION Foreground RED and GREENR f G f Background RED and GREENR b G b Background-correctedREDR = R f – R b GREENG = G f – G b Log-transformedLog 2 (R) Log 2 (G) Difference: “Minus”M = Log 2 (R) – Log 2 (G) = Log 2 (R/G) Mean: “Average”A = 0.5 * ( Log 2 (R) + Log 2 (G) ) = 0.5 * Log 2 (R*G) MA-Plots …to come True Signals! MICROARRAY ANALYSIS
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb The Red/Green Intensities can be spatially biased Data Acquisition Criteria
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb The Red/Green Intensities can be intensity-biased MA-Plot Data Acquisition Criteria Values should scatter around zero
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb Data Acquisition Criteria Background Correction: Why bother?
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb Background Correction: Why bother? Data Acquisition Criteria
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb RED versus GREEN Data Acquisition Criteria Log-transformation: Why bother?
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb MA-Plots: All versus only valid signals Data Acquisition Criteria
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb Data Acquisition Criteria Signal to Noise Ratio Mean to Median Correlation
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb Data Normalisation Normalisation is an attempt to correct for systematic bias. Normalisation allows you to compare data from one array to another. Systematic Bias can be introduced into microarray experiments at all stages. Need to: –Avoid it (as much as possible) –Recognize it –Correct for it –Discard unrecoverable data In practice we do not always understand the data - inevitably some biology will be removed too (or at least not revealed).
TumorPool of Cell Lines Differential labeling efficiency of dyes Different amounts of starting material. Different amounts of RNA in each channel Differential efficiency of hybridization over slide surface. Differential efficiency of scanning in each channel. A Quantitative Overview to Gene Expression Profiling in Animal Genetics Data Normalisation Armidale Animal Breeding Summer Course, UNE, Feb Source: Catherine Ball (Stanford)
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb Systematic Bias Sources … Different labeling efficiencies or dye effects Scanner malfunction Differences in concentration of DNA on arrays (plate effects) Printing or tip problems Uneven hybridization Batch bias Experimenter issues …and Dealing with it Detect and recognize the effect Note something odd Determine magnitude and effect on data Try a few methods Identify source of bias Think big! Eliminate or reduce contributing factors Correct data Discard uncorrectable data
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb Systematic Bias Labeling Efficiencies Cause Bias One channel of a two- channel array has higher intensity than the other (usually GREEN). Most common source of recognizable bias. Solution: Most easy to addressed (eg. dye- swaps, balanced loops).
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb Systematic Bias Scanning (operator?) Bias Mis-aligned lasers can cause big problems In this case, the two channels are slightly out of register Solution: fix the scanner and repeat
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb Systematic Bias Printing (operator?) Bias Irregular shaped spots are often observed (printing error) Slides from the same printing batch cluster together Solution: Probably limited to better printing technique and image analysis, rather than normalization
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb Systematic Bias Probe Bias Different concentrations of probes might produce patterns in arrays Biological role of probes can produce patterns in arrays These patterns can create a spatial bias that are not artificial, but biological
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb Systematic Bias Probe Bias Probes arranged on the array based on biological function cause spatial bias Solution: avoid arranging reporters based on function, know your experimental design Coding regions Intergenic regions
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb Systematic Bias Hybridisation (operator?) Bias Poor technique during hybridisation can cause a spatial bias Operator is one of the largest sources of systematic bias Experiments done by the same operator often cluster together more tightly than warranted by the biology Solution: Consistent methods, successful techniques
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb TechniqueChoicesAim (Real)Aim (Ideal) Transformation “To Near Normality” Log 2 Lin-Log Numerically tractable Gaussian Normalisation “Location” Location Parameter: 1. Mean 2. Median 3. Regression(s) (LOWESS) Account for systematic effects Gaussian Standardisation “Scale” Scale ParameterStabilise variance Gaussian Data Normalisation …and other beautifying techniques
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb Data Normalisation Transformation …to near normality Solution: Explore the entire Box-Cox family of power transformations: Maximum at λ 0, hence use the log-transformation
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb Data Normalisation Transformation …to near normality Raw Data …exponential-like Log2 Transformed …normal-like
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb Data Normalisation Transformation …to near normality Lin-Log Transformation x = background corrected = Fg - Bg
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb Data Normalisation Transformation …to near normality The Edwards’ transformation as well as the Lin-Log transformation are an attempt to use the entire data, not only those for which foreground is greater than background. The reasoning is that errors are linear and multiplicative for small and large signals, respectively. The search for and choice of could be rather unconvincing (eg. Different for different array slides). Solution:Use Log 2 if Foreground > Background Otherwise, use a small arbitrary value (say 0), Or simply disregard. Alternatively: Use only Foreground and Log 2 it
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb Location Normalisation Log 2 (R/G) – c = M - c GLOBAL:Mean:c = Mean of M’s Median:c = Median of M’s LOWESS:c = Weighted Regress of M on A Assumption: Changes roughly symmetric around Mean or Median Assumption: Changes roughly symmetric at all intensities LOCAL:LOWESS:c = c(i) = Weighted Regression of M on A within print-tip-group i Location Parameter LOWESS = Locally WEighted Regression and Smoothing Scatterplots
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb Location Normalisation LOWESS = Locally WEighted Regression and Smoothing Scatterplots Source: G Rosa 2003.
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb Location Normalisation LOWESS = Locally WEighted Regression and Smoothing Scatterplots Source: G Rosa SAS Code Genetic analysis of complex traits using SAS ISBN
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb Location Normalisation LOWESS = Locally WEighted Regression and Smoothing Scatterplots Source: G Rosa Normalised Intensities
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb Location Normalisation LOWESS = Locally WEighted Regression and Smoothing Scatterplots Source: G Rosa 2003.
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb Location Normalisation Source: Yang et al 2002 None
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb Location Normalisation Source: Yang et al 2002 After Global Median
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb Location Normalisation Source: Yang et al 2002 Global Lowess
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb Location Normalisation Print-in-Group Lowess Source: Yang et al 2002
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb Location Normalisation Source: Yang et al 2002 After Print-in-Group Lowess
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb Location Normalisation Additional Assumption (other than symmetry of changes): The proportion of genes that are Differentially Expressed (DE) is minimal Question: Which genes to use? Answer:Only the ones (housekeeping) that we know are not DE Comment:“Boutique” arrays become a nuisance
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb Scale Normalisation (Standardisation) Log 2 (R/G) – c(i) a(i) Notes:1. The scaling a(i) is such that Var(M) = a(i) 2 2 2. The estimation requires an approximation (“robust”) to the geometric mean: where MAD is the Median Absolute Deviation. 3. It doesn’t get any more heuristic (funnier?) than this “Some scale adjustments may be required so that the relative expression levels from one particular experiment (slide) do not dominate the average relative expression levels across replicate experiments.” Yang et al 2002
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb Data Normalisation …and other beautifying techniques Notes: 1.Except Log 2, everything else applies only to Ratios: M = log 2 (R/G) 2.Except Log 2, everything else applies only within slide 3.Everything is beautified to identify DE genes straight from MA-plot, either from a single slide or from a function of M’s across slides. 4.The uncertainty in measurements increases as intensity decreases 5.Measurements close to the detection limit are the most uncertain (cf. Sensitivity) 6.Fold-change measurements ignore these effects 7.We can calculate an intensity-dependent z-score that measures the ratio relative to the standard deviation in the data
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb Data Normalisation …and other beautifying techniques Corrected Log 10 ( Ratio ) Mean ( Log 10 ( Intensity ) ) 2-fold Locally estimated standard deviation of positive ratios Z= 1 Z= -1 Locally estimated standard deviation of negative ratios Local Log 10 ( Ratio ) Z-Score Mean ( Log 10 ( Intensity ) ) Z= 5 Z= -5 Corrected Log 10 ( Ratio ) Mean ( Log 10 ( Intensity ) ) 2-fold Z= 2 Z= 1 Z= -1 Z= -2 Z= 5 Z= -5 Z > 2 is at the ~ 95% confidence level Source: J Pevsner 2004
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb Bilban M, Buehler LK, Head S, Desoye G, Quaranta V. Normalizing DNA microarray data. Curr Issues Mol Biol Apr;4(2): Durbin BP, Hardin JS, Hawkins DM, Rocke DM. A variance-stabilizing transformation for gene-expression microarray data. Bioinformatics Jul;18 Suppl 1:S Kepler TB, Crosby L, Morgan KT. Normalization and analysis of DNA microarray data by self-consistency and local regression. Genome Biol Jun 28;3(7):RESEARCH0037. Schuchhardt, J., D. Beule, et al. Normalization Strategies for cDNA Microarrays. NAR (10): E47-e47. Tran PH, Peiffer DA, Shin Y, Meek LM, Brody JP, Cho KW. Microarray optimizations: increasing spot accuracy and automated identification of true microarray signals. Nucleic Acids Res Jun 15;30(12):e54. Tseng GC, Oh MK, Rohlin L, Liao JC, Wong WH. Issues in cDNA microarray analysis: quality filtering, channel normalization, models of variations and assessment of gene effects. Nucleic Acids Res Jun 15;29(12): Tsodikov A, Szabo A, Jones D. Adjustments and measures of differential expression for microarray data. Bioinformatics Feb;18(2): Yang MC, Ruan QG, Yang JJ, Eckenrode S, Wu S, McIndoe RA, She JX. A statistical method for flagging weak spots improves normalization and ratio estimates in microarrays. Physiol Genomics Oct 10;7(1): Yang YH, Dudoit S, Luu P, Lin DM, Peng V, Ngai J, Speed TP. Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res Feb 15;30(4):e15.Curr Issues Mol Biol Apr;4(2):57-64.Bioinformatics Jul;18 Suppl 1:S Genome Biol Jun 28;3(7):RESEARCH0037.NAR (10): E47-e47.Nucleic Acids Res Jun 15;30(12):e54.Nucleic Acids Res Jun 15;29(12): Bioinformatics Feb;18(2): Physiol Genomics Oct 10;7(1):45-53.Nucleic Acids Res Feb 15;30(4):e15. Normalisation: References