Lecture 2 – Pre-processing and Normalization José Luis Mosquera Computational Lab on Microarrays Data Analysis Special Topics in Computer Science Institute of Bioinformatics – Johannes Kepler University June 2010
Outline 1. Microarray Raw Data 2. Image Analysis 3. Diagnostic Plots 4. Non-specific Filtering 5. Normalization
Microarray Data Analysis Pipeline Flowchart of a experiment with microarrays To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. Ronald Fisher
Microarray Raw Data ● Raw data differs considerably between ● cDNA ● Spotted arrays ● The spirit of the process is similar, but ● specific procedures or steps differ To keep in mind...
Microarray Raw Data (1) ● One.GPR file per chip containing ● One row per gene but many columns with – Intensitiy values for each channel (R, G) – Summary values for intensities – Quality controls, such as FLAG ● Intensity values are converted into a single expression matrix containing ● One column per chip with log(R/G) values ● One row per gene (same rows as.GRP files) ● Gene information stored in the.GAL file ● Both.GPR and.GAL are ASCI files ● An accurate description of these files is available herehere... from cDNA arrays (1)
● One.CEL file per chip, containing ● PM and MM values for each probe in the chip ● Presence/Absence calls (one per probeset) – They can be interpreted as a statistical test of the spot foreground intensity in the experimental sample respect to the background intensity distribution ● Separate PM/MM values are converted into a single expression matrix containing ● One column per chip with absolute intensity values ● One row per probeset ● Gene information stored in the.CDF file ●.CEL file is a binary file...from Affymetrix arrays (1) Microarray Raw Data (2)
Image Analysis (1) ● Co-registration and overlay offers a quick visualization, providing information on ● color balance, ● uniformity of hybridization, ● spot uniformity, ● background, and ● artifacts such as – dust or – scratches Bab array (high background, little DE) Good (low background, detectable DE) Red/Green overlay
Image Analysis (2) ● Image analysis is a crucial pre-processing step ● Association of a location and corresponding annotation with signal intensities ● Several non-trivial technical choices – Scanner, – image analysis software –... ● can affect the quality of the signal ● Background correction is sometimes not desirable (low B g arrays) ● Multiple pixel values on image reduced to a single gene-specific value for each channel Some hints
Diagnostic Plots ● Plots are useful to... ● Check microarray data quality ● Give hints on how to pre-process the data ● Verify how the pre-processing has worked Look at the data!
MA-plot (1) ● The data obtained from cDNA arrays come in the form of fluorescent Red (Cy5 or R) and Green (Cy3 or G) dye intensities ● To determine whether normalization is needed, one can plot R vs G intensities and see whether the slope of the line is around 1 ● But a better representation of genes with “medium” expression is to take logs... Scatterplots
MA-plot (2) ● Biologically, a unit change in log 2 represents a 2-fold change ● Increase and decrease are symmetric under log Log 2 transformation Linear scaleLog scale
MA-plot (3) ● An improved method of the R vs G plot is the MA-plot, which is basically a scaled 45 degree rotation ● It is a plot of the distribution of ● M-value which is the log 2 of the R/G intensity ratio ● A-value which is the log 2 of average intensity M and A values
MA-plot (4) ● The general assumption is that most of the genes would not see any change in their expression ● The majority of the points on the M would be located at 0, since log 2 (1) = 0 Relationship Intensity and M vs A
● The five-number summary is a descriptive statistic that provides information about a set of observations ● It consists of the five most important sample percentiles ● the sample minimum (min) is the smallest observation ● the lower or first quartile (Q 1 ) is the 25 th quantile ● the median, middle value or second quartile (Q 2 ) is the 50 th quantile ● the upper or third quartile (Q 3 ) is the 75 th called quantile ● the sample maximum (max) is the largest observation Five-Number Summary Boxplots (1)
Boxplots (2) ● Liver samples from 16 mice ( 8 WT and 8 ApoAI KO) Example from limma package
Signal/Noise Histograms ● Images with high background tend to have lower ratios Example
Spatial Plots (1) ● They are useful when coordinates are available Slide backgrounds
Spatial Plots (2) ● If there are no spatial effects high intensity spots should be uniformly distributed ● Top (black) and bottom (green) 5% of log ratios Highlighting extreme log ratios
Pin-group Effects (1) Example Lowess lines through points from pin groupsBoxplots of log ratios by pin-group
Pin-group Effects (2) Example Boxplots show a clear example of spatial bias
Plate Effects Quality between slides
Filtering (1) ● There may be errors during Hybridization and/or Scanning which yield bad spots ● These are automatically flagged ● Many spots may show very low signals ● Problems with spotting ● No hybridization in this spot ● Bad spots may be removed from the analysis to avoid unnecessary noise Why?
Filtering (2) ● We may filter the data on intensity ● by excluding values where both the red and green channels are less than 100 ● by setting the value of an intensity to the minimum in the event only one of the two channel intensities is below the minimum of 100 ● We may use the flag column imported with the data, and exclude intensities with a flag value not equal to 1 ●......spots and ajust signal
● Filtering is intended to remove spots whose images or signals were wrong due to different possible reasons 1) Small quantity of cDNA in the array 2) Errors during the scanning process ● But, some people prefer not to filter to avoid eliminating “good spots” unintentionally ● So Must we filter data? Filtering (3)
Normalization (1) ● Normalization describes techniques used to transform the data to correct for systematic variation (or differences) ● among arrays and ● within arrays ● But not biological variation among samples What is the normalization?
Normalization (2) ● By looking at diagnostic plots for biases that vary ● spot intensities, ● location on the array ● plate origin ● pins ● scanning parameters,.. ● By performing self-self hybridization How to know if normalization is necessary?
Normalization (3) ● In dual cDNA, two samples are each one labeled with a different fluorescent dye ● In most studies, the samples are from different sources (e.g. cancer vs normal) ● However, it is also possible to co-hybridize two samples from the same source (but differently labeled) ● If we hybridize a sample with itself intensities should be the same in both channels ● All deviations from this equality means there is systematic bias that needs correction What is self-self normalization? (1)
Normalization (4) What is self-self normalization? (2)
Normalization (5) What is self-self normalization? (3) ● In a self-self hybridization, we would expect all ratios to be equal to one ● But they may not be! ● Why not? ● Unequal labeling efficiency ● Noise in the system ● Differential expression ● Normalization brings (appropriate) ratios back to one
Normalization (6) Examples of self-self hybridization False Color OverlayBoxplots within pin-groupsScatter MA-plots
Normalization (7) Examples of non self-self hybridizations Early Goodman (UC Berkeley) From the NCI60 data set Early Ngai la (UC Berkeley) Early PMCRI (Melbourne, Australia)
Normalization (8) ● Methods ● Global adjustment ● Intensity dependent normalization ● Within print-tip group normalization ● And many other... ● Selection of spots for normalization Methods and Issues
Normalization (9) ● Normalization based on a global adjustment log 2 R/G -> log 2 R/G - c = log 2 R/(kG) ● Choices for k or c = log 2 k are ● c = median or mean of log ratios for a particular gene set ● e.g. housekeeping genes) ● total intensity normalization, where k = ∑R i / ∑G i Global adjustment
● Here we run a line through the middle of the MA-plot, shifting the M value of the pair (A, M) by c=c(A) log 2 R/G -> log 2 R/G - c (A) = log 2 R / (k(A)G) ● One estimate of c(A) is made using the LOWESS function of Cleveland (1979): LOcally Weighted Scatterplot Smoothing. Intensity-dependent Normalization (10)
● The LOWESS lines can be run through many different sets of points, and each strategy has its own implicit set of assumptions justifying its applicability ● e.g. global LOWESS justified by supposing that, when stratified by mRNA abundance a)only a minority of genes are expected to be differentially expressed, or b)any differential expression is as likely to be up-regulation as down-regulation ● Pin-group LOWESS requires stronger assumptions: that one of the above applies within each pin-group ● The use of other sets of genes, e.g. control or housekeeping genes, involve similar assumptions. Which spots to use for normalization? Normalization (11)
Which spots to use for normalization? Normalization (12) Unnormalized Print-tip normalization Print-tip and scale normalization