Mathematical Statistics Centre for Mathematical Sciences Lund University, Sweden 2004-10-01 Low-level Analysis of Microarray Data.

Mathematical Statistics Centre for Mathematical Sciences Lund University, Sweden 2004-10-01 Low-level Analysis of Microarray Data

2  Why are microarrays important? Genome projects have identified lots of new genes, but for many we do not know what they are doing. "Functional Genomics" Application areas: Cancer, Inflammation, Infectious diseases Toxicology - drug (side) effects Developmental biology

3 Papers A)H. Bengtsson, B. Calder, I. S. Mian, M. Callow, E. Rubin, and T. P. Speed. Identifying Differentially Expressed Genes in cDNA Microarray Experiments: making aging visible. 2001. B)H. Bengtsson. Identification and normalization of plate effect in cDNA microarray data. 2002. C)H. Bengtsson. The R.oo package - object-oriented programming with references using standard R code. 2003. D)H. Bengtsson and O. Hössjer. Methodological study of affine transformations of gene expression data with proposed normalization method. 2003. E)H. Bengtsson, G. Jönsson, and J. Vallon-Christersson. Calibration and assessment of channel-specific biases in microarray data with extended dynamical range. 2003. F)H. Bengtsson. aroma - An R Object-oriented Microarray Analysis environment. 2004. G)H. Bengtsson and O. Hössjer. Affine calibration for microarrays with dilution series or spike-ins. 2004.

4 Idea of microarrays (and other gene expression techniques): Measure the amount of mRNA to find genes that are expressed (active). (measuring protein might be better, but is currently much harder) Central Dogma of Molecular Biology

5 Gene-expression analysis x c,i : gene expression for gene i=1,...,I in sample c=1,...C. I=5...-30,000 and C>=2. Consider two samples from cancer and healthy tissue. We are interested in genes that have different expression: Non-differentially expr. genes: I x 1 =x 2 = {i  I: x 1,i =x 2,i } Differentially expressed genes: I x 1  x 2 = I \ I x 1 =x 2 ={i  I: x 1,i  x 2,i } differentially expressed genes non-diff. expr. genes

6 Hypothesis tests For each gene i=1,...,I we test the hypotheses: null hypothesis H 0 : x 1,i = x 2,i Non-differentially expressed genes alternative hypothesis H 1 : x 1,i  x 2,i Differentially expressed genes

7  Which genes are differentially transcribed? same-sametumor-normal log-ratio

8 Statistics 101:  bias accuracy   precision variance 

9 Central dogma of data analysis Can always increase sensitivity on the cost of specificity, or vice versa, the art is to find the best trade-off! X X X X X X X X X

10 Printing / spotting

11 J. Vallon-Christersson, Dept Oncology, Lund Univ. Microarray slide preparation

12 RNA extraction & hybridization Hybridize RNA Tumor sample cDNA RNA Reference sample cDNA Extract mRNA from samples Reverse transcription of mRNA to cDNA Label with Cy3 or Cy5 fluorescent dye Hybridize labeled cDNA to slide Wash slide Figure: Hybridization chamber. (probes) (targets)

13 Two-channel scanning excitation red laser green laser emission overlay images  higher frequency, more energy  lower frequency, less energy

14 Combined color image for visualization

15 Signal quantification 1.Addressing Locate spot centers. 2.Segmentation Classification of pixels either as signal or background (using circles, seeded region growing or other). 3.Signal quantification a) foreground estimates : R fg, G fg b) background estimates : R bg, G bg c)... (shape, size etc) Terry Speed et al.

16 Summary scanning data: (R fg,G fg,R bg,G bg,...) cDNA clones PCR product amplification purification printing Hybridize RNA Test sample cDNA RNA Reference sample cDNA excitation red laser green laser emission overlay images Production

17  Sources of variation amount of RNA in the biopsy efficiencies of -RNA extraction -reverse transcription -labeling -fluorescent detection probe purity and length distribution spotting efficiency, spot size cross-/unspecific hybridization stray signal NormalizationError model Systematic o similar effect on many measurements o corrections can be estimated from data Stochastic o too random to be ex- plicitely accounted for o remain as “noise”

18 M = log 2 (R/G) (log-ratio), A = ½ ·log 2 (R·G) (log-intensity) Exploratory data analysis: (R,G)  (log 2 R,log 2 G)  (M,A) R = red channel signal G = green channel signal R vs G log(R) vs log(G)M vs A

19 In self-self comparisons, find stochastic as well as systematic deviations from M=0. Idea: Do various exploratory plots to understand the nature of these deviations: for example, M vs A, spatial plots, density & boxplots plots, print- order plots etc. Exploratory data analysis: the need for normalization

24 Within-slide normalization Classical methods: No normalization Median normalization Shift of log-ratios. Curve-fit normalization methods. A.k.a. lo(w)ess normalization. etc.

25 Global median normalization M’ = M – median(M) Assumption: Changes roughly symmetric. aroma: normalizeWithinSlide(ma, method=“m”)

26 Global curve-fit normalization cont. (Biased towards the green channel & intensity dependent artifacts) aroma: plot(ma); normalizeWithinSlide(ma, method=“l”); plot(ma) curve-fit normalization done for all spots together M i ’ = M i – c(A i )

27 Print-tip curve-fit normalization Print-tip curve-fit normalized data {(A,M)} n=1..N : M’ i,p = M i,p - c p (A i );p = print tip 1-16 where c p is an intensity dependent function for print tip p. aroma: plot(ma); lowessCurve(ma, groupBy=“printtip”); boxplot(ma, groupBy=“printtip”)

28 Print-tip curve-fit normalization cont. aroma: normalizeWithinSlide(ma, “p”); boxplot(ma, groupBy=“printtip”); plotSpatial(ma) curve-fit normalization done for each print-tip group separately

29 Across-slides normalization The within-slide normalization makes the red and the green signals comparable. What about signals from different arrays? Across-slide normalization attacks this problem. Idea: Log-ratios (M values) from different slides should have similar variance. Example: If the standard deviation of the log-ratios from one array is 0.50 and from another array 0.78, the log-ratios have to be rescaled to get the same standard deviation.

30 Paper A H. Bengtsson, B. Calder, I. S. Mian, M. Callow, E. Rubin, and T. P. Speed. Identifying Differentially Expressed Genes in cDNA Microarray Experiments: making aging visible. Online report accompanying the discussion forum with the same name. Science SAGE KE, 2001 (12), vp8. Illustrates typical normalization of two-color microarray data Short introduction on how to identify differentially- expressed genes. Basic study on how many replicated arrays are needed. Comparison between two image-analysis methods.

31 Paper B H. Bengtsson. Identification and normalization of plate effect in cDNA microarray data. Preprints in Mathematical Sciences 2002:28, Mathematical Statistics, Centre for Mathematical Sciences, Lund University, 2002.

32 Printing of spots 1 16 6384 spots printed onto 9 slides in total 399 print turns using 4x4 print-tips... Hmm... why the horizontal stripes? 

33 Print-order plot of log-ratios The spots are order according to when they were spotted/dipped onto the glass slide(s). Note that it takes hours/days to print all spots on all arrays. plotPrintorder(ma)

34 Remove (constant) plate biases? Will remove some of the intensity dependent effects......and some of the spatial artifacts.

35 Intensity normalization plate by plate?...and most of the spatial artifacts....plus a print-tip normalization? Removes the plate effects...

36 Comparison of normalization methods Median absolute deviation (MAD) for gene i with replicates j=1,2,...,J: d i = 1.4826 · median | r ij | where r ij = M ij – median M ij is residual j for gene i. The measure of reproducibility (small is good) is a scalar defined as the mean of all genewise MADs: M.O.R. =  d i / N where N is the number of genes. Remember that minimizing variance alone is not useful - we also need to consider bias Ex: two different genes: d a < d b

37 Pl – Constant platewise norm., Pl(A) – Intensity dep. platewise norm., Sl(A) – Intensity dep. slidewise norm., Pr(A) – Intensity dep. print-tip-wise norm., sPr(A) – Scaled intensity dep. print-tip-wise norm., bg – background corrected data. Doing platewise intensity dependent normalization lowers the gene variability by another ~10% from print-tip norm. Background correction increases variability (but might reduce bias) Using measure of reproducibility is helpful in deciding what to do. Results

38 Visual comparison Scaled print-tip intensity normalization: (M.O.R.=0.123; 46%) Scaled print-tip follow by plate intensity normalization: (M.O.R.=0.110; 41%) No normalization: (M.O.R.=0.270; 100%)

39 Major reason for print-order effects +  intensity dependent effects different intensities for different plates

40 Paper D H. Bengtsson and O. Hössjer. Methodological study of affine transformations of gene expression data with proposed normalization method. Preprints in Mathematical Sciences 2003:38, Mathematical Statistics, Centre for Mathematical Sciences, Lund University, 2003.

41 A general model y c,i = f c (x c,i ) +  c,i genes i = 1, 2,..., I RNA extracts c = 1, 2,..., C x c,i the true gene expression level y c,i the observed gene expression level  c,i measurement noise, E[  c,i ] = 0 f c (·) measurement function is (unknown)

42 Measurement functions model the biochemical and physical measurement procedure

43 Recall the M vs A transform Especially in two-channel microarray analysis the M vs A transform is useful: where x R,i and x G,i are the true gene expression levels. Since these are unknown, it is a common practice to plug in the observed expression levels

44 Linear models and M vs A In the literature, it common to assume a linear measurement function, either explicitly or implicitly: where b c is the scale factor for channel c = R,G. The observed log-ratios and log- intensities become where  =b R /b G is the relative scale factor. Systematic effects: The overall shift in the log-ratios is log 2 , which is constant.

45 Examples of bias with linear measurement functions Balance channels:  = b R /b G = 1 ) log 2  = 0 Red too strong:  = b R /b G = 4/1 ) log 2  = +2 Green too strong:  = b R /b G = 1/4 ) log 2  = -2

46 Affine measurement functions Assume that the measurement function is affine: where a c and b c are channel-specific bias and scale parameters. The observed log-ratios become where  i = a R -  a G (x R,i / x G,i ),  =b R /b G, and A i is the observed log intensity for gene i. Systematic effects: i) Constant shift of log2 . ii) intensity-dependent bias. Note that the second bias term is fold-change dependent! Moreover, if a R = a G = 0 (linear), the second bias term is zero as before.

47 Examples of bias with affine measurement functions Assume different channel biases, i.e. a R ≠a G and  =b R /b G =1. No bias: (a R,a G )=(0,0) Small bias in red: (a R,a G )=(20,0) Small bias in red, large bias in green: (a R,a G )=(20,200)

48 Theoretical examples of affine transformations Assume different channel biases, i.e. a R ≠a G, but same  =b R /b G =0.57. The blue dash-dot curves are non-differentially expressed genes. Colored curves above and below are genes with fold-change log 2 (x R,i /x G,i ) = ±1,±2,... (a R,a G )=(0,0)  = 0.57 (a R,a G )=(20,200)  0.57 (a R,a G )=(80,140)  0.57

49  … and with data:

50 Consider a two-channel (C={R,G}) microarray with genes i=1,...,I for which we observe With  = a R -  a G and  = b R /b G, we have for non-differentially expressed genes (no noise) that The red and green signals, y i = (y R,i,y G,i ), then scatter around the line with intercept  and slope . Next, robust estimates of  and . Robust affine normalization

51 IWPCA – iterative reweighted PCA Define the objective function w i is a weight, where r i ( ,  ;y i ) > 0 is the Euclidean distance between y i and the line L( ,  ). The estimate of  and  is then With weights w i = 1 we obtain standard PCA (minimization in L 2 ). With weights we minimize in L 1, if we let  ! 0. Algorithm: Estimate  and . Update weights. Reestimate  and  and so on.

52 Additional constraints With estimates of  = a R -  a G and  = b R /b G, we impose the additional constraint that after normalization no negative signal may exists (less conservative constraint may also be used). In addition, let b R =1. The robust affine normalization is then Example: Estimates (a R,a G,  ) = (21.5,29.3,0.163).

53 Generalization to several channels/arrays The robust affine normalization generalizes directly to microarrays with multilple channels, i.e. C=3,4,.... to multiple arrays, if the experimental design allows, i.e. K=2,3,.... In addition: No between-array normalization is needed (”it’s included”). Normalization to a common reference is straightforward. That is, if all arrays share the same reference, first all reference channels is normalized toward each other to form a ”baseline channel”, then test channels are normalized toward the baseline channel. Usage (in aroma): normalizeAffine(rg)

54 Paper E H. Bengtsson, G. Jönsson, and J. Vallon-Christersson. Calibration and assessment of channel-specific biases in microarray data with extended dynamical range. Preprints in Mathematical Sciences 2003:37, Mathematical Statistics, Centre for Mathematical Sciences, Lund University, 2003.

55 The microarray technology (again)

56 Scan protocol By scanning the same array at various PMT gains (and constant laser power) we can learn about h c. When we scan at gains v and w, we obtain two different measurements of the unknown, but fixed amount of hybridized DNA (x c,i ): for both channels c = {R,G} and all genes i = 1,...,I.

57 Observed signals at various gains Channel: c = G. PMT voltage pairs (v,w): (800,500)(600,500) (700,500)(700,600) (800,600)(800,700) All PMT pairs converge at the same point (e c,e c ) and not at the origin (0,0).  Affine measurement function: zoom 

58 The K observed signals in channel c for each gene i: scatter around the line with The line L(a c,b c ) can be estimate robustly using IWPCA as before. Next, define Bias parameters a c are uniquely identified by the additional constraint that || d c || ¼ 0, i.e. General model and estimation

59 Multiscan calibration Backtransform: Within-channel M vs A:

60 Results By scanning the same microarray at various sensitivity levels one obtains: calibrated signals with scanner biases removed an extended dynamical range improved signal-to-noise levels better understanding of the noise structure (of the scanner)

61 Paper G H. Bengtsson and O. Hössjer. Affine calibration for microarrays with dilution series or spike- ins. Preprints in Mathematical Sciences 2004:19, Mathematical Statistics, Centre for Mathematical Sciences, Lund University, 2004.

62 gene i=1,...,I replicated spots j=1,...,J at various concentrations (dilution series). channel c. First, assume that the concentration of fluorophores b i,j of all replicated spots of a gene i is the same: z c,i,j = z c,i Model the total amount of light x c,i,j from spot (i,j) as x c,i,j = b i,j z c,i Model

63 We observe y c,i,j = a c + b c x c,i,j +  c,i,j = a c + b c b i,j z c,i +  c,i,j  c,i,j independent zero-mean noise with variance  c,i,j 2. a c and b c offset and scale parameters for channel c In the paper, the variance function  c,i,j 2 =  c,i 2 = k c  i 2 is used. The offset parameter a c can now be uniquely estimated using maximum likelihood techniques utilizing numerical optimization and PCA. b c are estimated as before. Model cont.

64 Probe calibration Before After C=2, I=432, J=4 ten dilution series are highlighted

65 In addition the method gives... Estimates of the relative expression level of the dilution series follow directly from the maximum likelihood procedure.

66 Arrays should contain multiple spots for the same gene that have different dilutions (instead of identical replicates) Systematic treatment of bias (optical background), expression ratios for the dilution series ("genes"), and noise. Result

67 Good scientific software is like a good scientific publication  reproducible  peer-reviewed  easily accessible by other researchers, society  builds on the work of others  other will build their work on top of it  commercialize those developments that are successful

68 Why Open Source? so that you can find out what algorithm is being used, and how it is being used so that you can modify these algorithms to try out new ideas or to accommodate local conditions or needs so that they can be used as components  Transparency  Pursuit of reproducibilty  Efficiency of development  Training

69 Paper C H. Bengtsson. The R.oo package - object-oriented programming with references using standard R code. In Kurt Hornik, Friedrich Leisch, and Achim Zeileis, editors, Proceedings of the 3rd International Workshop on Distributed Statistical Computing (DSC 2003), Vienna, Austria, March 2003

70 R.oo package Objectives: Better and more robust object-oriented design and implementation More memory efficient and faster implementations Easy to understand and to extend User- and developer-friendly interface Core of the microarray package (aroma) Methods: Single root class Object all classes inherits from. Provide support for reference variables. R Coding Conventions (RCC), e.g. naming conventions ( MyClassName,myObject, myFunction() ) Free and open-source (5500 code lines). Proof of concept, rich documentation with many examples (~100 pages).

71 Paper F H. Bengtsson. aroma - An R Object-oriented Microarray Analysis environment. Preprints in Mathematical Sciences 2004:18, Mathematical Statistics, Centre for Mathematical Sciences, Lund University, 2004.

72 aroma package Objectives: Provide simple methods for low-level analysis of microarray data Wide support for file formats User- and developer-friendly interface Easy to understand and easy to extend Easy to maintain Methods: Make use of R.oo Make use of R.classes pkgs (20000 lines of code & > 400 pages of docs) Follow R Coding Conventions Free and open-source (18000 code lines) Easy to install anywhere. Rich documentation with examples (>280 pages)

73 A short example > gpr <- MicroarrayData$read("Slide1.gpr") > gpr [1] "GenePixData: Number of slides: 3. Number of fields: 43. Layout: Grids: 4x4 (=16), spots in grids: 18x18 (=324), total numberof spots: 5184. Spot names are specified. Spot ids are specified." > plotSpatial(gpr) > raw <- getRawData(gpr) > ma <- getSignal(raw) > idx 1.5) > plot(ma) > highlight(ma, idx, col="blue") > normalizeAffine(ma) >...

Mathematical Statistics Centre for Mathematical Sciences Lund University, Sweden 2004-10-01 Low-level Analysis of Microarray Data.

Similar presentations

Presentation on theme: "Mathematical Statistics Centre for Mathematical Sciences Lund University, Sweden 2004-10-01 Low-level Analysis of Microarray Data."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Mathematical Statistics Centre for Mathematical Sciences Lund University, Sweden 2004-10-01 Low-level Analysis of Microarray Data.

Similar presentations

Presentation on theme: "Mathematical Statistics Centre for Mathematical Sciences Lund University, Sweden 2004-10-01 Low-level Analysis of Microarray Data."— Presentation transcript:

Similar presentations

About project

Feedback