Presentation is loading. Please wait.

Presentation is loading. Please wait.

Statistical analysis of microarray data

Similar presentations


Presentation on theme: "Statistical analysis of microarray data"— Presentation transcript:

1 Statistical analysis of microarray data
Henrik Bengtsson Mathematical Statistics, Centre for Mathematical Sciences, Lund University, Sweden Bioinformatics 2, Bioinformatics Centre, University of Copenhagen. February 19, 2004

2 Outline Part I – (Very short) Background
Central Dogma of Biology Idea behind the microarray technology Experimental design Part II - Printing, Hybridization, Scanning & Image Analysis From clone to slide From samples to hybridization From scanning to raw data Part III - Preprocessing of data (IMPORTANT) Exploratory data analysis and log transforms Background correction Normalization Part IV - Identifying differentially expressed genes Cut-off by log-ratios values, the t-statistics and cut-off by T values Multiple testing, adjusting the p-values Validation

3 The cDNA microarray technology - PART I: (Very short) Background
The Central Dogma of Biology Idea behind the microarray technology Experimental design

4 The Central Dogma of Biology
Idea of microarrays (and other gene expression techniques): Measure the amount of mRNA to find genes that are expressed (active). (measuring protein might be better, but is currently much harder)

5 History cDNA microarrays have evolved from Southern & Northern blots, with clone libraries gridded out on nylon membrane filters being an important and still widely used intermediate. Things took off about 1995 with the introduction of non-porous solid supports, such as glass - these permitted miniaturization - and fluorescence based detection. Terry Speed et al.

6 The cDNA Microarray Technique
Put a large number of DNA sequences or synthetic DNA oligomers onto a glass slide gene expressions at the same time. Measure amounts of cDNA bound to each spot Identify genes that behave differently in different cell populations - tumor cells vs healthy cells - brain cells vs liver cells - same tissue different organisms Time series experiments - gene expressions over time after treatment, e.g. T=5,10,30,... min vs T=0. ...

7 Experimental Design Experimental design is very important. Example: Assume we have 8 control (C1, ... C8) and 8 reference (R1, ..., R8) samples, should we compare a) C1-R1, ..., C8-R8, or b) C1-R’, ..., C8-R’ where R’ is a pool of all R, or c) something else? Wrong design can be very wrong and the correct can be very efficient and require less arrays (= less lab work and cheaper experiments!). Statistical theory exists and is helpful. Recently, statisticans has started to look at the microarray design problem in more detail. Progress will for be made sure! Experience from previous setups is also of great value. We will not discuss this problem anymore in this talk.

8 The cDNA microarray technology - PART II: Printing, Hybridization, Scanning & Image Analysis
From clone to slide From samples to hybridization From scans to raw data

9 Overview cDNA clones printing Hybridize RNA Test sample cDNA
PCR product amplification purification printing excitation red laser green laser emission scanning Hybridize RNA Test sample cDNA Reference sample overlay images Production data: (Rfg,Gfg,Rbg,Gbg, ...)

10 Printing / spotting Arrayer (approx 100,000 EUR)

11 Microarray slide preparation
Glas Amino-grupper UV kross-link - tvätt J. Vallon-Christersson, Dept Oncology, Lund Univ.

12 Terminology: probe and target
As defined in Nature 1999: The probes are the immobilized DNA sequences spotted on the array, i.e. spot, oligo, immobile substrate. The targets are the labeled cDNA sequences to be hybridized to the array, i.e. mobile substrate. The opposite usage can also be seen in some references. However, think of probes as the measuring device (which you can buy), and the targets (that you provide) as what you want to measure.

13 RNA extraction & hybridization
(targets) Hybridize RNA Tumor sample cDNA Reference sample Extract mRNA from samples Reverse transcription of mRNA to cDNA in presence of aminoallyl-dUTP (aa-dUTP) Label with Cy3 or Cy5 fluorescent dye (links to aa-dUTP) Hybridize labeled cDNA to slide Wash slide Preparation incl. array (approx EUR/array) (probes) Figure: Hybridization chamber.

14 References Original cDNA microarray paper: General:
Mark Schena, Dari Shalon, Ronald W. Davis, and Patrick O. Brown. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science, 270(5235):467–470, October 1995. General: Mark Schena. Microarrays Analysis. John Wiley & Sons, Inc., Hoboken, New Jersey, 2003. David J. Duggan, Michael Bittner, Yidong Chen, and Paul Meltzer & Jeffrey M. Trent. Expression profiling using cDNA microarrays. Nature Genetics, 21(1 Supplement):10–14, January 1999.

15 Scanning Terry Speed et al.

16 Software (approx 0-5,000 EUR)
Scanner (approx 50, ,000 EUR)

17 Two-channel scanning   overlay images higher frequency, more energy
excitation red laser green laser emission overlay images higher frequency, more energy lower frequency, less energy

18 Combined color image for visualization

19 Image Analysis Terry Speed et al.

20 Signal quantification
Addressing Locate spot centers. Segmentation Classification of pixels either as signal or background (using circles, seeded region growing or other). Signal quantification a) foreground estimates b) background estimates c) ... (shape, size etc) Terry Speed et al.

21 Does one size fit all? Terry Speed et al.

22 Segmentation and quantification of spot regions
Seeded region growing Fixed-sized or adaptive-sized circles Inside the boundaries is foreground (the spots) and outside is the background. We estimate both... Terry Speed et al.

23 Robust signal estimates: mean vs median
When a statistician talks about a robust measure he/she normally means a measure that is not sensitive to outliers, e.g. x = (8, 85, 7, 9, 5, 4, 13, 6, 8) The mean of all x’s, i.e. (x1+x2+...+xK)/K, will be affected if one of the measures, say x2, is wrong; mean(x) = 16.11 The median of all x’s, i.e. the middle value of (x1+x2+...+xK), will not be affected if a “few” (< 50%) are wrong; median(x) = 8.0 For this reason we prefer to use median instead of the mean in cases were we expect artifacts. (If there are a lot of measurements and the errors are symmetrically distributed the median will give the same result as the mean without outliers.)

24 Typical data output

25 Summary cDNA clones printing Hybridize RNA Test sample cDNA
PCR product amplification purification printing excitation red laser green laser emission scanning Hybridize RNA Test sample cDNA Reference sample overlay images Production data: (Rfg,Gfg,Rbg,Gbg, ...)

26 References Image analysis:
Yee Hwa Yang, Michael Buckley, Sandrine Dudoit, and Terry Speed. Comparison of methods for image analysis on cDNA microarray data. Technical Report 584, Department of Statistics, University of California at Berkeley, Nov 2000. Anders Bengtsson. Microarray image analysis: Background estimation using region and filtering techniques. Master’s Theses in Mathematical Sciences, Mathematical Statistics, Centre for Mathematical Sciences, Lund Institute of Technology, Sweden, December :E40.

27 The cDNA microarray technology - PART III: Preprocessing of data (IMPORTANT)
Exploratory Data Analysis and log-transforms Background correction Normalization

28 Exploratory Data Analysis
Terry Speed et al.

29 Scatter plot: R vs G “Observed” data {(R,G)}n=1..5184:
non-differentially expressed genes along the diagonal: R = G up-regulated genes Most genes have low gene expression levels. What happens here? down-regulated genes “Observed” data {(R,G)}n= : R = signal in red channel, G = signal in green channel

30 Scatter plot: log2R vs log2G
non-differentially expressed genes are still along the diagonal: log2R = log2G up-regulated genes Low gene expression levels are “blown up”. down-regulated genes “Observed” data {(log2R,log2G)}n= : R = signal in red channel, G = signal in green channel

31 Scatter plot: M vs A (recommended)
Note: M vs A is basically a rotation of the log2R vs log2G scatter plot. up-regulated genes non-differentially expressed genes are now along the horizontal line: M = 0  log2R - log2G = 0 Why: Now the quantity of interest, i.e. the fold change, is contained in one variable, namely M! If M > 0, up-regulated. If M < 0, down-regulated. down-regulated genes Transformed data {(M,A)}n= : M = log2(R) - log2(G) (minus) A = ½·[log2(R) + log2(G)] (add)

32 A = ½·[log2(R) + log2(G)] = [logarithmic rules] = ½·log2(R·G)
Details on M vs A Log-ratios: M = log2(R) – log2(G) = [logarithmic rules] = log2(R/G) Average log-intensities: A = ½·[log2(R) + log2(G)] = [logarithmic rules] = ½·log2(R·G) There is a one-to-one relationship between (M,A) and (R,G): R=(22A+M)1/2, G=(22A-M)1/2

33 More on why log and why M vs A?
It makes the distribution symmetric around zero  R:G M=log(R/G) comment 1:1 20=1 2:1 +1 1:2 -1 21=2, 2-1=1/2 4:1 +2 1:4 -2 22=4, 2-2=1/4 8:1 +3 1:8 -3 23=8, 2-3=1/8 16:1 +4 1:16 -4 24=16, 2-4=1/16 32:1 +5 1:32 -5 25=32, 2-5=1/32 Logs stretch out region we are most interested in and makes the distribution more normal.  Easier to see artifacts of the data, .e.g. as intensity dependent variation and dye-bias.  Before: After: Log base 2 because the raw data is binary data (max intensity is = 65535). It is also naturally to think of 2-, 4-, 8-fold etc up and down regulated genes. For the actual analysis, any log-base will do.

34 Summary of (R,G)  (log2R,log2G)  (M,A)
R vs G log(R) vs log(G) M vs A R = red channel signal G = green channel signal M = log2(R/G) (log-ratio), A = ½·log2(R·G) (log-intensity) plot(rg, log=FALSE); plot(rg); ma <- as.MAData(rg); plot(ma)

35 Density plots of all signals
Example: Signals in the red channel seem to be slightly weaker than the signals in the green channel, compare with M vs A plot: ...more green. log2R = red channel signal log2G = green channel signal plotDensity(rg)

36 Spatial plot of log-ratios (M values)
Print-tip 1 Print-tip 8 Print-tip 9 Print-tip 16 plotSpatial(ma)

37 Print-tip box plot of log-ratios
1 16 boxplot(ma, groupBy=“printtip”)

38 Printing order of spots
1 Hmm... why the horizontal stripes?  16 6384 spots printed onto 9 slides in total 399 print turns using 4x4 print-tips...

39 Print-order plot of log-ratios
The spots are order according to when they were spotted/dipped onto the glass slide(s). Note that it takes hours/days to print all spots on all arrays. plotPrintorder(ma)

40 Print-dip plot of log-ratios
Average values of the 16 log-ratios at each dip from each of the 399 print turns. plotDiporder(ma)

41 References Exploratory data analysis for microarrays:
Yee Hwa Yang, Sandrine Dudoit, Percy Luu, and Terence P Speed. Normalization for cDNA microarray data. In Michael L. Bittner, Yidong Chen, Andreas N. Dorsel, and Edward R. Dougherty, editors, Proceedings of SPiE, volume 4266 of Microarrays: Optical Technologies and Informatics, pages 141–152, San Jose, California, June The International Society for Optical Engineering. Henrik Bengtsson. Identification and normalization of plate effects in cDNA microarray data. Preprints in Mathematical Sciences 2002:28, Mathematical Statistics, Centre for Mathematical Sciences, Lund University, Sweden, 2002. Gordon Smyth and Terry Speed, METHODS: Selecting Candidate Genes from DNA Array Screens, Dec 2003

42 Background correction and background estimates revisited...
Terry Speed et al.

43 Alt. 1: Local background estimates
One for the red and one for the green channel. Axon GenePix Pro - size of bg region depends on size of the spot. Give too much variation! Fixed-size regions - same number of background pixels give same variation for all estimate. A. Bengtsson, Lund University, 2003.

44 Alt. 2: Filtered background estimates
One for the red and one for the green channel. In general, filter by moving a window (mask) over the image: 1) At each position replace the value in the center (red dot) by, say, the median pixel intensity of all spots in the window. 2) Move to next pixel and so on. 3) Result: (more sophisticated filters than the median filter are used in practice) Idea: With filters we do not have to know the where the spots are. More robust. Also known as: Morphological filters A. Bengtsson, Lund University, 2003.

45 Background subtraction
The foreground region of the spots are assumed to be contaminated with background noise/signal. Sources of background: probe unspecifically sticking on slide, irregular / dirty slide surface, dust, noise in the scanner measurement. From the image analysis we have foreground and background estimates for each spot in both channels: (Rfg, Rbg, Gfg, Gbg) For each spot on the slide we do background subtraction Red intensity: R = Rfg – Rbg Green intensity: G = Gfg - Gbg Not considered: real cross-hybridisation and unspecific hybridisation to the probe. Still not known if this is a real problem. 4 Example of an extremely bad array. Terry Speed et al.

46 Which background estimate should we use?
Axon GenePix Pro - size of bg region depends on size of the spot. Give too much variation! Filtered estimates - no spot segmentation. Fixed-size regions - same number of background pixels give same variation for all estimate. Fixed-sized are better than variable-sized region methods, but filtered estimates may be even better. Should we do background correction at all? (A. Bengtsson, 2003) is a good start, but more research is needed.

47 Background correction or not?
non-background subtracted background subtracted M = log2(Rfg/Gfg) Mbg = log2({Rfg-Rbg} / {Gfg-Gbg}) -seems better, but...

48 Background problem still not solved!
Foreground & background Background subtracted GenePix background: -Curvature in the other direction now! Too much back- ground correction?! Spot background: (morphological opening) -Still curvature left. Too little back- ground correction?!

49 References Background estimation and correction:
Yee Hwa Yang, Michael Buckley, Sandrine Dudoit, and Terry Speed. Comparison of methods for image analysis on cDNA microarray data. Technical Report 584, Department of Statistics, University of California at Berkeley, Nov 2000. Anders Bengtsson. Microarray image analysis: Background estimation using region and filtering techniques. Master’s Theses in Mathematical Sciences, Mathematical Statistics, Centre for Mathematical Sciences, Lund Institute of Technology, Sweden, December :E40. Henrik Bengtsson, Göran Jönsson, and Johan Vallon-Christersson. Calibration and assessment of channel-specific biases in microarray data with extended dynamical range. Preprints in Mathematical Sciences 2003:37, Mathematical Statistics, Centre for Mathematical Sciences, Lund University, Sweden, 2003. Charles Kooperberg, Thomas G. Fazzio, Jeffrey J. Delrow, and Thoshio Tsukiyama. Improved background correction for spotted DNA microarrays. Journal of Computational Biology, 9:55–66, 2002.

50 Normalization Terry Speed et al.

51 Normalization Expectation: Most genes are non-differentially expressed, i.e. most of the data points should be around M=0. Idea: Do various exploratory plots to see if this assumption is met. For example, M vs A, spatial plots, density & boxplots plots, print-order plots etc. Result: We commonly observe something else: Measured value = real value + systematic errors + noise Correction: If so, normalize the data such that the expectations are met: Corrected value = real value + systematic errors + noise

52 Sources of systematic effects, but also noise and natural variability
Biological variability RNA extraction Probe labeling (dye differences) Printing (print-order, plate-order, clone variation, etc.) Hybridization (temperature, time, mixing, etc.) Human (variation between lab researchers) Scanning (laser and detector, chemistry of the fluorescent label) Image analysis (identification, quantification, background methods etc.)

53 Examples of practical problems that can lead to artifacts in data
Surface chemistry: uneven surface may lead to high background. Dipping the pin into large volume. It helps to pre-printing on a dummy array to drain off excess sample. Spot variation can be due to mechanical difference between pins. Pins could be clogged during the printing process. Spot size and density depends on surface and solution properties. Pins need good washing between samples to prevent sample carryover. Dust while printing and while hybridizing the arrays. ... and much more!

54 Within-slide normalization
Current methods widely used today: No normalization Median normalization Shift of log-ratios. Curve-fit normalization methods. A.k.a. lo(w)ess normalization. Dye-swap normalization Actually an experimental design.

55 Within-slide normalization cont.
The median, the curve-fit and the dye-swap normalization and others, can be applied to different groups of spots: all spots print-tip groups plate groups ... but also spatially on spots ordered in print order

56 No normalization – raw data
Non-normalized data {(M,A)}n=1..N: M = log2(R/G) aroma: ´plot(ma, slide=2)

57 Global median normalization
Assumption: Changes roughly symmetric. M’ = M – median(M) aroma: normalizeWithinSlide(ma, method=“m”)

58 Global curve-fit normalization
Global curve-fit normalized data {(M,A)}n=1..N: Mnorm = M - c(A) where c(A) is an intensity dependent function (loess, splines, ...). aroma: plot(ma); lowessCurve(ma)

59 Global curve-fit normalization cont.
curve-fit normalization done for all spots together (Biased towards the green channel & intensity dependent artifacts) aroma: plot(ma); normalizeWithinSlide(ma, method=“l”); plot(ma)

60 Print-tip curve-fit normalization
Print-tip curve-fit normalized data {(M,A)}n=1..N: Mp,norm = Mp - cp(A); p = print tip 1-16 where cp(A) is an intensity dependent function for print tip p. aroma: plot(ma); lowessCurve(ma, groupBy=“printtip”); boxplot(ma, groupBy=“printtip”)

61 Print-tip curve-fit normalization cont.
curve-fit normalization done for each print-tip group separately aroma: normalizeWithinSlide(ma, “p”); boxplot(ma, groupBy=“printtip”); plotSpatial(ma)

62 Scaled print-tip curve-fit normalization
Scaled print-tip curve-fit normalized data {(M,A)}n=1..N: Mp,norm = sp·(Mp - cp(A)); p=print tip 1-16 where sp is a scale factor for print tip p (Median Absolute Deviation). After print-tip normalization After scaled print-tip normalization curve-fit normalization + rescaling done for each print-tip group separately aroma: normalizeWithinSlide(ma, “p”); normalizeWithinSlide(ma, “s”)

63 Spatial artifacts after normalization
No normalization Global curve-fit normalization Print-tip curve-fit normalization Scaled print-tip curve-fit normalization

64 Another example Scaled print-tip curve-fit normalization:

65 Dye-swap normalization
Idea: We have observed dye-dependent biases. What if we do the same hybridization a second time, but with reversed labeling of the dyes? Any dye-effects will hopefully cancel out, i.e. self-normalize, if the dyes are swapped and the signals are averaged. Requires: Paired arrays, i.e. 2, 4, 6, ... arrays. Details: Let {(M1,A1)} and {(M2,A2)} be the log-ratios and log-intensities for array 1 and 2, respectively. Dye-swap normalization is then: M = (M1-M2)/2, A = (A1+A2)/2 4

66 Dye-swap normalization cont.
Details: Let {(M1,A1)} and {(M2,A2)} be the log-ratios and log-intensities for array 1 and 2, respectively. Dye-swap normalization is then: M = (M1-M2)/2, A = (A1+A2)/2 What is done? Let {(R1,G1)} and {(R2,G2)} be the corresponding red and green signals from array 1 and 2. Then M = (M1-M2)/2 = [log2(R1/G1)-log2(R2/G2)]/2 = [log2R1-log2G1-(log2R2-log2G2)]/2 = [(log2R1-log2R2)+(log2G2-log2R1)]/2 = [log2(R1/R2) + log2(G2/G1)]/2 = (M1’+M2’)/2 Similarly for A = (A1+A2)/2 = ... = (A1’+A2’)/2. Now, if the red signals are equally biased, then the effects are hopefully cancelled-out. Same for the green channel. Thus, dye-swap is averaging two virtual (hopefully) self-normalized slides. 4

67 Median plate normalization
Will remove some of the intensity dependent effects... ...and some of the spatial artifacts.

68 Within-slide normalization summary
Alternatives: No normalization Median normalization a) all spots b) print-tip c) plate-group Curve-fit normalization a) Global curve-fit normalization b) Print-tip curve-fit normalization c) Scaled print-tip curve-fit normalization Dye-swap normalization

69 Across-slides normalization
The within-slide normalization makes the red and the green signals comparable. What about signals from different arrays? Across-slide normalization attacks this problem. Idea: Log-ratios (M values) from different slides should have equal variation. Example: If the standard deviation of the log-ratios from one array is 0.50 and from another array 0.78, the log-ratios have to be rescaled to get the same standard deviation.

70 Across-slides normalization cont.
Idea: Log-ratios (M values) from different slides should have equal variation. True ratio is M’ij where j represents different spots and i represents different arrays. Observed is Mij, where Mij = ai * M’ij. Robust estimate of ai is Corrected values are calculated as: 0 < ai < 1: how much spread slide i has compared to the average of the other slides. (Conceptually we could have used the standard deviation instead, but it is not robust!) aroma: normalizeAcrossSlides(ma)

71 Combining data from several slides
1. Within-slide normalization 2. Across-slides normalization 3. Averaging + more information aroma: normalizeWithinSlide(ma, “p”); normalizeAcrossSlides(ma); tma <- as.TMAData(ma)

72 The average of all normalized slides
The “average” slide: aroma: plot(tma)

73 References Normalization:
Yee Hwa Yang, Sandrine Dudoit, Percy Luu, and Terence P Speed. Normalization for cDNA microarray data. In Michael L. Bittner, Yidong Chen, Andreas N. Dorsel, and Edward R. Dougherty, editors, Proceedings of SPiE, volume 4266 of Microarrays: Optical Technologies and Informatics, pages 141–152, San Jose, California, June The International Society for Optical Engineering. Henrik Bengtsson and Ola Hössjer. Methodological study of affine transformations of gene expression data with proposed normalization method. Preprints in Mathematical Sciences 2003:38, Mathematical Statistics, Centre for Mathematical Sciences, Lund University, Sweden, 2003. Wolfgang Huber, Anja von Heydebreck, Holger Sültmann, Annemarie Poustka, and Martin Vingron. Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics, 1:1–9, 2002. Henrik Bengtsson. Identification and normalization of plate effects in cDNA microarray data. Preprints in Mathematical Sciences 2002:28, Mathematical Statistics, Centre for Mathematical Sciences, Lund University, Sweden, 2002. Gordon Smyth and Terry Speed, METHODS: Selecting Candidate Genes from DNA Array Screens, Dec 2003 Henrik Bengtsson. aroma - An R Object-oriented Microarray Analysis environment. Preprint in Mathematical Sciences (manuscript in progress), Mathematical Statistics, Centre for Mathematical Sciences, Lund University, Sweden, URL:

74 The cDNA microarray technology - PART IV: Identifying differentially expressed genes
Cut-off by M values The t-statistics and cut-off by T values Multiple testing and adjusted the p-values Validation

75 Average of all normalized slides
The “average” slide: aroma: plot(tma)

76 Cut-off by log-ratios (naive)
Top 5% of the absolute M values: aroma: plot(tma); highlight(tma, top(abs(tma$M), 0.05), col=“red”)

77 The t-statistics Idea: For replicated data, i.e. multiple measurements of the same thing, we trust the estimate of the average (mean or median) more if the deviation (std.dev. or MAD) is small. If the deviation is large, we do not trust it that much. Example: The blue and the red genes have almost the same average log-ratio, but we are more confident with the measure of the blue gene since its variability across replicates is smaller. The T statistics down-weight the importance of the average if the deviation is large and vice versa; T = mean(x) / SE(x) where SE(x)=std.dev(x)/N (standard error of the mean) aroma: tma <- as.TMAData(ma); summary(ma$T)

78 Cut-off by T values (better)
Top 5% of the absolute T values: aroma: plot(tma); highlight(tma, top(abs(tma$T), 0.05), col=“blue”)

79 Where are the M values compared to the T values?
Top 5% of the absolute M values in red: aroma: plot(tma); plot(tma, “TvsSE”); ...

80 References Identification of differentially expressed genes:
Sandrine Dudoit, Yee Hwa Yang, Matthew J. Callow, and Terence P. Speed. Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Technical Report 578, Department of Statistics, University of California at Berkeley, 2000. M. Callow, S. Dudoit, E. Gong, T. Speed, and E. Rubin. Microarray expression profiling identifies genes with altered expression in HDL-deficient mice. Genome Research, 10(12):2022–9, December 2000. Ingrid Lönnstedt and Terence P. Speed. Replicated microarray data. Statistical Sinica, 12(1), 2002.

81 Multiple testing In a microarray experiment, each gene (each probe or probe set) is really a separate experiment. Yet, if you treat each gene as an independent comparison, you will always find some with significant differences.

82 False positive and false negative
False Positive and False Negative: Statistical test vs. truth If truth is T ≠ 0: If truth is T=0: Statistical test decision: Reject H0: T=0. Correctly reject H0 False Positive (Type I Error) Statistical test decision: Do not reject H0: T=0. False Negative (Type II Error) Correctly not reject H0 In cDNA microarray experiments we commonly test the hypothesis H0 that T=0 against T≠0 (non-DE or not) for every gene separately. For the genes for which we reject H0, we say they are differentially expressed. However, by chance we will reject H0 for some genes that are not DE. We call these findings false positive (Type I). Genes that are DE, but for which we do not reject H0 are called false negative (Type II).

83 False Discovery Rate False Discovery Rate (FDR) is equal to the p-value of the test, p = P(|T| > tα/2), times the number of genes in the array For a p-value of 0.01·10,000 genes = 100 false DEs. You cannot eliminate false positives, but by choosing a more stringent p-value, you can keep them manageable (try p=0.001  10 false positives) The FDR must be smaller than the number of real differences that you find, which in turn depends on the size of the differences and variability of the measured expression values. It’s tricky, but at least you should be aware of it.

84 Multiple testing cont. Multiple testing pitfalls
thousands of tests, i.e. each gene is tested against H0: T=0. By chance some will “fail”. false positives problems more serious. need to adjust p-values. Different adjustment procedures Bonferroni, Sidak, Duncan, Holm, etc. Not discussed here, but available automatically in the better microarray analysis software.

85 Adjusted and unadjusted p-values for the 50 genes with the largest absolute t-statistics.
Terry Speed et al.

86 References Multiple testing and adjusted p-values:
Sandrine Dudoit, Yee Hwa Yang, Matthew J. Callow, and Terence P. Speed. Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Technical Report 578, Department of Statistics, University of California at Berkeley, 2000. Y. Ge, S. Dudoit & T. P. Speed, Resampling-based multiple testing for microarray data hypothesis (submitted to Test, Spain). Technical Report #633 of UCB Statistics, 2003.

87 Validation of identified genes

88 Exploratory validation plots
Were are the differentially expressed genes located? Does it make sense? If there is a pattern, you might need to redo the normalization.

89 Biological verification
Having a top 20 list of interesting genes, what biological relevance do they have in this experiment? Have you found interesting genes, but with very noisy data? Do more experiments and see if you can replicate the results. Otherwise, you might just have found noise! Verify interesting genes by other means such as Southern or Northern blots and quantative real-time PCR (QRT-PCR) or other microarray technologies etc.

90 Summary: Identification of DEs
You need replication and statistics to find real differences. Cutoff by log ratios is not enough/correct. Cutoff by t-statistics is much better. Multiple testing => must adjust the p values. Validate your results

91 Take-home messages Good image analysis is essential. Some software are obsolete and not that good. Commercial software like GenePix could be better (and hopefully will be soon). Background correction or not is not solved. Progress has been done, but more research is needed. Normalization is needed. We understand more now than two years ago. Use at least the t-statistics to identify differentially expressed genes. Do not rely exclusively on log-ratios. Multiple testing must be considered; adjust your p-values. Talk to a statistician before doing the experiments! We do think about these kind of problems for a living.


Download ppt "Statistical analysis of microarray data"

Similar presentations


Ads by Google