Presentation is loading. Please wait.

Presentation is loading. Please wait.

Experimental Design and Differential Expression Class web site: Statistics for Microarrays.

Similar presentations


Presentation on theme: "Experimental Design and Differential Expression Class web site: Statistics for Microarrays."— Presentation transcript:

1 Experimental Design and Differential Expression Class web site: http://statwww.epfl.ch/davison/teaching/Microarrays/ETHZ/ Statistics for Microarrays

2 Biological question Differentially expressed genes Sample class prediction etc. Testing Biological verification and interpretation Microarray experiment Estimation Experimental design Image analysis Normalization Clustering Discrimination R, G 16-bit TIFF files (Rfg, Rbg), (Gfg, Gbg)

3 Some Considerations for cDNA Microarray Experiments (I) Scientific (Aims of the experiment) Specific questions and priorities How will the experiments answer the questions Practical (Logistic) Types of mRNA samples: reference, control, treatment, mutant, etc Source and Amount of material (tissues, cell lines) Number of slides available

4 Some Considerations for cDNA Microarray Experiments (II) Other Information Experimental process prior to hybridization: sample isolation, mRNA extraction, amplification, labelling, … Controls planned: positive, negative, ratio, etc. Verification method: Northern, RT-PCR, in situ hybridization, etc.

5 Aspects of Experimental Design Applied to Microarrays (I) Array Layout Which cDNA sequences are printed Spatial position Allocation of samples to slides Design layouts A vs B: Treatment vs control Multiple treatments Factorial Time series

6 Aspects of Experimental Design Applied to Microarrays (II) Other considerations Replication Physical limitations: the number of slides and the amount of material Sample Size Extensibility - linking

7 Layout options The main issue is the use of reference samples, typically labelled green. Standard statistical design principles can lead to more efficient layouts; use of dye-swaps can also help. Sample size determination is more than usually difficult, as there are 1,000s of possible changes, each with its own SD.

8 Natural design choice Case 1: Meaningful biological control (C) Samples: Liver tissue from four mice treated by cholesterol modifying drugs. Question 1: Genes that respond differently between the T and the C. Question 2: Genes that responded similarly across two or more treatments relative to control. Case 2: Use of universal reference Samples: Different tumor samples. Question: To discover tumor subtypes. C T1 T2T3T4 T1T1 Ref T2T2 T n-1 TnTn

9 Treatment vs Control Two samples e.g. KO vs. WT or mutant vs. WT TC TRef C Direct Indirect  2 /22222 average (log (T/C))log (T / Ref) – log (C / Ref )

10 I) Common Reference II) Common reference III) Direct comparison Number of Slides Ave. variance Units of material A = B = C = 1A = B = C = 2 Ave. variance One-way layout: one factor, k levels CB A ref CBA CBA

11 I) Common Reference II) Common reference III) Direct comparison Number of Slides N = 3N=6N=3 Ave. variance20.67 Units of materialA = B = C = 1A = B = C = 2 Ave. variance10.67 One-way layout: one factor, k levels CB A ref CBA CBA For k = 3, efficiency ratio (Design I / Design III) = 3. In general, efficiency ratio = 2k / (k-1). (But may not be achievable due to lack of independence.)

12 Design I Design III A B C A Ref BC Illustration from one experiment Box plots of log ratios: direct still ahead

13 CTL OSM EGF OSM & EGF Factorial experiments Treated cell lines Possible experiments Here interest is not in genes for which there is an O or an E (main) effect, but in which there is an O  E interaction, i.e. in genes for which log(O&E/O)-log(E/C) is large or small.

14 IndirectA balance of direct and indirect I)II)III)IV) # Slides N = 6 Main effect A 0.50.670.5NA Main effect B 0.50.430.50.3 Int A.B 1.50.671 2 x 2 factorial: some design options C A.BBA B C A B C A B C A Table entry: variance (assuming all log ratios uncorrelated)

15 Some Design Possibilities for Detecting Interaction Samples: treated tumor cell lines at 4 time points (30 minutes, 1 hour, 4 hours, 24 hours) Question: Which genes contribute to the enhanced inhibitory effect of OSM when it is combined with EGF? Role of time? ctlOSM EGF OSM & EGF ctl OSM EGF OSM & EGF 22 Design A: Design B:

16 Combining Estimates Different ways of estimating the same contrast: e.g. A compared to P Direct = A-P Indirect = A-M + (M-P) or A-D + (D-P) or -(L-A) - (P-L) How do we combine these? L P V D M A

17

18 Time Course Experiments Number of time points Which differences are of highest interest (e.g. between initial time and later times, between adjacent times) Number of slides available

19 Design choices in time series. Entry: variance t vs t+1t vs t+2t vs t+3 Ave T1T2T2T3T3T4T1T3T2T 4 T1T4 N=3A) T1 as common reference 1221211.5 B) Direct Hybridization 1112231.6 7 N=4C) Common reference 2222222 D) T1 as common ref + more.67 1.67.671.6711.06 E) Direct hybridization choice 1.75 11.83 F) Direct Hybridization choice 2 1.751.83 T2 T3 T4 T1 T2 T3 T4 T1 Re f T2 T3 T4 T1 T2T3T4T1 T2T3T4T1 T2 T3 T4 T1

20 Replication Why? To reduce variability To increase generalizability What is it? Duplicate spots Duplicate slides Technical replicates Biological replicates

21 Technical Replicates: Labeling 3 sets of self – self hybridizations Data 1 and Data 2 were labeled together and hybridized on two slides separately Data 3 were labeled separately Data 1 Data 2 Data 3

22 Sample Size Variance of individual measurements (X) Effect size(s) to be detected (X) Acceptable false positive rate Desired power (probability of detecting an effect of at least the specfied size)

23 Extensibility “Universal” common reference for arbitrary undetermined number of (future) experiments Provides extensibility of the series of experiments (within and between labs) Linking experiments necessary if common reference source diminished/depleted

24 Summary Balance of direct and indirect comparisons Optimize precision of the estimates among comparisons of interest Must satisfy scientific and physical constraints of the experiment

25 Identifying Differentially Expressed Genes Goal: Identify genes associated with covariate or response of interest Examples: –Qualitative covariates or factors: treatment, cell type, tumor class –Quantitative covariate: dose, time –Responses: survival, cholesterol level –Any combination of these!

26 Biological question Differentially expressed genes Sample class prediction etc. Testing Biological verification and interpretation Microarray experiment Estimation Experimental design Image analysis Normalization Clustering Discrimination R, G 16-bit TIFF files (Rfg, Rbg), (Gfg, Gbg)

27 Differentially Expressed Genes Simultaneously test m null hypotheses, one for each gene j : H j : no association between expression level of gene j and covariate/response Combine expression data from different slides and estimate effects of interest Compute test statistic T j for each gene j Adjust for multiple hypothesis testing

28 Test statistics Qualitative covariates: e.g. two-sample t-statistic, Mann-Whitney statistic, F- statistic Quantitative covariates: e.g. standardized regression coefficient Survival response: e.g. score statistic for Cox model

29 Example: Apo AI experiment (Callow et al., Genome Research, 2000) GOAL: Identify genes with altered expression in the livers of one line of mice with very low HDL cholesterol levels compared to inbred control mice Experiment: Apo AI knock-out mouse model 8 knockout (ko) mice and 8 control (ctl) mice (C57Bl/6) 16 hybridisations: mRNA from each of the 16 mice is labelled with Cy5, pooled mRNA from control mice is labelled with Cy3 Probes: ~6,000 cDNAs, including 200 related to lipid metabolism

30 Which genes have changed? This method can be used with replicated data: 1. For each gene and each hybridisation (8 ko + 8 ctl) use M=log 2 (R/G) 2. For each gene form the t-statistic: average of 8 ko Ms - average of 8 ctl Ms sqrt(1/8 (SD of 8 ko Ms) 2 + 1/8 (SD of 8 ctl Ms) 2 ) 3. Form a histogram of 6,000 t values 4. Make a normal Q-Q plot; look for values “off the line” 5. Adjust for multiple testing

31 Histogram & Q-Q plot ApoA1

32 Plots of t-statistics

33 Assigning p-values to measures of change Estimate p-values for each comparison (gene) by using the permutation distribution of the t- statistics. For each of the possible permutation of the trt / ctl labels, compute the two-sample t-statistics t* for each gene. The unadjusted p-value for a particular gene is estimated by the proportion of t*’s greater than the observed t in absolute value.

34 Apo AI: Adjusted and unadjusted p-values for the 50 genes with the larges absolute t-statistics

35 Genes with adjusted p-value  0.01

36 Single-slide methods Model-dependent rules for deciding whether (R,G) corresponds to a differentially expressed gene Amounts to drawing two curves in the (R,G)-plane; call a gene differentially expressed if it falls outside the region between the two curves At this time, not enough known about the systematic and random variation within a microarray experiment to justify these strong modeling assumptions n = 1 slide may not be enough (!)

37 Single-slide methods Chen et al: Each (R,G) is assumed to be normally and independently distributed with constant CV; decision based on R/G only (purple) Newton et al: Gamma-Gamma-Bernoulli hierarchical model for each (R,G) (yellow) Roberts et al: Each (R,G) is assumed to be normally and independently distributed with variance depending linearly on the mean Sapir & Churchill: Each log R/G assumed to be distributed according to a mixture of normal and uniform distributions; decision based on R/G only (turquoise)

38 Matt Callow’s Srb1 dataset (#8). Newton’s, Sapir & Churchill’s and Chen’s single slide method Difficulty in assigning valid p- values based on a single slide

39 Another example: Survival analysis with expression data Bittner et al. looked at differences in survival between the two groups (the ‘cluster’ and the ‘unclustered’ samples) ‘Cluster’ seemed to have longer survival

40 Kaplan-Meier Survival Curves, Bittner et al.

41 unclustered cluster Average Linkage Hierarchical Clustering, survival only

42 Kaplan-Meier Survival Curves, reduced grouping

43 Identification of genes associated with survival For each gene j, j = 1, …, 3613, model the instantaneous failure rate, or hazard function, h(t) with the Cox proportional hazards model: h(t) = h 0 (t) exp(  j x ij ) and look for genes with both: large effect size  j large standardized effect size  j /SE(  j ) ^ ^^

44

45 Findings Top 5 genes by this method not in Bittner et al. ‘weighted gene list’ - Why? weighted gene list based on entire sample; our method only used half weighting relies on Bittner et al. cluster assignment other possibilities?

46 Limitations of Single Gene Tests May be too noisy in general to show much Do not reveal coordinated effects of positively correlated genes Hard to relate to pathways

47 Some ideas for further work Expand models to include more genes and possibly two-way interactions Nonparametric tree-based subset selection – would require much larger sample sizes

48 Acknowledgements Sandrine Dudoit Jane Fridlyand Yee Hwa (Jean) Yang Debashis Ghosh Erin Conlon Ingrid Lonnstedt Terry Speed


Download ppt "Experimental Design and Differential Expression Class web site: Statistics for Microarrays."

Similar presentations


Ads by Google