Download presentation
Presentation is loading. Please wait.
1
Theoretical and experimental comparisons of gene expression indexes for oligonucleotide microarrays Department of Biostatistics, University pf North Carolina, Chapel Hill Division of Human Cancer Genetics Ohio State University William J. Lemon, Jeffrey J.T. Palatini, Ralf Krahe, Fred A. Wright
2
Measuring gene expression with the Affymetrix GeneChip Perfect Match (PM) Mismatch (MM) PM - 25 bases complementary to region of gene MM - Middle base is different... Coding portion of gene X polyA cRNA from sample mRNA is put on the chip intensity of binding reflects gene expression
3
Reproducibility of Probe Sensitivities Li, C and Wong, WH, Proc. Natl. Acad. Sci. USA, 98:31-36, 2001.
4
The Li-Wong Model Li, C and Wong, WH, Proc. Natl. Acad. Sci. USA, 98:31-36, 2001. Li-Wong Full (LWF) Li-Wong Reduced (LWR) Identifiability constraint
5
The Li-Wong Model Li, C and Wong, WH, Proc. Natl. Acad. Sci. USA, 98:31-36, 2001. Li-Wong Full (LWF) Li-Wong Reduced (LWR) Identifiability constraint ith array jth probe pair Total no. probe pairs
6
The Li-Wong Model Li, C and Wong, WH, Proc. Natl. Acad. Sci. USA, 98:31-36, 2001. Li-Wong Full (LWF) Li-Wong Reduced (LWR) Identifiability constraint ith array jth probe pair Total no. probe pairs expression sensitivities
7
How to compare gene expression indexes? We get maximum likelihood estimates for using either full data (LWF) or reduced data (LWR) The Affymetrix software computes: Average Difference (AD) Log-Average (LA) The log-average might perform particularly poorly. Note that if terms are small and error variance is small,
8
We gain insight by assuming Li-Wong model is true. Then what are the consequences? For large sample sizes, the ’s and ’s will be well- estimated
9
Compare LW estimators directly: Comparing to AD is tricky, but with a correction factor AD is also an unbiased estimate of :
10
This also gives insight into “perfect match only” analyses: RE(full, PM-only)= and Furthermore, PM-only is always at least twice as efficient as LWR
11
Empirical Comparisons We propose that an expression index is “good” if it has a high correlation with the underlying true expression (which is usually unknown). this correlation can be estimated using a specially designed mixing experiment if r is the correlation coefficient between the measured index and true expression, the “relative efficiency” of two indexes and can be estimated as
12
Suppose the true underlying gene expression for a given gene is . Consider two indices of gene expression is an unbiased estimate of And we have
13
Can we estimate this relative efficiency? Suppose we could do a regression of on . the ratio of explained to residual variance in the model can be shown to be and similarly for, so
14
Can we estimate r without ever knowing true expressions ? Yes, with a specially designed mixing experiment we seek two contrasting conditions in which many genes will be differentially expressed
15
Experimental Design Human Fibroblasts (GM 08330) 20% FBS 48h 24h Harvest total RNA Lys, PheDap, Thr 50:50 Add Bacterial Control Genes StimulatedStarved 5 passages Dap, Thr, Lys, Phe Produce 50:50 group Produce duplicates each day for 3d Synthesize cDNA, cRNA; fragment Add Hybridization Control Genes BioB, BioC, BioD, Cre Hybridize HuGeneFL 0.1% FBS Serum starvation Cell culture Serum stimulation 0.1% 20% Harvest total RNA Gene Expression Indexes Data Reduction RNA extraction 20% FBS (6 replicates for each condition)
16
BIN1 expression Stim 50:50 Starved True expression = average of Stim, Starved
17
BIN1 expression Stim 50:50 Starved 12 3
18
Note that Where X=1, 2, 3 (say) for Stim, 50:50 Starved, respectively
19
Mean probe intensity per array Stim 50:50 Starved Overall intensity higher in Stimulated
20
Coefficients of variation for assay (individual probes) and gene expression indexes
21
Stim50:50StarvedStim50:50Starved Stim 50:50 Starved Stim 50:50 Starved LWF AD LWR LA Correlation matrix of 18 arrays as a colorized image for each expression index.
22
Comparing Models Cluster Analysis Affymetrix Log Ave Full Model Reduced Model Affymetrix Ave Diff Strv 1 Strv 4 Strv 2 Strv 5 Strv 3 Strv 6 50:50 3 50:50 5 50:50 4 50:50 2 50:50 1 50:50 6 Stim 4 Stim 6 Stim 5 Stim 3 Stim 1 Stim 2 Strv 1 Strv 3 Strv 2 Strv 6 Strv 5 Strv 4 Stim 1 Stim 6 Stim 3 Stim 5 Stim 4 50:50 5 50:50 4 50:50 3 50:50 2 50:50 1 50:50 6 Strv 3 Strv 4 Strv 6 Strv 5 Strv 2 Strv 1 Stim 2 Stim 1 Stim 4 Stim 5 Stim 6 Stim 3 50:50 5 50:50 4 50:50 2 50:50 1 50:50 6 50:50 3 Strv 2 Strv 3 Strv 1 Strv 6 Strv 5 Strv 4 Stim 2 Stim 4 50:50 1 Stim 1 Stim 6 Stim 3 Stim 5 50:50 3 50:50 5 50:50 4 50:50 2 50:50 6
23
Relative Efficiency LWF LWR AD LA Median(r 2 /(1-r 2 )) LWF LWR AD LA UnscaledScaled
24
Correlation of duplicate measurements of 149 genes LWF median r=.74 LWR median r=.43 AD median r=.08 LA median r=.17
25
Number of unexpressed genes Only 0.2% of the LW estimates are negative 50:50 group has fewest negative estimates could this indicate very few unexpressed genes? Stim 50:50 Starved
26
A conservative approach to estimating number of unexpressed genes Let U denote number of unexpressed genes genes are ranked according to expression index This is useful if we can get a random sample of unexpressed genes Unexpressed population Gene expression index
27
We use the spiked-out bacterial control genes as a sample of “unexpressed” genes the 4 genes are are represented 3 times each (different portions of mRNA), for a total of 12 probe sets Based on this reasoning, we estimate that greater than 88% of the genes are expressed, even in the Starved samples
28
Rank of expression index variance across the 6 Stimulated arrays versus rank of index mean Truly absent in stim group AD LWF Very low estimated expression for truly absent genes when using LWF
29
Present/absent calls We use the statistic to declare genes present/absent (absolute call) we find the vast majority of genes on the array appear to be present for the spiked in/out genes, we find vastly improved present/absent calling using LW estimates
30
LWF-Z LWR-Z Untrimmed AD Untrimmed LA LA AD Absolute Call ROC curve - spiked in/out genes
31
Variability in estimates Full Model Reduced Model log(variance) log(mean) Stim 50:50 Starved
32
Conclusions Model-based estimators are superior to simple averaging Full model superior to reduced this does not necessarily mean that the mismatch probes are a good idea - but if they are present we should use them we have demonstrated this using both analytic considerations and experimental data a carefully designed experiment can be used to address many issues Many more genes may be expressed than previously thought
33
Other issues/ future work Spiking genes might be used to calibrate and normalize arrays relationship between variance and mean of expression indexes may be useful in planning experiments our data may be useful for future work, especially in producing indexes that are resistant to probe saturation all primary data, this Powerpoint presentation and a preprint are available at http://thinker.med.ohio-state.edu
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.