Presentation is loading. Please wait.

Presentation is loading. Please wait.

Summarization of Oligonucleotide Expression Arrays BIOS 691-803 Winter 2010.

Similar presentations


Presentation on theme: "Summarization of Oligonucleotide Expression Arrays BIOS 691-803 Winter 2010."— Presentation transcript:

1 Summarization of Oligonucleotide Expression Arrays BIOS 691-803 Winter 2010

2 What is Summarization? Some expression arrays (Affymetrix, Nimblegen) use multiple probes to target a single transcript – a ‘probe set’ Typically probes have different fold changes between any two samples How to effectively summarize the information in a probe set?

3 Many Probes for One Gene GeneSequence Multiple oligo probes Perfect Match Mismatch5´3´ How to combine signals from multiple probes into a single gene abundance estimate?

4 Probe Variation Individual probes don’t agree on fold changes Probes vary by two orders of magnitude on each chip –CG content is most important factor in signal strength Signal from 16 probes along one gene on one chip

5 Probe Measure Variation Typical probes are two orders of magnitude different! CG content is most important factor RNA target folding also affects hybridization 3x10 4 0

6 Bioinformatics Issues Probes may not map accurately SNP’s in probes Affymetrix places most probes in 3’UTR of genes –Alternate Poly-A sites mean that some probe targets may really be less common than others

7 Probe Mapping Early builds of the genome often confused regions or genes and their complements Probe sets at right represent probe sets for rRNA gene and its complement

8 Alternate Poly-Adenylation Sites Poly-A marks mRNA ‘tail’ Many genes have alternatives 3’ UTR may be longer or shorter

9 Alternate Polyadenylation of MID1

10 Many Approaches to Summarization Affymetrix MicroArray Suite; PLiER dChip - Li and Wong, HSPH Bioconductor: –RMA - Bolstad, Irizarry, Speed, et al –affyPLM – Bolstad –gcRMA – Wu Physical chemistry models – Zhang et al Factor model Probe-weighting

11 Critique of Averaging (MAS5) Not clear what an average of different probes should mean Tukey bi-weight can be unstable when data cluster at either end – frequently the conditions here No ‘learning’ based on cross-chip performance of individual probes

12 Motivation for multi-chip models: Probe level data from spike-in study ( log scale ) note parallel trend of all probes Courtesy of Terry Speed

13 Model for Probe Signal Each probe signal is proportional to –i) the amount of target sample – a –ii) the affinity of the specific probe sequence to the target – f NB: High affinity is not the same as Specificity –Probe can give high signal to intended target and also to other transcripts a1a1 a2a2 Probes 1 2 3 chip 1 chip 2 f 1 f 2 f 3

14 Multiplicative Model For each gene, a set of probes p 1,…,p k Each probe p j binds the gene with efficiency f j In each sample there is an amount a i. Probe intensity should be proportional to f j x a i Always some noise!

15 Robust Linear Models Criterion of fit –Least median squares –Sum of weighted squares –Least squares and throw out outliers Method for finding fit –High-dimensional search –Iteratively re-weighted least squares –Median Polish

16 For each probe set, take log of PM ij = a i f j : then fit the model: where caret represents “after pre-processing” Fit this additive model by iteratively re- weighted least-squares or median polish Bolstad, Irizarry, Speed – (RMA) Critique: Model assumes probe noise is constant (homoschedastic) on log scale

17 Comparing Measures 20 replicate arrays – variance should be small Standard deviations of expression estimates on arrays arranged in four groups of genes by increasing mean expression level Green: MAS5.0; Black: Li-Wong; Blue, Red: RMA Courtesy of Terry Speed

18 Background 25-mers are prone to cross-hybridization MM > PM for about 1/3 of all probes Cross-hybridization varies with GC content Signal intensity varies with cross-hybe

19 The gcRMA Approach Estimate non-specific binding using either: –True null assay (non- homologous RNA) –Estimates from MM Subtract background before normalization and fitting model

20 Evaluating gcRMA On AffyComp data sets, gcRMA wins –Replicates with 14 spike-ins done by Affy Many investigators get crappy results (and don’t write it up) gcRMA does very well on highly expressed genes, not nearly so well on less expressed genes Gharaibeh et al. BMC Bioinformatics 2008 9:452

21 Factor Model Assume relation between p observations x and true value z : x = z +  where  i are independent Use factor analytic methods to estimate –Depends on assuming z ~ Normal –Differs from RMA in relaxing assumption of IID errors – some probes can have more random error than others

22 Weighting Probes It is clear that some probes are more reliable than others How to assess this in a simple fashion? If a gene really changes across arrays, then a responsive probe will change more than a noisy probe Weight by relative ranges Best performance on AffyComp!

23 Summary and Evaluation No one best solution for all situations gcRMA and DFW seem to do very well on AffyComp data –May need weights for DFW by tissue Leading methods seem to rely on probe weighting


Download ppt "Summarization of Oligonucleotide Expression Arrays BIOS 691-803 Winter 2010."

Similar presentations


Ads by Google