Download presentation
Presentation is loading. Please wait.
1
Statistical Methods in Microarray Data Analysis Mark Reimers, Genomics and Bioinformatics, Karolinska Institute
2
Four Recent Contributions Exploratory graphics Multiple comparisons corrections –Randomization-based significance tests Normalization –loess normalization for cDNA microarray Models for probe-level Affymetrix data –Robust estimation
3
Multiple comparisons Each gene has a 5% chance of exceeding the threshold at a p-value of.05 –Type I error 10,000 genes on a chip 500 genes should exceed.05 threshold
4
Corrections to p-Value Bonferroni correction –p i * = Np i, if Np i < 1, otherwise 1 –Too conservative! Sidak –p i * = 1 – (1 – p i ) N –Still conservative if genes are co-regulated (correlated)
5
Step-Down p-Values p-values for many genes: p 1, …, p N Order the smallest k as p (1), …, p (k) How likely are we to get k p-values this small by chance? An improvement in power over single-step procedures
6
Quantile Plot Plot sample t- scores against t- scores under random hypothesis Statistically significant genes stand out Corresponding quantiles of t-distribution Sample t-scores Changed genes
7
Volcano Plot Displays both biological importance and statistical significance log 2 (fold change) log 2 (p-value) or t-score
8
Normalization: Comparing Chips Measures differ consistently between chips due to: –Different amounts of RNA –Hybridization conditions –Scanner settings –Murphy’s Law Normalization: compensate for systematic technical differences in measurement process Re-scaling to mean or median leaves strong evidence of systematic technical variation
9
Normalization: Signal Distributions Distributions of log intensity of all probes among a set of 21 replicate chips Each color represents probe density on one chip Re-scaling would shift distribution shape to right or left on this plot
10
Density function Distribution function F 1 (x) Raw data Reference distribution F 2 (x) Formula: x norm = F 2 -1 (F 1 (x)) Quantile Normalization Assumes: gene distribution changes little
11
Visible Effect of Quantile Norm. Ratio-Intensity plots are straightened as by- product
12
Current Work Hybridization reaction varies across some chips Very common on cDNA 10%-20% of well- done Affy chips Synthetic image of ratio of individual probes to their median across chips: Yellow areas show ratios more than twice those of red areas
13
Models: Many Probes for One Gene GeneSequence Multiple oligo probes Perfect Match Mismatch5´3´ How to combine signals from multiple probes into a single gene abundance estimate?
14
Probe Variation Individual probes don’t agree on fold changes Probes vary by two orders of magnitude on each chip –CG content is most important factor in signal strength Signal from 16 probes along one gene on one chip
15
Models for Multiple Probes Issues: –Accuracy – does the model give accurate estimates of relative gene expression, when this is known? –Noise – what is the variance of replicates? –Theoretical basis – do we understand why we are doing what we do? Statistical experience with methodology Theory of hybridization process underlying observations
16
Three Competing Models Affymetrix MicroArray Suite –versions 4, and 5 dChip –Li and Wong, HSPH Bioconductor: affy package (RMA) –Bolstad, Irizarry, Speed, et al
17
Model 1: MicroArray Suite – Version 4 GeneChip ® older software uses Avg.diff with A a set of suitable pairs chosen by software –30%-40-% of probe differences can be negative
18
Model 2: MicroArray Suite – Version 5 MicroArray Suite version 5 uses MM* is an adjusted MM that is never bigger than PM Tukey biweight is a robust average procedure with weights: f(x)=c 2 /6[1-(1-x 2 /s 2 ) 3 ]; |x|<c For this (typical) example, it is not clear what the average would mean PM-MM values for probe pairs
19
Linear Models Extension of linear regression Essential features: –variance constant –errors independent –Small number of factors combine in algebraic form to give levels frequently additive
20
Model for Probe Signal Each probe signal is proportional to –i) the amount of target sample –ii) the hybridization efficiency of the specific probe sequence to the target –Each probe has a specific affinity to its gene target NB: Sensitivity need not imply Specificity 11 22 Probes 1 2 3 chip 1 chip 2
21
Robust Statistics Outlier: a measure that is far beyond the typical random variation –common in biological measures –10-15% in Affy probe sets Robust methods try to fit the majority of data points –Issue is to identify which points to down-weight or ignore Median is very robust – but inefficient –Trimmed means are almost as robust and much more efficient
22
Robust Linear Models Criterion of fit –Least median squares –Sum of weighted squares –Least squares and throw out outliers Method for finding fit –High-dimensional search –Iteratively re-weighted least squares –Median Polish
23
Why Robust Models for GeneChips? 10% - 15% of individual signals in a probe set deviate greatly from pattern Often outliers lie close together Causes: –Scratches –Proximity to heating elements –Uneven fluid flow
24
Why Robust Models for GeneChips? 10% - 15% of individual signals in a probe set deviate greatly from pattern Often outliers lie close together Causes: –Scratches –Proximity to heating elements –Uneven fluid flow
25
Li & Wong (dChip) Model: PM ij = i j + ij - Original model (dChip 1.0) used PM ij - MM ij = i j + ij by analogy with Affy MAS 4 Outlier removal: –Identify extreme residuals –Remove –Re-fit –Iterate Distribution of errors ij assumed independent of signal strength
26
Robust Multi-chip Analysis Each probe responds roughly linearly –over a moderate range –some probes are outliers Linear Model: –signal = i j + i amount of transcript in sample i; j amplification of probe j Robust Fit: –identify outliers by heuristic – remove –standard robust method – iteratively re-weighted least squares
27
For each probe set, re-write PM ij = i j as: log( PM ij )= log( i ) + log( j ) Fit this additive model by iteratively re- weighted least-squares or median polish In practice, fit: Bolstad, Irizarry, Speed – (RMA) Where nlog() stands for logarithm after normalization NB. Now homoschedastic on log scale
28
It Makes a Difference Two fairly consistent genes in each of 71 samples MAS 5 values dChip values
29
Models Compared on Gene Variance Std Dev of gene measures from 20 replicate arrays Green: MAS5.0; Black: Li-Wong; Blue, Red: RMA Courtesy of Terry Speed LowAbundance:High
30
Improvement in Models Affymetrix Suite gets better every year –MAS 7 is expected to be a multi-chip model MAS 5.0 estimation does a reasonable job on probe sets that are bright –Metabolic and structural genes –These are most often reported in papers dChip and RMA do better on genes that are less abundant –Signalling proteins –transcription factors
31
Expression Comparison 1 – MAS 4 Courtesy of Terry Speed Ratio-Intensity Plot comparing two chips from spike-in experiment White dots represent unchanged genes Red numbers flag spike-in genes
32
Expression Comparison 2 – MAS 5 Courtesy of Terry Speed t-scores Theoretical t-distribution changed genes
33
Expression Comparison 3 – Li-Wong Courtesy of Terry Speed
34
Expression Comparison 4 - RMA Courtesy of Terry Speed
35
Current Work: Improving the Model How to use the MM information profitably –Combine estimates from PM and MM probes? Assessments of probe quality Accurate estimates of probe background Normalization method based on 2-d loess to correct spatial inhomogeneity
36
Relation Between PM and MM Across One Experiment Set Colored symbols are one probe
37
Probe Specific Background Horizontal lines represent probes; colored symbols correspond to arrays After subtracting individual backgrounds, ratios between corresponding arrays are more consistent between probes Fitted Data Probe BG subtracted
38
Where Are We? Affymetrix almost finished? –Probe variation ~40% => gene variation ~ 10% –RMA gives ~20% Work to be done: –Systematic biases for cDNA arrays –Platform reconciliation –Using QC and variation measures for individual probes in combined expression measures Frontiers: –Image analysis
39
Near Term Work to be Done New hybridization technologies for measuring gene expression Protein chips –More complex cross-hybridization Other high-throughput technologies –eg RNAi chips –Cell arrays Using sequence information to understand cross- hybridization
40
Integrated Analysis Integrating statistical measures of data uncertainty in machine-learning techniques for network analysis Statistical inference for pathways and gene ontology categories Robust data analysis to mine for genome- scale patterns in expression
41
Acknowledgements KI –Karin Dahlman –Yudi Pawitan –Arief Gusnanto –Lennie Fredriksson Berkeley –Terry Speed –Ben Bolstad Johns Hopkins –Rafael Irizarry
44
Affymetrix Arrays Single stranded, fluorescently labeled DNA target 20µm Each probe cell or feature contains millions of copies of a specific oligonucleotide probe Image of Hybridized Probe Array Image of Hybridized Probe Array Over 400,000 different probes complementary to genetic information of interest Oligonucleotide probe * * * * *1.28cm GeneChip Probe Array Hybridized Probe Cell
45
Evidence for Spatial Variation Synthetic Image of Affy chip
46
Loess Normalization for Areas Fit two-parameter loess smoother With 5-10 df
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.