Download presentation
Presentation is loading. Please wait.
Published byJessie Franklin Modified over 8 years ago
1
Transformations… what for? which one? Wolfgang Huber Div.Molecular Genome Analysis DKFZ Heidelberg
2
Log-ratio with/without background correction Shrunken log-ratio (BHM) Variance stabilized log-ratio (=generalized log-ratio, “glog”) Microarray intensities x 1,…,x n How do you like to think about (interprete) it? How do you estimate the parameters? What comes out (the “bottom line”?)
3
ratios and fold changes Fold changes are useful to describe continuous changes in expression 1000 1500 3000 x3 x1.5 ABC 0 200 3000 ? ? ABC But what if the gene is “off” (below detection limit) in one condition?
4
ratios and fold changes Many interesting genes will be off in some of the conditions of interest 1.If you want expression measure (“net normalized spot intensity”) to be an unbiased estimator of abundance many values 0 need something more than (log)ratio 2. If you let expression measure be biased (always>0) can keep ratios. how do you choose the bias?
5
ratios and fold changes Ratios are scale-free: But there is (at least) one absolute scale in the data: Can we use this to construct useful functions
6
How to compare microarray intensities with each other? How to incorporate measurement uncertainty (“variance”)? How to simultaneously and consistently deal with calibration (“normalization”)? In the following:
7
Sources of variation amount of RNA in the biopsy efficiencies of -RNA extraction -reverse transcription -labeling -fluorescent detection probe purity and length distribution spotting efficiency, spot size cross-/unspecific hybridization stray signal CalibrationError model Systematic o similar effect on many measurements o corrections can be estimated from data Stochastic o too random to be ex- plicitely accounted for o remain as “noise”
8
a i per-sample offset ik ~ N(0, b i 2 s 1 2 ) “additive noise” b i per-sample normalization factor b k sequence-wise probe efficiency ik ~ N(0,s 2 2 ) “multiplicative noise” modeling ansatz measured intensity = offset + gain true abundance
9
The two-component model raw scalelog scale “additive” noise “multiplicative” noise B. Durbin, D. Rocke, JCB 2001
10
variance stabilizing transformations X u a family of random variables with EX u =u, VarX u =v(u). Define var f(X u ) independent of u derivation: linear approximation
11
variance stabilizing transformations f(x) x
12
variance stabilizing transformations 1.) constant variance (‘additive’) 2.) constant CV (‘multiplicative’) 4.) additive and multiplicative 3.) offset
13
the “glog” transformation - - - f(x) = log(x) ——— h s (x) = asinh(x/s) P. Munson, 2001 D. Rocke & B. Durbin, ISMB 2002
14
parameter estimation o maximum likelihood estimator: straightforward – but sensitive to deviations from normality o model holds for genes that are unchanged; differentially transcribed genes act as outliers. o robust variant of ML estimator, à la Least Trimmed Sum of Squares regression. o works as long as <50% of genes are differentially transcribed
15
Least trimmed sum of squares regression minimize - least sum of squares - least trimmed sum of squares P. Rousseeuw, 1980s
16
evaluation: effects of different data transformations difference red-green rank(average)
17
Normality: QQ-plot
18
evaluation: sensitivity / specificity in detecting differential abundance o Data: paired tumor/normal tissue from 19 kidney cancers, in color flip duplicates on 38 cDNA slides à 4000 genes. o 6 different strategies for normalization and quantification of differential abundance o Calculate for each gene & each method: t-statistics, permutation-p o For threshold , compare the number of genes the different methods find, #{p i | p i }
19
evaluation: comparison of methods more accurate quantification of differential expression higher sensitivity / specificity one-sided test for upone-sided test for down
20
evaluation: a benchmark for Affymetrix genechip expression measures o Data: Spike-in series: from Affymetrix 59 x HGU95A, 16 genes, 14 concentrations, complex background Dilution series: from GeneLogic 60 x HGU95Av2, liver & CNS cRNA in different proportions and amounts o Benchmark: 15 quality measures regarding -reproducibility -sensitivity -specificity Put together by Rafael Irizarry (Johns Hopkins) http://affycomp.biostat.jhsph.edu
21
affycomp results (28 Sep 2003) good bad
22
ROC curves
23
Stratification i collaboration with R. Irizarry wiwi position- and sequence-specific effects w i (s): Naef et al., Phys Rev E 68 (2003)
24
glog versus "sliding z-score" sliding z-score
25
Availability o implementation in R o open source package vsn on www.bioconductor.org o Bioconductor is an international collaboration on open source software for bioinformatics and statistical omics
26
What to do with the gene lists: the functional genomics pipeline @ DKFZ High- throughput transcriptome sequencing: clones with unannotated full length ORFs functional characterization Neoplastic diseases association of mRNA profiles with - genetic aberrations - histopathology - clinical behavior
27
HT functional assays (S. Wiemann, D. Arlt) Library of "unknown" transcripts expression clone BrdU incorporation GFP-ORF- protein DAPI: identification CFP: expression BrdU: proliferation Image segmentation and quantification proliferation +activator -inhibitor automated microscope Rainer Pepperkok, EMBL
28
DAPIORF-YFPAnti-BrdU/Cy5 Detection of modulators of cell proliferation overlay YFP – Cy5 Dorit Arlt YFP channelCy5 channel 72.0 761.0 71.0 684.1 119.7 779.0 87.3 820.2 149.5 645.6 70.2 536.1 84.7 799.5 103.1 912.8 81.0 916.7 2621.8 267.6 74.1 766.2 156.8 866.6 169.0 819.8 105.5 757.7 156.0 367.8 76.5 746.2 135.2 731.2 86.2 567.3 77.7 896.3 92.6 1095.4 104.6 633.3 481.2 567.7 539.0 663.9 95.0 726.2 156.7 842.1 Measurement of fluorescence intensities 68.5, 231.6, 80.9, -4.8 by automated image analysis SMP Cell
29
activation inhibition Statistical analysis of cellular assay data transfected cells control cells detect transfection effect: inh (p=10 -8) Plate summary plot
30
Cellular assays: challenges for statisticians oImage analysis: pattern recognition, classification oLow-level analysis what are good models for calibration, “normalization”, data transformation? oHigh-level analysis models for the dependence of cellular processes on over-/underexpression of genes connect results from different assays, microarray data
31
Summary olog-ratio log: what about genes that are not expressed in some of the conditions of interest? o generalized log-ratio h: a useful extrapolation - interpretability - sensitivity - specificity - computational convenience o what to do with the gene lists? systematic (high throughput) functional assays
32
Acknowledgements DKFZ Heidelberg Molecular Genome Analysis Annemarie Poustka Holger Sültmann Andreas Buneß Markus Ruschhaupt Katharina Finis Jörg Schneider Klaus Steiner Stefan Wiemann Dorit Arlt MPI Molekulare Genetik Anja von Heydebreck Martin Vingron Uni Heidelberg Günther Sawitzki DFCI Harvard Robert Gentleman UMC Leiden Judith Boer RZPD Anke Schroth Bernd Korn EMBL Urban Liebel...and many more!
33
Models are never correct, but some are useful True relationship: Model: linear dependence Model: quadratic dependence
34
raw scalelogglog difference log-ratio generalized log-ratio constant part variance: proportional part variance stabilization
35
ratio compression Yue et al., (Incyte Genomics) NAR (2001) 29 e41
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.