Bioinformatics Expression profiling and functional genomics Part I: Preprocessing Ad 29/10/2006
Course material: course notes + powerpoint files Exercises
Overview MICROARRAY PREPROCESSING Gene expression Omics era Transcript profiling Experiment design Preprocessing Exercises
mRNA DNA transcription translation+1 protein Gene expression
Adaptation of cell to its environment FNR box cytNcytOcytQcytP ? ? Bacterial cell ininininout Signal 1 Signal 2 Adaptation of a cell: response on environmental signals response to e.g. hormones (cell differentiation) Cellular response determined by the genes which are switched on upon a signal Gene expression
Action of genetic networks underlie the observed phenotypical behavior Gene expression
Overview MICROARRAY PREPROCESSING Gene expression Omics era Transcript profiling Experiment design Preprocessing Exercises
Functional genomics Structural Genomics Comparative Genomics
Traditional molecular biology –Directed toward understanding the role of a particular gene or protein in a molecular biological process –Northern analysis –Mutational analysis –Expression by reporter fusions Omics era Measurement of the expression of 1000 of genes, proteins simultaneously Omics era – The function or the expression of a gene in a global context of the cell – Holistic approaches allow better understanding of fundamental molecular biological processes Because a gene does not act on its own, it is always embedded in a larger network (systems biology)
Detection Reference Test Reference sample Test sample RNA cDNA transcriptomics Omics era
proteomics Omics era
metabolomics Omics era
SYSTEMS BIOLOGY Consider the cell as a system Omics era
SYSTEMS BIOLOGY Mechanistic insight in the biological system at molecular biological level High throughput data Omics era
analysis of such large scale data is no longer trivial => computational challenges –Low signal/ noise –High dimensionality Simple spreadsheet analysis such as excel are no longer sufficient More advanced datamining procedures become necessary Another urgent problem is also how to store and organize all the information. Bioinformatics Omics era
Overview MICROARRAY PREPROCESSING Gene expression Omics era Transcript profiling –Principle of microarray –Applications Experiment design Preprocessing Exercises
Detection Reference Test Reference sample Test sample RNA cDNA transcriptomics Transcript profiling
Previously: measure expression level of one gene: Northern blot analysis Novel techniques: measure expression level of all genes simultaneously => EXPRESSION PROFILING Principle: hybridisation mRNA: 5’ –UGACCUGACG- 3’ cDNA 3’ -ACTGGACTGC-5’ Hybridize : stick together Transcript profiling
Monitor molecular activities on a global level –protein levels proteomics, –enzyme activities –Metabolites –gene expression (mRNA), transcriptomics = transcript profiling allows to gain a general insight in the global cell behavior (holistic) Molecular biological methods –RT-PCR –SAGE –Protein arrays –Microarray analysis Transcript profiling
cDNA array Spotted cDNA Glass side Upscaled Northern hybridisation Gene (DNA) Transcript (mRNA) cDNA Transcript profiling
Preparation of probes Collect cDNA clones Amplify target cDNA insert by PCR Check yield & specificity by electrophoresis Spot + PCR products on glass slides Transcript profiling
Detection Reference Test Reference sample Test sample RNA cDNA Transcript profiling
Signal 1 Signal 2 2. mRNA isolation 3. labeling 4. Hybridization + washing 5. scanning 6. Image analysis numerical value 1. Cell culture Transcript profiling
Transcript profiling
Superimposed color image * Transform into color images * Superimpose color images from R and G channel good alignment bad alignment Transcript profiling
black spots : gene was neither expressed in test nor in control sample green : gene was only expressed in control sample red : gene was only expressed in test sample yellow : gene was expressed both in test and in control sample Superimposed color image Transcript profiling
Signal intensity is proportional with the amount of cDNA present in the sample signal cy3 -> numerical value signal cy5 -> numerical value Data analysis Image analysis Transcript profiling
Data representation Gene profile Experiment profile
Spotted DNA microarrayHigh density oligonucleotide array Transcript profiling
Overview MICROARRAY PREPROCESSING Gene expression Omics era Transcript profiling Experiment design Preprocessing Exercises
Depending on experimental design other mathematical approach Comparison of 2 samples (black/white) Comparison of multiple arrays Global dynamic profiling Static experiment: Comparison of samples (mutants, patients) Experiment Design
Type1: Comparison of 2 samples Statistical testing Control sample Induced sample Retrieve statistically over or under expressed genes 2 sample design Experiment Design
black/white experiment description (array V mice genes) Condition 1 : pygmee mouse 10 days old (test) Condition 2 : normal mouse 10 days old (ref) detect differentially expressed genes Experiment design (Latin Square) Condition 1 Dye1 Replica L Condition 1 dye1 Replica R Condition 2 dye2 Replica L Condition 2 dye2 Replica R Condition 2 dye1 Replica L Condition 2 dye1 Replica R Condition 1 dye2 Replica L Condition 1 dye2 Replica R Array 1 Array 2 Per gene, per condition 4 measurements available Experiment Design
Measure expression of all genes During time (dynamic profile) In different conditions Identify coexpressed genes Identify mechanism of coregulation Motif Finding Clustering Multiple array design Experiment Design
Original dataset : 6178 genes Preprocessing: select 4634 most variable (25 % most variable) variance normalized adaptive quality based clustering (32 clusters) (95%) Multiple array design Study of Mitotic cell cycle of Saccharomyces cerevisiae with oligonucleotide arrays (Cho et al.1999) - 15 time points (E=18) time points 90 & 100 min deleted (Zhang et al. 1999, Tavazoie et al., 1999) Experiment Design
Reference: unsynchronized cells Condition: synchronized cells during cell cycle at distinct time intervals Condition 1 Dye1 Replica L Condition 2 Dye1 Replica L Condition 3 Dye1 Replica L Condition 4 Dye1 Replica L. … Condition 19 Dye2 Replica L Condition 19 Dye2 Replica L Condition 19 Dye2 Replica L Condition 19 Dye2 Replica L Array 1 Reference design: e.g. Spellman dataset Experiment Design
Loop design Experiment Design
Overview MICROARRAY PREPROCESSING Gene expression Omics era Transcript profiling Experiment design Preprocessing –Sources of Variation –General normalization steps –Slide by slide normalization –ANOVA normalization
Sources of variation –Overshine effects –Dye effect –Spot effects –Array effect Consistent errors Consistent errors complicate direct comparison of measurements of the same gene/condition Consistent errors need to be removed by preprocessing/normalization Preprocessing Tedious Influences downstream measurements
Signal 1 Signal 2 2. mRNA isolation 3. labeling 4. Hybridization + washing 5. scanning 6. Image analysis numerical value 1. Cell culture Preprocessing Dye effect
Dye, condition effect: within slide variation Measurement error: –Preparation mRNA –Labeling &reverse transcription Normalization Global normalization assumption Overall signal in one channel more pronounced than in other channel Preprocessing
Signal 1 Signal 2 2. mRNA isolation 3. labeling 4. Hybridization + washing 5. scanning 6. Image analysis numerical value 1. Cell culture Preprocessing Array effect
normalization within slide ratio Differences in global intensity between slides Comparison between slides impossible Array effects: between slide variation Preprocessing Hybridization differences
Array effects: Between slide variation Preprocessing
Measurement error: Different quantity of DNA in spot Difference in duplicate spots Ratio: compare differential expression between genes Spot effect Absolute levels between genes incomparable Gene 1: test: 4ref:2R/G:2 Gene 2:test: 8ref:4R/G:2 Pin main effects: spot effects Preprocessing
Non specific signal Cy5 or Cy3 resulting from overshining = emission from neighboring spots Overshine effects: within slide variation Preprocessing Background intensity increases with the intensity of the neighboring spots
Removing sources of variation is obligatory step To make comparisons within a slide possible E.g. find differentially expressed genes To allow interslide comparisons E.g. combining the replica’s of the original experiment and the color flip Preprocessing
Overview MICROARRAY PREPROCESSING Gene expression Omics era Transcript profiling Experiment design Preprocessing –Sources of Variation –General normalization steps –Slide by slide normalization –ANOVA normalization ANOVA
ANOVA based Filtering Linearisation Bootstrapping Log transformation Array by array approach Filtering normalization Ratio Test statistic (T-test) Log transformation Preprocessing Background corr
Background correction compensates for overshining Background correction is considered additive Preprocessing: Background correction Background correction
ANOVA based Filtering Linearisation Bootstrapping Log transformation Array by array approach Filtering normalization Ratio Test statistic (T-test) Log transformation Preprocessing Background corr
additive error: independent on the measured intensity the absolute level of the error remains the same (at low levels high relative error, at high expression levels low relative error). multiplicative error: the error increases with the measured intensity (at high levels high relative error) Multiplicative error Preprocessing: log transformation
LOG2 transformed intensity values: Multiplicative effects removed, additive effects more pronounced residuals are constant at high intensities Additive error: error increases as the signal is lower (intuitively plausible) Preprocessing: log transformation
Log (test/ref) = log2(test)-log2(ref): upregulation range 0…+infinity downregulation range 0…-infinity 2 fold overexpression 2 fold underexpression Ratio = 2 Ratio = 0.5 Log2(Ratio) = 1 Log2(Ratio) = -1 ratio (test/ref) test>ref upregulation range 1…+infinity test<ref downregulation range 0...1: range of downregulation squashed Why log2 Preprocessing: log transformation
ANOVA based Filtering Linearisation Bootstrapping Log transformation Array by array approach Filtering normalization Ratio Test statistic (T-test) Log transformation Preprocessing Background corr
Spots are identified by Image analysis –Array Vision –ImaGene –Matarray Spot detection and signal acquisition e.g. Signal is defined Mean pixel intensity of all pixels in a spot for which the Intensity is higher than the local background + 2SD Spots can have different qualities –Irregular spots –Spots with excessive large diameter –Spots which are extremely small artifacts Preprocessing: filtering
Red >0.1 stdev Green >1 stdev Blue >2 stdev Preprocessing: filtering
Filtering: Zero values: treat these separately ratio log transformation Zero values: black white experiment interesting genes off in condition 1 versus on in condition 2 Undefined Preprocessing: filtering
Some genes only labeled with green dye, not with red dye If no mRNA of a gene is present, the green dye binds aspecifically to a spot? color flip essential to eliminate false positives Seemingly underexpressed Preprocessing: filtering
MICROARRAY PREPROCESSING Gene expression Omics era Transcript profiling Experiment design Preprocessing –Sources of Variation –General normalization steps –Slide by slide normalization –ANOVA normalization Overview
ANOVA based Filtering Linearisation Bootstrapping Log transformation Array by array approach Filtering normalization Ratio Test statistic (T-test) Log transformation Preprocessing Background corr
On average ratio red/green should be 1 – Rescale based on average of housekeeping genes – Rescale based on spikes – Rescale based on average expression value of the full array (global normalization) Methods used for normalization – linear normalization – Intensity dependent normalization Preprocessing: normalization
Linear Normalization G R G R Preprocessing: normalization
–Red and green related by a constant factor –Calculate factor by linear regression Log2(ratio) 0 0 Linear normalization factor determined by linear regression Filtering to remove outliers in the non-linear range (green values) Preprocessing: normalization
Linear normalization not straightforward,… Log2(R/G) (Log2(R) + Log2(G))/2 Linear fit Lowess fit Preprocessing: normalization
Non-linear intensity dependent normalization Lowess (Dudoit et al., 2000) : genes seemingly underexpressed due to specific dye effect will be compensated for Log R and log G recalculated based on the lowess fit Lowess linearizes and normalizes the data !!!!! Preprocessing: normalization
Intensity dependent normalization Preprocessing: normalization
Result of the normalization Preprocessing: normalization
ANOVA based Filtering Linearisation Bootstrapping Log transformation Array by array approach Filtering normalization Ratio Test statistic (T-test) Log transformation Preprocessing Background corr
Compensates for spot effects Choice of the reference important –Intuitive reference: First time point Uninduced sample –Independent reference (reference design) Tissue mixture Intuitive interpretation possible Ratio often undefined interpretation complicated Ratio defined Preprocessing: ratio
Log ratio: upregulation range 0…+infinity downregulation range 0…-infinity 2 fold overexpression 2 fold underexpression Ratio = 2 Ratio = 0.5 Log2(Ratio) = 1 Log2(Ratio) = -1 ratio (R/G): R>G upregulation range 1…+infinity R<G downregulation range 0...1: range of downregulation squashed Preprocessing: ratio
ANOVA based Filtering Linearisation Bootstrapping Log transformation Array by array approach Filtering normalization Ratio Test statistic (T-test) Log transformation Preprocessing Background corr
Overview further analysis Raw data Preprocessed data Differentially expressed genes Clusters of coexpressed genes Preprocessing ClusteringTest statistic
ANOVA based Filtering Linearisation Bootstrapping Log transformation Array by array approach Filtering Normalization Ratio Test statistic (T-test) Log transformation Background corr Preprocessing
I. MAIN EFFECTS + EFFECT OF INTEREST Overall mean Array effect (hybridisation effciency) Condition effect (mRNA isolation effciency) Gene effect Constitutive level of gene GC effect Differential expression due to the altered variety Dye effect (labeling efficiency) Model the expression level of each as a combination of the different factors Least squares fit: subject to restrictions contrast of interest: estimate (GC) i1 – (GC) i2 MultiFactor, Linear, fixed levels Preprocessing: ANOVA
Assumption: Independent, additive error ~F where F is a distribution with mean and variance 2 Plot the residuals y estimated - y measured Estimated intensity Preprocessing: ANOVA
I. MAIN EFFECTS + EFFECT OF INTEREST Analysis of variance shows relative contribution of each of the effects Explains the relative contribution of each of these effects Preprocessing: ANOVA
Advantages: Gains more information with less observations => derives variation from all measurements made (less replica’s required e.g. array effect based on N-1 gene measurements) Statistical testing: estimated error can be used for bootstrapping to estimate confidence levels No ratio’s required Requirements: Requires knowledge about experimental effects Model used implicates that all effects and combinations of effects should be linear Bootstrapping: residuals should be normally distributed around zero with constant variance Preprocessing: ANOVA
Estimate error Simulate new datasets based on estimated error (3000 times) Calculate factor of interest (GC effect) for each bootstrapped dataset (recalculate ANOVA) Calculate CI on (GC1-GC2) of N genes based on 3000 bootstraps Use this interval to test for significant genes 0 GC1-GC2 ANOVA Bootstrap analysis Preprocessing: ANOVA
DATA Filtered for zero values set 1: unnormalised data MODELS (Kerr et al. 2000, 2001) Model 1 (no spot effects) Model 2 (spot effects independent) Model 3 (spot effects dependent) MODELS GC effects not confounded with the spot effects type of model does influence the (residual error) => Does influence the bootstrap interval More Arrays Simulaneously Preprocessing
DATA Filtered for zero values set 1: unnormalised data MODELS (Kerr et al. 2000, 2001) Model 1 (no spot effects) Model 2 (spot effects independent) Model 3 (spot effects dependent) MODELS GC effects not confounded with the spot effects type of model does influence the (residual error) => Does influence the bootstrap interval More Arrays Simulaneously Preprocessing
I. MAIN EFFECTS + EFFECT OF INTEREST Overall mean Array effect (hybridisation effciency) Condition effect (mRNA isolation effciency) Gene effect Constitutive level of gene GC effect Differential expression due to the altered variety Dye effect (labeling efficiency) More Arrays Simulaneously Preprocessing
Least squares fit: subject to restrictions contrast of interest: estimate (VG)k1g – (VG)k2g Usual confidence intervals based on normal theory not appropriate Bootstrap analysis of residuals avoid making distributional assumptions about error Assumption: Independent, additive error ~F where F is a distribution with mean and variance 2 More Arrays Simulaneously Preprocessing
More Arrays Simulaneously Preprocessing
ŷ ŷŷ ŷ TEST, ARRAY 1 REFERENCE, ARRAY 1 REFERENCE, ARRAY 2TEST, ARRAY 2 More Arrays Simulaneously Preprocessing
More Arrays Simulaneously Additive error and non linear effects undermine application of ANOVA Preprocessing
ŷ ŷŷ ŷ TEST, ARRAY 1 REFERENCE, ARRAY 1 REFERENCE, ARRAY 2TEST, ARRAY 2 More Arrays Simulaneously Preprocessing
Lowess 99 % confidence interval based on 100 genes, 3000 bootstraps retained 370 genes (62 T-test p value < 0.01) Bootstrap analysis Preprocessing
Methods tested on pygmee dataset 3750 genes 1.ANOVA 99 % CI 2.ANOVA 95 % CI 3.SAM 4.T-test 5.Fold test Retained 360 genes Construct for each gene a binary profile Hierarchically cluster genes based on this profile methods Comparison Only 8 genes retained by all methods
methods Comparison
methods Comparison
Latin Square (mouse data set) Reference: normal mouse Condition: pygmee mouse Two experiments C=1, C=2 reflects two sample time points 2 batches: not all genes of the genome on one array A 1, C 1 B1 Test = R Ref = G A 2, C 1 B1 Test = G Ref = R A 5, C 2 B1 Test = R Ref = G A 6, C 2 B1 Test = G Ref = R A 3, C 1 B2 Test = R Ref = G A 4, C 1 B2 Test = R Ref = G A 7, C 2 B2 Test = R Ref = G A 8, C 2 B2 Test = G Ref = R Transcript profiling Experiment Design