Download presentation
Presentation is loading. Please wait.
Published byTobias Matthews Modified over 9 years ago
1
Bioinformatics Expression profiling and functional genomics Part I: Preprocessing Ad 29/10/2006
2
http://www.esat.kuleuven.ac.be/~kmarchal/ Course material: course notes + powerpoint files Exercises
3
Overview MICROARRAY PREPROCESSING Gene expression Omics era Transcript profiling Experiment design Preprocessing Exercises
4
mRNA DNA transcription translation+1 protein Gene expression
5
Adaptation of cell to its environment FNR box cytNcytOcytQcytP ? ? Bacterial cell ininininout Signal 1 Signal 2 Adaptation of a cell: response on environmental signals response to e.g. hormones (cell differentiation) Cellular response determined by the genes which are switched on upon a signal Gene expression
6
Action of genetic networks underlie the observed phenotypical behavior Gene expression
7
Overview MICROARRAY PREPROCESSING Gene expression Omics era Transcript profiling Experiment design Preprocessing Exercises
8
Functional genomics Structural Genomics Comparative Genomics
9
Traditional molecular biology –Directed toward understanding the role of a particular gene or protein in a molecular biological process –Northern analysis –Mutational analysis –Expression by reporter fusions Omics era Measurement of the expression of 1000 of genes, proteins simultaneously Omics era – The function or the expression of a gene in a global context of the cell – Holistic approaches allow better understanding of fundamental molecular biological processes Because a gene does not act on its own, it is always embedded in a larger network (systems biology)
10
Detection Reference Test Reference sample Test sample RNA cDNA transcriptomics Omics era
11
proteomics Omics era
12
metabolomics Omics era
13
SYSTEMS BIOLOGY Consider the cell as a system Omics era
14
SYSTEMS BIOLOGY Mechanistic insight in the biological system at molecular biological level High throughput data Omics era
15
analysis of such large scale data is no longer trivial => computational challenges –Low signal/ noise –High dimensionality Simple spreadsheet analysis such as excel are no longer sufficient More advanced datamining procedures become necessary Another urgent problem is also how to store and organize all the information. Bioinformatics Omics era
16
Overview MICROARRAY PREPROCESSING Gene expression Omics era Transcript profiling –Principle of microarray –Applications Experiment design Preprocessing Exercises
17
Detection Reference Test Reference sample Test sample RNA cDNA transcriptomics Transcript profiling
18
Previously: measure expression level of one gene: Northern blot analysis Novel techniques: measure expression level of all genes simultaneously => EXPRESSION PROFILING Principle: hybridisation mRNA: 5’ –UGACCUGACG- 3’ cDNA 3’ -ACTGGACTGC-5’ Hybridize : stick together Transcript profiling
19
Monitor molecular activities on a global level –protein levels proteomics, –enzyme activities –Metabolites –gene expression (mRNA), transcriptomics = transcript profiling allows to gain a general insight in the global cell behavior (holistic) Molecular biological methods –RT-PCR –SAGE –Protein arrays –Microarray analysis Transcript profiling
21
cDNA array Spotted cDNA Glass side Upscaled Northern hybridisation +1+1+1+1 Gene (DNA) Transcript (mRNA) cDNA Transcript profiling
22
Preparation of probes Collect cDNA clones Amplify target cDNA insert by PCR Check yield & specificity by electrophoresis Spot + PCR products on glass slides Transcript profiling
23
Detection Reference Test Reference sample Test sample RNA cDNA Transcript profiling
24
Signal 1 Signal 2 2. mRNA isolation 3. labeling 4. Hybridization + washing 5. scanning 6. Image analysis numerical value 1. Cell culture Transcript profiling
25
http://www.bio.davidson.edu/courses/genomics/chip/chip.html Transcript profiling
26
Superimposed color image * Transform into color images * Superimpose color images from R and G channel good alignment bad alignment Transcript profiling
27
black spots : gene was neither expressed in test nor in control sample green : gene was only expressed in control sample red : gene was only expressed in test sample yellow : gene was expressed both in test and in control sample Superimposed color image Transcript profiling
28
Signal intensity is proportional with the amount of cDNA present in the sample signal cy3 -> numerical value signal cy5 -> numerical value Data analysis Image analysis Transcript profiling
29
Data representation Gene profile Experiment profile
30
Spotted DNA microarrayHigh density oligonucleotide array Transcript profiling
31
Overview MICROARRAY PREPROCESSING Gene expression Omics era Transcript profiling Experiment design Preprocessing Exercises
32
Depending on experimental design other mathematical approach Comparison of 2 samples (black/white) Comparison of multiple arrays Global dynamic profiling Static experiment: Comparison of samples (mutants, patients) Experiment Design
33
Type1: Comparison of 2 samples Statistical testing Control sample Induced sample Retrieve statistically over or under expressed genes 2 sample design Experiment Design
34
black/white experiment description (array V mice genes) Condition 1 : pygmee mouse 10 days old (test) Condition 2 : normal mouse 10 days old (ref) detect differentially expressed genes Experiment design (Latin Square) Condition 1 Dye1 Replica L Condition 1 dye1 Replica R Condition 2 dye2 Replica L Condition 2 dye2 Replica R Condition 2 dye1 Replica L Condition 2 dye1 Replica R Condition 1 dye2 Replica L Condition 1 dye2 Replica R Array 1 Array 2 Per gene, per condition 4 measurements available Experiment Design
35
Measure expression of all genes During time (dynamic profile) In different conditions Identify coexpressed genes Identify mechanism of coregulation Motif Finding Clustering Multiple array design Experiment Design
36
Original dataset : 6178 genes Preprocessing: select 4634 most variable (25 % most variable) variance normalized adaptive quality based clustering (32 clusters) (95%) Multiple array design Study of Mitotic cell cycle of Saccharomyces cerevisiae with oligonucleotide arrays (Cho et al.1999) - 15 time points (E=18) time points 90 & 100 min deleted (Zhang et al. 1999, Tavazoie et al., 1999) Experiment Design
37
Reference: unsynchronized cells Condition: synchronized cells during cell cycle at distinct time intervals Condition 1 Dye1 Replica L Condition 2 Dye1 Replica L Condition 3 Dye1 Replica L Condition 4 Dye1 Replica L. … Condition 19 Dye2 Replica L Condition 19 Dye2 Replica L Condition 19 Dye2 Replica L Condition 19 Dye2 Replica L Array 1 Reference design: e.g. Spellman dataset Experiment Design
38
Loop design Experiment Design
39
Overview MICROARRAY PREPROCESSING Gene expression Omics era Transcript profiling Experiment design Preprocessing –Sources of Variation –General normalization steps –Slide by slide normalization –ANOVA normalization
40
Sources of variation –Overshine effects –Dye effect –Spot effects –Array effect Consistent errors Consistent errors complicate direct comparison of measurements of the same gene/condition Consistent errors need to be removed by preprocessing/normalization Preprocessing Tedious Influences downstream measurements
41
Signal 1 Signal 2 2. mRNA isolation 3. labeling 4. Hybridization + washing 5. scanning 6. Image analysis numerical value 1. Cell culture Preprocessing Dye effect
42
Dye, condition effect: within slide variation Measurement error: –Preparation mRNA –Labeling &reverse transcription Normalization Global normalization assumption Overall signal in one channel more pronounced than in other channel Preprocessing
43
Signal 1 Signal 2 2. mRNA isolation 3. labeling 4. Hybridization + washing 5. scanning 6. Image analysis numerical value 1. Cell culture Preprocessing Array effect
44
normalization within slide ratio Differences in global intensity between slides Comparison between slides impossible Array effects: between slide variation Preprocessing Hybridization differences
45
Array effects: Between slide variation Preprocessing
46
Measurement error: Different quantity of DNA in spot Difference in duplicate spots Ratio: compare differential expression between genes Spot effect Absolute levels between genes incomparable Gene 1: test: 4ref:2R/G:2 Gene 2:test: 8ref:4R/G:2 Pin main effects: spot effects Preprocessing
47
Non specific signal Cy5 or Cy3 resulting from overshining = emission from neighboring spots Overshine effects: within slide variation Preprocessing Background intensity increases with the intensity of the neighboring spots
48
Removing sources of variation is obligatory step To make comparisons within a slide possible E.g. find differentially expressed genes To allow interslide comparisons E.g. combining the replica’s of the original experiment and the color flip Preprocessing
49
Overview MICROARRAY PREPROCESSING Gene expression Omics era Transcript profiling Experiment design Preprocessing –Sources of Variation –General normalization steps –Slide by slide normalization –ANOVA normalization ANOVA
50
ANOVA based Filtering Linearisation Bootstrapping Log transformation Array by array approach Filtering normalization Ratio Test statistic (T-test) Log transformation Preprocessing Background corr
51
Background correction compensates for overshining Background correction is considered additive Preprocessing: Background correction Background correction
52
ANOVA based Filtering Linearisation Bootstrapping Log transformation Array by array approach Filtering normalization Ratio Test statistic (T-test) Log transformation Preprocessing Background corr
53
additive error: independent on the measured intensity the absolute level of the error remains the same (at low levels high relative error, at high expression levels low relative error). multiplicative error: the error increases with the measured intensity (at high levels high relative error) Multiplicative error Preprocessing: log transformation
54
LOG2 transformed intensity values: Multiplicative effects removed, additive effects more pronounced residuals are constant at high intensities Additive error: error increases as the signal is lower (intuitively plausible) Preprocessing: log transformation
56
Log (test/ref) = log2(test)-log2(ref): upregulation range 0…+infinity downregulation range 0…-infinity 2 fold overexpression 2 fold underexpression Ratio = 2 Ratio = 0.5 Log2(Ratio) = 1 Log2(Ratio) = -1 ratio (test/ref) test>ref upregulation range 1…+infinity test<ref downregulation range 0...1: range of downregulation squashed Why log2 Preprocessing: log transformation
57
ANOVA based Filtering Linearisation Bootstrapping Log transformation Array by array approach Filtering normalization Ratio Test statistic (T-test) Log transformation Preprocessing Background corr
58
Spots are identified by Image analysis –Array Vision –ImaGene –Matarray Spot detection and signal acquisition e.g. Signal is defined Mean pixel intensity of all pixels in a spot for which the Intensity is higher than the local background + 2SD Spots can have different qualities –Irregular spots –Spots with excessive large diameter –Spots which are extremely small artifacts Preprocessing: filtering
59
Red >0.1 stdev Green >1 stdev Blue >2 stdev Preprocessing: filtering
60
Filtering: Zero values: treat these separately ratio log transformation Zero values: black white experiment interesting genes off in condition 1 versus on in condition 2 Undefined Preprocessing: filtering
61
Some genes only labeled with green dye, not with red dye If no mRNA of a gene is present, the green dye binds aspecifically to a spot? color flip essential to eliminate false positives Seemingly underexpressed Preprocessing: filtering
62
MICROARRAY PREPROCESSING Gene expression Omics era Transcript profiling Experiment design Preprocessing –Sources of Variation –General normalization steps –Slide by slide normalization –ANOVA normalization Overview
63
ANOVA based Filtering Linearisation Bootstrapping Log transformation Array by array approach Filtering normalization Ratio Test statistic (T-test) Log transformation Preprocessing Background corr
64
On average ratio red/green should be 1 – Rescale based on average of housekeeping genes – Rescale based on spikes – Rescale based on average expression value of the full array (global normalization) Methods used for normalization – linear normalization – Intensity dependent normalization Preprocessing: normalization
65
Linear Normalization G R G R Preprocessing: normalization
66
–Red and green related by a constant factor –Calculate factor by linear regression Log2(ratio) 0 0 Linear normalization factor determined by linear regression Filtering to remove outliers in the non-linear range (green values) http://afgc.stanford.edu/~finkel/talk.htm Preprocessing: normalization
67
Linear normalization not straightforward,… Log2(R/G) (Log2(R) + Log2(G))/2 Linear fit Lowess fit Preprocessing: normalization
68
Non-linear intensity dependent normalization Lowess (Dudoit et al., 2000) : genes seemingly underexpressed due to specific dye effect will be compensated for Log R and log G recalculated based on the lowess fit Lowess linearizes and normalizes the data !!!!! Preprocessing: normalization
69
Intensity dependent normalization Preprocessing: normalization
70
Result of the normalization Preprocessing: normalization
71
ANOVA based Filtering Linearisation Bootstrapping Log transformation Array by array approach Filtering normalization Ratio Test statistic (T-test) Log transformation Preprocessing Background corr
72
Compensates for spot effects Choice of the reference important –Intuitive reference: First time point Uninduced sample –Independent reference (reference design) Tissue mixture Intuitive interpretation possible Ratio often undefined interpretation complicated Ratio defined Preprocessing: ratio
73
Log ratio: upregulation range 0…+infinity downregulation range 0…-infinity 2 fold overexpression 2 fold underexpression Ratio = 2 Ratio = 0.5 Log2(Ratio) = 1 Log2(Ratio) = -1 ratio (R/G): R>G upregulation range 1…+infinity R<G downregulation range 0...1: range of downregulation squashed Preprocessing: ratio
74
ANOVA based Filtering Linearisation Bootstrapping Log transformation Array by array approach Filtering normalization Ratio Test statistic (T-test) Log transformation Preprocessing Background corr
75
Overview further analysis Raw data Preprocessed data Differentially expressed genes Clusters of coexpressed genes Preprocessing ClusteringTest statistic
76
ANOVA based Filtering Linearisation Bootstrapping Log transformation Array by array approach Filtering Normalization Ratio Test statistic (T-test) Log transformation Background corr Preprocessing
77
I. MAIN EFFECTS + EFFECT OF INTEREST Overall mean Array effect (hybridisation effciency) Condition effect (mRNA isolation effciency) Gene effect Constitutive level of gene GC effect Differential expression due to the altered variety Dye effect (labeling efficiency) Model the expression level of each as a combination of the different factors Least squares fit: subject to restrictions contrast of interest: estimate (GC) i1 – (GC) i2 MultiFactor, Linear, fixed levels Preprocessing: ANOVA
78
Assumption: Independent, additive error ~F where F is a distribution with mean and variance 2 Plot the residuals y estimated - y measured Estimated intensity Preprocessing: ANOVA
79
I. MAIN EFFECTS + EFFECT OF INTEREST Analysis of variance shows relative contribution of each of the effects Explains the relative contribution of each of these effects Preprocessing: ANOVA
80
Advantages: Gains more information with less observations => derives variation from all measurements made (less replica’s required e.g. array effect based on N-1 gene measurements) Statistical testing: estimated error can be used for bootstrapping to estimate confidence levels No ratio’s required Requirements: Requires knowledge about experimental effects Model used implicates that all effects and combinations of effects should be linear Bootstrapping: residuals should be normally distributed around zero with constant variance Preprocessing: ANOVA
81
Estimate error Simulate new datasets based on estimated error (3000 times) Calculate factor of interest (GC effect) for each bootstrapped dataset (recalculate ANOVA) Calculate CI on (GC1-GC2) of N genes based on 3000 bootstraps Use this interval to test for significant genes 0 GC1-GC2 ANOVA Bootstrap analysis Preprocessing: ANOVA
83
DATA Filtered for zero values set 1: unnormalised data MODELS (Kerr et al. 2000, 2001) Model 1 (no spot effects) Model 2 (spot effects independent) Model 3 (spot effects dependent) MODELS GC effects not confounded with the spot effects type of model does influence the (residual error) => Does influence the bootstrap interval More Arrays Simulaneously Preprocessing
84
DATA Filtered for zero values set 1: unnormalised data MODELS (Kerr et al. 2000, 2001) Model 1 (no spot effects) Model 2 (spot effects independent) Model 3 (spot effects dependent) MODELS GC effects not confounded with the spot effects type of model does influence the (residual error) => Does influence the bootstrap interval More Arrays Simulaneously Preprocessing
85
I. MAIN EFFECTS + EFFECT OF INTEREST Overall mean Array effect (hybridisation effciency) Condition effect (mRNA isolation effciency) Gene effect Constitutive level of gene GC effect Differential expression due to the altered variety Dye effect (labeling efficiency) More Arrays Simulaneously Preprocessing
86
Least squares fit: subject to restrictions contrast of interest: estimate (VG)k1g – (VG)k2g Usual confidence intervals based on normal theory not appropriate Bootstrap analysis of residuals avoid making distributional assumptions about error Assumption: Independent, additive error ~F where F is a distribution with mean and variance 2 More Arrays Simulaneously Preprocessing
87
More Arrays Simulaneously Preprocessing
88
ŷ ŷŷ ŷ TEST, ARRAY 1 REFERENCE, ARRAY 1 REFERENCE, ARRAY 2TEST, ARRAY 2 More Arrays Simulaneously Preprocessing
89
More Arrays Simulaneously Additive error and non linear effects undermine application of ANOVA Preprocessing
90
ŷ ŷŷ ŷ TEST, ARRAY 1 REFERENCE, ARRAY 1 REFERENCE, ARRAY 2TEST, ARRAY 2 More Arrays Simulaneously Preprocessing
91
Lowess 99 % confidence interval based on 100 genes, 3000 bootstraps retained 370 genes (62 T-test p value < 0.01) Bootstrap analysis Preprocessing
92
Methods tested on pygmee dataset 3750 genes 1.ANOVA 99 % CI 2.ANOVA 95 % CI 3.SAM 4.T-test 5.Fold test Retained 360 genes Construct for each gene a binary profile 1 1 1 1 1 Hierarchically cluster genes based on this profile methods Comparison Only 8 genes retained by all methods
93
methods Comparison
94
methods Comparison
95
Latin Square (mouse data set) Reference: normal mouse Condition: pygmee mouse Two experiments C=1, C=2 reflects two sample time points 2 batches: not all genes of the genome on one array A 1, C 1 B1 Test = R Ref = G A 2, C 1 B1 Test = G Ref = R A 5, C 2 B1 Test = R Ref = G A 6, C 2 B1 Test = G Ref = R A 3, C 1 B2 Test = R Ref = G A 4, C 1 B2 Test = R Ref = G A 7, C 2 B2 Test = R Ref = G A 8, C 2 B2 Test = G Ref = R Transcript profiling Experiment Design
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.