Bioinformatics Expression profiling and functional genomics Part I: Preprocessing Ad 29/10/2006.

Slides:



Advertisements
Similar presentations
Introduction to Microarray Gene Expression
Advertisements

Statistical tests for differential expression in cDNA microarray experiments (2): ANOVA Xiangqin Cui and Gary A. Churchill Genome Biology 2003, 4:210 Presented.
Pre-processing in DNA microarray experiments Sandrine Dudoit PH 296, Section 33 13/09/2001.
1 MicroArray -- Data Analysis Cecilia Hansen & Dirk Repsilber Bioinformatics - 10p, October 2001.
Normalization of microarray data
Mathematical Statistics, Centre for Mathematical Sciences
Microarray technology and analysis of gene expression data Hillevi Lindroos.
Sandrine Dudoit1 Microarray Experimental Design and Analysis Sandrine Dudoit jointly with Yee Hwa Yang Division of Biostatistics, UC Berkeley
Getting the numbers comparable
DNA Microarray Bioinformatics - #27612 Normalization and Statistical Analysis.
Microarray Data Preprocessing and Clustering Analysis
Microarray analysis Golan Yona ( original version by David Lin )
Figure 1: (A) A microarray may contain thousands of ‘spots’. Each spot contains many copies of the same DNA sequence that uniquely represents a gene from.
Statistical Analysis of Microarray Data
Gene Expression Data Analyses (2)
Normalization of 2 color arrays Alex Sánchez. Dept. Estadística Universitat de Barcelona.
Microarray Analysis Jesse Mecham CS 601R. Microarray Analysis It all comes down to Experimental Design Experimental Design Preprocessing Preprocessing.
Making Sense of Complicated Microarray Data
Introduce to Microarray
Statistical Analysis of Microarray Data
Genomics I: The Transcriptome RNA Expression Analysis Determining genomewide RNA expression levels.
Analysis of High-throughput Gene Expression Profiling
Analysis of microarray data
(4) Within-Array Normalization PNAS, vol. 101, no. 5, Feb Jianqing Fan, Paul Tam, George Vande Woude, and Yi Ren.
The following slides have been adapted from to be presented at the Follow-up course on Microarray Data Analysis.
CDNA Microarrays Neil Lawrence. Schedule Today: Introduction and Background 18 th AprilIntroduction and Background 25 th AprilcDNA Mircoarrays 2 nd MayNo.
(2) Ratio statistics of gene expression levels and applications to microarray data analysis Bioinformatics, Vol. 18, no. 9, 2002 Yidong Chen, Vishnu Kamat,
Affymetrix vs. glass slide based arrays
The European Nutrigenomics Organisation Deciding and acting on quality of microarray experiments in genomics Chris Evelo BiGCaT Bioinformatics Maastricht.
DNA microarray technology allows an individual to rapidly and quantitatively measure the expression levels of thousands of genes in a biological sample.
CDNA Microarrays MB206.
Data Type 1: Microarrays
Panu Somervuo, March 19, cDNA microarrays.
Gene Expression Data Qifang Xu. Outline cDNA Microarray Technology cDNA Microarray Technology Data Representation Data Representation Statistical Analysis.
Applying statistical tests to microarray data. Introduction to filtering Recall- Filtering is the process of deciding which genes in a microarray experiment.
Probe-Level Data Normalisation: RMA and GC-RMA Sam Robson Images courtesy of Neil Ward, European Application Engineer, Agilent Technologies.
ARK-Genomics: Centre for Comparative and Functional Genomics in Farm Animals Richard Talbot Roslin Institute and R(D)SVS University of Edinburgh Microarrays.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
A A R H U S U N I V E R S I T E T Faculty of Agricultural Sciences Introduction to analysis of microarray data David Edwards.
Metabolomics Metabolome Reflects the State of the Cell, Organ or Organism Change in the metabolome is a direct consequence of protein activity changes.
Gene Expression and Evolution. Why are Evolutionists Interested in Gene Expression? Divergence in gene expression can underlie differences between taxa.
What Is Microarray A new powerful technology for biological exploration Parallel High-throughput Large-scale Genomic scale.
Introduction to Statistical Analysis of Gene Expression Data Feng Hong Beespace meeting April 20, 2005.
Statistical Methods for Identifying Differentially Expressed Genes in Replicated cDNA Microarray Experiments Presented by Nan Lin 13 October 2002.
1 Global expression analysis Monday 10/1: Intro* 1 page Project Overview Due Intro to R lab Wednesday 10/3: Stats & FDR - * read the paper! Monday 10/8:
Statistics for Differential Expression Naomi Altman Oct. 06.
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb Analysis of (cDNA) Microarray.
Design of Micro-arrays Lecture Topic 6. Experimental design Proper experimental design is needed to ensure that questions of interest can be answered.
Microarray hybridization Usually comparative – Ratio between two samples Examples – Tumor vs. normal tissue – Drug treatment vs. no treatment – Embryo.
MICROARRAYS D’EXPRESSIÓ ESTUDI DE REGULADORS DE LA TRANSCRIPCIÓ DE LA FAMILIA trxG M. Corominas:
CSIRO Insert presentation title, do not remove CSIRO from start of footer Experimental Design Why design? removal of technical variance Optimizing your.
Overview of Microarray. 2/71 Gene Expression Gene expression Production of mRNA is very much a reflection of the activity level of gene In the past, looking.
Microarray analysis Quantitation of Gene Expression Expression Data to Networks BIO520 BioinformaticsJim Lund Reading: Ch 16.
ANALYSIS OF GENE EXPRESSION DATA. Gene expression data is a high-throughput data type (like DNA and protein sequences) that requires bioinformatic pattern.
Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae Vu, T. T.,
Henrik Bengtsson Mathematical Statistics Centre for Mathematical Sciences Lund University Plate Effects in cDNA Microarray Data.
Variability & Statistical Analysis of Microarray Data GCAT – Georgetown July 2004 Jo Hardin Pomona College
Distinguishing active from non active genes: Main principle: DNA hybridization -DNA hybridizes due to base pairing using H-bonds -A/T and C/G and A/U possible.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Biostatistics Regression and Correlation Methods Class #10 April 4, 2000.
Other uses of DNA microarrays
Expression profiling & functional genomics Exercises.
Microarray: An Introduction
DNA Microarray. Microarray Printing 96-well-plate (PCR Products) 384-well print-plate Microarray.
Micro Array Error Analysis Protein Interaction Map Integration & Visualization Dr. Werner Van Belle Department.
1 Department of Engineering, 2 Department of Mathematics,
1 Department of Engineering, 2 Department of Mathematics,
1 Department of Engineering, 2 Department of Mathematics,
Getting the numbers comparable
Normalization for cDNA Microarray Data
Presentation transcript:

Bioinformatics Expression profiling and functional genomics Part I: Preprocessing Ad 29/10/2006

Course material: course notes + powerpoint files Exercises

Overview MICROARRAY PREPROCESSING Gene expression Omics era Transcript profiling Experiment design Preprocessing Exercises

mRNA DNA transcription translation+1 protein Gene expression

Adaptation of cell to its environment FNR box cytNcytOcytQcytP ? ? Bacterial cell ininininout Signal 1 Signal 2 Adaptation of a cell: response on environmental signals response to e.g. hormones (cell differentiation) Cellular response determined by the genes which are switched on upon a signal Gene expression

Action of genetic networks underlie the observed phenotypical behavior Gene expression

Overview MICROARRAY PREPROCESSING Gene expression Omics era Transcript profiling Experiment design Preprocessing Exercises

Functional genomics Structural Genomics Comparative Genomics

Traditional molecular biology –Directed toward understanding the role of a particular gene or protein in a molecular biological process –Northern analysis –Mutational analysis –Expression by reporter fusions Omics era Measurement of the expression of 1000 of genes, proteins simultaneously Omics era – The function or the expression of a gene in a global context of the cell – Holistic approaches allow better understanding of fundamental molecular biological processes Because a gene does not act on its own, it is always embedded in a larger network (systems biology)

Detection Reference Test Reference sample Test sample RNA cDNA transcriptomics Omics era

proteomics Omics era

metabolomics Omics era

SYSTEMS BIOLOGY Consider the cell as a system Omics era

SYSTEMS BIOLOGY Mechanistic insight in the biological system at molecular biological level High throughput data Omics era

analysis of such large scale data is no longer trivial => computational challenges –Low signal/ noise –High dimensionality Simple spreadsheet analysis such as excel are no longer sufficient More advanced datamining procedures become necessary Another urgent problem is also how to store and organize all the information. Bioinformatics Omics era

Overview MICROARRAY PREPROCESSING Gene expression Omics era Transcript profiling –Principle of microarray –Applications Experiment design Preprocessing Exercises

Detection Reference Test Reference sample Test sample RNA cDNA transcriptomics Transcript profiling

Previously: measure expression level of one gene: Northern blot analysis Novel techniques: measure expression level of all genes simultaneously => EXPRESSION PROFILING Principle: hybridisation mRNA: 5’ –UGACCUGACG- 3’ cDNA 3’ -ACTGGACTGC-5’ Hybridize : stick together Transcript profiling

Monitor molecular activities on a global level –protein levels proteomics, –enzyme activities –Metabolites –gene expression (mRNA), transcriptomics = transcript profiling allows to gain a general insight in the global cell behavior (holistic) Molecular biological methods –RT-PCR –SAGE –Protein arrays –Microarray analysis Transcript profiling

cDNA array Spotted cDNA Glass side Upscaled Northern hybridisation Gene (DNA) Transcript (mRNA) cDNA Transcript profiling

Preparation of probes Collect cDNA clones Amplify target cDNA insert by PCR Check yield & specificity by electrophoresis Spot + PCR products on glass slides Transcript profiling

Detection Reference Test Reference sample Test sample RNA cDNA Transcript profiling

Signal 1 Signal 2 2. mRNA isolation 3. labeling 4. Hybridization + washing 5. scanning 6. Image analysis numerical value 1. Cell culture Transcript profiling

Transcript profiling

Superimposed color image * Transform into color images * Superimpose color images from R and G channel good alignment bad alignment Transcript profiling

black spots : gene was neither expressed in test nor in control sample green : gene was only expressed in control sample red : gene was only expressed in test sample yellow : gene was expressed both in test and in control sample Superimposed color image Transcript profiling

Signal intensity is proportional with the amount of cDNA present in the sample signal cy3 -> numerical value signal cy5 -> numerical value Data analysis Image analysis Transcript profiling

Data representation Gene profile Experiment profile

Spotted DNA microarrayHigh density oligonucleotide array Transcript profiling

Overview MICROARRAY PREPROCESSING Gene expression Omics era Transcript profiling Experiment design Preprocessing Exercises

Depending on experimental design other mathematical approach Comparison of 2 samples (black/white) Comparison of multiple arrays Global dynamic profiling Static experiment: Comparison of samples (mutants, patients) Experiment Design

Type1: Comparison of 2 samples Statistical testing Control sample Induced sample Retrieve statistically over or under expressed genes 2 sample design Experiment Design

black/white experiment description (array V mice genes) Condition 1 : pygmee mouse 10 days old (test) Condition 2 : normal mouse 10 days old (ref) detect differentially expressed genes Experiment design (Latin Square) Condition 1 Dye1 Replica L Condition 1 dye1 Replica R Condition 2 dye2 Replica L Condition 2 dye2 Replica R Condition 2 dye1 Replica L Condition 2 dye1 Replica R Condition 1 dye2 Replica L Condition 1 dye2 Replica R Array 1 Array 2 Per gene, per condition 4 measurements available Experiment Design

Measure expression of all genes During time (dynamic profile) In different conditions Identify coexpressed genes Identify mechanism of coregulation Motif Finding Clustering Multiple array design Experiment Design

Original dataset : 6178 genes Preprocessing: select 4634 most variable (25 % most variable) variance normalized adaptive quality based clustering (32 clusters) (95%) Multiple array design Study of Mitotic cell cycle of Saccharomyces cerevisiae with oligonucleotide arrays (Cho et al.1999) - 15 time points (E=18) time points 90 & 100 min deleted (Zhang et al. 1999, Tavazoie et al., 1999) Experiment Design

Reference: unsynchronized cells Condition: synchronized cells during cell cycle at distinct time intervals Condition 1 Dye1 Replica L Condition 2 Dye1 Replica L Condition 3 Dye1 Replica L Condition 4 Dye1 Replica L. … Condition 19 Dye2 Replica L Condition 19 Dye2 Replica L Condition 19 Dye2 Replica L Condition 19 Dye2 Replica L Array 1 Reference design: e.g. Spellman dataset Experiment Design

Loop design Experiment Design

Overview MICROARRAY PREPROCESSING Gene expression Omics era Transcript profiling Experiment design Preprocessing –Sources of Variation –General normalization steps –Slide by slide normalization –ANOVA normalization

Sources of variation –Overshine effects –Dye effect –Spot effects –Array effect Consistent errors Consistent errors complicate direct comparison of measurements of the same gene/condition Consistent errors need to be removed by preprocessing/normalization Preprocessing Tedious Influences downstream measurements

Signal 1 Signal 2 2. mRNA isolation 3. labeling 4. Hybridization + washing 5. scanning 6. Image analysis numerical value 1. Cell culture Preprocessing Dye effect

Dye, condition effect: within slide variation Measurement error: –Preparation mRNA –Labeling &reverse transcription Normalization Global normalization assumption Overall signal in one channel more pronounced than in other channel Preprocessing

Signal 1 Signal 2 2. mRNA isolation 3. labeling 4. Hybridization + washing 5. scanning 6. Image analysis numerical value 1. Cell culture Preprocessing Array effect

normalization within slide ratio Differences in global intensity between slides Comparison between slides impossible Array effects: between slide variation Preprocessing Hybridization differences

Array effects: Between slide variation Preprocessing

Measurement error: Different quantity of DNA in spot Difference in duplicate spots Ratio: compare differential expression between genes Spot effect Absolute levels between genes incomparable Gene 1: test: 4ref:2R/G:2 Gene 2:test: 8ref:4R/G:2 Pin main effects: spot effects Preprocessing

Non specific signal Cy5 or Cy3 resulting from overshining = emission from neighboring spots Overshine effects: within slide variation Preprocessing Background intensity increases with the intensity of the neighboring spots

Removing sources of variation is obligatory step To make comparisons within a slide possible E.g. find differentially expressed genes To allow interslide comparisons E.g. combining the replica’s of the original experiment and the color flip Preprocessing

Overview MICROARRAY PREPROCESSING Gene expression Omics era Transcript profiling Experiment design Preprocessing –Sources of Variation –General normalization steps –Slide by slide normalization –ANOVA normalization ANOVA

ANOVA based Filtering Linearisation Bootstrapping Log transformation Array by array approach Filtering normalization Ratio Test statistic (T-test) Log transformation Preprocessing Background corr

Background correction compensates for overshining Background correction is considered additive Preprocessing: Background correction Background correction

ANOVA based Filtering Linearisation Bootstrapping Log transformation Array by array approach Filtering normalization Ratio Test statistic (T-test) Log transformation Preprocessing Background corr

additive error: independent on the measured intensity the absolute level of the error remains the same (at low levels high relative error, at high expression levels low relative error). multiplicative error: the error increases with the measured intensity (at high levels high relative error) Multiplicative error Preprocessing: log transformation

LOG2 transformed intensity values: Multiplicative effects removed, additive effects more pronounced residuals are constant at high intensities Additive error: error increases as the signal is lower (intuitively plausible) Preprocessing: log transformation

Log (test/ref) = log2(test)-log2(ref): upregulation range 0…+infinity downregulation range 0…-infinity 2 fold overexpression 2 fold underexpression Ratio = 2 Ratio = 0.5 Log2(Ratio) = 1 Log2(Ratio) = -1 ratio (test/ref) test>ref upregulation range 1…+infinity test<ref downregulation range 0...1: range of downregulation squashed Why log2 Preprocessing: log transformation

ANOVA based Filtering Linearisation Bootstrapping Log transformation Array by array approach Filtering normalization Ratio Test statistic (T-test) Log transformation Preprocessing Background corr

Spots are identified by Image analysis –Array Vision –ImaGene –Matarray Spot detection and signal acquisition e.g. Signal is defined Mean pixel intensity of all pixels in a spot for which the Intensity is higher than the local background + 2SD Spots can have different qualities –Irregular spots –Spots with excessive large diameter –Spots which are extremely small artifacts Preprocessing: filtering

Red >0.1 stdev Green >1 stdev Blue >2 stdev Preprocessing: filtering

Filtering: Zero values: treat these separately ratio log transformation Zero values: black white experiment interesting genes off in condition 1 versus on in condition 2 Undefined Preprocessing: filtering

Some genes only labeled with green dye, not with red dye If no mRNA of a gene is present, the green dye binds aspecifically to a spot? color flip essential to eliminate false positives Seemingly underexpressed Preprocessing: filtering

MICROARRAY PREPROCESSING Gene expression Omics era Transcript profiling Experiment design Preprocessing –Sources of Variation –General normalization steps –Slide by slide normalization –ANOVA normalization Overview

ANOVA based Filtering Linearisation Bootstrapping Log transformation Array by array approach Filtering normalization Ratio Test statistic (T-test) Log transformation Preprocessing Background corr

On average ratio red/green should be 1 – Rescale based on average of housekeeping genes – Rescale based on spikes – Rescale based on average expression value of the full array (global normalization) Methods used for normalization – linear normalization – Intensity dependent normalization Preprocessing: normalization

Linear Normalization G R G R Preprocessing: normalization

–Red and green related by a constant factor –Calculate factor by linear regression Log2(ratio) 0 0 Linear normalization factor determined by linear regression Filtering to remove outliers in the non-linear range (green values) Preprocessing: normalization

Linear normalization not straightforward,… Log2(R/G) (Log2(R) + Log2(G))/2 Linear fit Lowess fit Preprocessing: normalization

Non-linear intensity dependent normalization Lowess (Dudoit et al., 2000) : genes seemingly underexpressed due to specific dye effect will be compensated for Log R and log G recalculated based on the lowess fit Lowess linearizes and normalizes the data !!!!! Preprocessing: normalization

Intensity dependent normalization Preprocessing: normalization

Result of the normalization Preprocessing: normalization

ANOVA based Filtering Linearisation Bootstrapping Log transformation Array by array approach Filtering normalization Ratio Test statistic (T-test) Log transformation Preprocessing Background corr

Compensates for spot effects Choice of the reference important –Intuitive reference: First time point Uninduced sample –Independent reference (reference design) Tissue mixture Intuitive interpretation possible Ratio often undefined interpretation complicated Ratio defined Preprocessing: ratio

Log ratio: upregulation range 0…+infinity downregulation range 0…-infinity 2 fold overexpression 2 fold underexpression Ratio = 2 Ratio = 0.5 Log2(Ratio) = 1 Log2(Ratio) = -1 ratio (R/G): R>G upregulation range 1…+infinity R<G downregulation range 0...1: range of downregulation squashed Preprocessing: ratio

ANOVA based Filtering Linearisation Bootstrapping Log transformation Array by array approach Filtering normalization Ratio Test statistic (T-test) Log transformation Preprocessing Background corr

Overview further analysis Raw data Preprocessed data Differentially expressed genes Clusters of coexpressed genes Preprocessing ClusteringTest statistic

ANOVA based Filtering Linearisation Bootstrapping Log transformation Array by array approach Filtering Normalization Ratio Test statistic (T-test) Log transformation Background corr Preprocessing

I. MAIN EFFECTS + EFFECT OF INTEREST Overall mean Array effect (hybridisation effciency) Condition effect (mRNA isolation effciency) Gene effect Constitutive level of gene GC effect Differential expression due to the altered variety Dye effect (labeling efficiency) Model the expression level of each as a combination of the different factors Least squares fit: subject to restrictions contrast of interest: estimate (GC) i1 – (GC) i2 MultiFactor, Linear, fixed levels Preprocessing: ANOVA

Assumption: Independent, additive error ~F where F is a distribution with mean and variance  2 Plot the residuals y estimated - y measured Estimated intensity Preprocessing: ANOVA

I. MAIN EFFECTS + EFFECT OF INTEREST Analysis of variance shows relative contribution of each of the effects Explains the relative contribution of each of these effects Preprocessing: ANOVA

Advantages: Gains more information with less observations => derives variation from all measurements made (less replica’s required e.g. array effect based on N-1 gene measurements) Statistical testing: estimated error can be used for bootstrapping to estimate confidence levels No ratio’s required Requirements: Requires knowledge about experimental effects Model used implicates that all effects and combinations of effects should be linear Bootstrapping: residuals should be normally distributed around zero with constant variance Preprocessing: ANOVA

Estimate error Simulate new datasets based on estimated error (3000 times) Calculate factor of interest (GC effect) for each bootstrapped dataset (recalculate ANOVA) Calculate CI on (GC1-GC2) of N genes based on 3000 bootstraps Use this interval to test for significant genes 0 GC1-GC2 ANOVA Bootstrap analysis Preprocessing: ANOVA

DATA Filtered for zero values set 1: unnormalised data MODELS (Kerr et al. 2000, 2001) Model 1 (no spot effects) Model 2 (spot effects independent) Model 3 (spot effects dependent) MODELS GC effects not confounded with the spot effects type of model does influence the  (residual error) => Does influence the bootstrap interval More Arrays Simulaneously Preprocessing

DATA Filtered for zero values set 1: unnormalised data MODELS (Kerr et al. 2000, 2001) Model 1 (no spot effects) Model 2 (spot effects independent) Model 3 (spot effects dependent) MODELS GC effects not confounded with the spot effects type of model does influence the  (residual error) => Does influence the bootstrap interval More Arrays Simulaneously Preprocessing

I. MAIN EFFECTS + EFFECT OF INTEREST Overall mean Array effect (hybridisation effciency) Condition effect (mRNA isolation effciency) Gene effect Constitutive level of gene GC effect Differential expression due to the altered variety Dye effect (labeling efficiency) More Arrays Simulaneously Preprocessing

Least squares fit: subject to restrictions contrast of interest: estimate (VG)k1g – (VG)k2g Usual confidence intervals based on normal theory not appropriate Bootstrap analysis of residuals avoid making distributional assumptions about error Assumption: Independent, additive error ~F where F is a distribution with mean and variance  2 More Arrays Simulaneously Preprocessing

More Arrays Simulaneously Preprocessing

ŷ ŷŷ ŷ TEST, ARRAY 1 REFERENCE, ARRAY 1 REFERENCE, ARRAY 2TEST, ARRAY 2 More Arrays Simulaneously Preprocessing

More Arrays Simulaneously Additive error and non linear effects undermine application of ANOVA Preprocessing

ŷ ŷŷ ŷ TEST, ARRAY 1 REFERENCE, ARRAY 1 REFERENCE, ARRAY 2TEST, ARRAY 2 More Arrays Simulaneously Preprocessing

Lowess 99 % confidence interval based on 100 genes, 3000 bootstraps retained 370 genes (62 T-test p value < 0.01) Bootstrap analysis Preprocessing

Methods tested on pygmee dataset 3750 genes 1.ANOVA 99 % CI 2.ANOVA 95 % CI 3.SAM 4.T-test 5.Fold test Retained 360 genes Construct for each gene a binary profile Hierarchically cluster genes based on this profile methods Comparison Only 8 genes retained by all methods

methods Comparison

methods Comparison

Latin Square (mouse data set) Reference: normal mouse Condition: pygmee mouse Two experiments C=1, C=2 reflects two sample time points 2 batches: not all genes of the genome on one array A 1, C 1 B1 Test = R Ref = G A 2, C 1 B1 Test = G Ref = R A 5, C 2 B1 Test = R Ref = G A 6, C 2 B1 Test = G Ref = R A 3, C 1 B2 Test = R Ref = G A 4, C 1 B2 Test = R Ref = G A 7, C 2 B2 Test = R Ref = G A 8, C 2 B2 Test = G Ref = R Transcript profiling Experiment Design