Fishing expeditions in gloomy waters: Detecting differential expression in microarray data Matthias E. Futschik Institute for Theoretical Biology Humboldt-University,

Slides:



Advertisements
Similar presentations
Improved normalisation of microarray data by optimised iterative local regression Matthias E. Futschik Department of Information Science University of.
Advertisements

Lecture 9 Microarray experiments MA plots
Microarray data analysis – Gold-mining in a minefield Matthias E. Futschik Institute for Theoretical Biology Medical Devision (Charité), Humboldt-University,
Statistical tests for differential expression in cDNA microarray experiments (2): ANOVA Xiangqin Cui and Gary A. Churchill Genome Biology 2003, 4:210 Presented.
From the homework: Distribution of DNA fragments generated by Micrococcal nuclease digestion mean(nucs) = bp median(nucs) = 110 bp sd(nucs+ = 17.3.
Pre-processing in DNA microarray experiments Sandrine Dudoit PH 296, Section 33 13/09/2001.
Filtering and Normalization of Microarray Gene Expression Data Waclaw Kusnierczyk Norwegian University of Science and Technology Trondheim, Norway.
Normalization of microarray data
Microarray technology and analysis of gene expression data Hillevi Lindroos.
Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005.
Getting the numbers comparable
DNA Microarray Bioinformatics - #27612 Normalization and Statistical Analysis.
Microarray Data Preprocessing and Clustering Analysis
MARE 250 Dr. Jason Turner Hypothesis Testing II. To ASSUME is to make an… Four assumptions for t-test hypothesis testing:
Differentially expressed genes
Statistical Analysis of Microarray Data
Lecture 19: Tues., Nov. 11th R-squared (8.6.1) Review
Analysis of Differential Expression T-test ANOVA Non-parametric methods Correlation Regression.
Normalization of 2 color arrays Alex Sánchez. Dept. Estadística Universitat de Barcelona.
Making Sense of Complicated Microarray Data
Significance Tests P-values and Q-values. Outline Statistical significance in multiple testing Statistical significance in multiple testing Empirical.
Different Expression Multiple Hypothesis Testing STAT115 Spring 2012.
Filtering and Normalization of Microarray Gene Expression Data Waclaw Kusnierczyk Norwegian University of Science and Technology Trondheim, Norway.
1 Normalization Methods for Two-Color Microarray Data 1/13/2009 Copyright © 2009 Dan Nettleton.
(4) Within-Array Normalization PNAS, vol. 101, no. 5, Feb Jianqing Fan, Paul Tam, George Vande Woude, and Yi Ren.
General Linear Model & Classical Inference
Practical statistics for Neuroscience miniprojects Steven Kiddle Slides & data :
The following slides have been adapted from to be presented at the Follow-up course on Microarray Data Analysis.
CDNA Microarrays Neil Lawrence. Schedule Today: Introduction and Background 18 th AprilIntroduction and Background 25 th AprilcDNA Mircoarrays 2 nd MayNo.
Probability Distributions and Test of Hypothesis Ka-Lok Ng Dept. of Bioinformatics Asia University.
Multiple testing in high- throughput biology Petter Mostad.
Practical Issues in Microarray Data Analysis Mark Reimers National Cancer Institute Bethesda Maryland.
DNA microarray technology allows an individual to rapidly and quantitatively measure the expression levels of thousands of genes in a biological sample.
Essential Statistics in Biology: Getting the Numbers Right
Statistics & Biology Shelly’s Super Happy Fun Times February 7, 2012 Will Herrick.
DATA TRANSFORMATION and NORMALIZATION Lecture Topic 4.
CDNA Microarrays MB206.
Microarrays & Expression profiling Matthias E. Futschik Institute for Theoretical Biology Humboldt-University, Berlin, Germany Bioinformatics Summer School,
Applying statistical tests to microarray data. Introduction to filtering Recall- Filtering is the process of deciding which genes in a microarray experiment.
WORKSHOP SPOTTED 2-channel ARRAYS DATA PROCESSING AND QUALITY CONTROL Eugenia Migliavacca and Mauro Delorenzi, ISREC, December 11, 2003.
Probe-Level Data Normalisation: RMA and GC-RMA Sam Robson Images courtesy of Neil Ward, European Application Engineer, Agilent Technologies.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
A A R H U S U N I V E R S I T E T Faculty of Agricultural Sciences Introduction to analysis of microarray data David Edwards.
MGS3100_04.ppt/Sep 29, 2015/Page 1 Georgia State University - Confidential MGS 3100 Business Analysis Regression Sep 29 and 30, 2015.
Lecture Topic 5 Pre-processing AFFY data. Probe Level Analysis The Purpose –Calculate an expression value for each probe set (gene) from the PM.
Statistical Principles of Experimental Design Chris Holmes Thanks to Dov Stekel.
Introduction to Statistical Analysis of Gene Expression Data Feng Hong Beespace meeting April 20, 2005.
Statistical Methods for Identifying Differentially Expressed Genes in Replicated cDNA Microarray Experiments Presented by Nan Lin 13 October 2002.
1 Global expression analysis Monday 10/1: Intro* 1 page Project Overview Due Intro to R lab Wednesday 10/3: Stats & FDR - * read the paper! Monday 10/8:
Statistical Inference for the Mean Objectives: (Chapter 9, DeCoursey) -To understand the terms: Null Hypothesis, Rejection Region, and Type I and II errors.
Experimental Research Methods in Language Learning Chapter 10 Inferential Statistics.
Statistics for Differential Expression Naomi Altman Oct. 06.
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb Analysis of (cDNA) Microarray.
Application of Class Discovery and Class Prediction Methods to Microarray Data Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics.
Suppose we have T genes which we measured under two experimental conditions (Ctl and Nic) in n replicated experiments t i * and p i are the t-statistic.
Two-Way (Independent) ANOVA. PSYC 6130A, PROF. J. ELDER 2 Two-Way ANOVA “Two-Way” means groups are defined by 2 independent variables. These IVs are typically.
For a specific gene x ij = i th measurement under condition j, i=1,…,6; j=1,2 Is a Specific Gene Differentially Expressed Differential expression.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 6 –Multiple hypothesis testing Marshall University Genomics.
Distinguishing active from non active genes: Main principle: DNA hybridization -DNA hybridizes due to base pairing using H-bonds -A/T and C/G and A/U possible.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
BIOL 582 Lecture Set 2 Inferential Statistics, Hypotheses, and Resampling.
Statistical Inference for the Mean Objectives: (Chapter 8&9, DeCoursey) -To understand the terms variance and standard error of a sample mean, Null Hypothesis,
Microarray Data Analysis Xuming He Department of Statistics University of Illinois at Urbana-Champaign.
Micro array Data Analysis. Differential Gene Expression Analysis The Experiment Micro-array experiment measures gene expression in Rats (>5000 genes).
Differential Gene Expression
Getting the numbers comparable
Normalization for cDNA Microarray Data
Presentation transcript:

Fishing expeditions in gloomy waters: Detecting differential expression in microarray data Matthias E. Futschik Institute for Theoretical Biology Humboldt-University, Berlin, Germany Hvar Summer School, 2004

Overview Starting points: Where are we? Gene expression matrix Data pre-processing Background subtraction Data transformation Normalisation Hybridisation model Within slide normalisation Local regression Detection of differential expression Hypothesis testing Statistical tests

Roadmap: Where are we? Good news: We are almost ready for ‘higher` data analysis !

Data-Preprocessing  Background subtraction:  May reduce spatial artefacts  May increase variance as both foreground and background intensities are estimates (  “arrow- like” plots MA-plots)  Preprocessing:  Thresholding: exclusion of low intensity spots or spots that show saturation  Transformation: A common transformation is log-transformation for stabilitation of variance across intensity scale and detection of dye related bias. Log-transformation

The problem: Are all low intensity genes down-regulated?? Are all genes spotted on the left side up-regulated ??

Hybridisation model Microarrays do not assess gene activities directly, but indirectly by measuring the fluorescence intensities of labelled target cDNA hybridised to probes on the array. So how do we get what we are interested in? Answer: Find the relation between flourescance spot intensities and mRNA abundance! Explicitly modelling the relation between signal intensities and changes in gene expression can separate the measured error into systematic and random errors. Systematic errors are errors which are reproducible and might be corrected in the normalisation procedure, whereas random errors cannot be corrected, but have to be assessed by replicate experiments.

Hybridisation model for two-colour arrays I = N(θ) A + ε A first attempt: For two-colour microarrays, the fundamental variables are the fluorescence intensities of spots in the red (I r ) and the green channel (I g ). These intensities are functions of the abundance of labelled transcripts A r/g. Under ideal circumstances, this relation of I and A is linear up to an additional experimental error ε: N : normalisation factor determined by experimental parameters θ such as the laser power amplification of the scanned signal. Frequently, however, this simple relation does not hold for microarrays due to effects such as intensity background, and saturation.

Hybridisation model for two-colour arrays M - κ (θ) = D + ε Let`s try a more flexible approach based on ratio R (pairing of intensities reduces variablity due to spot morphology) After some calculus (homework! I will check it tomorrow) we get How do we get κ (θ)? κ: non-linear normalisation factors (functions) dependent on experimental parameters. D = log 2 (A r /A g ) M = log 2 (I r /I g )

Normalization – bending data to make it look nicer... Normalization describes a variety of data transformations aiming to correct for experimental variation

Within – array normalization  Normalization based on 'householding genes' assumed to be equally expressed in different samples of interest  Normalization using 'spiked in' genes: Ajustment of intensities so that control spots show equal intensities across channels and arrays  Global linear normalisation assumes that overall expression in samples is constant. Thus, overall intensitiy of both channels is linearly scaled to have value.  Non-linear normalisation assumes symmetry of differential expression across intensity scale and spatial dimension of array

Normalization by local regression Regression of local intensity >> residuals are 'normalized' log-fold changes Common presentation: MA-plots: A = 0.5* log 2 (Cy3*Cy5) M = log 2 (Cy5/Cy3) >> Detection of intensity-dependent bias! Similarly, MXY-plots for detection of spatial bias. M, and thus κ, is function of A, X and Y Normalized expression changes show symmetry across intensity scale and slide dimension

Normalisation by local regression and problem of model selection Example: Correction of intensity-dependent bias in data by loess (MA-regression: A=0.5*(log 2 (Cy5)+log 2 (Cy3)); M = log 2 (Cy5/Cy3); Raw data Local regression Corrected data However, local regression and thus correction depends on choice of parameters. Correction: M- M reg ? ? ? Different choices of paramters lead to different normalisations.

Optimising by cross-validation and iteration Iterative local regression by locfit (C.Loader): 1) GCV of MA-regression 2) Optimised MA-regression 3) GCV of MXY-regression 4) Optimised MXY-regression 2 iterations generally sufficient GCV of MA

Optimised local scaling Iterative regression of M and spatial dependent scaling of M: 1) GCV of MA-regression 2) Optimised MA-regression 3) GCV of MXY-regression 4) Optimised MXY-regression 5) GCV of abs(M)XY-regression 6) Scaling of abs(M)

Comparison of normalisation procedures MA-plots: 1) Raw data 2) Global lowess (Dudoit et al.) 3) Print-tip lowess (Dudoit et al.) 4) Scaled print-tip lowess (Dudoit et al.) 5) Optimised MA/MXY regression by locfit 6) Optimised MA/MXY regression wit1h scaling => Optimised regression leads to a reduction of variance (bias)

Comparison II: Spatial distribution => Not optimally normalised data show spatial bias MXY-plots can indicate spatial bias MXY-plots:

Averaging by sliding window reveals un- corrected bias Distribution of median M within a window of 5x5 spots: => Spatial regression requires optimal adjustment to data

Statistical significance testing by permutation test M Original distribution What is the probabilty to observe a median M within a window by chance? Comparison with empirical distribution => Calculation of probability (p-value) using Fisher’s method M r1 M r2 M r3 Randomised distributions

Statistical significance testing by permutation test Histogram of p-values for a window size of 5x5 Number of permutation: 10 6 p-values for negative M p-values for positive M

Statistical significance testing by permutation test MXY of p-values for a window size of 5x5 Number of permutation: 10 6 Red: significant positive M Green: significant negative M M. Futschik and T. Crompton, Genome Biology, to appear

Normalization makes results of different microarrays comparable  Between-array normalization  scaling of arrays linearly or e.g. by quantile-quantile normalization  Usage of linear model e.g. ANOVA or mixed-models:  y ijg = µ + A i + D j + AD ig + G g + VG kg + DG jg + AG ig + ε ijg

Classical hypothesis testing: 1) Setting up of null hypothesis H 0 (e.g. gene X is not differentially expressed) and alternative hypothesis H a (e.g. Gene X is differentially expressed) 2) Using a test statistic to compare observed values with values predicted for H 0. 3) Define region for the test statistic for which H 0 is rejected in favour of H a. Going fishing: What is differentially expressed

Significance of differential gene expression Typical test statistics 1) Parametric tests e.g. t-test, F-test assume a certain type of underlying distribution 2) Non-parametric tests (i.e. Sign test, Wilcoxon rank test) have less stringent assumptions t = (       P-value: probability of occurrence by chance Two kinds of errors in hypothesis testing: 1) Type I error: detection of false positive 2) Type II error: detection of false negative Level of significance :α = P(Type I error) Power of test : 1- P(Type II error) = 1 – β

Detection of differential expression What makes differential expression differential expression? What is noise? Foldchanges are commonly used to quantify differenitial expression but can be misleading (intensity- dependent). Basic challange: Large number of (dependent/correlated) variables compared to small number of replicates (if any). Can you spot the interesting spots?

Criteria for gene selection  Accuracy: how closely are the results to the true values  Precision: how variable are the results compared to the true value  Sensitivity: how many true posítive are detected  Specificity: how many of the selected genes are true positives.

>> Multiple testing required with large number of tests but small number of replicates. >> Adjustment of significance of tests necessary Example: Probability to find a true H 0 rejected for α=0.01 in 100 independent tests: P = 1- (1-α) 100 ~ 0.63 Multiple testing poses challanges

Compound error measures: Per comparison error rate: PCER= E[V]/N Familiywise error rate: FWER=P(V≥1) False discovery rate: FDR= E[V/R] N: total number of tests V: number of reject true H 0 (FP) R: number of rejected H (TP+FP) Aim to control the error rate: 1) by p-value adjustment (step-down procedures: Bonferroni, Holm, Westfall-Young,...) 2) by direct comparison with a background distribution (commonly generated by random permuation)

Alternative approach: Treat spots as replicates For direct comparison: Gene X is significantly differentially expressed if corresponding fold change falls in chosen rejection region. The parameters of the underlying distribution are derived from all or a subset of genes. Since gene expression is usually heteroscedastic with respect to abundance, variance has to be stabalised by local variance estimation. Alternatively, local estimates of z- score can be derived.

Constistency of replications Case study : SW480/620 cell line comparison SW480: derived from primary tumour SW 620: derived from lymphnode metastisis of same patient  Model for cancer progression Experimental design: 4 independent hybridisations, 4000 genes cDNA of SW620 Cy5-labelled, cDNA of SW480 Cy3-labelled.  This design poses a problem! Can you spot it?

Usage of paired t-test d: average differences of paired intensities s d : standard deviation of d p-value < 0.01 Bonferroni adjusted p-value < 0.01

Robust t-test Adjust estimation of variance: Compound error model: Gene-specific error Experiment-specific error This model avoids selection of control spots M. Futschik et al, Genome Letters, 2002

Another look at the results Significant genes as red spots: 3 σ-error bars do not overlap with M=0 axis. That‘s good!

Take-home messages Don‘t download and analyse array data blindly Visualise distributions: the eye is astonishing good in finding interesting spots Use different statistics and try to understand the differences Remember: Statistical significance is not necessary biological significance! Ready to go fishing in Hvar... ?