Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.

Slides:



Advertisements
Similar presentations
PTP 560 Research Methods Week 9 Thomas Ruediger, PT.
Advertisements

Is it statistically significant?
From the homework: Distribution of DNA fragments generated by Micrococcal nuclease digestion mean(nucs) = bp median(nucs) = 110 bp sd(nucs+ = 17.3.
Inferential Statistics & Hypothesis Testing
Testing Differences Among Several Sample Means Multiple t Tests vs. Analysis of Variance.
OHRI Bioinformatics Introduction to the Significance Analysis of Microarrays application Stem.
Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005.
Independent Sample T-test Formula
Microarray Data Preprocessing and Clustering Analysis
Using Statistics in Research Psych 231: Research Methods in Psychology.
MARE 250 Dr. Jason Turner Hypothesis Testing II. To ASSUME is to make an… Four assumptions for t-test hypothesis testing:
10 Hypothesis Testing. 10 Hypothesis Testing Statistical hypothesis testing The expression level of a gene in a given condition is measured several.
Differentially expressed genes
Statistical Analysis of Microarray Data
Analysis of Differential Expression T-test ANOVA Non-parametric methods Correlation Regression.
Lecture 9: One Way ANOVA Between Subjects
Significance Tests P-values and Q-values. Outline Statistical significance in multiple testing Statistical significance in multiple testing Empirical.
Independent Sample T-test Often used with experimental designs N subjects are randomly assigned to two groups (Control * Treatment). After treatment, the.
15-1 Introduction Most of the hypothesis-testing and confidence interval procedures discussed in previous chapters are based on the assumption that.
Today Concepts underlying inferential statistics
Independent Sample T-test Classical design used in psychology/medicine N subjects are randomly assigned to two groups (Control * Treatment). After treatment,
Chapter 14 Inferential Data Analysis
Descriptive Statistics
Different Expression Multiple Hypothesis Testing STAT115 Spring 2012.
Inferential Statistics
Analysis of Variance. ANOVA Probably the most popular analysis in psychology Why? Ease of implementation Allows for analysis of several groups at once.
AM Recitation 2/10/11.
Statistics 11 Hypothesis Testing Discover the relationships that exist between events/things Accomplished by: Asking questions Getting answers In accord.
Probability Distributions and Test of Hypothesis Ka-Lok Ng Dept. of Bioinformatics Asia University.
Multiple testing in high- throughput biology Petter Mostad.
Jeopardy Hypothesis Testing T-test Basics T for Indep. Samples Z-scores Probability $100 $200$200 $300 $500 $400 $300 $400 $300 $400 $500 $400.
QNT 531 Advanced Problems in Statistics and Research Methods
Statistics Primer ORC Staff: Xin Xin (Cindy) Ryan Glaman Brett Kellerstedt 1.
Statistical Power The ability to find a difference when one really exists.
Essential Statistics in Biology: Getting the Numbers Right
1 Power and Sample Size in Testing One Mean. 2 Type I & Type II Error Type I Error: reject the null hypothesis when it is true. The probability of a Type.
Applying statistical tests to microarray data. Introduction to filtering Recall- Filtering is the process of deciding which genes in a microarray experiment.
Differential Expression II Adding power by modeling all the genes Oct 06.
Differential Gene Expression Dennis Kostka, Christine Steinhoff Slides adapted from Rainer Spang.
Educational Research: Competencies for Analysis and Application, 9 th edition. Gay, Mills, & Airasian © 2009 Pearson Education, Inc. All rights reserved.
Assume we have two experimental conditions (j=1,2) We measure expression of all genes n times under both experimental conditions (n two- channel.
Lecture 5: Chapter 5: Part I: pg Statistical Analysis of Data …yes the “S” word.
Inference and Inferential Statistics Methods of Educational Research EDU 660.
A A R H U S U N I V E R S I T E T Faculty of Agricultural Sciences Introduction to analysis of microarray data David Edwards.
Statistical Hypotheses & Hypothesis Testing. Statistical Hypotheses There are two types of statistical hypotheses. Null Hypothesis The null hypothesis,
Back to basics – Probability, Conditional Probability and Independence Probability of an outcome in an experiment is the proportion of times that.
Statistical Principles of Experimental Design Chris Holmes Thanks to Dov Stekel.
Introduction to Inferential Statistics Statistical analyses are initially divided into: Descriptive Statistics or Inferential Statistics. Descriptive Statistics.
Educational Research Chapter 13 Inferential Statistics Gay, Mills, and Airasian 10 th Edition.
Multiple Testing Matthew Kowgier. Multiple Testing In statistics, the multiple comparisons/testing problem occurs when one considers a set of statistical.
Experimental Design and Statistics. Scientific Method
Statistical Inference for the Mean Objectives: (Chapter 9, DeCoursey) -To understand the terms: Null Hypothesis, Rejection Region, and Type I and II errors.
Statistics for Differential Expression Naomi Altman Oct. 06.
Copyright © Cengage Learning. All rights reserved. 12 Analysis of Variance.
Suppose we have T genes which we measured under two experimental conditions (Ctl and Nic) in n replicated experiments t i * and p i are the t-statistic.
Statistical Analysis of Microarray Data By H. Bjørn Nielsen.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
The Broad Institute of MIT and Harvard Differential Analysis.
1 Significance analysis of Microarrays (SAM) Applied to the ionizing radiation response Tusher, Tibshirani, Chu (2001) Dafna Shahaf.
Statistical Inference Statistical inference is concerned with the use of sample data to make inferences about unknown population parameters. For example,
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 6 –Multiple hypothesis testing Marshall University Genomics.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb Analysis of (cDNA) Microarray.
Hypothesis Testing. Statistical Inference – dealing with parameter and model uncertainty  Confidence Intervals (credible intervals)  Hypothesis Tests.
Expression profiling & functional genomics Exercises.
Educational Research Inferential Statistics Chapter th Chapter 12- 8th Gay and Airasian.
Inferential Statistics Psych 231: Research Methods in Psychology.
Microarray Data Analysis Xuming He Department of Statistics University of Illinois at Urbana-Champaign.
Micro array Data Analysis. Differential Gene Expression Analysis The Experiment Micro-array experiment measures gene expression in Rats (>5000 genes).
Differential Gene Expression
Presentation transcript:

Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006

ANOVA based Filtering Linearisation Bootstrapping Log transformation Array by array approach Filtering normalization Ratio Test statistic (T-test) Log transformation Preprocessing Background corr

Overview further analysis Raw data Preprocessed data Differentially expressed genes Clusters of coexpressed genes Preprocessing ClusteringTest statistic

Comparison of 2 experiments: Fold test T-test SAM … A plethora of different method available Which one performs best? Different underlying statistical assumptions Implication on the final result Difficult to define the best method Test Statistic Preprocessing: test statistic

Type1: Comparison of 2 samples Statistical testing Control sample Induced sample Retrieve statistically over or under expressed genes Diff Expr Genes: test statistic

black/white experiment description (array V mice genes) Condition 1 : pygmee mouse 10 days old (test) Condition 2 : normal mouse 10 days old (ref) detect differentially expressed genes Experiment design (Latin Square) Condition 1 Dye1 Replica L Condition 1 dye1 Replica R Condition 2 dye2 Replica L Condition 2 dye2 Replica R Condition 2 dye1 Replica L Condition 2 dye1 Replica R Condition 1 dye2 Replica L Condition 1 dye2 Replica R Array 1 Array 2 Per gene, per condition 4 measurements available Diff Expr Genes : test statistic

Fold change (ratio test) 4 measurements per gene, condition Calculate average Sort averages log(Sample/control) > threshold (usually 2) Arbitrary threshold Discards all information obtained from replicates Implicitly assumes constant variance but variance depends on expression value Diff Expr Genes : test statistic

Why does fold chance fail: Majority of genes expressed at low levels where signal/noise is low => not sufficiently conservative –2 fold change occurs at random for a large number of genes –High number of false positives Higher levels of expression smaller changes in gene expression may be real => too conservative –High number of false negatives Improvement: –T-test –pairwise fold change: genes significantly differentially expressed if R=-fold change is observed consistently between paired samples –SAM Diff Expr Genes : test statistic

Possible if replicates of reference and test are available Significance of the difference between the reference and test data (level of expression) relative to the observed level of within class variation (consistency) Assumptions Normal distribution of variables Population mean and variance estimated from data => (Student t distribution for H0 hypothesis) Not all genes need to have the same variance Under null hypothesis sample means should be equal (rescaling obligatory) T-test: hypothesis test Diff Expr Genes : test statistic

Consider paired data as new variable Calculate average ratio Calculate standard deviation of the 4 ratio measurements Determine t-value df, student t distribution, t-value p-value p-value (represents the probability that a certain null hypothesis is true) Paired t-test (microarray data are paired) Diff Expr Genes : test statistic

Classical hypothesis tests (t-test, Wilcoxon rank-sum test,...): –a test statistic is calculated (t-value) –the probability or p-value is calculated that an equally good or better test statistic is generated if a certain null hypothesis is true –The null hypothesis: gene has no difference in mean expression levels between 2 conditions –Low p-value (below rejection level  ): null hypothesis is not likely: reject null hypothesis: there is a difference in (mean) expression between the two classes t-test H0 H1 H0: D=0 H1: D<>0 Gene x Type I Type II Diff Expr Genes : test statistic

Comparison of fold test with paired t-test Gene expression levels measured under two different conditions Rejection level  –p j <  : null hypothesis rejected (result Positive) –p j >  : null hypothesis not rejected (result Negative) But: Multiple testing: Type I and Type II error = False positives and negatives Diff Expr Genes : test statistic

Each gene is assigned a score on the basis of its change in expression relative to the standard deviation of repeated measurements for that gene H0 (expected relative difference) is estimated by permutation analysis –Permute the samples –Calculate d(i) values for both the experimental samples and the permutated control samples –Rank genes by magnitude of their d(i) values for both the experimental and the permutated control samples SAM Diff Expr Genes : test statistic

Observed values Calculate d(I) value for each gene Rank genes according to their d(I) value Simulated values Permute dataset Calculate d(I) value for each gene in each permuted dataset Calculate average d(I) value for each gene Rank d(I) values Make scatterplot SAM Diff Expr Genes : test statistic

SAM Diff Expr Genes : test statistic

T-test Paired t-test SAM Parametrized : Student t- distribution Errors normally distributed Restricted number of repeat measurements Impossible to evaluate assumption No explicit assumption Order statistics Test statisticAssumptionsDistribution H0 Errors equal variance (iid) Less stringent assumption Diff Expr Genes : test statistic

Multiple testing: problem P value: measure of significance in terms of the false positive rate The rate that truly null features are called significant Significance is 5%: on average 5% of the truly null features will be called significant (type-I error) Type I error: Null hypothesis rejected when it is true – ‘ accidental ’ low p-value – falsely declared differentially expressed = false positive Multiple testing: Example: genes with random expression profiles -  = 5% - one would find  500 genes with a p-value lower than 5% = false positives Type II error: Null hypothesis not rejected when it is not true (false negatives). Gene that is actually differentially expressed is not declared differentially expressed. Adapted from De Smet et al Diff Expr Genes: test statistic

Multiple testing: solutions Control of the familywise error rate (FWE): P(FP  1) – protection against type I errors Bonferonni correction: reject null hypothesis at rejection level  /N, which guarantees that FWE = P(FP  1) <  Is OK when very few genes are expected to be actually differentially expressed (i.e., affected by the difference in conditions / for which the null hypopthesis is false): every false positive is ‘costly’ Rejection rate becomes very conservative But in microarray data, usually a considerable number of genes is actually differentially expressed: control of the FWE results in a severe loss of statistical power (FN or type II error is large) In practice we do not have to protect against every possible FP Better solution FDR: false positive discovery rate Adapted from De Smet et al Diff Expr Genes: test statistic

We need a sensible balance between the number of true positives and the number of false positives Therefore is is better to control the ‘False Discovery Rate’ (FDR) instead of the FWE: The false positive rate: The rate that truly null features are called significant The FDR: = % of false positives among all the genes that are declared positive = % of true null hypotheses erroneously rejected among all the null hypotheses rejected Adapted from De Smet et al FDR Diff Expr Genes: test statistic

Difference p-value and FDR 5% FDR: 5% false positives among the features called significant 5% p value cutoff: 5% false positives among all the null features in the dataset, says little about the content of the features actually called significant Diff Expr Genes: test statistic

An estimate of E[S(t)] is the observed S(t): i= the number of observed pvalues <p i E[F(t)] = N 0 p i Estimate N 0 No real differential expression Randomised data set Uniform distribution FN TN TP FP Rejection level  Non-accidental differential expression Superposition of two distribuions Adapted from De Smet et al

Overview MICROARRAY PREPROCESSING Gene expression Omics era Transcript profiling Experiment design Preprocessing Slide by slide normalisation ANOVA Exercises