Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005.

Slides:



Advertisements
Similar presentations
Anthony Greene1 Simple Hypothesis Testing Detecting Statistical Differences In The Simplest Case:  and  are both known I The Logic of Hypothesis Testing:
Advertisements

Is it statistically significant?
From the homework: Distribution of DNA fragments generated by Micrococcal nuclease digestion mean(nucs) = bp median(nucs) = 110 bp sd(nucs+ = 17.3.
Chapter Seventeen HYPOTHESIS TESTING
MARE 250 Dr. Jason Turner Hypothesis Testing II To ASSUME is to make an… Four assumptions for t-test hypothesis testing: 1. Random Samples 2. Independent.
MARE 250 Dr. Jason Turner Hypothesis Testing II. To ASSUME is to make an… Four assumptions for t-test hypothesis testing:
Differentially expressed genes
Test statistic: Group Comparison Jobayer Hossain Larry Holmes, Jr Research Statistics, Lecture 5 October 30,2008.
Analysis of Differential Expression T-test ANOVA Non-parametric methods Correlation Regression.
Statistical Methods in Computer Science Hypothesis Testing I: Treatment experiment designs Ido Dagan.
Lecture 2: Basic steps in SPSS and some tests of statistical inference
Significance Tests P-values and Q-values. Outline Statistical significance in multiple testing Statistical significance in multiple testing Empirical.
Chapter 9 Hypothesis Testing.
Statistical Methods in Computer Science Hypothesis Testing I: Treatment experiment designs Ido Dagan.
15-1 Introduction Most of the hypothesis-testing and confidence interval procedures discussed in previous chapters are based on the assumption that.
5-3 Inference on the Means of Two Populations, Variances Unknown
Type II Error, Power and Sample Size Calculations
Hypothesis Testing.
Different Expression Multiple Hypothesis Testing STAT115 Spring 2012.
Statistical Analysis. Purpose of Statistical Analysis Determines whether the results found in an experiment are meaningful. Answers the question: –Does.
Hypothesis Testing and T-Tests. Hypothesis Tests Related to Differences Copyright © 2009 Pearson Education, Inc. Chapter Tests of Differences One.
Choosing Statistical Procedures
AM Recitation 2/10/11.
Multiple testing in high- throughput biology Petter Mostad.
1 © Lecture note 3 Hypothesis Testing MAKE HYPOTHESIS ©
Jeopardy Hypothesis Testing T-test Basics T for Indep. Samples Z-scores Probability $100 $200$200 $300 $500 $400 $300 $400 $300 $400 $500 $400.
Sullivan – Fundamentals of Statistics – 2 nd Edition – Chapter 11 Section 2 – Slide 1 of 25 Chapter 11 Section 2 Inference about Two Means: Independent.
The paired sample experiment The paired t test. Frequently one is interested in comparing the effects of two treatments (drugs, etc…) on a response variable.
Chapter 9 Hypothesis Testing: Single Population
Essential Statistics in Biology: Getting the Numbers Right
Introduction to inference Use and abuse of tests; power and decision IPS chapters 6.3 and 6.4 © 2006 W.H. Freeman and Company.
Assume we have two experimental conditions (j=1,2) We measure expression of all genes n times under both experimental conditions (n two- channel.
First approach - repeating a simple analysis for each gene separately - 30k times Assume we have two experimental conditions (j=1,2) We measure.
Hypothesis Testing Hypothesis Testing Topic 11. Hypothesis Testing Another way of looking at statistical inference in which we want to ask a question.
Inference and Inferential Statistics Methods of Educational Research EDU 660.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
A A R H U S U N I V E R S I T E T Faculty of Agricultural Sciences Introduction to analysis of microarray data David Edwards.
Confidence intervals and hypothesis testing Petter Mostad
Testing Hypothesis That Data Fit a Given Probability Distribution Problem: We have a sample of size n. Determine if the data fits a probability distribution.
1 Statistical Significance Testing. 2 The purpose of Statistical Significance Testing The purpose of Statistical Significance Testing is to answer the.
Statistical Significance for a two-way table Inference for a two-way table We often gather data and arrange them in a two-way table to see if two categorical.
Statistical Inference for the Mean Objectives: (Chapter 9, DeCoursey) -To understand the terms: Null Hypothesis, Rejection Region, and Type I and II errors.
Statistics for Differential Expression Naomi Altman Oct. 06.
Kruskal-Wallis H TestThe Kruskal-Wallis H Test is a nonparametric procedure that can be used to compare more than two populations in a completely randomized.
Chap 8-1 Fundamentals of Hypothesis Testing: One-Sample Tests.
: An alternative representation of level of significance. - normal distribution applies. - α level of significance (e.g. 5% in two tails) determines the.
Suppose we have T genes which we measured under two experimental conditions (Ctl and Nic) in n replicated experiments t i * and p i are the t-statistic.
9.3/9.4 Hypothesis tests concerning a population mean when  is known- Goals Be able to state the test statistic. Be able to define, interpret and calculate.
1 URBDP 591 A Lecture 12: Statistical Inference Objectives Sampling Distribution Principles of Hypothesis Testing Statistical Significance.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
NON-PARAMETRIC STATISTICS
Statistical Analysis II Lan Kong Associate Professor Division of Biostatistics and Bioinformatics Department of Public Health Sciences December 15, 2015.
WS 2007/08Prof. Dr. J. Schütze, FB GW KI 1 Hypothesis testing Statistical Tests Sometimes you have to make a decision about a characteristic of a population.
Statistical Inference Statistical inference is concerned with the use of sample data to make inferences about unknown population parameters. For example,
Copyright © 1998, Triola, Elementary Statistics Addison Wesley Longman 1 Assumptions 1) Sample is large (n > 30) a) Central limit theorem applies b) Can.
Variability & Statistical Analysis of Microarray Data GCAT – Georgetown July 2004 Jo Hardin Pomona College
Hypothesis Testing Steps for the Rejection Region Method State H 1 and State H 0 State the Test Statistic and its sampling distribution (normal or t) Determine.
Copyright© 1998, Triola, Elementary Statistics by Addison Wesley Longman 1 Testing a Claim about a Mean: Large Samples Section 7-3 M A R I O F. T R I O.
Hypothesis Testing. Statistical Inference – dealing with parameter and model uncertainty  Confidence Intervals (credible intervals)  Hypothesis Tests.
Hypothesis Tests u Structure of hypothesis tests 1. choose the appropriate test »based on: data characteristics, study objectives »parametric or nonparametric.
Statistical Inference for the Mean Objectives: (Chapter 8&9, DeCoursey) -To understand the terms variance and standard error of a sample mean, Null Hypothesis,
1 Underlying population distribution is continuous. No other assumptions. Data need not be quantitative, but may be categorical or rank data. Very quick.
Hypothesis Testing. Steps for Hypothesis Testing Fig Draw Marketing Research Conclusion Formulate H 0 and H 1 Select Appropriate Test Choose Level.
McGraw-Hill/Irwin © 2003 The McGraw-Hill Companies, Inc.,All Rights Reserved. Part Four ANALYSIS AND PRESENTATION OF DATA.
Estimating the False Discovery Rate in Genome-wide Studies BMI/CS 576 Colin Dewey Fall 2008.
Ex St 801 Statistical Methods Part 2 Inference about a Single Population Mean (HYP)
Part Four ANALYSIS AND PRESENTATION OF DATA
Differential Gene Expression
Chapter 9: Hypothesis Tests Based on a Single Sample
Nonparametric Statistics
Presentation transcript:

Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005

Background (Microarray) Cells Extract RNA

Background Cells Extract RNA

Background Cells Extract RNA

Background Cells Extract RNA

Background Cells Extract RNA genes

Background Cells Extract RNA genes

Background Cells Extract RNA genes

Background Biological sample  RNA extraction (total RNA or mRNA)  Amplification (in vitro transcription)  Label samples  Hybridization  Washing and staining Scanning Microarrays are highly noisy Use replicated experiments to make inferences about differential expression for the population from which the biological samples originate biological variability technical variability

Background Normalization Calculate Gene Expression Index

An Example 5 normal sample and 9 myeloma (MM) samples genes (rows)

Genes of Interest Statistical significance: that the observed differential expression is unlikely to be due to chance. Scientific significance: that the observed level of differential expression is of sufficient magnitude to be of biological relevance.

Group 1 (N samples): X 1, X 2, … X N Group 2 (M samples): Y 1, Y 2, … Y M Statistical significance in the two group problem Assume Y j ~ Normal (μ 2, σ 2 ) X i ~ Normal (μ 1, σ 2 ) Null hypothesis: Group 1 is the “same” to Group 2 (i.e., μ 1 = μ 2 ) Parametric Test: t-test

Statistical significance in the two group problem Y j ~ Normal (μ 2, σ 2 )X i ~ Normal (μ 1, σ 2 ) Null hypothesis: μ 1 = μ 2 Test null hypothesis with test statistics: Parametric Test: t-test

If variances are unequal (1) When N+M > 30, this is approximately normal (2) When  1 >>  2, this is approximately t(df = N–1) (3) In general, Welch approximation: t’ ~ t(df’), where Y j ~ Normal (μ 2, σ 2 2 ) X i ~ Normal (μ 1, σ 1 2 ) σ1  σ2σ1  σ2

Wilcoxon rank sum test Consider row 7 of MM study rank sum = 23 This test is more appropriate than the t-tests when the underlying distribution is far from normal. (But it requires large group sizes)

P-value p-value = P(|T|>|t|) is calculated based on the distribution of T under the null hypothesis. p-value is a function of the test statistics and can be viewed as a random variable. –e.g. p-value = 2(1 - F(|t * |), F = cdf of t(N+M – 2). A small p-value represents evidence against the null hypothesis  differentially expressed in our case.

Permutation test A non-parametric way of computation p-value for any test statistics. –In the MM-study, each gene has (14 choose 5) = 2002 different test values obtainable from permuting the group labels. Under the null hypothesis that the distribution for the two groups are identical, all these test values are equally probable. What is the probability of getting a test value at least as extreme as the observed one? This is the permutation p-value.

Permutation technique Condition 0Condition 1 Patient 4Patient 2Patient 3Patient 1Patient 5Patient 6 Condition 0Condition 1 Patient 1Patient 2Patient 5Patient 4Patient 3Patient 6 Condition 0Condition 1 Patient 1Patient 6Patient 3Patient 4Patient 5Patient 2 Condition 0Condition 1 Patient 1Patient 2Patient 3Patient 4Patient 5Patient 6 Compute TS 0 Compute TS 1 Compute TS 2 Compute TS 3 The set of TS i form the empirical distribution of the test statistic TS

Scientific Significance Fold change FC = May not be high when statistical significance is high. Not an appropriate measure if the dispersion is not taken into consideration.

Conservative fold change Conservative fold change (CFC) = Max (25 th percentile of sample 1 / 75 th percentile of sample 2, 25 th percentile of sample 2 / 75 th percentile of sample 1)

Sample 1: Normal (100, 1) Sample 2: Normal (103, 1) CFC =

CFC=3.53 CFC=1.07 CFC=2.89 CFC=1.45

P-values and FC contains different information

Gene Selection and Ranking A high threshold of statistical significance  Select genes with p-values smaller than a threshold The selected genes are ordered according to their scientific significance (i.e. ranked by fold-changes)

The False Positive Rate (FPR) If we select genes with p-value < 0.01, then the probability of making a positive call when the gene is in fact not differential is less than Thus selection by p-value controls the FPR. However, if we have 12,000 genes in a microarray, then a FPR = 0.01 still allows up to 120 false positives. To make sensible decision, we must take multiple comparisons into consideration.

Dealing with Multiple Comparison Bonferroni inequality: To control the family-wise error rate for testing m hypotheses at level α, we need to control the FPR for each individual test at α/m Then P(false rejection at least one hypothesis) < α or P(no false rejection) > 1- α This is appropriate for some applications (e.g. testing a new drug versus several existing ones), but is too conservative for our task of gene selection.