Download presentation
Presentation is loading. Please wait.
Published byDenis Lewis Modified over 9 years ago
1
Comp. Genomics Recitation 10 4/7/09 Differential expression detection
2
Outline Clustering vs. Differential expression Fold change T-test Multiple testing FDR/SAM Mann-Whitney Examples
3
Microarray preliminaries General input: A matrix of probes (sequences) and intensities We assume the hard work is over: Probes are assigned to genes The data is properly (?) normalized We have an expression matrix Rows correspond to genes Columns correpond to conditions
4
Microarray analysis Common scenarios: We tested the behavior of genes across several time points We test a large number of different condtions Clustering is the solution We compared a small number of conditions (2) and have multiple replicates for each condition E.g., we took blood expression in 10 sick and 10 individuals Differential expression analysis
5
Identification of differential genes The most basic experimental design: comparison between 2 conditions – ‘treatment’ vs. control More complex: sick/treatment/control The goal: identify genes that are differentially expressed in the examined conditions Number of replicates is usually low (n=2-4) Statistics are important Slides: Rani Elkon
6
Approaches for identification of differential genes 1.Fold Change 2.T-test 3.SAM
7
1. Fold Change Consider genes whose mean expression level was change by at least 1.75-2 fold as differential genes Pros: Very simple! Cons: Usually no estimation of false positive rate is provided Biased to genes with low expression level Ignores the variability of gene levels over replicates.
8
Fold Change limit – Biased to low expression levels Determine ‘floor’ cut-off and set all expression levels below it to this floor level
9
Fold Change limit – ignores variability over replicates We need a score that ‘punishes’ genes with high variability over replicates
10
1.Fold Change 2.T-test 3.SAM Approaches for identification of differential genes
11
2. T-test Compute a t-score for each gene m c, m t – mean levels in Control and Treatment S c 2, S t 2 – variance estimates in Control and Treatment n c, n t – number of replicates in in Control and Treatment
12
T-test The t-score is good because it is a results of a well known statistical hypothesis testing If we assume the sample is normally distributed (unknown variance) and compare two hypotheses: H 0 – All the measurements come from the same distribution H 1 – All the measurements come from different normal distributions In this case a p-value can be derived for every t- score
13
T-test Set cut-off for p-value (α=0.01) and consider all genes with p-value < α as differential genes
14
Multiple Testing P g associated with the t-score t g is the probability for obtaining by random a t-score that is at least as extreme as t g. Multiplicity problem: thousands of genes are tested simultaneously (all the genes on the array!) Simple example: 10,000 genes on a chip not a single one is differentially expressed (everything is random) α=0.01 10010000x0.01 = 100 genes are expected to have a p- value < 0.01 just by chance.
15
Multiple testing Individual p–values of e.g. 0.01 no longer correspond to significant findings. Need to adjust for multiple testing when assessing the statistical significance of findings Actually this is a somewhat common problem in statistics
16
Multiple Testing Simple solution (Bonferroni): consider as differential genes only those with p-value < (α/N) N: number of tests α=0.01, N=10,000: cut-off=0.000001 Ensure very low probability for having any false positive genes (less than α) Advantage: very clean list of differential genes Limit: the list usually contains very few genes … unacceptable high rate of false negatives
17
FDR correction (Benjamini & Hochberg) False Discovery Rate In high-throughput studies certain proportion of false positives is tolerable Control the expected proportion of false positives among the genes declared as differential (q=10%). Scheme: Rank genes according to their p-vals: p (1) <p (2) …<p (N) Consider as differential the top k genes, where k = max{i: p (i) < i*(q/N)}
18
1.Fold Change 2.T-test 3.SAM Approaches for identification of differential genes
19
3. SAM (Tusher, Tibshirani & Chu) ‘Significance Analysis of Microarray’ Limit of analytical FDR approach: assumes that the tests are independent In the microarray context, the expression levels of some genes are highly correlated → unreliable FDR estimate SAM uses permutations to get an ‘ empirical ’ estimate for the FDR of the reported differential genes
20
SAM Scheme: Compute for each gene a statistic that measures its relative expression difference in control vs ‘treatment’ (t-score or a variant) Rank the genes according to their ‘difference score’ Set a cut off (d 0 ) and consider all genes above it as differential (N d ) Permute the condition labels, and count how many genes got score above d 0 (N p ) Repeat on many (all possible) permutations and count (N pj ) estimate FDR as the proportion: Average(N pj )/N d
21
Permutation on condition labels D score G1e11e12e13e14e15e16e17e18d1 G2e21e22e23e24e25e26e27e28d2 G3e31e32e33e34e35e36e37e38d3 d1p1 d2p1 d3p1 d1p2 d2p2 d3p2 BACK
22
SAM example Ionizing radiation response experiment After setting the threshold: 46 genes found significant 36 permutations 8.4 genes on average pass the threshold False discovery rate is 18%
23
Mann-Whitney/Wilcoxon In general normality assumption of t-test is problematic Aparametric statistics are very useful in many bioinfo related problem Assume nothing about the distribution of the samples Less powerful (more false negatives, but less false positives)
24
Mann-Whitney/Wilcoxon MW/Wilcoxon test for two samples: H 0 – The medians of both distributions are the same H 1 – The medians of the distributions are different Assumes: The two samples are independent The observations can ordered (ordinal)
25
Mann-Whitney/Wilcoxon Computes a U-score whose distribution is known under H 0 (& can be approximated by normal distribution in large samples) Arrange all the observations into a single ranked series Add up the ranks in sample 1. The sum of ranks in sample 2 follows by calculation, since the sum of all the ranks equals N(N+1)/2 U-score:
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.