Differential Gene Expression

Slides:



Advertisements
Similar presentations
Bayesian mixture models for analysing gene expression data Natalia Bochkina In collaboration with Alex Lewin, Sylvia Richardson, BAIR Consortium Imperial.
Advertisements

Linear Models for Microarray Data
Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Shibing Deng Pfizer, Inc. Efficient Outlier Identification in Lung Cancer Study.
Lecture 9 Microarray experiments MA plots
Multiple testing and false discovery rate in feature selection
Differential Expression Analysis Introduction to Systems Biology Course Chris Plaisier Institute for Systems Biology.
From the homework: Distribution of DNA fragments generated by Micrococcal nuclease digestion mean(nucs) = bp median(nucs) = 110 bp sd(nucs+ = 17.3.
Microarray Data Analysis Statistical methods to detect differentially expressed genes.
Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005.
Differentially expressed genes
Statistical Analysis of Microarray Data
1 Data Analysis for Gene Chip Data Part I: One-gene-at-a-time methods Min-Te Chao 2002/10/28.
. Differentially Expressed Genes, Class Discovery & Classification.
Analysis of Differential Expression T-test ANOVA Non-parametric methods Correlation Regression.
1 Test of significance for small samples Javier Cabrera.
Significance Tests P-values and Q-values. Outline Statistical significance in multiple testing Statistical significance in multiple testing Empirical.
Multiple Testing Procedures Examples and Software Implementation.
Different Expression Multiple Hypothesis Testing STAT115 Spring 2012.
General Linear Model & Classical Inference
False Discovery Rate (FDR) = proportion of false positive results out of all positive results (positive result = statistically significant result) Ladislav.
Analysis of Variance. ANOVA Probably the most popular analysis in psychology Why? Ease of implementation Allows for analysis of several groups at once.
Multiple Comparison Correction in SPMs Will Penny SPM short course, Zurich, Feb 2008 Will Penny SPM short course, Zurich, Feb 2008.
Wfleabase.org/docs/tileMEseq0905.pdf Notes and statistics on base level expression May 2009Don Gilbert Biology Dept., Indiana University
Multiple testing correction
Hypothesis Testing Statistics for Microarray Data Analysis – Lecture 3 supplement The Fields Institute for Research in Mathematical Sciences May 25, 2002.
Multiple testing in high- throughput biology Petter Mostad.
Candidate marker detection and multiple testing
Essential Statistics in Biology: Getting the Numbers Right
Regression Part II One-factor ANOVA Another dummy variable coding scheme Contrasts Multiple comparisons Interactions.
Differential Expression II Adding power by modeling all the genes Oct 06.
General Linear Model & Classical Inference London, SPM-M/EEG course May 2014 C. Phillips, Cyclotron Research Centre, ULg, Belgium
Differential Gene Expression Dennis Kostka, Christine Steinhoff Slides adapted from Rainer Spang.
Carlo Colantuoni – Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017.
Multiple Testing in Microarray Data Analysis Mi-Ok Kim.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
A A R H U S U N I V E R S I T E T Faculty of Agricultural Sciences Introduction to analysis of microarray data David Edwards.
Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman.
Analysis of Variance (ANOVA) Brian Healy, PhD BIO203.
Statistical Methods for Identifying Differentially Expressed Genes in Replicated cDNA Microarray Experiments Presented by Nan Lin 13 October 2002.
Application of Class Discovery and Class Prediction Methods to Microarray Data Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics.
Suppose we have T genes which we measured under two experimental conditions (Ctl and Nic) in n replicated experiments t i * and p i are the t-statistic.
Statistical Analysis of Microarray Data By H. Bjørn Nielsen.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
The Broad Institute of MIT and Harvard Differential Analysis.
1 Significance analysis of Microarrays (SAM) Applied to the ionizing radiation response Tusher, Tibshirani, Chu (2001) Dafna Shahaf.
One-way ANOVA Example Analysis of Variance Hypotheses Model & Assumptions Analysis of Variance Multiple Comparisons Checking Assumptions.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 6 –Multiple hypothesis testing Marshall University Genomics.
Distinguishing active from non active genes: Main principle: DNA hybridization -DNA hybridizes due to base pairing using H-bonds -A/T and C/G and A/U possible.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Canadian Bioinformatics Workshops
Lab 5 Unsupervised and supervised clustering Feb 22 th 2012 Daniel Fernandez Alejandro Quiroz.
Canadian Bioinformatics Workshops
Estimating the False Discovery Rate in Genome-wide Studies BMI/CS 576 Colin Dewey Fall 2008.
General Linear Model & Classical Inference Short course on SPM for MEG/EEG Wellcome Trust Centre for Neuroimaging University College London May 2010 C.
Canadian Bioinformatics Workshops
Estimation of Gene-Specific Variance
Multiple Testing Methods for the Analysis of Microarray Data
General Linear Model & Classical Inference
Statistics in MSmcDESPOT
Chapter 2 Simple Comparative Experiments
Mixture Modeling of the Distribution of p-values from t-tests
Significance analysis of microarrays (SAM)
Significance Analysis of Microarrays (SAM)
Multiple Testing Methods for the Analysis of Gene Expression Data
Sequence comparison: Multiple testing correction
Significance Analysis of Microarrays (SAM)
Statistical Analysis and Design of Experiments for Large Data Sets
Inferential Statistics
False discovery rate estimation
Presentation transcript:

Differential Gene Expression Xiaole Shirley Liu STAT115 / STAT215

Identification of Diagnostic Genes Classical study of cancer subtypes Golub et al. (1999)

Differential Expression Naïve method: Fold change Avg(X) / Avg(Y) Note on scale: Natural scale: MAS4, MAS5, dChip Log scale: RMA, need to take exp()

Fold Change Problems Does not give confidence of differential expression Better statistical test?

Test Normality Normal distribution QQ Plot Normal: T*-test Non-normal: non-parametric test

Wilcoxon Rank Sum Test Break Rank all data in row, count sum of ranks TT or TC Significance calculated from permutation as well E.g. 10 normal and 10 cancer Min(T) = 55 Max(T) = 155 Significance(T=150) Check U table (transformation of T) for stat significance Non-parametric, less power with fewer samples Break

Linear Model for Differential Expression Yijk = mj + aij + errorijk Separate model for each gene j. mj is the mean expression for gene j over the entire experiment (RMA ExprIndex). aij is the deviation of the mean of the ith condition from the overall mean Si aij=0 k is a specific sample. For 3 rep (mutant) over 3 rep (wildtype ctrl), we care whether amu-awt=0 (null hypothesis H0)

Ordinary t-tests c based on sample size in the two conditions How to determine sg?

Variance Estimates Same variance across treatment: standard t-test Variance in different treatment is different: Welch-t test Big |t|, small p, reject H0

Variance Stabilization Problem with estimating variance when the sample size is small (e.g. 2-3 replicates in each condition) Statistical Analysis of Microarrays (SAM) Modified t*, increase sg based on sg of other genes on the array (i.e. lowest 5 percentile of sg) LIMMA: Smyth 2004 Empirical Bayes: borrow info from all genes

LIMMA: Design Matrix Specifies RNA samples used on arrays >Mat Treat1 Treat2 Control Sample1 1 0 0 Sample2 1 0 0 Sample3 1 0 0 Sample4 0 1 0 Sample5 0 1 0 Sample6 0 1 0 Sample7 0 0 1 Sample8 0 0 1 Sample9 0 0 1

LIMMA: Contrast Matrix Specifies which comparisons are of interest > contrast Treat1-Control Treat2-Control Treat1 1 0 Treat2 0 1 Control -1 -1 Flexibility of the generalized linear model can consider many different conditions Yijk = mj + aij + errorijk

LIMMA: Contrast Matrix Smooth gene-wise variance towards a common (typical) value in a graduated way by borrowing information from all the genes, but allow flexibility for individual genes Assume gene’s variance follows inverse gamma distribution, large variance shrunk down, small variance shrunk up Break

Multiple Hypotheses Testing How many differential genes to report?

Multiple Hypotheses Testing We test differential expression for every gene with p-value, e.g. 0.01 For ~20 K genes on the array, potentially 0.01 x 20K = 200 genes wrongly called H0: no diff expr; H1: diff expr Reject H0: call something to be differential expressed Should control family-wise error rate or false discovery rate

Family-Wise Error Rate P(false rejection at most one hypothesis) < α P(no false rejection ) > 1- α Bonferroni correction: to control the family-wise error rate for testing m hypotheses at level α, we need to control the false rejection rate for each individual test at α/m If α is 0.05, for 20K gene prediction, p-value cutoff is 0.05/20K = 2.5E-6 Too conservative for differential expressed gene selection

False Discovery Rate Break U V m0 T S m1 m - R R m # not rejected Not called # rejected Called Total # H0 Two groups similar U V m0 # H1 Two groups different T S m1 m - R R m V: type I errors, false positives T: type II errors, false negatives FDR = V / R, FP / all called Break

False Discovery Rate Less conservative than family-wise error rate Benjamini and Hochberg (1995) method for FDR control, e.g. FDR ≤ * Assume all the p-val from different tests are independent Draw all m genes (x), ranked by p-val (y) Draw line y = x * / m, x = 1…m Call all the genes below the line

FDR Threshold Genes ranked by p-val p-value x * / m line index / m

Q-value Teaser: what’s the pvalue distribution if there are no differential genes Storey & Tibshirani, PNAS, 2003 Empirically derived q-value Every p-value has its corresponding q-value (FDR)

Practical Use of FDR Very useful concepts in most of genomics or high throughput studies Pvalue and FDR are monotonic Common FDR: 1%, 5%, 10%, also filter by fold change Give rough estimate of signal / noise and experimental quality For expression, most people are comfortable with ~500-2000 differentially expressed genes

Summary Differential Expression Fold change T* test on normally distributed data LIMMA uses hierarchical model to stabilize gene-wise variance FDR: adjust for multiple hypotheses testing FWER: conservative Benjamini-Hochberg qvalue