The Broad Institute of MIT and Harvard Differential Analysis.

Slides:



Advertisements
Similar presentations
Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Advertisements

Charlie Whittaker – BIG meeting 12/3/14
Independent t -test Features: One Independent Variable Two Groups, or Levels of the Independent Variable Independent Samples (Between-Groups): the two.
1 Statistics Achim Tresch UoC / MPIPZ Cologne treschgroup.de/OmicsModule1415.html
From the homework: Distribution of DNA fragments generated by Micrococcal nuclease digestion mean(nucs) = bp median(nucs) = 110 bp sd(nucs+ = 17.3.
Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Gene Set Enrichment Analysis (GSEA)
Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005.
10 Hypothesis Testing. 10 Hypothesis Testing Statistical hypothesis testing The expression level of a gene in a given condition is measured several.
Gene Expression Data Analyses (3)
Differentially expressed genes
Reduced Support Vector Machine
Independent Samples and Paired Samples t-tests PSY440 June 24, 2008.
Gene Set Analysis 09/24/07. From individual gene to gene sets Finding a list of differentially expressed genes is only the starting point. Suppose we.
. Differentially Expressed Genes, Class Discovery & Classification.
Analysis of Differential Expression T-test ANOVA Non-parametric methods Correlation Regression.
Lecture 9: One Way ANOVA Between Subjects
Statistical Methods in Computer Science Hypothesis Testing I: Treatment experiment designs Ido Dagan.
Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.
Significance Tests P-values and Q-values. Outline Statistical significance in multiple testing Statistical significance in multiple testing Empirical.
Statistical Methods in Computer Science Hypothesis Testing I: Treatment experiment designs Ido Dagan.
Statistical Comparison of Two Learning Algorithms Presented by: Payam Refaeilzadeh.
Today Concepts underlying inferential statistics
The t Tests Independent Samples.
Different Expression Multiple Hypothesis Testing STAT115 Spring 2012.
Inferential Statistics
An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets.
Probability Distributions and Test of Hypothesis Ka-Lok Ng Dept. of Bioinformatics Asia University.
Multiple testing in high- throughput biology Petter Mostad.
Jeopardy Hypothesis Testing T-test Basics T for Indep. Samples Z-scores Probability $100 $200$200 $300 $500 $400 $300 $400 $300 $400 $500 $400.
Means Tests Hypothesis Testing Assumptions Testing (Normality)
Essential Statistics in Biology: Getting the Numbers Right
Independent Samples t-Test (or 2-Sample t-Test)
Significance analysis of microarrays (SAM) SAM can be used to pick out significant genes based on differential expression between sets of samples. Currently.
Basic features for portal users. Agenda - Basic features Overview –features and navigation Browsing data –Files and Samples Gene Summary pages Performing.
Differential Gene Expression Dennis Kostka, Christine Steinhoff Slides adapted from Rainer Spang.
ANOVA (Analysis of Variance) by Aziza Munir
GSEA Overview -- Workflow GSEA is a computational method that determines whether an a priori defined set of genes shows statistically significant, concordant.
Course on Functional Analysis
Multiple Testing in Microarray Data Analysis Mi-Ok Kim.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
Comp. Genomics Recitation 3 The statistics of database searching.
Back to basics – Probability, Conditional Probability and Independence Probability of an outcome in an experiment is the proportion of times that.
Jeopardy Hypothesis Testing t-test Basics t for Indep. Samples Related Samples t— Didn’t cover— Skip for now Ancient History $100 $200$200 $300 $500 $400.
Multiple Testing Matthew Kowgier. Multiple Testing In statistics, the multiple comparisons/testing problem occurs when one considers a set of statistical.
Statistical Methods for Identifying Differentially Expressed Genes in Replicated cDNA Microarray Experiments Presented by Nan Lin 13 October 2002.
MRNA Expression Experiment Measurement Unit Array Probe Gene Sequence n n n Clinical Sample Anatomy Ontology n 1 Patient 1 n Disease n n ProjectPlatform.
I271B The t distribution and the independent sample t-test.
Application of Class Discovery and Class Prediction Methods to Microarray Data Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics.
Suppose we have T genes which we measured under two experimental conditions (Ctl and Nic) in n replicated experiments t i * and p i are the t-statistic.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
Inferential Statistics Significance Testing Chapter 4.
T Test for Two Independent Samples. t test for two independent samples Basic Assumptions Independent samples are not paired with other observations Null.
Applied Quantitative Analysis and Practices LECTURE#14 By Dr. Osman Sadiq Paracha.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 6 –Multiple hypothesis testing Marshall University Genomics.
ENGR 610 Applied Statistics Fall Week 7 Marshall University CITE Jack Smith.
Chapter 13 Understanding research results: statistical inference.
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb Analysis of (cDNA) Microarray.
Gene Set Analysis using R and Bioconductor Daniel Gusenleitner
Canadian Bioinformatics Workshops
Educational Research Inferential Statistics Chapter th Chapter 12- 8th Gay and Airasian.
Canadian Bioinformatics Workshops
Inferential Statistics Psych 231: Research Methods in Psychology.
Statistical principles: the normal distribution and methods of testing Or, “Explaining the arrangement of things”
Fewer permutations, more accurate P-values Theo A. Knijnenburg 1,*, Lodewyk F. A. Wessels 2, Marcel J. T. Reinders 3 and Ilya Shmulevich 1 1Institute for.
Estimating the False Discovery Rate in Genome-wide Studies BMI/CS 576 Colin Dewey Fall 2008.
Gene Set Enrichment Analysis. GSEA: Key Features Ranks all genes on array based on their differential expression Identifies gene sets whose member genes.
Canadian Bioinformatics Workshops
Differential Gene Expression
Significance analysis of microarrays (SAM)
Presentation transcript:

The Broad Institute of MIT and Harvard Differential Analysis

The Broad Institute of MIT and Harvard Differential Analysis distinct classes, Given phenotypically distinct classes, find “markers” that distinguish these classes from one another Tumor Normal Marker selection TumorNormal

The Broad Institute of MIT and Harvard Problem Gene Markers Error Example Normal vs. Renal carcinoma I. Tissue or Cell Type ~ ~0% Normal vs. Renal carcinoma Normal vs. Abnormal Leukemia ALL vs. AML II. Morphological ~ ~0-5% Leukemia ALL vs. AML Type ALL B- vs. T-Cell III. Morphological Subtype ~ ~0-15% ALL B- vs. T-Cell Multiclass Classification AML Treatment Outcome IV. Treatment Outcome ~1-20 ~5-50% AML Treatment Outcome Drug Sensitivity Degree of Difficulty adapted from P. Tamayo Hierarchy of difficulty Gene Marker Selection

The Broad Institute of MIT and Harvard Gene Marker Selection Compute score for each gene Score Dataset Phenotype/ class labels Compute score: t-test, SNR, etc. Ranked gene list T-test: Signal-to-Noise Ratio (SNR):

The Broad Institute of MIT and Harvard Gene Marker Selection Small sample size. Each gene tested is a separate hypothesis  likelihood of false positives. Gene interaction not taken into account. Challenges

The Broad Institute of MIT and Harvard Gene Markers Selection  Generate a 10,000x100 matrix from a Gaussian (mean=0, SD=0.5)  Pick n columns (6,14,30,100)  Assign sample labels yellow and green  Select top 25 markers for yellow, top 25 markers for green  Generate a 10,000x100 matrix from a Gaussian (mean=0, SD=0.5)  Pick n columns (6,14,30,100)  Assign sample labels yellow and green  Select top 25 markers for yellow, top 25 markers for green Small Sample Size With small sample size it is easy to find genes correlated with phenotype Yellow Green 6 samples YellowGreen 14 samples Yellow Green 30 samples Yellow Green 100 samples

The Broad Institute of MIT and Harvard scores Distribution of permuted scores for given gene P-value calculation If a gene is normally distributed the t-score follows the t-distribution –What if they aren’t normally distributed? Permutation Test: –shuffle labels (class membership) –compute score for each gene (t-score, SNR,.. ) –repeat many times  Empirical null distribution  Empirical null distribution of scores for each gene Compare observed score to empirical distribution. Observed score of gene No distributional assumptions are made - compute gene-specific p-values

The Broad Institute of MIT and Harvard Permutation test and P-value “Called” Class A “Called” Class B “True” classes Permutation 1 Permutation 2 Permutation n To determine how significant a gene’s statistical score is Known class A samplesKnown class B samplesScore Generates a “null distribution” of values for this gene Compare with “real” score for this gene

The Broad Institute of MIT and Harvard Marker Selection Process Dataset Phenotype/ class labels Measure of significance Compute score: t-test, SNR, etc. Measure significance: permutation test Correct for multiple hypotheses: FDR, FWER, etc. Markers Score Ranked gene list

The Broad Institute of MIT and Harvard Multiple Hypotheses Bonferroni Correction: –Most conservative metric –Divides the p-value by the number of hypotheses FWER (Family-Wise Error Rate): probability of calling one or more hypotheses significant given that they are all null FDR (False Discovery Rate): probability that the null hypothesis is true given that the result is significant Try to reduce the number of hypotheses tested in the first place (i.e. filtering) What to control

The Broad Institute of MIT and Harvard Exercise 1.Choose module: Gene List Selection  ComparativeMarkerSelection 2.Choose input file: Next to “input file”, choose “Specify URL” View datasets window in Web browser Click and drag all_aml_train.preprocessed.gct 3.Choose class file: Next to “cls file”, choose “Specify URL” View datasets window in Web browser Click and drag all_aml_train.cls 4.Click Run ComparativeMarkerSelection Module

The Broad Institute of MIT and Harvard Viewing Analysis Results

The Broad Institute of MIT and Harvard Reduce number of hypotheses/genes by variation filtering (attempt at reducing false negatives) Choose test statistic (e.g., SNR, t-score,...) If enough samples, compute p-values by permutation test (otherwise, compute asymptotic test using the standard t- distribution). Control for Multiple Hypothesis Testing by using the FDR correction –Remember: if you choose FDR ≤ 0.05, you’re willing to accept 5% of false positives. –If number of significant hypotheses/genes “too large” even for very small threshold values, either: use the maxT correction (possible w/ empirical p-values only). use additional criteria (e.g., min fold-change, min expression value, etc.) Differential Analysis Cookbook

The Broad Institute of MIT and Harvard Create expression data set – ExpressionFileCreator Reduce number of hypotheses/genes by variation filtering – PreprocessDataset Make class file Run Differential Analysis – ComparativeMarkerSelection –Choose test statistic (say, t-score) View results with ComparativeMarkerSelectionViewer –If enough samples, compute p-values by permutation test (otherwise, use asymptotic test). –Control for MHT by using the FDR correction –Use HeatMapViewer to view results for top genes Use GSEA to find gene sets (or pathways) that are enriched in your dataset. Differential Analysis GenePattern modules

The Broad Institute of MIT and Harvard Working with Samples and Features

The Broad Institute of MIT and Harvard Overview Extracting a set of samples Computing co-expressed genes Converting probe set ids to gene names Computing overlap between gene sets

The Broad Institute of MIT and Harvard Working with Samples and Features 1.From a combined dataset of cancer and normal samples, select the normal samples. 2.Within the normal samples, find the genes coexpressed with LRPPRC (Affymetrix probe M92439_at), a gene with mitochondrial function. 3.Compare these genes and those coexpressed with LRPPRC in another expression dataset to determine the coexpressed genes common to both datasets. SelectFeaturesColumns GeneNeighbors GeneListSignificanceViewer CollapseDataset VennDiagram GCM_Total.r es GCM_Normals.res GCM_Normals.markerdata.g ct GCM_Total_Normals.markerdata.collapsed.row.nam es.txt ExtractRowNames GCM_Normals.markerlist.o df GCM_Total_Normals.markerdata.collapsed. gct

The Broad Institute of MIT and Harvard Exercise