The Broad Institute of MIT and Harvard Differential Analysis.

The Broad Institute of MIT and Harvard Differential Analysis

The Broad Institute of MIT and Harvard Differential Analysis distinct classes, Given phenotypically distinct classes, find “markers” that distinguish these classes from one another Tumor Normal Marker selection TumorNormal

The Broad Institute of MIT and Harvard Problem Gene Markers Error Example Normal vs. Renal carcinoma I. Tissue or Cell Type ~1000-2000 ~0% Normal vs. Renal carcinoma Normal vs. Abnormal Leukemia ALL vs. AML II. Morphological ~200-500 ~0-5% Leukemia ALL vs. AML Type ALL B- vs. T-Cell III. Morphological Subtype ~50-100 ~0-15% ALL B- vs. T-Cell Multiclass Classification AML Treatment Outcome IV. Treatment Outcome ~1-20 ~5-50% AML Treatment Outcome Drug Sensitivity Degree of Difficulty adapted from P. Tamayo Hierarchy of difficulty Gene Marker Selection

The Broad Institute of MIT and Harvard Gene Marker Selection Compute score for each gene Score Dataset Phenotype/ class labels Compute score: t-test, SNR, etc. Ranked gene list T-test: Signal-to-Noise Ratio (SNR):

The Broad Institute of MIT and Harvard Gene Marker Selection Small sample size. Each gene tested is a separate hypothesis  likelihood of false positives. Gene interaction not taken into account. Challenges

The Broad Institute of MIT and Harvard Gene Markers Selection  Generate a 10,000x100 matrix from a Gaussian (mean=0, SD=0.5)  Pick n columns (6,14,30,100)  Assign sample labels yellow and green  Select top 25 markers for yellow, top 25 markers for green  Generate a 10,000x100 matrix from a Gaussian (mean=0, SD=0.5)  Pick n columns (6,14,30,100)  Assign sample labels yellow and green  Select top 25 markers for yellow, top 25 markers for green Small Sample Size With small sample size it is easy to find genes correlated with phenotype Yellow Green 6 samples YellowGreen 14 samples Yellow Green 30 samples Yellow Green 100 samples

The Broad Institute of MIT and Harvard scores Distribution of permuted scores for given gene P-value calculation If a gene is normally distributed the t-score follows the t-distribution –What if they aren’t normally distributed? Permutation Test: –shuffle labels (class membership) –compute score for each gene (t-score, SNR,.. ) –repeat many times  Empirical null distribution  Empirical null distribution of scores for each gene Compare observed score to empirical distribution. Observed score of gene No distributional assumptions are made - compute gene-specific p-values

The Broad Institute of MIT and Harvard 741994671945610384121 7351439455769883 673 8 9785 2428 241 966 5 38 849798 456527 7249624129 913711155 7547126581 9487291 3 8426692 52537 76293 5997 252484292582 75 53258934561192625165 61527993422914838666 317282424129 8373986 8 743 31561831934126 928847989 8965573652 42 893839 529652 539 19718 27 2914328892 16618864988555874 49 51155217249 14 97775 Permutation test and P-value “Called” Class A “Called” Class B “True” classes Permutation 1 Permutation 2 Permutation n To determine how significant a gene’s statistical score is Known class A samplesKnown class B samplesScore Generates a “null distribution” of values for this gene Compare with “real” score for this gene

The Broad Institute of MIT and Harvard Marker Selection Process Dataset Phenotype/ class labels Measure of significance Compute score: t-test, SNR, etc. Measure significance: permutation test Correct for multiple hypotheses: FDR, FWER, etc. Markers Score Ranked gene list

The Broad Institute of MIT and Harvard Multiple Hypotheses Bonferroni Correction: –Most conservative metric –Divides the p-value by the number of hypotheses FWER (Family-Wise Error Rate): probability of calling one or more hypotheses significant given that they are all null FDR (False Discovery Rate): probability that the null hypothesis is true given that the result is significant Try to reduce the number of hypotheses tested in the first place (i.e. filtering) What to control

The Broad Institute of MIT and Harvard Exercise 1.Choose module: Gene List Selection  ComparativeMarkerSelection 2.Choose input file: Next to “input file”, choose “Specify URL” View datasets window in Web browser Click and drag all_aml_train.preprocessed.gct 3.Choose class file: Next to “cls file”, choose “Specify URL” View datasets window in Web browser Click and drag all_aml_train.cls 4.Click Run ComparativeMarkerSelection Module

The Broad Institute of MIT and Harvard Viewing Analysis Results

The Broad Institute of MIT and Harvard Reduce number of hypotheses/genes by variation filtering (attempt at reducing false negatives) Choose test statistic (e.g., SNR, t-score,...) If enough samples, compute p-values by permutation test (otherwise, compute asymptotic test using the standard t- distribution). Control for Multiple Hypothesis Testing by using the FDR correction –Remember: if you choose FDR ≤ 0.05, you’re willing to accept 5% of false positives. –If number of significant hypotheses/genes “too large” even for very small threshold values, either: use the maxT correction (possible w/ empirical p-values only). use additional criteria (e.g., min fold-change, min expression value, etc.) Differential Analysis Cookbook

The Broad Institute of MIT and Harvard Create expression data set – ExpressionFileCreator Reduce number of hypotheses/genes by variation filtering – PreprocessDataset Make class file Run Differential Analysis – ComparativeMarkerSelection –Choose test statistic (say, t-score) View results with ComparativeMarkerSelectionViewer –If enough samples, compute p-values by permutation test (otherwise, use asymptotic test). –Control for MHT by using the FDR correction –Use HeatMapViewer to view results for top genes Use GSEA to find gene sets (or pathways) that are enriched in your dataset. Differential Analysis GenePattern modules

The Broad Institute of MIT and Harvard Working with Samples and Features

The Broad Institute of MIT and Harvard Overview Extracting a set of samples Computing co-expressed genes Converting probe set ids to gene names Computing overlap between gene sets

The Broad Institute of MIT and Harvard Working with Samples and Features 1.From a combined dataset of cancer and normal samples, select the normal samples. 2.Within the normal samples, find the genes coexpressed with LRPPRC (Affymetrix probe M92439_at), a gene with mitochondrial function. 3.Compare these genes and those coexpressed with LRPPRC in another expression dataset to determine the coexpressed genes common to both datasets. SelectFeaturesColumns GeneNeighbors GeneListSignificanceViewer CollapseDataset VennDiagram GCM_Total.r es GCM_Normals.res GCM_Normals.markerdata.g ct GCM_Total_Normals.markerdata.collapsed.row.nam es.txt ExtractRowNames GCM_Normals.markerlist.o df GCM_Total_Normals.markerdata.collapsed. gct

The Broad Institute of MIT and Harvard Exercise

The Broad Institute of MIT and Harvard Differential Analysis.

Similar presentations

Presentation on theme: "The Broad Institute of MIT and Harvard Differential Analysis."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The Broad Institute of MIT and Harvard Differential Analysis.

Similar presentations

Presentation on theme: "The Broad Institute of MIT and Harvard Differential Analysis."— Presentation transcript:

Similar presentations

About project

Feedback