Suppose we have analyzed total of N genes, n of which turned out to be differentially expressed/co-expressed (experimentally identified - call them significant)

Slides:



Advertisements
Similar presentations
Lecture (11,12) Parameter Estimation of PDF and Fitting a Distribution Function.
Advertisements

Lecture 3 Outline: Thurs, Sept 11 Chapters Probability model for 2-group randomized experiment Randomization test p-value Probability model for.
CHAPTER 21 Inferential Statistical Analysis. Understanding probability The idea of probability is central to inferential statistics. It means the chance.
Inferential Statistics
Hypothesis Testing A hypothesis is a claim or statement about a property of a population (in our case, about the mean or a proportion of the population)
Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Sampling Distributions (§ )
Data mining with the Gene Ontology Josep Lluís Mosquera April 2005 Grup de Recerca en Estadística i Bioinformàtica GOing into Biological Meaning.
Introduction to Functional Analysis J.L. Mosquera and Alex Sanchez.
EPIDEMIOLOGY AND BIOSTATISTICS DEPT Esimating Population Value with Hypothesis Testing.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 9-1 Chapter 9 Fundamentals of Hypothesis Testing: One-Sample Tests Basic Business Statistics.
8-3 Testing a Claim about a Proportion
Previous Lecture: Analysis of Variance
Ka-fu Wong © 2004 ECON1003: Analysis of Economic Data Lesson6-1 Lesson 6: Sampling Methods and the Central Limit Theorem.
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
Multiple testing in high- throughput biology Petter Mostad.
Confidence Intervals and Hypothesis Testing - II
Fundamentals of Hypothesis Testing: One-Sample Tests
© 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-Hall, Inc. Chap th Lesson Introduction to Hypothesis Testing.
Review of Statistical Inference Prepared by Vera Tabakova, East Carolina University ECON 4550 Econometrics Memorial University of Newfoundland.
STAT 5372: Experimental Statistics Wayne Woodward Office: Office: 143 Heroy Phone: Phone: (214) URL: URL: faculty.smu.edu/waynew.
GO::TermFinder Gavin Sherlock Department of Genetics Stanford University
Getting the story – biological model based on microarray data Once the differentially expressed genes are identified (sometimes hundreds of them), we need.
Essential Statistics in Biology: Getting the Numbers Right
Comparing Two Proportions
Copyright © 2010, 2007, 2004 Pearson Education, Inc Lecture Slides Elementary Statistics Eleventh Edition and the Triola Statistics Series by.
Slide Slide 1 Chapter 8 Hypothesis Testing 8-1 Overview 8-2 Basics of Hypothesis Testing 8-3 Testing a Claim about a Proportion 8-4 Testing a Claim About.
● Final exam Wednesday, 6/10, 11:30-2:30. ● Bring your own blue books ● Closed book. Calculators and 2-page cheat sheet allowed. No cell phone/computer.
Assume we have two experimental conditions (j=1,2) We measure expression of all genes n times under both experimental conditions (n two- channel.
First approach - repeating a simple analysis for each gene separately - 30k times Assume we have two experimental conditions (j=1,2) We measure.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved Section 8-3 Testing a Claim About a Proportion.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide
HYPOTHESIS TESTING. Statistical Methods Estimation Hypothesis Testing Inferential Statistics Descriptive Statistics Statistical Methods.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Large sample CI for μ Small sample CI for μ Large sample CI for p
Chap 8-1 A Course In Business Statistics, 4th © 2006 Prentice-Hall, Inc. A Course In Business Statistics 4 th Edition Chapter 8 Introduction to Hypothesis.
Copyright © 2010 Pearson Education, Inc. Chapter 22 Comparing Two Proportions.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 22 Comparing Two Proportions.
Copyright © 2010 Pearson Education, Inc. Slide
Copyright ©2013 Pearson Education, Inc. publishing as Prentice Hall 9-1 σ σ.
Introduction to Inference: Confidence Intervals and Hypothesis Testing Presentation 8 First Part.
Introduction to Inference: Confidence Intervals and Hypothesis Testing Presentation 4 First Part.
Slide Slide 1 Section 8-4 Testing a Claim About a Mean:  Known.
Chap 8-1 Fundamentals of Hypothesis Testing: One-Sample Tests.
Statistical Testing with Genes Saurabh Sinha CS 466.
Copyright © 2011 Pearson Education, Inc. Putting Statistics to Work.
: An alternative representation of level of significance. - normal distribution applies. - α level of significance (e.g. 5% in two tails) determines the.
Suppose we have T genes which we measured under two experimental conditions (Ctl and Nic) in n replicated experiments t i * and p i are the t-statistic.
Logic and Vocabulary of Hypothesis Tests Chapter 13.
Ka-fu Wong © 2003 Chap 6- 1 Dr. Ka-fu Wong ECON1003 Analysis of Economic Data.
Cluster validation Integration ICES Bioinformatics.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
Math 3680 Lecture #13 Hypothesis Testing: The z Test.
Getting the story – biological model based on microarray data Once the differentially expressed genes are identified (sometimes hundreds of them), we need.
Analysis of Experimental Data; Introduction
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
Review: Stages in Research Process Formulate Problem Determine Research Design Determine Data Collection Method Design Data Collection Forms Design Sample.
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb Analysis of (cDNA) Microarray.
Hypothesis Testing. Suppose we believe the average systolic blood pressure of healthy adults is normally distributed with mean μ = 120 and variance σ.
Uncertainty and confidence Although the sample mean,, is a unique number for any particular sample, if you pick a different sample you will probably get.
PEP-PMMA Training Session Statistical inference Lima, Peru Abdelkrim Araar / Jean-Yves Duclos 9-10 June 2007.
Review Statistical inference and test of significance.
Canadian Bioinformatics Workshops
Class Seven Turn In: Chapter 18: 32, 34, 36 Chapter 19: 26, 34, 44 Quiz 3 For Class Eight: Chapter 20: 18, 20, 24 Chapter 22: 34, 36 Read Chapters 23 &
The hypergeometric distribution/ Fisher exact test
Statistical Testing with Genes
Sampling Distributions (§ )
Statistical Testing with Genes
Presentation transcript:

Suppose we have analyzed total of N genes, n of which turned out to be differentially expressed/co-expressed (experimentally identified - call them significant) - form the Cluster 1 Suppose that y out of N total genes were classified into a specific "Functional group" - FCluster 1 Suppose that x are in both Cluster1 and FCluster1 Q1: Is this FCluster1 significantly associated with Cluster1? Q2: Are significant genes overrepresented in this functional group when compared to their overall frequency among all analyzed genes? Q3: If we randomly draw n out of N genes to be put in Cluster1, what is the chance of getting x or more of them to fall into FCluster1? Q4: If we randomly draw y out of N genes to be in FCluster1, what is the chance of getting x of them or more to fall into Cluster1 First step of making a story: Statistical significance of a particular "Functional cluster"

Deriving the Fisher's exact test (hypergeometric distribution) - a typical approach to deriving probability distribution functions for discrete data Simulation-based Fisher's test - data is simulated according the "null" distribution many times and the sampling distribution of the test hypothesis under the null hypothesis is calculated empirically Simulation-based testing in computational biology is extremely common Recreating the true "null" distribution by simulations is often much easier than deriving the true null distribution Testing correlations between genes groups and functional clusters

External Validation Correlate clusters or just groups of differentially expressed genes with "functional clusters" Genes in FCluster 1 Genes NOT in FCluster 1 Genes in Cluster 1 xn-xn Genes NOT in Cluster 1 y-xN-y-(n-x)N-n yN-yN Expression Clusters Functional Clusters f

Statistical significance of a particular "Functional cluster" - cont g y+1 g1g1 gygy gNgN... g1g1 gxgx g x+1 gygy g n+y+1 g y+1 g n+(y-x) gNgN... Observed Before finding differentially expressed genes we know that y out of N genes belong to FCluster1(red boxes) If there is no association between Cluster1 and FCluster1, number of genes in FCluster1 that turn out to be differentially expressed is not significantly greater than number we can get by randomly drawing n out of N genes (boxes with blue border) and counting how many belong to FCluster Outcomes (o 1,...,o T ): A set of n genes was randomly selected from the list of N genes and borders of their boxes are painted blue. Event of interest (E): Set of all outcomes for which the number of red boxes with blue border is equal to x Since drawing is random all outcomes are equally probable

Statistical significance of a particular "Functional cluster" - cont Outcome (o 1,...,o T ): A set of y genes with selected from the list of N genes Event of interest (E): Set of all outcomes for which the number of red boxes among the y boxes drawn is equal to x All we have to do is calculating M and N where: T=number of different sets we can draw a set of y genes out of total of N genes M=number of different ways to obtain x red boxes (significant genes) when drawing y boxes (genes) out of total of N boxes (genes), x of which are red (significant) Comes from the fact that order in which we pick genes does not matter First pick x red boxes. For each such set of x red boxes pick a set of y-x non-red boxes

Statistical significance of a particular "Functional cluster" - p-value Fisher's exact test or the "hypergeometric" test P-value: Probability of observing x or more significant genes under the null hypothesis Genes in FCluster 1 Genes NOT in FCluster 1 Genes in Cluster 1 xn-xn Genes NOT in Cluster 1 y-xN-y-(n-x)N-n yN-yN Expression Clusters Functional Clusters

Top 2 GO Categories for genes with FDR< 0.01 GO Term 1 FDR for the category= GOID = GO: Term = muscle contraction Definition = A process leading to shortening and/or development of tension in muscle tissue. Muscle contraction occurs by a sliding filament mechanism whereby actin filaments slide inward among the myosin filaments. Ontology = BP Two By Two matrix of gene memberships in this category [,1] [,2] [1,] 3 12 [2,] GO Term 2 FDR for the category= GOID = GO: Term = regulation of muscle contraction Definition = Any process that modulates the frequency, rate or extent of muscle contraction. Ontology = BP Two By Two matrix of gene memberships in this category [,1] [,2] [1,] 2 13 [2,] Statistically Significant GO Categories

Top 2 GO Categories for genes with FDR< 0.05 GO Term 1 FDR for the category= GOID = GO: Term = extracellular region Synonym = extracellular Definition = The space external to the outermost structure of a cell. For cells without external protective or external encapsulating structures this refers to space outside of the plasma membrane. This term covers the host cell environment outside an intracellular parasite. Ontology = CC Two By Two matrix of gene memberships in this category [,1] [,2] [1,] [2,] GO Term 2 FDR for the category= GOID = GO: Term = extracellular space Synonym = intercellular space Definition = That part of a multicellular organism outside the cells proper, usually taken to be outside the plasma membranes, and occupied by fluid. Ontology = CC Two By Two matrix of gene memberships in this category [,1] [,2] [1,] [2,] Statistically Significant GO Categories

> fisher.test(matrix(c(3,12,33,9268),byrow=T,ncol=2), alternative = "greater") Fisher's Exact Test for Count Data data: matrix(c(3, 12, 33, 9268), byrow = T, ncol = 2) p-value = 2.336e-05 alternative hypothesis: true odds ratio is greater than 1 95 percent confidence interval: Inf sample estimates: odds ratio 69 > fisher.test(matrix(c(160,544,1381,7231),byrow=T,ncol=2), alternative = "greater") Fisher's Exact Test for Count Data data: matrix(c(160, 544, 1381, 7231), byrow = T, ncol = 2) p-value = 6.089e-06 alternative hypothesis: true odds ratio is greater than 1 95 percent confidence interval: Inf sample estimates: odds ratio Statistically Significant GO Categories

> fisher.test(matrix(c(4,1752,0,9268),byrow=T,ncol=2), alternative = "greater") Fisher's Exact Test for Count Data data: matrix(c(4, 1752, 0, 9268), byrow = T, ncol = 2) p-value = alternative hypothesis: true odds ratio is greater than 1 95 percent confidence interval: Inf sample estimates: odds ratio Inf > fisher.test(matrix(c(5,1751,0,9268),byrow=T,ncol=2), alternative = "greater") Fisher's Exact Test for Count Data data: matrix(c(5, 1751, 0, 9268), byrow = T, ncol = 2) p-value = alternative hypothesis: true odds ratio is greater than 1 95 percent confidence interval: Inf sample estimates: odds ratio Inf > Statistically Significant GO Categories

Two By Two matrix of gene memberships in this category [,1] [,2] [1,] 3 12 [2,] p-value = 2.336e-05 Randomly Drawing 15 "significant" genes out of the total of 9316 genes > sample(1:9316,15, replace = FALSE, prob = NULL) [1] > sample(1:9316,15, replace = FALSE, prob = NULL) [1] > sample(1:9316,15, replace = FALSE, prob = NULL) [1] Without a loss of generality, can assume that first 36 genes belong the FCluster1 Overall Strategy: Generate Many Random Samples For Random Sample, Count How Many Genes Belong to first 36 (x) Calculate the proportion of samples for which x  3 Re-sampling based testing

Two By Two matrix of gene memberships in this category [,1] [,2] [1,] 3 12 [2,] p-value = 2.336e-05 Randomly Drawing 15 "significant" genes out of the total of 9316 genes > for(i in 1: ){ + x<-sum(sample(1:9316,15, replace = FALSE, prob = NULL)<=36) + ngreater3 =3) + } > ngreater3/ [1] 1.7e-05 The approximate, resampling-based, p-value is close to the p- value based on the Fisher's test and the hypergeometric distribution Approximating very small p-values is difficult using resampling methods-need huge number of replicates to be precise Re-sampling based testing

Two By Two matrix of gene memberships in this category [,1] [,2] [1,] 1 14 [2,] > fisher.test(matrix(c(1,14,35,9266),byrow=T,ncol=2), alternative = "greater") Fisher's Exact Test for Count Data data: matrix(c(1, 14, 35, 9266), byrow = T, ncol = 2) p-value = alternative hypothesis: true odds ratio is greater than 1 95 percent confidence interval: Inf sample estimates: odds ratio > ngreater3<-0 > for(i in 1: ){ + x<-sum(sample(1:9316,15, replace = FALSE, prob = NULL)<=36) + ngreater3 =1) + } > ngreater3/ [1] > Re-sampling based testing

#Loading Annotation Libraries library(annotate) library(mouseLLMappings) library(GO) #Forming lists of all GO terms GO<-as.list(GOTERM) AllGOTerms<-names(GO) #Getting LocusLink (i.e. Entrez Gene IDs) genes on the microarray ACC2LL <- as.list(mouseLLMappingsACCNUM2LL) AllGenesLL<-ACC2LL[as.character(LimmaDataNickel$genes[,"Name"])] AllGenesLL<-unique(unlist(AllGenesLL[!is.na(names(AllGenesLL))])) nGenesLL<-length(AllGenesLL) #Getting frequencies of genes on the microarray in each GO category GenesInGO<-sapply(AllGOTerms, function(x) {llsGO<-intersect(get(x,GOALLLOCUSID),AllGenesLL);nlls<-length(llsGO);nlls}) #Selecting categories with at least 5 genes AllGE5 =5] GO Significance

sig.genes<-(EBayesFDRD<FDR) sum(sig.genes,na.rm=T) sig.genes.LL<-ACC2LL[as.character(LimmaDataNickel$genes[sig.genes,"Name"])] sig.genes.LL<-unique(unlist(sig.genes.LL[!is.na(names(sig.genes.LL))])) nSigGenes<-length(sig.genes.LL) nNonSigGenes<-nGenesLL-nSigGenes nSigGenesInGO<-length(intersect(sig.genes.LL,get(AllGE5[1],GOALLLOCUSID))) nSigGenesNOTinGO<-nSigGenes-nSigGenesInGO nNonSigGenesInGO<- length(intersect(setdiff(AllGenesLL,sig.genes.LL),get(AllGE5[[1]][1],GOALLLOCUSID))) nNonSigGenesNOTinGO<-nNonSigGenes-nNonSigGenesInGO twoBytwo<- matrix(c(nSigGenesInGO,nSigGenesNOTinGO,nNonSigGenesInGO,nNonSigGenesNOTinGO),byrow=T,ncol=2) FisherPValues<-sapply(AllGE5,function(x) { currentLL<-get(x,GOALLLOCUSID) nSigGenesInGO<-length(intersect(sig.genes.LL,currentLL)) nSigGenesNOTinGO<-nSigGenes-nSigGenesInGO nNonSigGenesInGO<-length(intersect(setdiff(AllGenesLL,sig.genes.LL),currentLL)) nNonSigGenesNOTinGO<-nNonSigGenes-nNonSigGenesInGO twoBytwo<- matrix(c(nSigGenesInGO,nSigGenesNOTinGO,nNonSigGenesInGO,nNonSigGenesNOTinGO),byrow=T,ncol=2) fisher.test(twoBytwo, alternative = "greater")$p.value }) FisherFDR<-p.adjust(FisherPValues,method="fdr")}) GO Significance

SigGOs<-(FisherFDR<0.05) MostSigGOs<-order(FisherFDR) nSigGOs<-2 cat("\n\nTop ",nSigGOs," GO Categories for genes with FDR<",FDR,"\n") for(TopGO in 1:nSigGOs){ GO<-AllGE5[MostSigGOs[TopGO]] currentLL<-get(GO,GOALLLOCUSID) nSigGenesInGO<-length(intersect(sig.genes.LL,currentLL)) nSigGenesNOTinGO<-nSigGenes-nSigGenesInGO nNonSigGenesInGO<-length(intersect(setdiff(AllGenesLL,sig.genes.LL),currentLL)) nNonSigGenesNOTinGO<-nNonSigGenes-nNonSigGenesInGO twoBytwo<- matrix(c(nSigGenesInGO,nSigGenesNOTinGO,nNonSigGenesInGO,nNonSigGenesNOTinGO),byrow=T,ncol=2) cat("GO Term ",TopGO,"\n") cat("FDR for the category= ",FisherFDR[MostSigGOs[TopGO]],"\n") print(get(GO,GOTERM)) cat("\nTwo By Two matrix of gene memberships in this category\n") print(twoBytwo) GO Significance

This is a bit complicated for our microarray data For microarrays with an "annotation package" everything is much easier For example, all Affymetrix arrays have appropriate annotation packages Can use GOstats package to calculate significances directly For a nice example that you can follow by simply copying and pasting commands into your R session: GO Significance