Suppose we have analyzed total of N genes, n of which turned out to be differentially expressed/co-expressed (experimentally identified - call them significant) - form the Cluster 1 Suppose that y out of N total genes were classified into a specific "Functional group" - FCluster 1 Suppose that x are in both Cluster1 and FCluster1 Q1: Is this FCluster1 significantly associated with Cluster1? Q2: Are significant genes overrepresented in this functional group when compared to their overall frequency among all analyzed genes? Q3: If we randomly draw n out of N genes to be put in Cluster1, what is the chance of getting x or more of them to fall into FCluster1? Q4: If we randomly draw y out of N genes to be in FCluster1, what is the chance of getting x of them or more to fall into Cluster1 First step of making a story: Statistical significance of a particular "Functional cluster"
Deriving the Fisher's exact test (hypergeometric distribution) - a typical approach to deriving probability distribution functions for discrete data Simulation-based Fisher's test - data is simulated according the "null" distribution many times and the sampling distribution of the test hypothesis under the null hypothesis is calculated empirically Simulation-based testing in computational biology is extremely common Recreating the true "null" distribution by simulations is often much easier than deriving the true null distribution Testing correlations between genes groups and functional clusters
External Validation Correlate clusters or just groups of differentially expressed genes with "functional clusters" Genes in FCluster 1 Genes NOT in FCluster 1 Genes in Cluster 1 xn-xn Genes NOT in Cluster 1 y-xN-y-(n-x)N-n yN-yN Expression Clusters Functional Clusters f
Statistical significance of a particular "Functional cluster" - cont g y+1 g1g1 gygy gNgN... g1g1 gxgx g x+1 gygy g n+y+1 g y+1 g n+(y-x) gNgN... Observed Before finding differentially expressed genes we know that y out of N genes belong to FCluster1(red boxes) If there is no association between Cluster1 and FCluster1, number of genes in FCluster1 that turn out to be differentially expressed is not significantly greater than number we can get by randomly drawing n out of N genes (boxes with blue border) and counting how many belong to FCluster Outcomes (o 1,...,o T ): A set of n genes was randomly selected from the list of N genes and borders of their boxes are painted blue. Event of interest (E): Set of all outcomes for which the number of red boxes with blue border is equal to x Since drawing is random all outcomes are equally probable
Statistical significance of a particular "Functional cluster" - cont Outcome (o 1,...,o T ): A set of y genes with selected from the list of N genes Event of interest (E): Set of all outcomes for which the number of red boxes among the y boxes drawn is equal to x All we have to do is calculating M and N where: T=number of different sets we can draw a set of y genes out of total of N genes M=number of different ways to obtain x red boxes (significant genes) when drawing y boxes (genes) out of total of N boxes (genes), x of which are red (significant) Comes from the fact that order in which we pick genes does not matter First pick x red boxes. For each such set of x red boxes pick a set of y-x non-red boxes
Statistical significance of a particular "Functional cluster" - p-value Fisher's exact test or the "hypergeometric" test P-value: Probability of observing x or more significant genes under the null hypothesis Genes in FCluster 1 Genes NOT in FCluster 1 Genes in Cluster 1 xn-xn Genes NOT in Cluster 1 y-xN-y-(n-x)N-n yN-yN Expression Clusters Functional Clusters
Top 2 GO Categories for genes with FDR< 0.01 GO Term 1 FDR for the category= GOID = GO: Term = muscle contraction Definition = A process leading to shortening and/or development of tension in muscle tissue. Muscle contraction occurs by a sliding filament mechanism whereby actin filaments slide inward among the myosin filaments. Ontology = BP Two By Two matrix of gene memberships in this category [,1] [,2] [1,] 3 12 [2,] GO Term 2 FDR for the category= GOID = GO: Term = regulation of muscle contraction Definition = Any process that modulates the frequency, rate or extent of muscle contraction. Ontology = BP Two By Two matrix of gene memberships in this category [,1] [,2] [1,] 2 13 [2,] Statistically Significant GO Categories
Top 2 GO Categories for genes with FDR< 0.05 GO Term 1 FDR for the category= GOID = GO: Term = extracellular region Synonym = extracellular Definition = The space external to the outermost structure of a cell. For cells without external protective or external encapsulating structures this refers to space outside of the plasma membrane. This term covers the host cell environment outside an intracellular parasite. Ontology = CC Two By Two matrix of gene memberships in this category [,1] [,2] [1,] [2,] GO Term 2 FDR for the category= GOID = GO: Term = extracellular space Synonym = intercellular space Definition = That part of a multicellular organism outside the cells proper, usually taken to be outside the plasma membranes, and occupied by fluid. Ontology = CC Two By Two matrix of gene memberships in this category [,1] [,2] [1,] [2,] Statistically Significant GO Categories
> fisher.test(matrix(c(3,12,33,9268),byrow=T,ncol=2), alternative = "greater") Fisher's Exact Test for Count Data data: matrix(c(3, 12, 33, 9268), byrow = T, ncol = 2) p-value = 2.336e-05 alternative hypothesis: true odds ratio is greater than 1 95 percent confidence interval: Inf sample estimates: odds ratio 69 > fisher.test(matrix(c(160,544,1381,7231),byrow=T,ncol=2), alternative = "greater") Fisher's Exact Test for Count Data data: matrix(c(160, 544, 1381, 7231), byrow = T, ncol = 2) p-value = 6.089e-06 alternative hypothesis: true odds ratio is greater than 1 95 percent confidence interval: Inf sample estimates: odds ratio Statistically Significant GO Categories
> fisher.test(matrix(c(4,1752,0,9268),byrow=T,ncol=2), alternative = "greater") Fisher's Exact Test for Count Data data: matrix(c(4, 1752, 0, 9268), byrow = T, ncol = 2) p-value = alternative hypothesis: true odds ratio is greater than 1 95 percent confidence interval: Inf sample estimates: odds ratio Inf > fisher.test(matrix(c(5,1751,0,9268),byrow=T,ncol=2), alternative = "greater") Fisher's Exact Test for Count Data data: matrix(c(5, 1751, 0, 9268), byrow = T, ncol = 2) p-value = alternative hypothesis: true odds ratio is greater than 1 95 percent confidence interval: Inf sample estimates: odds ratio Inf > Statistically Significant GO Categories
Two By Two matrix of gene memberships in this category [,1] [,2] [1,] 3 12 [2,] p-value = 2.336e-05 Randomly Drawing 15 "significant" genes out of the total of 9316 genes > sample(1:9316,15, replace = FALSE, prob = NULL) [1] > sample(1:9316,15, replace = FALSE, prob = NULL) [1] > sample(1:9316,15, replace = FALSE, prob = NULL) [1] Without a loss of generality, can assume that first 36 genes belong the FCluster1 Overall Strategy: Generate Many Random Samples For Random Sample, Count How Many Genes Belong to first 36 (x) Calculate the proportion of samples for which x 3 Re-sampling based testing
Two By Two matrix of gene memberships in this category [,1] [,2] [1,] 3 12 [2,] p-value = 2.336e-05 Randomly Drawing 15 "significant" genes out of the total of 9316 genes > for(i in 1: ){ + x<-sum(sample(1:9316,15, replace = FALSE, prob = NULL)<=36) + ngreater3 =3) + } > ngreater3/ [1] 1.7e-05 The approximate, resampling-based, p-value is close to the p- value based on the Fisher's test and the hypergeometric distribution Approximating very small p-values is difficult using resampling methods-need huge number of replicates to be precise Re-sampling based testing
Two By Two matrix of gene memberships in this category [,1] [,2] [1,] 1 14 [2,] > fisher.test(matrix(c(1,14,35,9266),byrow=T,ncol=2), alternative = "greater") Fisher's Exact Test for Count Data data: matrix(c(1, 14, 35, 9266), byrow = T, ncol = 2) p-value = alternative hypothesis: true odds ratio is greater than 1 95 percent confidence interval: Inf sample estimates: odds ratio > ngreater3<-0 > for(i in 1: ){ + x<-sum(sample(1:9316,15, replace = FALSE, prob = NULL)<=36) + ngreater3 =1) + } > ngreater3/ [1] > Re-sampling based testing
#Loading Annotation Libraries library(annotate) library(mouseLLMappings) library(GO) #Forming lists of all GO terms GO<-as.list(GOTERM) AllGOTerms<-names(GO) #Getting LocusLink (i.e. Entrez Gene IDs) genes on the microarray ACC2LL <- as.list(mouseLLMappingsACCNUM2LL) AllGenesLL<-ACC2LL[as.character(LimmaDataNickel$genes[,"Name"])] AllGenesLL<-unique(unlist(AllGenesLL[!is.na(names(AllGenesLL))])) nGenesLL<-length(AllGenesLL) #Getting frequencies of genes on the microarray in each GO category GenesInGO<-sapply(AllGOTerms, function(x) {llsGO<-intersect(get(x,GOALLLOCUSID),AllGenesLL);nlls<-length(llsGO);nlls}) #Selecting categories with at least 5 genes AllGE5 =5] GO Significance
sig.genes<-(EBayesFDRD<FDR) sum(sig.genes,na.rm=T) sig.genes.LL<-ACC2LL[as.character(LimmaDataNickel$genes[sig.genes,"Name"])] sig.genes.LL<-unique(unlist(sig.genes.LL[!is.na(names(sig.genes.LL))])) nSigGenes<-length(sig.genes.LL) nNonSigGenes<-nGenesLL-nSigGenes nSigGenesInGO<-length(intersect(sig.genes.LL,get(AllGE5[1],GOALLLOCUSID))) nSigGenesNOTinGO<-nSigGenes-nSigGenesInGO nNonSigGenesInGO<- length(intersect(setdiff(AllGenesLL,sig.genes.LL),get(AllGE5[[1]][1],GOALLLOCUSID))) nNonSigGenesNOTinGO<-nNonSigGenes-nNonSigGenesInGO twoBytwo<- matrix(c(nSigGenesInGO,nSigGenesNOTinGO,nNonSigGenesInGO,nNonSigGenesNOTinGO),byrow=T,ncol=2) FisherPValues<-sapply(AllGE5,function(x) { currentLL<-get(x,GOALLLOCUSID) nSigGenesInGO<-length(intersect(sig.genes.LL,currentLL)) nSigGenesNOTinGO<-nSigGenes-nSigGenesInGO nNonSigGenesInGO<-length(intersect(setdiff(AllGenesLL,sig.genes.LL),currentLL)) nNonSigGenesNOTinGO<-nNonSigGenes-nNonSigGenesInGO twoBytwo<- matrix(c(nSigGenesInGO,nSigGenesNOTinGO,nNonSigGenesInGO,nNonSigGenesNOTinGO),byrow=T,ncol=2) fisher.test(twoBytwo, alternative = "greater")$p.value }) FisherFDR<-p.adjust(FisherPValues,method="fdr")}) GO Significance
SigGOs<-(FisherFDR<0.05) MostSigGOs<-order(FisherFDR) nSigGOs<-2 cat("\n\nTop ",nSigGOs," GO Categories for genes with FDR<",FDR,"\n") for(TopGO in 1:nSigGOs){ GO<-AllGE5[MostSigGOs[TopGO]] currentLL<-get(GO,GOALLLOCUSID) nSigGenesInGO<-length(intersect(sig.genes.LL,currentLL)) nSigGenesNOTinGO<-nSigGenes-nSigGenesInGO nNonSigGenesInGO<-length(intersect(setdiff(AllGenesLL,sig.genes.LL),currentLL)) nNonSigGenesNOTinGO<-nNonSigGenes-nNonSigGenesInGO twoBytwo<- matrix(c(nSigGenesInGO,nSigGenesNOTinGO,nNonSigGenesInGO,nNonSigGenesNOTinGO),byrow=T,ncol=2) cat("GO Term ",TopGO,"\n") cat("FDR for the category= ",FisherFDR[MostSigGOs[TopGO]],"\n") print(get(GO,GOTERM)) cat("\nTwo By Two matrix of gene memberships in this category\n") print(twoBytwo) GO Significance
This is a bit complicated for our microarray data For microarrays with an "annotation package" everything is much easier For example, all Affymetrix arrays have appropriate annotation packages Can use GOstats package to calculate significances directly For a nice example that you can follow by simply copying and pasting commands into your R session: GO Significance