Presentation is loading. Please wait.

Presentation is loading. Please wait.

Suppose we have analyzed total of N genes, n of which turned out to be differentially expressed/co-expressed (experimentally identified - call them significant)

Similar presentations


Presentation on theme: "Suppose we have analyzed total of N genes, n of which turned out to be differentially expressed/co-expressed (experimentally identified - call them significant)"— Presentation transcript:

1 Suppose we have analyzed total of N genes, n of which turned out to be differentially expressed/co-expressed (experimentally identified - call them significant) - form the Cluster 1 Suppose that y out of N total genes were classified into a specific "Functional group" - FCluster 1 Suppose that x are in both Cluster1 and FCluster1 Q1: Is this FCluster1 significantly associated with Cluster1? Q2: Are significant genes overrepresented in this functional group when compared to their overall frequency among all analyzed genes? Q3: If we randomly draw n out of N genes to be put in Cluster1, what is the chance of getting x or more of them to fall into FCluster1? Q4: If we randomly draw y out of N genes to be in FCluster1, what is the chance of getting x of them or more to fall into Cluster1 First step of making a story: Statistical significance of a particular "Functional cluster"

2 Deriving the Fisher's exact test (hypergeometric distribution) - a typical approach to deriving probability distribution functions for discrete data Simulation-based Fisher's test - data is simulated according the "null" distribution many times and the sampling distribution of the test hypothesis under the null hypothesis is calculated empirically Simulation-based testing in computational biology is extremely common Recreating the true "null" distribution by simulations is often much easier than deriving the true null distribution Testing correlations between genes groups and functional clusters

3 External Validation Correlate clusters or just groups of differentially expressed genes with "functional clusters" Genes in FCluster 1 Genes NOT in FCluster 1 Genes in Cluster 1 xn-xn Genes NOT in Cluster 1 y-xN-y-(n-x)N-n yN-yN Expression Clusters Functional Clusters f

4 Statistical significance of a particular "Functional cluster" - cont g y+1 g1g1 gygy gNgN... g1g1 gxgx g x+1 gygy g n+y+1 g y+1 g n+(y-x) gNgN... Observed Before finding differentially expressed genes we know that y out of N genes belong to FCluster1(red boxes) If there is no association between Cluster1 and FCluster1, number of genes in FCluster1 that turn out to be differentially expressed is not significantly greater than number we can get by randomly drawing n out of N genes (boxes with blue border) and counting how many belong to FCluster Outcomes (o 1,...,o T ): A set of n genes was randomly selected from the list of N genes and borders of their boxes are painted blue. Event of interest (E): Set of all outcomes for which the number of red boxes with blue border is equal to x Since drawing is random all outcomes are equally probable

5 Statistical significance of a particular "Functional cluster" - cont Outcome (o 1,...,o T ): A set of y genes with selected from the list of N genes Event of interest (E): Set of all outcomes for which the number of red boxes among the y boxes drawn is equal to x All we have to do is calculating M and N where: T=number of different sets we can draw a set of y genes out of total of N genes M=number of different ways to obtain x red boxes (significant genes) when drawing y boxes (genes) out of total of N boxes (genes), x of which are red (significant) Comes from the fact that order in which we pick genes does not matter First pick x red boxes. For each such set of x red boxes pick a set of y-x non-red boxes

6 Statistical significance of a particular "Functional cluster" - p-value Fisher's exact test or the "hypergeometric" test P-value: Probability of observing x or more significant genes under the null hypothesis Genes in FCluster 1 Genes NOT in FCluster 1 Genes in Cluster 1 xn-xn Genes NOT in Cluster 1 y-xN-y-(n-x)N-n yN-yN Expression Clusters Functional Clusters

7 Top 2 GO Categories for genes with FDR< 0.01 GO Term 1 FDR for the category= 0.0442416 GOID = GO:0006936 Term = muscle contraction Definition = A process leading to shortening and/or development of tension in muscle tissue. Muscle contraction occurs by a sliding filament mechanism whereby actin filaments slide inward among the myosin filaments. Ontology = BP Two By Two matrix of gene memberships in this category [,1] [,2] [1,] 3 12 [2,] 33 9268 GO Term 2 FDR for the category= 0.1769315 GOID = GO:0006937 Term = regulation of muscle contraction Definition = Any process that modulates the frequency, rate or extent of muscle contraction. Ontology = BP Two By Two matrix of gene memberships in this category [,1] [,2] [1,] 2 13 [2,] 11 9290 Statistically Significant GO Categories

8 Top 2 GO Categories for genes with FDR< 0.05 GO Term 1 FDR for the category= 0.006130206 GOID = GO:0005576 Term = extracellular region Synonym = extracellular Definition = The space external to the outermost structure of a cell. For cells without external protective or external encapsulating structures this refers to space outside of the plasma membrane. This term covers the host cell environment outside an intracellular parasite. Ontology = CC Two By Two matrix of gene memberships in this category [,1] [,2] [1,] 160 544 [2,] 1381 7231 GO Term 2 FDR for the category= 0.006130206 GOID = GO:0005615 Term = extracellular space Synonym = intercellular space Definition = That part of a multicellular organism outside the cells proper, usually taken to be outside the plasma membranes, and occupied by fluid. Ontology = CC Two By Two matrix of gene memberships in this category [,1] [,2] [1,] 149 555 [2,] 1266 7346 Statistically Significant GO Categories

9 > fisher.test(matrix(c(3,12,33,9268),byrow=T,ncol=2), alternative = "greater") Fisher's Exact Test for Count Data data: matrix(c(3, 12, 33, 9268), byrow = T, ncol = 2) p-value = 2.336e-05 alternative hypothesis: true odds ratio is greater than 1 95 percent confidence interval: 16.1434 Inf sample estimates: odds ratio 69 > fisher.test(matrix(c(160,544,1381,7231),byrow=T,ncol=2), alternative = "greater") Fisher's Exact Test for Count Data data: matrix(c(160, 544, 1381, 7231), byrow = T, ncol = 2) p-value = 6.089e-06 alternative hypothesis: true odds ratio is greater than 1 95 percent confidence interval: 1.310533 Inf sample estimates: odds ratio 1.539936 Statistically Significant GO Categories

10 > fisher.test(matrix(c(4,1752,0,9268),byrow=T,ncol=2), alternative = "greater") Fisher's Exact Test for Count Data data: matrix(c(4, 1752, 0, 9268), byrow = T, ncol = 2) p-value = 0.000642 alternative hypothesis: true odds ratio is greater than 1 95 percent confidence interval: 4.74044 Inf sample estimates: odds ratio Inf > fisher.test(matrix(c(5,1751,0,9268),byrow=T,ncol=2), alternative = "greater") Fisher's Exact Test for Count Data data: matrix(c(5, 1751, 0, 9268), byrow = T, ncol = 2) p-value = 0.0001021 alternative hypothesis: true odds ratio is greater than 1 95 percent confidence interval: 6.443299 Inf sample estimates: odds ratio Inf > Statistically Significant GO Categories

11 Two By Two matrix of gene memberships in this category [,1] [,2] [1,] 3 12 [2,] 33 9268 p-value = 2.336e-05 Randomly Drawing 15 "significant" genes out of the total of 9316 genes > sample(1:9316,15, replace = FALSE, prob = NULL) [1] 3387 5063 7111 6228 108 5507 710 6341 6145 3930 8396 939 813 1578 7629 > sample(1:9316,15, replace = FALSE, prob = NULL) [1] 4042 1783 2416 860 3338 654 1980 3382 8375 7508 1378 5844 5404 7550 2881 > sample(1:9316,15, replace = FALSE, prob = NULL) [1] 8592 4307 7419 4042 6643 7142 485 1318 1578 1643 5763 1384 2997 7299 1127 Without a loss of generality, can assume that first 36 genes belong the FCluster1 Overall Strategy: Generate Many Random Samples For Random Sample, Count How Many Genes Belong to first 36 (x) Calculate the proportion of samples for which x  3 Re-sampling based testing

12 Two By Two matrix of gene memberships in this category [,1] [,2] [1,] 3 12 [2,] 33 9268 p-value = 2.336e-05 Randomly Drawing 15 "significant" genes out of the total of 9316 genes > for(i in 1:1000000){ + x<-sum(sample(1:9316,15, replace = FALSE, prob = NULL)<=36) + ngreater3 =3) + } > ngreater3/1000000 [1] 1.7e-05 The approximate, resampling-based, p-value is close to the p- value based on the Fisher's test and the hypergeometric distribution Approximating very small p-values is difficult using resampling methods-need huge number of replicates to be precise Re-sampling based testing

13 Two By Two matrix of gene memberships in this category [,1] [,2] [1,] 1 14 [2,] 35 9266 > fisher.test(matrix(c(1,14,35,9266),byrow=T,ncol=2), alternative = "greater") Fisher's Exact Test for Count Data data: matrix(c(1, 14, 35, 9266), byrow = T, ncol = 2) p-value = 0.05646 alternative hypothesis: true odds ratio is greater than 1 95 percent confidence interval: 0.8822488 Inf sample estimates: odds ratio 18.87886 > ngreater3<-0 > for(i in 1:1000000){ + x<-sum(sample(1:9316,15, replace = FALSE, prob = NULL)<=36) + ngreater3 =1) + } > ngreater3/1000000 [1] 0.056758 > Re-sampling based testing

14 #Loading Annotation Libraries library(annotate) library(mouseLLMappings) library(GO) #Forming lists of all GO terms GO<-as.list(GOTERM) AllGOTerms<-names(GO) #Getting LocusLink (i.e. Entrez Gene IDs) genes on the microarray ACC2LL <- as.list(mouseLLMappingsACCNUM2LL) AllGenesLL<-ACC2LL[as.character(LimmaDataNickel$genes[,"Name"])] AllGenesLL<-unique(unlist(AllGenesLL[!is.na(names(AllGenesLL))])) nGenesLL<-length(AllGenesLL) #Getting frequencies of genes on the microarray in each GO category GenesInGO<-sapply(AllGOTerms, function(x) {llsGO<-intersect(get(x,GOALLLOCUSID),AllGenesLL);nlls<-length(llsGO);nlls}) #Selecting categories with at least 5 genes AllGE5 =5] GO Significance

15 sig.genes<-(EBayesFDRD<FDR) sum(sig.genes,na.rm=T) sig.genes.LL<-ACC2LL[as.character(LimmaDataNickel$genes[sig.genes,"Name"])] sig.genes.LL<-unique(unlist(sig.genes.LL[!is.na(names(sig.genes.LL))])) nSigGenes<-length(sig.genes.LL) nNonSigGenes<-nGenesLL-nSigGenes nSigGenesInGO<-length(intersect(sig.genes.LL,get(AllGE5[1],GOALLLOCUSID))) nSigGenesNOTinGO<-nSigGenes-nSigGenesInGO nNonSigGenesInGO<- length(intersect(setdiff(AllGenesLL,sig.genes.LL),get(AllGE5[[1]][1],GOALLLOCUSID))) nNonSigGenesNOTinGO<-nNonSigGenes-nNonSigGenesInGO twoBytwo<- matrix(c(nSigGenesInGO,nSigGenesNOTinGO,nNonSigGenesInGO,nNonSigGenesNOTinGO),byrow=T,ncol=2) FisherPValues<-sapply(AllGE5,function(x) { currentLL<-get(x,GOALLLOCUSID) nSigGenesInGO<-length(intersect(sig.genes.LL,currentLL)) nSigGenesNOTinGO<-nSigGenes-nSigGenesInGO nNonSigGenesInGO<-length(intersect(setdiff(AllGenesLL,sig.genes.LL),currentLL)) nNonSigGenesNOTinGO<-nNonSigGenes-nNonSigGenesInGO twoBytwo<- matrix(c(nSigGenesInGO,nSigGenesNOTinGO,nNonSigGenesInGO,nNonSigGenesNOTinGO),byrow=T,ncol=2) fisher.test(twoBytwo, alternative = "greater")$p.value }) FisherFDR<-p.adjust(FisherPValues,method="fdr")}) GO Significance

16 SigGOs<-(FisherFDR<0.05) MostSigGOs<-order(FisherFDR) nSigGOs<-2 cat("\n\nTop ",nSigGOs," GO Categories for genes with FDR<",FDR,"\n") for(TopGO in 1:nSigGOs){ GO<-AllGE5[MostSigGOs[TopGO]] currentLL<-get(GO,GOALLLOCUSID) nSigGenesInGO<-length(intersect(sig.genes.LL,currentLL)) nSigGenesNOTinGO<-nSigGenes-nSigGenesInGO nNonSigGenesInGO<-length(intersect(setdiff(AllGenesLL,sig.genes.LL),currentLL)) nNonSigGenesNOTinGO<-nNonSigGenes-nNonSigGenesInGO twoBytwo<- matrix(c(nSigGenesInGO,nSigGenesNOTinGO,nNonSigGenesInGO,nNonSigGenesNOTinGO),byrow=T,ncol=2) cat("GO Term ",TopGO,"\n") cat("FDR for the category= ",FisherFDR[MostSigGOs[TopGO]],"\n") print(get(GO,GOTERM)) cat("\nTwo By Two matrix of gene memberships in this category\n") print(twoBytwo) GO Significance

17 This is a bit complicated for our microarray data For microarrays with an "annotation package" everything is much easier For example, all Affymetrix arrays have appropriate annotation packages Can use GOstats package to calculate significances directly For a nice example that you can follow by simply copying and pasting commands into your R session: http://genomebiology.com/2004/5/10/R80 GO Significance


Download ppt "Suppose we have analyzed total of N genes, n of which turned out to be differentially expressed/co-expressed (experimentally identified - call them significant)"

Similar presentations


Ads by Google