Download presentation
Presentation is loading. Please wait.
Published byLeslie Whitaker Modified over 11 years ago
1
Alex Lewin (Imperial College Centre for Biostatistics) Ian Grieve (IC Microarray Centre) Elena Kulinskaya (IC Statistical Advisory Service) Improving Interpretation in Gene Set Enrichment Analysis
2
Introduction Microarray experiment list of differentially expressed (DE) genes Genes belong to categories of Gene Ontology (GO) Are some GO categories (groups of genes) over- represented amongst the DE genes?
3
Contents Grouping Gene Ontology categories can improve interpretation of gene set enrichment analysis Fuzzy decision rules for multiple testing with discrete data
4
Gene Ontology (GO) Database of biological terms Arranged in graph connecting related terms: links from more general to more specific terms For each node, can define ancestor and descendant terms Directed Acyclic Graph ~16,000 terms from QuickGO website (EBI)
5
Gene Annotations Genes/proteins annotated to relevant GO terms –Gene may be annotated to several GO terms –GO term may have 1000s of genes annotated to it (or none) Gene annotated to term A annotated to all ancestors of A
6
Find GO terms over-represented amongst differentially expressed genes For each GO term, compare: proportion of differentially expressed genes annotated to that term v. proportion of non-differentially expressed genes annotated to that term Fishers test p-value for each GO term. Multiple testing considerations threshold below which p-values are declared significant. Many websites do this type of analysis, eg FatiGO website http://fatigo.bioinfo.cnio.es/ 22 1737847 467GO not DE
7
Difficulties in Testing GO terms Interpretation: many terms close in the graph may be found significant – or not significant but many low p- values close together in the graph Statistical Power: many terms have few genes annotated Discrete statistics: p-values not Uniform under null
8
Grouping GO terms Use the Poset Ontology Categorizer (POSOC) Joslyn et al. 2004 Software which groups terms based on - pseudo-distance between terms - coverage of genes Example: for data used here, reduces ~16,000 terms to 76 groups
9
Example: genes associated with the insulin-resistance gene Cd36 Knock-out and wildtype mice Bayesian hierarchical model gives posterior probabilities (p g ) of being differentially expressed Most differentially expressed: p g > 0.5 (280 genes) Least differentially expressed: p g < 0.2 (11171 genes)
10
Example Results Individual term tests Used Fatigo website Multiple testing corrections (Benjamini and Hochberg FDR) done separately for each level Found no GO terms significant when FDR controlled at 5% Group tests POSOC on all genes on U74A chip, gives 76 groups 3 groups found significant when controlling FDR at 5%
11
Comparison of Individual and Group Tests Rank in Fatigo (smallest p-values)Membership of POSOC groupsignificant 1: response to external stimulus 2: resp. to pest, pathogen or parasite 3: response to wounding 4: organismal movement 5: response to biotic stimulus 6: neurophysiological process 7: response to stress 8: inflammatory response 9: transmission of nerve impulse 10: neuromuscular physiological proc. 11: defense response 12: immune response 13: chemotaxis 14: nucleobase, nucleoside, nuc … 15: cell-cell signalling IA response to p.p.p. response to wounding IA - IA immune resp, resp. to ppp, resp to wound - IA immune resp, resp. to ppp, resp to wound chemotaxis, cell-migration - IA yes IA - IA yes - IA yes no (at 5%) no - IA = immediate ancestor of significant POSOC group
12
Physiological process` Organismal movement Inflammatory response Response to stimulus Response to external stimulus Response to biotic stimulus Response to stress Response to wounding Defense response Response to pest, pathogen or parasite Immune response Biological process Response to other organism Ranks high individually (smallest p-values) Significant in group tests (and ranks high individually) Comparison of Individual and Group Tests
13
Discrete test statistics Null hypothesis determined by margins of 2x2 table Often very small no. possible values for cells small no. possible p-values X 1737847 467GO not DE Null Hypothesis: X ~ HyperGeom(173, 7847-173, 467) X = 0,…,173
14
Discrete test statistics X 1737847 467GO not DE p-value p(x) = P( X x | null ) P( p α | null) α for most α
15
Randomised Test Observe X=x 0 p obs = observed p-value = P( X x 0 | null ) p prev = next smallest possible p-value = P( X x 0 -1 | null ) Randomised p-value P(x 0 ) = P( X < x 0 | null ) + u*P( X = x 0 | null ) where u ~ Unif(0,1) = p prev + u*(p obs - p prev ) conditionally, P | x 0 ~ Unif(p prev, p obs ) unconditionally P ~ Unif(0,1) p obs 0 1p prev
16
Fuzzy Decision Rule Idea is to use all possible realisations of randomised test. Summarise evidence by critical function of randomised test: τ α (p prev, p obs ) = 1 p obs < α ( α – p prev )/(p obs - p prev ) p prev < α < p obs 0 p prev > α p obs 0 1p prev Use τ α as a fuzzy measure of evidence against the null hypothesis. (Fuzzy decision rule considered by Cox & Hinckley, 1974 and developed by Geyer and Meeden 2005)
17
Fuzzy Decision Rules for Multiple Testing We have developed fuzzy decision rules for multiple tests (i = 1,…,m) Use Benjamini and Hochberg false discovery rate (BH FDR) τ BH α (p i prev, p i obs ) = P( randomised p-value i is rejected | null ) using BH FDR procedure For small no. tests we can calculate these exactly.
18
Fuzzy Decision Rules for Multiple Testing τ BH α (p i prev, p i obs ) = P( randomised p-value i is rejected | null ) For large no. tests use simulations: for j = 1,…,n { generate randomised p-values (i=1,…,m) P ij ~ Unif (p i prev, p i obs ) perform BH FDR procedure I ij = } τ BH α (p i prev, p i obs ) = 1/n Σ j I ij 1 if P ij rejected 0 else ^
19
Results for Cd36 Example [1] "alpha = 0.05" pprev pval i.bonf i.bh tau POSOC group 1 1e-04 3e-04 1 1 1 response to pest, pathogen or parasite 2 1e-04 4e-04 1 1 1 response to wounding 3 2e-04 6e-04 1 1 1 immune response 4 7e-04 0.0079 0 0 0.297 digestion 5 0.003 0.0122 0 0 0.021 chemotaxis 6 0.0039 0.0209 0 0 0.002 organic acid biosynthesis 7 0.0092 0.0306 0 0 0 synaptic transmission 8 5e-04 0.0436 0 0 0.059 response to fungi [1] "alpha = 0.15" pprev pval i.bonf i.bh tau POSOC group 1 1e-04 3e-04 1 1 1 response to pest, pathogen or parasite 2 1e-04 4e-04 1 1 1 response to wounding 3 2e-04 6e-04 1 1 1 immune response 4 7e-04 0.0079 0 1 1 digestion 5 0.003 0.0122 0 0 0.943 chemotaxis 6 0.0039 0.0209 0 0 0.661 organic acid biosynthesis 7 0.0092 0.0306 0 0 0.375 synaptic transmission 8 5e-04 0.0436 0 0 0.391 response to fungi
20
Results for Cd36 Example Order of fuzzy decisions is not the same as order of observed p-values Depends on amount of discreteness of null p obs p prev
21
Conclusions Grouping Gene Ontology categories can help find significant regions of the GO graph Fuzzy decision rules for multiple testing with discrete data can provide more candidates for rejection
22
Acknowledgements Cliff Joslyn (Los Alamos National Laboratory) Tim Aitman (IC Microarray Centre) Sylvia Richardson (IC Centre for Biostatistics) BBSRC Exploiting Genomics grant (AL) Wellcome Trust grant (IG) References Joslyn CA, Mniszewski SM, Fulmer A and Heaton G (2004), The Gene Ontology Categorizer, Bioinformatics 20, 169-177. Geyer and Meeden (2005), Fuzzy Confidence Intervals and P- values, Statistical Science, to appear.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.