Alex Lewin (Imperial College Centre for Biostatistics) Ian Grieve (IC Microarray Centre) Elena Kulinskaya (IC Statistical Advisory Service) Improving Interpretation.

Slides:



Advertisements
Similar presentations
Introductory Mathematics & Statistics for Business
Advertisements

1 Needles in Haystacks: Are There Any? How Many Are There? Where Are They? John Rice University of California, Berkeley.
Overview of Lecture Parametric vs Non-Parametric Statistical Tests.
Lewin A 1, Richardson S 1, Marshall C 1, Glazier A 2 and Aitman T 2 (2006), Biometrics 62, : Imperial College Dept. Epidemiology 2: Imperial College.
STATISTICAL TOOLS FOR SYNTHESIZING LISTS OF DIFFERENTIALLY EXPRESSED FEATURES IN MICROARRAY EXPERIMENTS Marta Blangiardo and Sylvia Richardson 1 1 Centre.
Estimating the False Discovery Rate in Multi-class Gene Expression Experiments using a Bayesian Mixture Model Alex Lewin 1, Philippe Broët 2 and Sylvia.
Model checks for complex hierarchical models Alex Lewin and Sylvia Richardson Imperial College Centre for Biostatistics.
1 Alex Lewin Centre for Biostatistics Imperial College, London Joint work with Natalia Bochkina, Sylvia Richardson BBSRC Exploiting Genomics grant Mixture.
Alex Lewin Sylvia Richardson (IC Epidemiology) Tim Aitman (IC Microarray Centre) Philippe Broët (INSERM, Paris) In collaboration with Anne-Mette Hein,
BGX 1 Sylvia Richardson Centre for Biostatistics Imperial College, London Statistical Analysis of Gene Expression Data In collaboration with Natalia Bochkina,
Bayesian mixture models for analysing gene expression data Natalia Bochkina In collaboration with Alex Lewin, Sylvia Richardson, BAIR Consortium Imperial.
Alex Lewin (Imperial College) Sylvia Richardson (IC Epidemiology) Tim Aitman (IC Microarray Centre) In collaboration with Anne-Mette Hein, Natalia Bochkina.
Model checking in mixture models via mixed predictive p-values Alex Lewin and Sylvia Richardson, Centre for Biostatistics, Imperial College, London Mixed.
BGX 1 Sylvia Richardson Natalia Bochkina Alex Lewin Centre for Biostatistics Imperial College, London Bayesian inference in differential expression experiments.
1 Session 8 Tests of Hypotheses. 2 By the end of this session, you will be able to set up, conduct and interpret results from a test of hypothesis concerning.
SADC Course in Statistics Comparing two proportions (Session 14)
Elementary Statistics
HYPOTHESIS TESTING. Purpose The purpose of hypothesis testing is to help the researcher or administrator in reaching a decision concerning a population.
6. Statistical Inference: Example: Anorexia study Weight measured before and after period of treatment y i = weight at end – weight at beginning For n=17.
Non-parametric statistics
Comparison of 2 Population Means Goal: To compare 2 populations/treatments wrt a numeric outcome Sampling Design: Independent Samples (Parallel Groups)
Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Module 16: One-sample t-tests and Confidence Intervals
Chapter 18: The Chi-Square Statistic
Topics
CHAPTER 15: Tests of Significance: The Basics Lecture PowerPoint Slides The Basic Practice of Statistics 6 th Edition Moore / Notz / Fligner.
January Structure of the book Section 1 (Ch 1 – 10) Basic concepts and techniques Section 2 (Ch 11 – 15): Inference for quantitative outcomes Section.
Shibing Deng Pfizer, Inc. Efficient Outlier Identification in Lung Cancer Study.
1 Chi-Square Test -- X 2 Test of Goodness of Fit.
1 Statistics Achim Tresch UoC / MPIPZ Cologne treschgroup.de/OmicsModule1415.html
Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Data mining with the Gene Ontology Josep Lluís Mosquera April 2005 Grup de Recerca en Estadística i Bioinformàtica GOing into Biological Meaning.
Introduction to Functional Analysis J.L. Mosquera and Alex Sanchez.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Using Gene Ontology Models and Tests Mark Reimers, NCI.
Differentially expressed genes
Biological Interpretation of Microarray Data Helen Lockstone DTC Bioinformatics Course 9 th February 2010.
Gene Ontology and Functional Enrichment Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Wfleabase.org/docs/tileMEseq0905.pdf Notes and statistics on base level expression May 2009Don Gilbert Biology Dept., Indiana University
Multiple testing in high- throughput biology Petter Mostad.
Gene Set Enrichment Analysis (GSEA)
Essential Statistics in Biology: Getting the Numbers Right
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine
Multiple Testing Matthew Kowgier. Multiple Testing In statistics, the multiple comparisons/testing problem occurs when one considers a set of statistical.
The Gene Ontology Categorizer C.A. Joslyn 1, S.M. Mniszewski 1, A. Fulmer 2 and G. Heaton 3 1 Computer and Computational Sciences, Los Alamos National.
Measures of Conserved Synteny Work was funded by the National Science Foundation’s Interdisciplinary Grants in the Mathematical Sciences All work is joint.
Statistical Testing with Genes Saurabh Sinha CS 466.
Gene set analyses of genomic datasets Andreas Schlicker Jelle ten Hoeve Lodewyk Wessels.
Extracting binary signals from microarray time-course data Debashis Sahoo 1, David L. Dill 2, Rob Tibshirani 3 and Sylvia K. Plevritis 4 1 Department of.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
GO enrichment and GOrilla
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 6 –Multiple hypothesis testing Marshall University Genomics.
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb Analysis of (cDNA) Microarray.
Chapter 7 Statistical Issues in Research Planning and Evaluation.
Chapter ?? 7 Statistical Issues in Research Planning and Evaluation C H A P T E R.
Canadian Bioinformatics Workshops
Estimating the False Discovery Rate in Genome-wide Studies BMI/CS 576 Colin Dewey Fall 2008.
P-values.
Review and Preview and Basics of Hypothesis Testing
::: Schedule. Biological (Functional) Databases
Differential Gene Expression
Statistical Testing with Genes
Association between two categorical variables
Chapters 20, 21 Hypothesis Testing-- Determining if a Result is Different from Expected.
Chapter Review Problems
STA 291 Spring 2008 Lecture 18 Dustin Lueker.
Alex Lewin (Imperial College) Sylvia Richardson (IC Epidemiology)
Varying Intolerance of Gene Pathways to Mutational Classes Explain Genetic Convergence across Neuropsychiatric Disorders  Shahar Shohat, Eyal Ben-David,
Statistical Testing with Genes
Presentation transcript:

Alex Lewin (Imperial College Centre for Biostatistics) Ian Grieve (IC Microarray Centre) Elena Kulinskaya (IC Statistical Advisory Service) Improving Interpretation in Gene Set Enrichment Analysis

Introduction Microarray experiment list of differentially expressed (DE) genes Genes belong to categories of Gene Ontology (GO) Are some GO categories (groups of genes) over- represented amongst the DE genes?

Contents Grouping Gene Ontology categories can improve interpretation of gene set enrichment analysis Fuzzy decision rules for multiple testing with discrete data

Gene Ontology (GO) Database of biological terms Arranged in graph connecting related terms: links from more general to more specific terms For each node, can define ancestor and descendant terms Directed Acyclic Graph ~16,000 terms from QuickGO website (EBI)

Gene Annotations Genes/proteins annotated to relevant GO terms –Gene may be annotated to several GO terms –GO term may have 1000s of genes annotated to it (or none) Gene annotated to term A annotated to all ancestors of A

Find GO terms over-represented amongst differentially expressed genes For each GO term, compare: proportion of differentially expressed genes annotated to that term v. proportion of non-differentially expressed genes annotated to that term Fishers test p-value for each GO term. Multiple testing considerations threshold below which p-values are declared significant. Many websites do this type of analysis, eg FatiGO website GO not DE

Difficulties in Testing GO terms Interpretation: many terms close in the graph may be found significant – or not significant but many low p- values close together in the graph Statistical Power: many terms have few genes annotated Discrete statistics: p-values not Uniform under null

Grouping GO terms Use the Poset Ontology Categorizer (POSOC) Joslyn et al Software which groups terms based on - pseudo-distance between terms - coverage of genes Example: for data used here, reduces ~16,000 terms to 76 groups

Example: genes associated with the insulin-resistance gene Cd36 Knock-out and wildtype mice Bayesian hierarchical model gives posterior probabilities (p g ) of being differentially expressed Most differentially expressed: p g > 0.5 (280 genes) Least differentially expressed: p g < 0.2 (11171 genes)

Example Results Individual term tests Used Fatigo website Multiple testing corrections (Benjamini and Hochberg FDR) done separately for each level Found no GO terms significant when FDR controlled at 5% Group tests POSOC on all genes on U74A chip, gives 76 groups 3 groups found significant when controlling FDR at 5%

Comparison of Individual and Group Tests Rank in Fatigo (smallest p-values)Membership of POSOC groupsignificant 1: response to external stimulus 2: resp. to pest, pathogen or parasite 3: response to wounding 4: organismal movement 5: response to biotic stimulus 6: neurophysiological process 7: response to stress 8: inflammatory response 9: transmission of nerve impulse 10: neuromuscular physiological proc. 11: defense response 12: immune response 13: chemotaxis 14: nucleobase, nucleoside, nuc … 15: cell-cell signalling IA response to p.p.p. response to wounding IA - IA immune resp, resp. to ppp, resp to wound - IA immune resp, resp. to ppp, resp to wound chemotaxis, cell-migration - IA yes IA - IA yes - IA yes no (at 5%) no - IA = immediate ancestor of significant POSOC group

Physiological process` Organismal movement Inflammatory response Response to stimulus Response to external stimulus Response to biotic stimulus Response to stress Response to wounding Defense response Response to pest, pathogen or parasite Immune response Biological process Response to other organism Ranks high individually (smallest p-values) Significant in group tests (and ranks high individually) Comparison of Individual and Group Tests

Discrete test statistics Null hypothesis determined by margins of 2x2 table Often very small no. possible values for cells small no. possible p-values X GO not DE Null Hypothesis: X ~ HyperGeom(173, , 467) X = 0,…,173

Discrete test statistics X GO not DE p-value p(x) = P( X x | null ) P( p α | null) α for most α

Randomised Test Observe X=x 0 p obs = observed p-value = P( X x 0 | null ) p prev = next smallest possible p-value = P( X x 0 -1 | null ) Randomised p-value P(x 0 ) = P( X < x 0 | null ) + u*P( X = x 0 | null ) where u ~ Unif(0,1) = p prev + u*(p obs - p prev ) conditionally, P | x 0 ~ Unif(p prev, p obs ) unconditionally P ~ Unif(0,1) p obs 0 1p prev

Fuzzy Decision Rule Idea is to use all possible realisations of randomised test. Summarise evidence by critical function of randomised test: τ α (p prev, p obs ) = 1 p obs < α ( α – p prev )/(p obs - p prev ) p prev < α < p obs 0 p prev > α p obs 0 1p prev Use τ α as a fuzzy measure of evidence against the null hypothesis. (Fuzzy decision rule considered by Cox & Hinckley, 1974 and developed by Geyer and Meeden 2005)

Fuzzy Decision Rules for Multiple Testing We have developed fuzzy decision rules for multiple tests (i = 1,…,m) Use Benjamini and Hochberg false discovery rate (BH FDR) τ BH α (p i prev, p i obs ) = P( randomised p-value i is rejected | null ) using BH FDR procedure For small no. tests we can calculate these exactly.

Fuzzy Decision Rules for Multiple Testing τ BH α (p i prev, p i obs ) = P( randomised p-value i is rejected | null ) For large no. tests use simulations: for j = 1,…,n { generate randomised p-values (i=1,…,m) P ij ~ Unif (p i prev, p i obs ) perform BH FDR procedure I ij = } τ BH α (p i prev, p i obs ) = 1/n Σ j I ij 1 if P ij rejected 0 else ^

Results for Cd36 Example [1] "alpha = 0.05" pprev pval i.bonf i.bh tau POSOC group 1 1e-04 3e response to pest, pathogen or parasite 2 1e-04 4e response to wounding 3 2e-04 6e immune response 4 7e digestion chemotaxis organic acid biosynthesis synaptic transmission 8 5e response to fungi [1] "alpha = 0.15" pprev pval i.bonf i.bh tau POSOC group 1 1e-04 3e response to pest, pathogen or parasite 2 1e-04 4e response to wounding 3 2e-04 6e immune response 4 7e digestion chemotaxis organic acid biosynthesis synaptic transmission 8 5e response to fungi

Results for Cd36 Example Order of fuzzy decisions is not the same as order of observed p-values Depends on amount of discreteness of null p obs p prev

Conclusions Grouping Gene Ontology categories can help find significant regions of the GO graph Fuzzy decision rules for multiple testing with discrete data can provide more candidates for rejection

Acknowledgements Cliff Joslyn (Los Alamos National Laboratory) Tim Aitman (IC Microarray Centre) Sylvia Richardson (IC Centre for Biostatistics) BBSRC Exploiting Genomics grant (AL) Wellcome Trust grant (IG) References Joslyn CA, Mniszewski SM, Fulmer A and Heaton G (2004), The Gene Ontology Categorizer, Bioinformatics 20, Geyer and Meeden (2005), Fuzzy Confidence Intervals and P- values, Statistical Science, to appear.