Gene Ontology (and Pathway) Analysis Bioinformatics for Biomedical Practitioners Statistics and Bioinformatics Research Group Statistics department, Universitat.

Slides:



Advertisements
Similar presentations
Applications of GO. Goals of Gene Ontology Project.
Advertisements

25th June 2007 Jane Lomax Using the Gene Ontology (GO) for analysis of expression data Jane Lomax EMBL-EBI.
Pathways analysis Iowa State Workshop 11 June 2009.
Gene Ontology John Pinney
Gene function analysis Stem Cell Network Microarray Course, Unit 5 May 2007.
Introduction to Functional Analysis J.L. Mosquera and Alex Sanchez.
CACAO - Remote training Gene Function and Gene Ontology Fall 2011
CACAO - Remote training Gene Function and Gene Ontology Fall 2011
Sequence-Structure-Function Sequence Structure Function Threading Ab initio BLAST Folding: impossible but for the smallest structures Function prediction.
CACAO - Penn State Gene Function and Gene Ontology January 2011
1Module 2: Analyzing Gene Lists Canadian Bioinformatics Workshops
Daniel Rico, PhD. Daniel Rico, PhD. ::: Introduction to Functional Analysis Course on Functional Analysis Bioinformatics Unit.
Using The Gene Ontology: Gene Product Annotation.
Gene Set Enrichment Analysis (GSEA)
GO and OBO: an introduction. Jane Lomax EMBL-EBI What is the Gene Ontology? What is OBO? OBO-Edit demo & practical What is the Gene Ontology? What is.
CACAO Training Fall Community Assessment of Community Annotation with Ontologies (CACAO)
The aims of the Gene Ontology project are threefold: - to compile vocabularies to describe components, functions and processes - to produce tools to query.
Networks and Interactions Boo Virk v1.0.
Tutorial session 2 Network annotation Exploring PPI networks using Cytoscape EMBO Practical Course Session 8 Nadezhda Doncheva and Piet Molenaar.
Gene Ontology Project
Gene Ontology TM (GO) Consortium Jennifer I Clark EMBL Outstation - European Bioinformatics Institute (EBI), Hinxton, Cambridge CB10 1SD, UK Objectives:
Gene expression analysis
Monday, November 8, 2:30:07 PM  Ontology is the philosophical study of the nature of being, existence or reality as such, as well as the basic categories.
From Functional Genomics to Physiological Model: Using the Gene Ontology Fiona McCarthy, Shane Burgess, Susan Bridges The AgBase Databases, Institute of.
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
Introduction to the GO: a user’s guide Iowa State Workshop 11 June 2009.
24th Feb 2006 Jane Lomax GO Further. 24th Feb 2006 Jane Lomax GO annotations Where do the links between genes and GO terms come from?
Alastair Kerr, Ph.D. WTCCB Bioinformatics Core An introduction to DNA and Protein Sequence Databases.
Getting Started: a user’s guide to the GO GO Workshop 3-6 August 2010.
Functional Annotation and Functional Enrichment. Annotation Structural Annotation – defining the boundaries of features of interest (coding regions, regulatory.
1 Gene function annotation. 2 Outline  Functional annotation  Controlled vocabularies  Functional annotation at TAIR  Resources and tools at TAIR.
Getting Started: a user’s guide to the GO TAMU GO Workshop 17 May 2010.
CACAO Training Fall Community Assessment of Community Annotation with Ontologies (CACAO)
Statistical Testing with Genes Saurabh Sinha CS 466.
Gene set analyses of genomic datasets Andreas Schlicker Jelle ten Hoeve Lodewyk Wessels.
Bioinformatics and Computational Biology
Introduction to the GO: a user’s guide NCSU GO Workshop 29 October 2009.
Tools in Bioinformatics Ontologies and pathways. Why are ontologies needed? A free text is the best way to describe what a protein does to a human reader.
Gene Set Analysis using R and Bioconductor Daniel Gusenleitner
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Module 1: Gene List/Network Intro Canadian Bioinformatics Workshops
Module 1: Gene Lists 1 Canadian Bioinformatics Workshops
Gene Annotation & Gene Ontology May 24, Gene lists from RNAseq analysis What do you do with a list of 100s of genes that contain only the following.
Canadian Bioinformatics Workshops
NCRI Cancer Conference November 1, 2015.
Sequence-Structure-Function Sequence Structure Function Threading Ab initio BLAST Folding: impossible but for the smallest structures Function prediction.
Canadian Bioinformatics Workshops
Extracting Biological Information from Gene Lists
Module 2: Analyzing gene lists: over-representation analysis
Gene Annotation & Gene Ontology
Canadian Bioinformatics Workshops
Networks and Interactions
CACAO Training ASM-JGI 2012.
Annotating with GO: an overview
Canadian Bioinformatics Workshops
GO : the Gene Ontology & Functional enrichment analysis
Introduction to the Gene Ontology
Statistical Testing with Genes
Canadian Bioinformatics Workshops
Department of Genetics • Stanford University School of Medicine
Modified from slides from Jim Hu and Suzi Aleksander Spring 2016
Using the Gene Ontology (GO) for analysis of expression data Jane Lomax EMBL-EBI 25th June 2007 Jane Lomax.
1 Department of Engineering, 2 Department of Mathematics,
What is an Ontology An ontology is a set of terms, relationships and definitions that capture the knowledge of a certain domain. (common ontology ≠ common.
Ensembl Genome Repository.
Gene expression analysis
Statistical Testing with Genes
Presentation transcript:

Gene Ontology (and Pathway) Analysis Bioinformatics for Biomedical Practitioners Statistics and Bioinformatics Research Group Statistics department, Universitat de Barelona Statistics and Bioinformatics Unit Vall d’Hebron Institut de Recerca Alex Sánchez

Outline Presentation Background (1): biology & biotechnology Background (2): R & Bioconductor The problem: Interpreting gene lists Annotations and annotation databases The Gene Ontology Resource Gene list analysis using the GO and relatives Existing tools for pathway analysis Bioinformatics for Biomedical Practitioners

(Bio)Statistics and Bioinformatics Research groups

The Statistics and Bioinformatics Unit

Disclaimer/Acknowledgement Most presentations nowadays are based on other presentations which are based on other presentations … In this case I have strongly relied on the free materials from the course “Pathway and Network Analysis of Omics Data” organized by the Canadian Bioinformatics Workshops.Canadian Bioinformatics Workshops I wish to thank them for sharing their materials and claim to attribute them the work as specified in the Creative Commons Share Alike 2.5 Canada license. Bioinformatics for Biomedical Practitioners

Background

Biology Bioinformatics for Biomedical Practitioners

Biotechnology Bioinformatics for Biomedical Practitioners

Focusing in gene expression Bioinformatics for Biomedical Practitioners

The R project for Statistical Computing Bioinformatics for Biomedical Practitioners

What is R The S language was developed in 1976 at Bell Laboratories by John Chambers to... – facilitate interactive exploration and visualization of data of varying complexity. – allow them to perform on all types of statistical analyzes. S language was (and is) commercial. R ("GNU" S) is born as a free alternative to S. Bioinformatics for Biomedical Practitioners

Why R? Free (like in free bier) High quality methods implemented Plattform independent Linux, Mac, Other Constantly evolving New version /6 months Statistical tool Modern Most existing methods (new method  R) Great graphics. Programming language Powerful & Flexible Open source Great for repetitive tasks Bioinformatics for Biomedical Practitioners

Why not R Console-based interface But GUI projects available R-commander, DeduceR Community-based quality control No company behind (no money back) But thousands of users for most packages Constantly evolving One new version every 6 months Bioinformatics for Biomedical Practitioners

R interfaces - Rstudio Bioinformatics for Biomedical Practitioners

Bioconductor Bioinformatics for Biomedical Practitioners An open source and open development software project for the analysis and comprehension of genomic data. Started in The core team is based primarily at the Fred Hutchinson Cancer Research Center. Primarily based on the R programming language. There are two releases of Bioconductor every year. Started with 15packages NOW THERE ARE MORE THAN 1000

Using Bioconductor Essentially Bioconductor is a set of R packages A bioconductor package – Implements a different, new functionality To manipulate or make tests on omics data To use annotations … – It can also be an Annotations database – Or even an experimental dataset BioC also provides training materials Bioinformatics for Biomedical Practitioners

Managing Gene Lists

The (in)famous “where to now?” question You obtained a list of features. What’s next? – Select some genes for validation? – Follow up experiments on some genes? – Publish a huge table with all results? – Try to learn on all genes in the list? Bioinformatics for Biomedical Practitioners ?

From gene lists to Pathway Analysis Gene lists contain useful information – This can be extracted from databases – Generically described as Gene Annotation Besides, we may obtain information from the analysis of gene sets – Genes don’t act individually, rather in groups  More realistic approach – There are less gene sets than individual genes  Relatively simpler to manage – Generically described as Pathway Analysis Bioinformatics for Biomedical Practitioners

Case study 1 Lists AvsB, AvsL and BvsL contain the IDs of genes selected by being differentially expressed between three types of breast cancer tumors. – Farmer P, Bonnefoi H, Becette V, Tubiana-Hulin M et al. Identification of molecular apocrine breast tumours by microarray analysis. Oncogene 2005 Jul 7;24(29): PMID: Bioinformatics for Biomedical Practitioners

Case study 2 One hundred sixty-four genes found to be upregulated in CD4+/CD62L- T cells relative to CD4+/CD62L+ T cells.One hundred sixty-four genes Cutting edge: L-selectin (CD62L) expression distinguishes small resting memory CD4+ T cells that preferentially respond to recall antigen. Hengel RL et al. J Immunol 2003 Jan 1;170(1):28-32 Bioinformatics for Biomedical Practitioners

Gene Lists and Annotations

Gene and Protein Identifiers Identifiers (IDs) are ideally unique, stable names or numbers that help track database records – E.g. Social Insurance Number, Entrez Gene ID But, information on features is stored in many databases… – Genes have many IDs Records for: Gene, DNA, RNA, Protein – Important to recognize the correct record type – E.g. Entrez Gene records don’t store sequence. They link to DNA regions, RNA transcripts and proteins e.g. in RefSeq, which stores sequence. Bioinformatics for Biomedical Practitioners

Common Identifiers Species-specific HUGO HGNC BRCA2 MGI MGI: RGD 2219 ZFIN ZDB-GENE FlyBase CG9097 WormBase WBGene or ZK SGD S or YDL029W Annotations InterPro IPR OMIM Pfam PF09104 Gene Ontology GO: SNPs rs Experimental Platform Affymetrix _3p_s_at Agilent A_23_P99452 CodeLink GE60169 Illumina GI_ S Gene Ensembl ENSG Entrez Gene 675 Unigene Hs RNA transcript GenBank BC RefSeq NM_ Ensembl ENST Protein Ensembl ENSP RefSeq NP_ UniProt BRCA2_HUMAN or A1YBP1_HUMAN IPI IPI EMBL AF PDB 1MIU Red = Recommended Bioinformatics for Biomedical Practitioners

Identifier Mapping There are many IDs! – Software tools recognize only a handful – May need to map from your gene list IDs to standard IDs Four main uses – Searching for a favorite gene name – Link to related resources – Identifier translation E.g. Proteins to genes, Affy ID to Entrez Gene – Merging data from different sources Find equivalent records Bioinformatics for Biomedical Practitioners

ID Challenges Avoid errors: map IDs correctly – Beware of 1-to-many mappings Gene name ambiguity – not a good ID – e.g. FLJ92943, LFS1, TRP53, p53 – Better to use the standard gene symbol: TP53 Excel error-introduction – OCT4 is changed to October-4 (paste as text) Problems reaching 100% coverage – E.g. due to version issues – Use multiple sources to increase coverage Zeeberg BR et al. Mistaken identifiers: gene name errors can be introduced inadvertently when using Excel in bioinformatics BMC Bioinformatics Jun 23;5:80 Bioinformatics for Biomedical Practitioners

Use ID converters to prepare list Bioinformatics for Biomedical Practitioners

ID Mapping Services g:Convert Ensembl Biomart Input gene/protein/transcript IDs (mixed) Type of output ID Bioinformatics for Biomedical Practitioners

Beware of ambiguous ID mappings Bioinformatics for Biomedical Practitioners

Recommendations For proteins and genes – (doesn’t consider splice forms) – Map everything to Entrez Gene IDs or Official Gene Symbols using an appropriate tool, such as R/Bioc, or a spreadsheet if no other option. If 100% coverage desired, manually curate missing mappings using multiple resources Be careful of Excel auto conversions – especially when pasting large gene lists! – Remember to format cells as ‘text’ before pasting Bioinformatics for Biomedical Practitioners

From Gene Lists to Pathway Analysis

Pathway and Network Analysis Any type of analysis that involves pathway or network information Helps gain mechanistic insight into ‘omics’ data – Identifying a master regulator, drug targets, characterizing pathways active in a sample Most commonly applied to help interpret lists of genes Most popular type is pathway enrichment analysis, but many others are useful Bioinformatics for Biomedical Practitioners

Benefits of Pathway Analysis Easier to interpret –Familiar concepts e.g. cell cycle Identifies possible causal mechanisms Predicts new roles for genes Improves statistical power –Fewer tests, aggregates data from multiple genes into one pathway More reproducible –E.g. gene expression signatures Facilitates integration of multiple data types Bioinformatics for Biomedical Practitioners

Pathways vs. Networks - Detailed, high-confidence consensus - Biochemical reactions - Small-scale, fewer genes - Concentrated from decades of literature - Simplified cellular logic, noisy - Abstractions: directed, undirected - Large-scale, genome-wide - Constructed from omics data integration Bioinformatics for Biomedical Practitioners

Types of Pathway/Network Analysis Bioinformatics for Biomedical Practitioners

Types of Pathway/Network Analysis What biological processes are altered in this cancer? Are new pathways altered in this cancer? Are there clinically-relevant tumour subtypes? How are pathway activities altered in a particular patient? Are there targetable pathways in this patient? Bioinformatics for Biomedical Practitioners

Pathway Analysis Workflow Overview Bioinformatics for Biomedical Practitioners

Where is pathway information? Pathways – Gene Ontology biological process, pathway databases e.g. Reactome Other annotations – Gene Ontology molecular function, cell location – Chromosome position – Disease association – DNA properties ( TF binding sites, gene structure (intron/exon), SNPs, …) – Transcript properties ( Splicing, 3’ UTR, microRNA binding sites, …) – Protein properties ( Domains, 2ry and 3ry structure, PTM sites ) – Interactions with other genes Bioinformatics for Biomedical Practitioners

The Gene Ontology (at last)

What is the Gene Ontology (GO)? Set of biological phrases (terms) which are applied to genes: – protein kinase – apoptosis – membrane Dictionary: term definitions Ontology: A formal system for describing knowledge Bioinformatics for Biomedical Practitioners

GO Structure Terms are related within a hierarchy – is-a – part-of Describes multiple levels of detail of gene function Terms can have more than one parent or child Bioinformatics for Biomedical Practitioners

What is covered by the GO? GO terms divided into three aspects: – cellular component – molecular function – biological process glucose-6-phosphate isomerase activity Cell division Bioinformatics for Biomedical Practitioners

Part 1/2: Terms Where do GO terms come from? – GO terms are added by editors at EBI and gene annotation database groups – Terms added by request – Experts help with major development Jun 2012Apr 2015increase Biological process23,07428,15822% Molecular function9,39210,83515% Cellular component2,9943,90330% total37,10442,89616% Bioinformatics for Biomedical Practitioners

Part 2/2: Annotations Genes are linked, or associated, with GO terms by trained curators at genome databases – Known as ‘gene associations’ or GO annotations – Multiple annotations per gene Some GO annotations created automatically (without human review) Bioinformatics for Biomedical Practitioners

Hierarchical annotation Genes annotated to specific term in GO automatically added to all parents of that term AURKB Bioinformatics for Biomedical Practitioners

Annotation Sources Manual annotation – Curated by scientists High quality Small number (time-consuming to create) – Reviewed computational analysis Electronic annotation – Annotation derived without human validation Computational predictions (accuracy varies) Lower ‘quality’ than manual codes Key point: be aware of annotation origin Bioinformatics for Biomedical Practitioners

Evidence Types Experimental Evidence Codes EXP: Inferred from Experiment IDA: Inferred from Direct Assay IPI: Inferred from Physical Interaction IMP: Inferred from Mutant Phenotype IGI: Inferred from Genetic Interaction IEP: Inferred from Expression Pattern IEA: Inferred from electronic annotation Author Statement Evidence Codes TAS: Traceable Author Statement NAS: Non-traceable Author Statement Curator Statement Evidence Codes IC: Inferred by Curator ND: No biological Data available Computational Analysis Evidence Codes ISS: Inferred from Sequence or Structural Similarity ISO: Inferred from Sequence Orthology ISA: Inferred from Sequence Alignment ISM: Inferred from Sequence Model IGC: Inferred from Genomic Context RCA: inferred from Reviewed Computational Analysis Bioinformatics for Biomedical Practitioners

Species Coverage All major eukaryotic model organism species and human Several bacterial and parasite species through TIGR and GeneDB at Sanger New species annotations in development Current list: – otations.shtml Bioinformatics for Biomedical Practitioners

Variable Coverage Apr 2015 Experimental Non-experimental Bioinformatics for Biomedical Practitioners

Contributing Databases – Berkeley Drosophila Genome Project (BDGP) Berkeley Drosophila Genome Project (BDGP – dictyBase (Dictyostelium discoideum) dictyBase – FlyBase (Drosophila melanogaster) FlyBase – GeneDB (Schizosaccharomyces pombe, Plasmodium falciparum, Leishmania major and Trypanosoma brucei) GeneDBSchizosaccharomyces pombe – UniProt Knowledgebase (Swiss-Prot/TrEMBL/PIR-PSD) and InterPro databases UniProt KnowledgebaseInterPro – Gramene (grains, including rice, Oryza) Gramene – Mouse Genome Database (MGD) and Gene Expression Database (GXD) (Mus musculus) Mouse Genome Database (MGD) and Gene Expression Database (GXD) – Rat Genome Database (RGD) (Rattus norvegicus) – Reactome Reactome – Saccharomyces Genome Database (SGD) (Saccharomyces cerevisiae) Saccharomyces Genome Database (SGD) – The Arabidopsis Information Resource (TAIR) (Arabidopsis thaliana) The Arabidopsis Information Resource (TAIR) – The Institute for Genomic Research (TIGR): databases on several bacterial species The Institute for Genomic Research (TIGR) – WormBase (Caenorhabditis elegans) WormBase – Zebrafish Information Network (ZFIN): (Danio rerio) Zebrafish Information Network (ZFIN) Bioinformatics for Biomedical Practitioners

GO Slim Sets GO has too many terms for some uses – Summaries (e.g. Pie charts) GO Slim is an official reduced set of GO terms – Generic, plant, yeast Crockett DK et al. Lab Invest. Nov. 2005; 85(11): Bioinformatics for Biomedical Practitioners

GO Software Tools GO resources are freely available to anyone without restriction – ontologies, gene associations and tools developed by GO Other have used GO to create versatile tools – – Bioinformatics for Biomedical Practitioners

Accessing GO: QuickGO Bioinformatics for Biomedical Practitioners

Pathway Databases lists ~550 pathway related databases MSigDB: collects major ones Bioinformatics for Biomedical Practitioners

Sources of Gene Attributes Ensembl BioMart (general) – Entrez Gene (general) – Model organism databases – E.g. SGD: Bioinformatics for Biomedical Practitioners

Ensembl BioMart Convenient access to gene list annotation Select genome Select attributes to download Select filters Bioinformatics for Biomedical Practitioners

Pathway Analysis Overrepresentation Analysis Gene Set Enrichment Analysis

Pathway enrichment analysis Gene list from experiment: Genes down-regulated in drug- sensitive brain cancer cell lines Pathway information: All genes known to be involved in Neurotransmitter signaling Statistical test: are there more annotations in gene list than expected? Hypothesis: drug sensitivity in brain cancer is related to reduced neurotransmitter signaling Test many pathways p<0.05 ? Bioinformatics for Biomedical Practitioners

Pathway Enrichment Analysis Combines – Gene(feature) lists  (Gen)omic experiment – Pathways and other gene annotations Gene Ontology – Ontology Structure – Annotation BioMart Other resources Bioinformatics for Biomedical Practitioners

Types of Enrichment Analysis Gene list (e.g. expression change > 2-fold) – Answers the question: Are any gene sets surprisingly enriched (or depleted) in my gene list? – Statistical test: Fisher’s Exact Test (aka Hypergeometric test) Ranked list (e.g. by differential expression) – Answers the question: Are any gene set ranked surprisingly high or low in my ranked list of genes? – Statistical test: minimum hypergeometric test (+ others we won’t discuss) Bioinformatics for Biomedical Practitioners

Enrichment Test Spindle Apoptosis Microarray Experiment (gene expression table) Gene-set Databases ENRICHMENT TEST ENRICHMENT TEST Enrichment Table Bioinformatics for Biomedical Practitioners

Gene list enrichment analysis Given: 1.Gene list: e.g. RRP6, MRD1, RRP7, RRP43, RRP42 (yeast) 2.Gene sets or annotations: e.g. Gene ontology, transcription factor binding sites in promoter Question: Are any of the gene annotations surprisingly enriched in the gene list? Details: – Where do the gene lists come from? – How to assess “surprisingly” (statistics) – How to correct for repeating the tests Bioinformatics for Biomedical Practitioners

Two-class design for gene lists Expression Matrix Class-1Class-2 Genes Ranked by Differential Statistic E.g.: - Fold change - Log (ratio) - t-test -Significance analysis of microarrays UP DOWN UP DOWN Selection by Threshold Bioinformatics for Biomedical Practitioners

Example gene list enrichment test Gene-set Databases Microarray Experiment (gene expression table) Gene list (e.g UP-regulated) Background (all genes on the array) Bioinformatics for Biomedical Practitioners

Gene-set Databases Microarray Experiment (gene expression table) Gene list (e.g UP-regulated) Background (all genes on the array) Gene-set Example gene list enrichment test Bioinformatics for Biomedical Practitioners

Enrichment Test The P-value assesses the probability that the overlap is at least as large as observed by random sampling the array genes. Random samples of array genes The output of an enrichment test is a P-value Bioinformatics for Biomedical Practitioners

Recipe for gene list enrichment test Step 1: Define gene list (e.g. thresholding analyzed list ) and background list, Step 2: Select gene sets to test for enrichment, Step 3: Run enrichment tests and correct for multiple testing, if necessary, Step 4: Interpret your enrichments Step 5: Publish! ;) Bioinformatics for Biomedical Practitioners

Why test enrichment in ranked lists? Possible problems with gene list test – No “natural” value for the threshold – Different results at different threshold settings – Possible loss of statistical power due to thresholding No resolution between significant signals with different strengths Weak signals neglected Bioinformatics for Biomedical Practitioners

Example ranked list enrichment test Gene-setp-value Spindle Apoptosis0.025 Gene-set Databases GSEA or Min Hypergeometric test GSEA or Min Hypergeometric test Enrichment TableRanked Gene List Bioinformatics for Biomedical Practitioners

Recipe for ranked list enrichment test Step 1: Rank ALL your genes, Step 2: Select gene sets to test for enrichment, Step 3: Run enrichment tests and correct for multiple testing, if necessary, Step 4: Interpret your enrichments Step 5: Publish! ;) Bioinformatics for Biomedical Practitioners

Multiple test corrections Bioinformatics for Biomedical Practitioners

How to win the P-value lottery Background population: 500 black genes, 4500 red genes Random draws … 7,834 draws later … Expect a random draw with observed enrichment once every : “1 / P-value “ draws Bioinformatics for Biomedical Practitioners

Simple P-value correction: Bonferroni If M = # of annotations tested: Corrected P-value = M x original P-value Corrected P-value is greater than or equal to the probability that one or more of the observed enrichments could be due to random draws. The jargon for this correction is “controlling for the Family-Wise Error Rate (FWER)” Bioinformatics for Biomedical Practitioners

Bonferroni correction caveats Bonferroni correction is very stringent and can “wash away” real enrichments leading to false negatives, Often one is willing to accept a less stringent condition, the “false discovery rate” (FDR), which leads to a gentler correction when there are real enrichments. Bioinformatics for Biomedical Practitioners

False discovery rate (FDR) FDR is the expected proportion of the observed enrichments due to random chance. Compare to Bonferroni correction which is a bound on the probability that any one of the observed enrichments could be due to random chance. Typically FDR corrections are calculated using the Benjamini-Hochberg procedure. FDR threshold is often called the “q-value” Bioinformatics for Biomedical Practitioners

Benjamini-Hochberg example I CategoryRank … Transcriptional regulation Transcription factor Initiation of transcription Nuclear localization Chromatin modification … Cytoplasmic localization Translation … Sort P-values of all tests in decreasing order (Nominal) P-value Bioinformatics for Biomedical Practitioners

Benjamini-Hochberg example II (Nominal) P-value CategoryAdjusted P-valueRank … Transcriptional regulation Transcription factor Initiation of transcription Nuclear localization Chromatin modification … Cytoplasmic localization Translation x 53/1 = x 53/2 = x 53/3 = x 53/4 = x 53/5 = … x 53/52 = x 53/53 = 0.99 Adjusted P-value is “nominal” P-value times # of tests divided by the rank of the P- value in sorted list … Adjusted P-value = P-value X [# of tests] / Rank Bioinformatics for Biomedical Practitioners

Benjamini-Hochberg example III CategoryAdjusted P-valueRank … Transcriptional regulation Transcription factor Initiation of transcription Nuclear localization Chromatin modification … Cytoplasmic localization Translation x 53/1 = x 53/2 = x 53/3 = x 53/4 = x 53/5 = … x 53/52 = x 53/53 = 0.99 Q-value (or FDR) corresponding to a nominal P-value is the smallest adjusted P-value assigned to P-values with the same or larger ranks … … FDR / Q-value (Nominal) P-value Bioinformatics for Biomedical Practitioners

Benjamini-Hochberg example III CategoryAdjusted P-valueRank FDR / Q-value … Transcriptional regulation Transcription factor Initiation of transcription Nuclear localization Chromatin modification … Cytoplasmic localization Translation x 53/1 = x 53/2 = x 53/3 = x 53/4 = x 53/5 = … x 53/52 = x 53/53 = 0.99 P-value threshold is highest ranking P-value for which corresponding Q-value is below desired significance threshold … 0.99 P-value threshold for FDR < … Red: non-significant Green: significant at FDR < 0.05 (Nominal) P-value Bioinformatics for Biomedical Practitioners

Reducing correction stringency The correction to the P-value threshold depends on the # of tests that you do, so, no matter what, the more tests you do, the more sensitive the test needs to be Can control the stringency by reducing the number of tests: – e.g. use GO slim; – restrict testing to the appropriate GO annotations; – or filter gene sets by size. Bioinformatics for Biomedical Practitioners

Summary Enrichment analysis: – Statistical tests Gene list: Fisher’s Exact Test Ranked list: mHG, GSEA, also see Wilcoxon ranksum, Mann-Whitney U-test, Kolmogorov-Smirnov test – Multiple test correction Bonferroni: stringent, controls probability of at least one false positive* FDR: more forgiving, controls expected proportion of false positives* -- typically uses Benjamini-Hochberg * Type 1 error, aka probability that observed enrichment if no association Bioinformatics for Biomedical Practitioners