Download presentation
Presentation is loading. Please wait.
Published byDulcie Martin Modified over 8 years ago
1
Canadian Bioinformatics Workshops www.bioinformatics.ca
2
2Module #: Title of Module
3
Module 8 – Variants to Networks Part 1 – How to annotate variants and prioritize potentially relevant ones Informatics Facility, The Centre for Applied Genomics (TCAG) The Hospital for Sick Children
4
Module 8 bioinformatics.ca Learning Objectives of Module I have detected somatic variants in a cancer sample. What information can I use to interpret them? What variant annotations can I use? How do impact prediction models work? How to use an annotation tool: Annovar (LAB)
5
Module 8 bioinformatics.ca Introduction
6
Module 8 bioinformatics.ca Variant vs Gene Information We have to consider information at two levels: Gene – Is the gene central to processes related to cancer? (e.g. proliferation, apoptosis, matrix degradation) – Is the gene sensitive to perturbation? (e.g. haploinsufficiency) Variant – What is the variant effect on the gene product?
7
Module 8 bioinformatics.ca Variant Recurrence Variant Gene Product Effect Gene Product Function / Pathway Integrating Different Evidences
8
Module 8 bioinformatics.ca On Variant Size Small: 1-50 bp SNV (Single Nucleotide Variants): 1 bp substitution, relatively easy to detect Small In/dels: a bit more challenging to detect Most available in databases; can be mapped by exact coordinates Medium: 100-1,000 bp Insertions, Deletions, Translocations, Complex re-arrangements Most challenging to detect More tolerant mapping (e.g. partners of gene fusion) Large: > 5 kbp Copy number variants relatively easy to detect using arrays, more challenging using next generation sequencing More tolerant mapping (e.g. 50% reciprocal overlap, cytoband(s))
9
Module 8 bioinformatics.ca Variant Annotation Components A.Variant database mapping A.Allele frequencies from reference data-sets (1000G, NHLBI-ESP, ExAC) B.dbSNP (sequence variation database) C.COSMIC (somatic variant database) B.Gene mapping (coding/splicing, UTR, intergenic) C.Gene product effect type (e.g. loss of function, missense) D.Coding Missense Effect Scoring A.SIFT B.PolyPhen2 C.MutationAssessor E.Other Effect Scoring A.PhyloP (conservation) B.CADD C.Splicing-regulatory predictions
10
Module 8 bioinformatics.ca Variant databases and allele frequencies
11
Module 8 bioinformatics.ca 1000 Genomes (Phase 3) Goal: – Identify all variants at > 1% frequency in represented human populations Subjects: 2,504 – Apparently healthy – Ethnicities: caucasian European, admixed Latin Americans, African, South Asians, East Asians Platform: Illumina – Low coverage (2-4x) whole genome – Exon (50x coverage)
12
Module 8 bioinformatics.ca NHLBI-ESP Goal: – discover heart, lung and blood disorder variants at frequency < 1% Subjects: 6,503 (ESP 6500 release) – Not necessarily healthy (includes individuals with extreme subclinical traits and diseased) – Ethnicities: 2,203 African-Americans, 4,300 European-Americans Platform: Illumina, exome sequencing (average 110x)
13
Module 8 bioinformatics.ca ExAC (Exome Aggregation Consortium) Goal: – Compile the largest set of exomes ever Subjects: 60,706 (unrelated) – Not necessarily healthy: includes cardiovascular, autoimmune, schizophrenia and cancer, but removed individuals with severe pediatric disease – Ethnicities: non-Finnish European, Finnish, Latin Americans, African, South Asians, East Asians, Other Platform: Illumina, exome Variant calling: – GATK
14
Module 8 bioinformatics.ca dbSNP Broad scope repository of “small” genetic variation (e.g. NCBI counterpart for structural variants: dbVAR) – Submissions before and after NGS era – Includes polymorphisms found in general population – Includes rare germline disease-associated (or suspected to be) – Includes somatic variants (also in COSMIC) Good to look up variants If you want to use as filter, make sure you remove “clinically flagged” variants (somatic, germline)
15
Module 8 bioinformatics.ca COSMIC “Catalogue of Somatic Mutation In Cancer” Reference database for somatic variation in cancer Worth following up variants matching COSMIC entries – How many studies/samples was it found in? 1, many? – Does the variant overlap a hotspot? – Is the gene frequently mutated?
16
Module 8 bioinformatics.ca Gene mapping
17
Module 8 bioinformatics.ca Gene Mapping: Types of Genes Types of genes: Protein-coding genes Non-protein-coding RNA genes (e.g. miRNA) Different functional relevance Different knowledge of variant effects
18
Module 8 bioinformatics.ca Gene Mapping: Parts of Genes Protein-coding genes have these parts: – UTR (transcribed, not translated) – Coding exons (translated) – Introns (spliced out, not translated) – Splice sites Also: Upstream, downstream transcribed gene Inter-genic
19
Module 8 bioinformatics.ca Gene Mapping: Annovar’s priority system Gene types and parts: what if they overlap..? Whenever more than one mapping is possible, Annovar will follow this priority system You can also ask Annovar to report all possible effects
20
Module 8 bioinformatics.ca Gene Mapping: Annovar’s priority system Protein Coding Gene G1 Non-coding RNA ncR1 (e.g. miRNA) TSS of G1 (Transcription Start Site) >>>> >> >>>> > >>>>>>>>> >
21
Module 8 bioinformatics.ca Gene Mapping: Annovar’s priority system ** Splice sites after the first were omitted to avoid clutter G1 Upstream G1 UTR 5’ G1 Exonic G1 Intronic G1 ExonicG1 IntronicG1 Exonic G1 Intronic ncR1 G1 Exonic G1 UTR 3’G1 Downstream ncR1 Downstream G1 Splicing ** >>>> >> >>>> > >>>>>>>>> >
22
Module 8 bioinformatics.ca Gene Mapping: Database Goal: map our variants to (coding and non-coding) genes RefSeq is the suggested database for transcribed gene and coding sequence definition – In the lab we will use Annovar with RefSeq database Other databases available: UCSC known genes, Ensembl
23
Module 8 bioinformatics.ca Gene product effect type
24
Module 8 bioinformatics.ca Gene Product Effect Regulatory / other non-protein-coding sequences: difficult to establish what a change “means” (certain cases are easier, e.g. miRNA seed) Protein-coding sequences: how is protein sequence affected? Definitely easier to chase after protein effects But should don’t forget other gene products exist…
25
Module 8 bioinformatics.ca Gene Product Effect: Protein-coding Stop-gain SNV: adds a STOP codon truncated protein Frameshift In/Del: shifts the reading frame protein translated incorrectly from that point Splicing: alters key sites guiding splicing In-frame In/Del: removes/add one or more amino acids Stoploss: loss of STOP codon extra piece in the protein Missense SNV: modifies one amino acid Synonymous: no amino acid change
26
Module 8 bioinformatics.ca Loss of Function (LOF) Variants Definition: Stop-gain, Frameshift, Splicing These are the more disruptive, BUT: What percentage of the protein is affected? Are there multiple transcript isoforms? Splicing effect difficult to predict – Cryptic splice sites Frameshift can be rescued by another frameshift or bypassed by splicing
27
Module 8 bioinformatics.ca Missense Variants: Tell Me More.. How do we tell if a missense alters protein function? Type of amino acid change (amino acid groups) Conservation across species Conserved protein domain Secondary protein structure Tertiary (3D) protein structure + simulation Other functional features (e.g. phosphosite) Machine learning model tying all of these together – What training set?
28
Module 8 bioinformatics.ca BRAF V600E T>C Somatic Pathogenic BRAF V600A T>A Somatic / germline Pathogenicity untested Missense Example: Back to BRAF
29
Module 8 bioinformatics.ca Conservation and Missense Variant Scoring Models
30
Module 8 bioinformatics.ca Conservation Conservation is a powerful and broadly used idea How conserved is a given nucleotide or genomic interval, comparing different species to human? How conserved is an amino acid in a protein sequence? Available from UCSC (nucleotide conservation): – PhyloP score – useful to assess single variants – PhastCons score/element – useful to assess putative regulatory regions and genes not coding proteins – Multi-species alignment – generally useful
31
Module 8 bioinformatics.ca Look for coding exons, UTRs and third nucleotide within codons
32
Module 8 bioinformatics.ca PhyloP PhyloP: test to detect if nucleotide substitution rates are faster or slower than expected under neutral drift – Only where aligned sequence available! PhyloP score – Positive: conserved (e.g. PhyloP > 2) – Zero: neutral – Negative: more diverged than neutral Species group: – All vertebrates – Only placental mammals – Only primates
33
Module 8 bioinformatics.ca Conservation Main caveat: – if you use conservation for a given position, this will not tell you directly what is the effect of your variants, but only if the position is important!
34
Module 8 bioinformatics.ca Missense Variant Effect: Scoring Models Overview Criteria to keep in mind: What features are used? – Nucleotide / amino acid conservation – Amino acid physicochemical properties Direct scoring versus Machine learning – Machine learning models are heavily dependent on the training-set used What data-set used for assessment / learning / optimization? – E.g. Activating / gain-of-function versus inactivating / loss-of-function mutations – E.g. Mendelian disorders (prevailingly loss-of-function) versus cancer (some are unique to cancer, e.g. drug resistance)
35
Module 8 bioinformatics.ca SIFT Broadly used, relatively old (first published: 2001) Designed for deleterious mutation (i.e. disruptive of protein function) Based uniquely on protein sequence (amino acid) conservation 1.Start from query protein sequence 2.Identify similar protein sequences (PSI-BLAST) 3.Multiple alignment of protein sequences (orthologs and paralogs) 4.Amino acid x residue probability matrix (PSSM) 5.For every residue, amino acid probability reweighted by amino acid diversity at the position (sum of frequency rank * frequency) Score: probability of observing amino acid normalized by residue conservation cut-off: 0.05 (based on case studies) Predicting deleterious amino acid substitutions. Ng PC, Henikoff S. Genome Res. 2001 May;11(5):863-74.
36
Module 8 bioinformatics.ca PolyPhen2 Integrates multiple features – 8 sequence-based, 3 structure-based (nucleotide and amino acid level) (e.g. side chain volume change, overlap with PFAM domain, multiple alignment metrics) Machine learning method (Naïve Bayes) Requires training set – Set 1: HumDiv – Positive: damaging alleles for known Mendelian disorders (Uniprot) – Negative: nondamaging differences between human proteins and related mammalian homologs – Performance 5-fold crossv: (TP ~ 80%, FP ~10%), (TP ~ 90%, FP ~ 20%) – Set 2: HumVar – Positive: all human disease causing mutations (Uniprot) – Negative: non-synonymous SNPs without disease association Richer model than SIFT More biased towards training set(s) than SIFT A method and server for predicting damaging missense mutations. Adzhubei IA, Schmidt S, Peshkin L, […], Bork P, Kondrashov AS, Sunyaev SR. Nat Methods. 2010 Apr;7(4):248-9.
37
Module 8 bioinformatics.ca MutationAssessor Direct / theoretical model (no machine learning) Based on amino acid conservation also specifically modeling conservation unique to protein subfamilies (can be regarded as an enhanced SIFT) Entropy-based score based on protein sequence alignment Performs well for (recurrent) somatic variants Predicting the functional impact of protein mutations: application to cancer genomics. Reva B, Antipin Y, Sander C. Nucleic Acids Res. 2011 Sep 1;39(17):e118
38
Module 8 bioinformatics.ca CADD Intended as a measure of “deleteriousness” for coding and non- coding sequence, not biased to known disease variation – However does not model gene specific constrain in detail Machine learning model (Linear SVM) – Negative training set: nearly fixed human alleles, variant if compared to inferred human-chimp ancestral genome – Positive training set: simulated variants based on mutation model aware of sequence context and primate substitution rates – Predictive features (63): VEP (Variant Effect Predictor) output, UCSC tracks, Encode tracks includes missense predictions and nucleotide-level conservation – Performance assessment: using pathogenic variants from ClinVar performs a bit better PhyloP for all sites and PolyPhen/SIFT for missense coding A general framework for estimating the relative pathogenicity of human genetic variants. Kircher M, Witten DM, Jain P, O'Roak BJ, Cooper GM, Shendure J. Nat Genet. 2014 Mar;46(3):310-5.
39
Module 8 bioinformatics.ca CADD Pathogenic ClinVar vs NHLBI-ESP > 5%
40
Module 8 bioinformatics.ca Splicing Regulatory Predictions Goal: predict how SNVs affect exon inclusion / exclusion Strategy: 1.Learn “Wild Type” splicing code based on reference genome sequence motifs and experimentally-measured splicing patterns in human tissues 2.“Mutant” code: predicts splicing change when variant alters splicing- guiding sequence motif Does not learn based on known disease splicing alterations Science 2015
41
Module 8 bioinformatics.ca Effect Scoring: Conclusive Remarks 1.Nucleotide-level conservation (PhyloP) is simple yet powerful, and multiple alignments can be additionally inspected 2.Missense scoring models are powerful, but their strengths and weaknesses need to be understood 3.Variants should be always reviewed putting all information in context – Consider conservation and effect scores using different models – Review the amino acid change and sequence context – Look for clusters of somatic variants and protein domain – Don’t forget gene-level information!
42
Module 8 bioinformatics.ca We are on a Coffee Break & Networking Session
43
Module 8 – Variants to Networks Part 2 – From Genes to Pathways Informatics Facility, The Centre for Applied Genomics (TCAG) The Hospital for Sick Children
44
Module 8 bioinformatics.ca Learning Objectives of Module What identifiers can be used for genes? What gene annotations are available for genes? What is a gene-set enrichment test and how does it work? Why do I need multiple test correction for gene-set enrichment and how does it work? How to visualize gene-set enrichment results using Enrichment Map General principles of network visualization in Cytoscape
45
Module 8 bioinformatics.ca Introduction
46
Module 8 bioinformatics.ca Activity Maps Spindle Apoptosi s Gene.A Gene.B Gene.C Gene.D Gene.E Gene.F GENE SETSNETWORKSPATHWAYS Ca ++ Channels MAPK Gene.G Gene.H Gene.I Gene.L Gene.M Gene.N Activity Profiles / Somatic Mutations Prior Knowledge about genes Spindle Apoptosis Gene.A Gene.B Gene.C Gene.D Gene.E Gene.F GENE SETSNETWORKSPATHWAYS Ca ++ Channels MAPK Gene.G Gene.H Gene.I Gene.L Gene.M Gene.N Scoring models Search algorithms Informatics
47
Module 8 bioinformatics.ca MCF7 breast cancer cell line’s transcriptional response to estradiol at 12 and 24 hours, summarized using GSEA and Gene Ontology gene-sets (Enrichment Map: A Network-Based Method for Gene-Set Enrichment Visualization and Interpretation, PlosOne, 2010)
48
Module 8 bioinformatics.ca Gene Identification
49
Module 8 bioinformatics.ca Gene Identification Getting gene IDs right is important – Identify the right entity – Stable and traceable Issues to keep in mind: – What is the output of the experiment? – Are the annotations used to analyze the experimental data in a compatible ID system? – Is the statistical test appropriate? (most used tests assume a random uniform distribution over genes)
50
Module 8 bioinformatics.ca Gene Identification In this lecture we will focus on genes as key entities Suggested for human (and mouse) genes: – Gene: NCBI EntrezGene ID (e.g. 7157) – Transcript: RefSeq (e.g. NM_001276760) – Official symbols (e.g. TP53) are not stable, and should be used only to “describe” genes, not to identify them
51
Module 8 bioinformatics.ca Gene Annotations and Gene-sets
52
Module 8 bioinformatics.ca From Cell Biology to Gene-sets Nuclear Pore Cell Cycle Gene.AAA Gene.ABA Gene.ABC Gene.CC1 Gene.CC2 Gene.CC3 Gene.CC4 Gene.CC5 Ribosome P53 signaling Gene.RP1 Gene.RP2 Gene.RP3 Gene.RP4 Gene.CC1 Gene.CK1 Gene.PPP Where can I get these gene-sets? How were the gene-sets compiled? How are they structured?
53
Module 8 bioinformatics.ca Gene-set Types Functions (Gene Ontology) Pathways Genotype-phenotype/disease association Protein Families / Domains Genomic position Gene Expression – Up/down after treatment or in relation to disease Predicted Targets of Regulators – Transcription Factors – miRNA Protein-protein Interaction Modules
54
Module 8 bioinformatics.ca
55
Module 8 bioinformatics.ca Gene Ontology (GO) Effort to standardize functional description of eukaryotic gene products (launched 1998) – > 460,000 species – Normal function, not disorder / disease – Team of curators, receive input from domain experts Gene annotations are provided by other databases (e.g. human: UniProtKB-GOA; mouse: Mouse Genome Informatics) – Based on expert curation of the literature, review of high-throughput data, or annotations in existing databases Ontology, as controlled structured vocabulary – Terms = functional concept (e.g. cell cycle, proteasome) – Relations between terms = is-a, part-of, regulates, occurs-in Logical inference from more specific to broader terms
56
Module 8 bioinformatics.ca GO Ontologies Molecular Function – Biochemical activities, in-vitro binding specificities, etc. – Example: Ligase Activity, Kinase Activity, DNA Binding Cellular Component – Parts of the cell – Example: Mitochondrion, Spindle Microtubule Biological Process – Processes at the intra-cellular and organism level – Example: DNA Replication, Apoptosis, Development
57
Module 8 bioinformatics.ca CHILD PARENT ABB1 ACAP3 TRAC1 LUC2 POF5 ZUMM C5A75 DUCZ
58
Module 8 bioinformatics.ca GO Qualifiers and Evidence Codes Qualifiers Apply specific restrictions, or alter the logical meaning, of a gene product – GO term annotation Evidence Codes Document the evidence supporting a gene product – GO term annotation Evidence codes are not meant as statements of the quality of the annotation Depending on the application, it may be worth restricting to more stringent evidences
59
Module 8 bioinformatics.ca GO Qualifiers NOT NOT may be used with terms from any of the three ontologies. NOT is used to make an explicit note that the gene product is not associated with the GO term. NOT is used when a GO term might otherwise be expected to apply to a gene product, but an experiment, sequence analysis, etc. proves otherwise. It is not generally used for negative or inconclusive experimental results. colocalizes_with colocalizes_with may be used only with cellular component terms. Gene products that are transiently or peripherally associated with an organelle or complex may be annotated to the relevant cellular component term, using the colocalizes_with qualifier. This qualifier may also be used in cases where the resolution of an assay is not accurate enough to say that the gene product is a bona fide component member. contributes_to contributes_to may be used only with molecular function terms. As noted above, an individual gene product that is part of a complex can be annotated to terms that describe the function of the complex. http://geneontology.org/page/go-annotation-conventions#qual
60
Module 8 bioinformatics.ca GO Evidence Codes Use of an experimental evidence code in a GO annotation indicates that the cited paper displayed results from a physical characterization of a gene or gene product that has supported the association of a GO term. Inferred from Experiment (EXP) Inferred from Direct Assay (IDA) Inferred from Physical Interaction (IPI) Inferred from Mutant Phenotype (IMP) Inferred from Genetic Interaction (IGI) Inferred from Expression Pattern (IEP) Use of the computational analysis evidence codes indicates that the annotation is based on an in silico analysis of the gene sequence and/or other data as described in the cited reference. The evidence codes in this category also indicate a varying degree of curatorial input. The Computational Analysis evidence codes are: Inferred from Sequence or structural Similarity (ISS) Inferred from Sequence Orthology (ISO) Inferred from Sequence Alignment (ISA) Inferred from Sequence Model (ISM) Inferred from Genomic Context (IGC) Inferred from Biological aspect of Ancestor (IBA) Inferred from Biological aspect of Descendant (IBD) Inferred from Key Residues (IKR) Inferred from Rapid Divergence(IRD) Inferred from Reviewed Computational Analysis (RCA)
61
Module 8 bioinformatics.ca GO Evidence Codes Author statement codes indicate that the annotation was made on the basis of a statement made by the author(s) in the reference cited; in particular, while the original reference did not have experimental data, in case of TAS it referenced another publication with such data. The Author Statement evidence codes used by GO are: Traceable Author Statement (TAS) Non-traceable Author Statement (NAS) Use of the curatorial statement evidence codes indicates an annotation made on the basis of a curatorial judgement that does not fit into one of the other evidence code classifications. ND is used only for the root terms (Molecular function, Biological process, Cellular Component). The Curatorial Statement codes are: Inferred by Curator (IC) No biological Data available (ND) All of the above evidence codes are assigned by curators. However, GO also used one evidence code that is assigned by automated methods, without curatorial judgement. The Automatically-Assigned evidence code is: Inferred from Electronic Annotation (IEA)
62
Module 8 bioinformatics.ca GO: Judith Blake’s 10 tips 1.Know the source of the GO annotations you are using – Are they up-to-date? – Did they filter out certain evidence codes? – Did they process qualifiers appropriately 2.Understand the scope of annotations – You should be up to speed with the present material 3.Be aware of evidence codes 4.Probe completeness of probe annotations – The curators’ reach is limited, and the community needs to actively participate in identifying poor coverage of specific functions 5.Understand the GO structure – Remember that ontological relations can and should be used to infer annotations on top of primary annotations
63
Module 8 bioinformatics.ca GO: Judith Blake’s 10 tips 6.Choose analysis tools carefully – You should be up to speed with the present material 7.Carefully report GO annotations source and version in your research manuscript 8.Use the Gene Ontology consortium resources… – …website, FAQ, mailing lists and contact information 9.Help building a better GO… – …by submitting requests to add or c=modify terms 10.Cite The Gene Ontology Consortium papers
64
Module 8 bioinformatics.ca GO Statistics Based on: R/bioconductor package org.Hs.eg.db 3.0.0 (Packaged: 2014-09-26) GO Term N#With IEAWithout IEA Terms >= 1g18,82616,524 Terms >= 10 g7,1565,692 Terms >= 50 g3,0002,311 Gene N#With IEAWithout IEA Any term18,22914,690 Terms <= 500 g15,19411,923 Terms <= 100 g12,94410,717
65
Module 8 bioinformatics.ca Gene-set Resources ResourceTypesSpeciesWebsiteUpdates g:Profiler Gene Ontology, pathways (KEGG, Reactome), phenotypes (HPO), Transcription Factor targets, miRNA Targets, Protein interaction modules Manyhttp://biit.cs.ut.ee/gpr ofiler/ frequent BaderLab Gene Ontology, Pathways (KEGG, Reactome, NCI, NetPath other), Drug targets, Transcription Factor Targets, miRNA Targets, phenotypes (HPO) Human, mouse, rat http://download.bade rlab.org/EM_Genesets /current_release/ frequent MSigDB Gene Ontology, Pathways (KEGG, Reactome, Biocarta, others), Position, Transcription Factor targets, miRNA Targets, Gene Expression Human, mouse, rat, Danio rerio, Macaca mulatta http://www.broadinsti tute.org/gsea/msigdb /index.jsp Last updated May 2013
66
Module 8 bioinformatics.ca Gene-set Resources ResourceTypesSpeciesWebsiteUpdates R/Bioconductor org.Xx.eg.db (e.g. org.Hs.eg.db) Gene Ontology, Pathways (KEGG), Position, Protein Domains (PFAM) Many major species (including yeast, human, mouse, rat) http://www.bioconduc tor.org/packages/relea se/data/annotation/ht ml/org.Hs.eg.db.html frequent
67
Module 8 bioinformatics.ca Pathways
68
Module 8 bioinformatics.ca What are Pathways? Depict mechanistic details of metabolic, signaling and other biological processes Advantages: – Curated, accurate – Cause and effect captured. – Human-interpretable visualizations – Disadvantages: – More sparse coverage of genome than functional sets – More complex models are required to score pathways – Static model of dynamic systems
69
Module 8 bioinformatics.ca Global metabolic map (nodes: metabolites, links: reactions) http://www.genome.jp/kegg-bin/show_pathway?map01100
70
Module 8 bioinformatics.ca Signaling
71
Module 8 bioinformatics.ca
72
Module 8 bioinformatics.ca Gene-set Enrichment Tests
73
Module 8 bioinformatics.ca Activity Maps Spindle Apoptosi s Gene.A Gene.B Gene.C Gene.D Gene.E Gene.F GENE SETSNETWORKSPATHWAYS Ca ++ Channels MAPK Gene.G Gene.H Gene.I Gene.L Gene.M Gene.N Activity Profiles / Somatic Mutations Prior Knowledge about genes Spindle Apoptosis Gene.A Gene.B Gene.C Gene.D Gene.E Gene.F GENE SETSNETWORKSPATHWAYS Ca ++ Channels MAPK Gene.G Gene.H Gene.I Gene.L Gene.M Gene.N Scoring models Search algorithms Informatics
74
Module 8 bioinformatics.ca Typical Enrichment Test Setp-value Spindle0.00001 Apoptosis0.00025 Experiment Gene-set Databases ENRICHMENT TEST ENRICHMENT TEST Enrichment Table Experimentally “positive” genes (e.g UP-regulated) Experimentally “detectable” genes (aka background set)
75
Module 8 bioinformatics.ca Don’t forget about gene-sets Spindle0.00001 Apoptosis0.00025 Enrichment Table FADD TRADD CYTC1 BAX BAXL CASP9 CASP10 …. FADD TRADD CYTC1 BAX BAXL CASP9 CASP10 …. SPP1 SPP2 CCCP MTC1 … SPP1 SPP2 CCCP MTC1 … Gene-sets Experimental data (e.g. gene expression table)
76
Module 8 bioinformatics.ca Typical Enrichment Test The P-value assesses the probability that, by random sampling the “detectable” genes, the overlap is at least as large as observed. Random samples of array genes The output of an enrichment test is a P-value Most used statistical model: Fisher’s Exact Test Most used statistical model: Fisher’s Exact Test Fisher’s Exact Test does not require to actually perform the random sampling, it is based on a theoretical null-hypothesis distribution (Hypergeometric Distribution)
77
Module 8 bioinformatics.ca Fisher’s Exact Test ba dc Exp_positive=yesExp_positive=no Gene-Set=yesab Gene-Set=nocd 2 x 2 Contingency Table Probability of one table to occur by random sampling: Hypergeometric distribution formula: Test p-value: sum of random sampling probabilities for tables as extreme or more extreme than the real table http://en.wikipedia.org/wiki/Fisher's_exact_test
78
Module 8 bioinformatics.ca Importance of the Background ba dc Inappropriate modeling of the background will lead to incorrectly biased results – E.g.: kinase phosphorylation assay: only kinases can be detected Depending on the experiment, the background may be easy or difficult to define
79
Module 8 bioinformatics.ca Multiple Test Correction
80
Module 8 bioinformatics.ca From One to Many Tests Mental experiment: Perform N random draws and test enrichment Set a p-value significance threshold alpha (e.g. alpha = 0.01) N * alpha random draws will pass it
81
Module 8 bioinformatics.ca From One to Many Tests SET 1 SET 2 SET N Test N real gene-sets and test enrichment Set a p-value significance threshold alpha (e.g. alpha = 0.01) N * alpha gene-sets would pass it even if they were random draws I cannot rely only on a significance threshold on the nominal p-value
82
Module 8 bioinformatics.ca Benjamini-Hochberg (BH) FDR FDR (false discovery rate) is the expected proportion of tests passing the significance threshold due to random sampling Benjamini-Hochberg (BH) FDR: for a given FDR q-value threshold alpha, for m total tests, find the largest k number of tests, so that: P-value (k) <= k / m * alpha so alpha >= P-value (k) * m / k
83
Module 8 bioinformatics.ca Benjamini-Hochberg (BH) FDR P-valueCategoryP-value * m / kRank FDR q-value 1 2 3 4 5 … 52 53 Transcriptional regulation Transcription factor Initiation of transcription Nuclear localization Chromatin modification … Cytoplasmic localization Translation 0.001 x 53/1 = 0.053 0.002 x 53/2 = 0.053 0.003 x 53/3 = 0.053 0.0031 x 53/4 = 0.040 0.005 x 53/5 = 0.053 … 0.985 x 53/52 = 1.004 0.99 x 53/53 = 0.99 In other words: (1) walk the list of tests from most significant, (2) estimate how many tests would pass at each p-value if they were random draws, (3) compute fraction of false positives, transform to monotonic 1 <= q-value <= 0 0.040 0.053 … 0.99 P-value threshold for FDR < 0.05 0.001 0.002 0.003 0.0031 0.005 … 0.97 0.99 Red: non-significant Green: significant at FDR < 0.05
84
Module 8 bioinformatics.ca Tests for Specific Data Types
85
Module 8 bioinformatics.ca Challenges with Specific Data Types General problem: Genes do not have a uniform probability of displaying some genomic signal General solution: Modify the enrichment statistic and / or the construction of the null hypothesis distribution (e.g. by resampling)
86
Module 8 bioinformatics.ca RNA-seq Problem: genes with longer transcript have more statistical power to be detected significant for differential expression (given the same absolute difference in expression) Solution: modify Fisher’s Exact Test (Wallenius non-central distribution) to account for bias
87
Module 8 bioinformatics.ca Binding Sites: GREAT Problem: genes do not have a uniform probability of receiving the mapping of a binding site Solution: use binomial test to assess deviation of fraction of binding sites within gene window from fraction of the genome represented by windows size Nature Biotechnology, May 2010
88
Module 8 bioinformatics.ca Somatic Mutations Problem: many mutations are passengers, probability of somatic mutation function of gene length and nucleotide context Solution: correct for these factors (e.g. testing genes using MutSigCV and then testing enrichment only for MutSigCV- significant genes)
89
Module 8 bioinformatics.ca Enrichment Results Visualization: Enrichment Map
90
Module 8 bioinformatics.ca Gene-set Enrichment Redundancy Problem Many redundant gene-sets – Gene Ontology has a very large number of gene-sets, often with slight differences – Different pathway databases have different yet overlapping definitions of pathways – Globally, it is useful to grasp the overlap relations between enriched gene-sets --> we need a visualization framework going beyond the enrichment table
91
Module 8 bioinformatics.ca Enrichment Map Visualization A B Overlap
92
Module 8 bioinformatics.ca
93
Module 8 bioinformatics.ca Cytoscape Network Visualization
94
Module 8 bioinformatics.ca Network Representations How to visually interpret biological data using networks Merico D, Gfeller D, Bader GD Nature Biotechnology 2009 Oct 27
95
Module 8 bioinformatics.ca Cytoscape Cytoscape is a freely available, open-source, Java-based application Very popular in the community, provides key functionalities, extended by plugins (now called “apps” to be cool)
96
Module 8 bioinformatics.ca Key Ideas in Network Visualization Layout – Automatic layout algorithms are necessary to arrange a network in a way that suggests the existence of patterns to the human eye Node and Edge visual attributes – Can be used to map a number of information items relating to gene / proteins and their interactions / similarity
97
Module 8 bioinformatics.ca Layout: Before and After
98
Module 8 bioinformatics.ca Visual Attributes
99
Module 8 bioinformatics.ca We are on a Coffee Break & Networking Session
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.