Canadian Bioinformatics Workshops www.bioinformatics.ca.

Slides:



Advertisements
Similar presentations
Using SuSPect to Predict the Phenotypic Effects of Missense Variants Chris Yates UCL Cancer Institute
Advertisements

Recommendations from HL7 Clinical Genomics & Anatomic Pathology Workgroups, NCBI, and LOINC/Lister Hill Center at NLM To the College of American Pathologists.
Outline to SNP bioinformatics lecture
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Chromatin Remodeling DNA is wrapped around histones to form nucleosomes DNA is wrapped around histones to form nucleosomes Chromosome packaging Chromosome.
SNP Resources: Finding SNPs, Databases and Data Extraction Debbie Nickerson NIEHS SNPs Workshop.
28-Way vertebrate alignment and conservation track in the UCSC Genome Browser Journal club Dec. 7, 2007.
Something related to genetics? Dr. Lars Eijssen. Bioinformatics to understand studies in genomics – São Paulo – June Image:
Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.
PolyPhen and SIFT: Tools for predicting functional effects of SNPs Epi 244 Spring 2009 Sam S. Oh.
SNP Resources: Finding SNPs Databases and Data Extraction Mark J. Rieder, PhD SeattleSNPs Variation Workshop March 20-21, 2006.
Chapter 17 From Gene to Protein. Gene Expression The process by which DNA directs the synthesis of proteins 2 stages: transcription and translation Detailed.
BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
Gene Mutations.
Presented by Karen Xu. Introduction Cancer is commonly referred to as the “disease of the genes” Cancer may be favored by genetic predisposition, but.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Supplementary Figure 1. Somatic mutation spectrum # Substitutions # Substitutions per Mb b c a Repeats Pseudogenes Whole genome Splice sites Non-coding.
Genome Variations & GWAS
Computational Molecular Biology Biochem 218 – BioMedical Informatics Simple Nucleotide.
BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
Whole Exome Sequencing for Variant Discovery and Prioritisation
Presented by: Andrew McMurry Boston University Bioinformatics Children’s Hospital Informatics Program Harvard Medical School Center for BioMedical Informatics.
- any detectable change in DNA sequence eg. errors in DNA replication/repair - inherited ones of interest in evolutionary studies Deleterious - will be.
MES Genome Informatics I - Lecture VIII. Interpreting variants Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute,
Doug Brutlag 2011 Genomics & Medicine Doug Brutlag Professor Emeritus of Biochemistry &
The Biology and Genetic Base of Cancer. 2 (Mutation)
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Fea- ture Num- ber Feature NameFeature description 1 Average number of exons Average number of exons in the transcripts of a gene where indel is located.
Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.
Sackler Medical School
E XOME SEQUENCING AND COMPLEX DISEASE : practical aspects of rare variant association studies Alice Bouchoms Amaury Vanvinckenroye Maxime Legrand 1.
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
Epidemiology 217 Molecular and Genetic Epidemiology Bioinformatics & Proteomics John Witte.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
Geuvadis Analysis Meeting 16/02/2012 Micha Sammeth CNAG – Barcelona.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
12/16/14 StarterConnection/Exit: What is the true meaning of the word mutation? Are mutations bad / harmful? 12/16/14 Protein Synthesis Writing
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Single nucleotide polymorphisms and Large scale variation
© 2012 Genomatix GeneGrid finding disease causing variants in NGS data Claudia Gugenmus Genomatix Software GmbH Bayerstrasse 85a
Lesson Four Structure of a Gene. Gene Structure What is a gene? Gene: a unit of DNA on a chromosome that codes for a protein(s) –Exons –Introns –Promoter.
UCSC Genome Browser Zeevik Melamed & Dror Hollander Gil Ast Lab Sackler Medical School.
Computational Biology and Genomics at Boston College Biology Gabor T. Marth Department of Biology, Boston College
Genetics. Mutations of Genes Mutation – change in the nucleotide base sequence of a genome; rare Not all mutations change the phenotype Two classes of.
Visualization of genomic data Genome browsers. UCSC browser Ensembl browser Others ? Survey.
COURSE OF BIOINFORMATICS Exam_30/01/2014 A.
A high-resolution map of human evolutionary constraints using 29 mammals Kerstin Lindblad-Toh et al Presentation by Robert Lewis and Kaylee Wells.
Using public resources to understand associations Dr Luke Jostins Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015.
Recent Advances in Genomic Science Julian Sampson Institute of Medical Genetics, Cardiff.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.
 BUILD-A-BUG ACTIVITY  Build your bug and turn in to your box  Mutations Notes  Mutations practice QUIZ NEXT CLASS: Transcription and Translation TUESDAY.
1 Finding disease genes: A challenge for Medicine, Mathematics and Computer Science Andrew Collins, Professor of Genetic Epidemiology and Bioinformatics.
Canadian Bioinformatics Workshops
Integrated sequence analysis pipeline provides one-stop solution for identifying disease-causing mutations Cougar Hao Hu, MPIMG.
Interpreting exomes and genomes: a beginner’s guide
Canadian Bioinformatics Workshops
Week-6: Genomics Browsers
Evolution of gene function
Basics of Comparative Genomics
Content and Labeling of Tests Marketed as Clinical “Whole-Exome Sequencing” Perspectives from a cancer genetics clinician and clinical lab director Allen.
Ab initio gene prediction
Mutations changes in the DNA sequence that can be inherited
Ensembl Genome Repository.
Basics of Comparative Genomics
Genome Annotation and the Human Genome
BF528 - Whole Genome Sequencing and Genomic Variation
Analysis of protein-coding genetic variation in 60,706 humans
Presentation transcript:

Canadian Bioinformatics Workshops

2Module #: Title of Module

Module 7 - Part.1 Annotation of Somatic Coding Variants Annovar Informatics Facility, The Centre for Applied Genomics (TCAG) The Hospital for Sick Children

Module bioinformatics.ca Learning Objectives of Module I have detected somatic variants in a cancer sample. What information can I use to interpret them? What do different annotations mean? How do missense effect prediction models work? How to use an annotation tool: Annovar (LAB)

Module bioinformatics.ca Introduction

Module bioinformatics.ca Cancer Driver Discovery: Biological Knowledge vs Frequency Small data-sets (1-10 subjects) – Variant previously reported – Gene function/disease phenotype + variant effect Large data-sets (> 100 subjects) – Variant recurrence – Over-represented pathways/networks

Module bioinformatics.ca Variant vs Gene Information We have to consider information at two levels: Gene – Is the gene central to processes related to cancer? (e.g. proliferation, apoptosis, matrix degradation) – Is the gene sensitive to perturbation? (e.g. haploinsufficiency) Variant – What is the variant effect on the gene product?

Module bioinformatics.ca Passengers and Drivers A Very Simplistic Model Important Activator + Activating Variant Important Repressor + Loss-of-function Variant Redundant Gene (or controlling unrelated process) Cancer Drive No effect Cancer Drive

Module bioinformatics.ca Passengers and Drivers A Very Simplistic Model Important Repressor + Silent Variant Important Repressor + Loss-of-function Variant Cancer Drive No effect

Module bioinformatics.ca Variant Recurrence Variant Gene Product Effect Gene Product Function / Pathway Integrating Different Evidences

Module bioinformatics.ca On Variant Size Small: 1-50 bp SNV (Single Nucleotide Variants): 1 bp substitution, relatively easy to detect Small In/dels: a bit more challenging to detect Most available in databases; can be mapped by exact coordinates Medium: 100-1,000 bp Insertions, Deletions, Translocations, Complex re-arrangements Most challenging to detect More tolerant mapping (e.g. partners of gene fusion) Large: > 5 kbp Copy number variants relatively easy to detect using arrays, more challenging using next generation sequencing More tolerant mapping (e.g. 50% reciprocal overlap, cytoband(s))

Module bioinformatics.ca SNV and In/Del Annotation

Module bioinformatics.ca Variant Annotation Components A.Variant database mapping A.Allele frequencies from reference data-sets (1000G, NHLBI-ESP, CGI-46) B.dbSNP (sequence variation database) C.COSMIC (somatic variant database) B.Gene mapping (coding/splicing, UTR, intergenic) C.Gene product effect type (e.g. loss of function, missense) D.Effect Scoring A.PhyloP (conservation) B.CADD E.Coding Missense Effect Scoring A.SIFT B.PolyPhen2 C.MutationAssessor

Module bioinformatics.ca Variant Annotation Components A.Variant database mapping A.Allele frequencies from reference data-sets (1000G, NHLBI-ESP, CGI-46) B.dbSNP (sequence variation database) C.COSMIC (somatic variant database) B.Gene mapping (coding/splicing, UTR, intergenic) C.Gene product effect type (e.g. loss of function, missense) D.Effect Scoring A.PhyloP (conservation) B.CADD E.Coding Missense Effect Scoring A.SIFT B.PolyPhen2 C.MutationAssessor

Module bioinformatics.ca Allele Frequencies Databases Use: identify “suspect” somatic variants that show up as germline variants in these databases 1000 Genomes NHLBI-ESP CGI-46 / CGI-69

Module bioinformatics.ca 1000 Genomes Goal: – Identify all variants at > 1% frequency in represented human populations Subjects: – 1092 with available variants – 2500 at project completion Launch date: 2007 – Many revisions (e.g. increase coverage)

Module bioinformatics.ca 1000 Genomes Phase 1: variants available (*) – 1092 apparently healthy subjects – Ethnicities: Represented: European, Black African, East Asian, Mixed Americans Missing: South-east Asians, Indians, Middle-east, North Africans – 38.2 M SNPs, 3.9 M In/Dels – Platform: Illumina + SOLID Low coverage (2-4x) whole genome Exon (50x coverage) – Variant calling: multiple methods including GATK Unified Genotyper Phase 2: variant calling on-going Phase 3: alignments just made available * version: 30 April 2012 ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/ /REA DME.phase1_integrated_release_version3_

Module bioinformatics.ca

Module bioinformatics.ca NHLBI-ESP Goal: – discover heart, lung and blood disorder variants at frequency < 1% Subjects: 6503 (ESP 6500 release) – Not necessarily healthy (includes individuals with extreme subclinical traits and diseased) – Ethnicities: 2203 African-Americans, 4300 European-Americans Platform: Illumina, exome sequencing (average 110x) Variant calling: – SNV: glfMultiples + ad-hoc quality filtering – In/Dels: GATK unified genotyper

Module bioinformatics.ca CG-46 and CG-69 Goal: variation in controls Subjects: – CG-46: 46 unrelated (recommended for allele frequencies) – CG-69: CG-46 + two trios + extended CEU pedigree Ethnicities: European, Black African, East Asian, Indian, Mexican Platform: Complete Genomics (whole genome, 80x) Variant calling: Complete Genomics pipeline

Module bioinformatics.ca CG-46

Module bioinformatics.ca Allele Frequency Databases: Take Home Messages Different ethnic compositions Whole genome / exome Different platforms and (diploid) variant callers Different sequencing depth Different power for variant detection at different frequencies Different number of subjects Different capability to generalize across population  Data-sets are complementary Constant updates, keep yourself updated!

Module bioinformatics.ca Variant Annotation Components A.Variant database mapping A.Allele frequencies from reference data-sets (1000G, NHLBI-ESP, CGI-46) B.dbSNP (sequence variation database) C.COSMIC (somatic variant database) B.Gene mapping (coding/splicing, UTR, intergenic) C.Gene product effect type (e.g. loss of function, missense) D.Conservation A.PhyloP E.Missense Effect Scoring A.SIFT B.PolyPhen2 C.MutationAssessor

Module bioinformatics.ca dbSNP Broad scope repository of “small” genetic variation (e.g. NCBI counterpart for structural variants: dbVAR) – Submissions before and after NGS era – Includes polymorphisms found in general population – Includes rare germline disease-associated (or suspected to be) – Includes somatic variants (also in COSMIC)  Good to look up variants  If you want to use as filter, make sure you remove “clinically flagged” variants (somatic, germline)

Module bioinformatics.ca COSMIC “Catalogue of Somatic Mutation In Cancer” Reference database for somatic variation in cancer Worth following up variants matching COSMIC entries – How many studies/samples was it found in? 1, many? – Does the variant overlap a hotspot? – Is the gene frequently mutated?

Module bioinformatics.ca Looking up a well-established driver mutation in dbSNP and Cosmic (BRAF V600E)

Module bioinformatics.ca BRAF V600E: rs dbSNP

Module bioinformatics.ca BRAF V600E: rs dbSNP click

Module bioinformatics.ca BRAF V600E: rs dbSNP Clinical view V600E T>C Somatic Pathogenic V600A T>A Somatic / germline Pathogenicity untested

Module bioinformatics.ca

Module bioinformatics.ca BRAF V600E: rs dbSNP Clinical view V600E T>C Somatic Pathogenic V600A T>A Somatic / germline Pathogenicity untested click

Module bioinformatics.ca BRAF V600E: rs From dbSNP to OMIM

Module bioinformatics.ca BRAF V600E

Module bioinformatics.ca click

Module bioinformatics.ca

Module bioinformatics.ca Variant Annotation Components A.Variant database mapping A.Allele frequencies from reference data-sets (1000G, NHLBI-ESP, CGI-46) B.dbSNP (sequence variation database) C.COSMIC (somatic variant database) B.Gene mapping (coding/splicing, UTR, intergenic) C.Gene product effect type (e.g. loss of function, missense) D.Effect Scoring A.PhyloP (conservation) B.CADD E.Coding Missense Effect Scoring A.SIFT B.PolyPhen2 C.MutationAssessor

Module bioinformatics.ca Gene Mapping: Types of Genes Types of genes: Protein-coding genes Non-protein-coding RNA genes (e.g. miRNA) Different functional relevance Different knowledge of variant effects

Module bioinformatics.ca Gene Mapping: Parts of Genes Protein-coding genes have these parts: – UTR (transcribed, not translated) – Coding exons (translated) – Introns (spliced out, not translated) – Splice sites Also: Upstream, downstream transcribed gene Inter-genic

Module bioinformatics.ca Gene Mapping: Annovar’s priority system Gene types and parts: what if they overlap..? Whenever more than one mapping is possible, Annovar will follow this priority system

Module bioinformatics.ca Gene Mapping: Annovar’s priority system Protein Coding Gene G1 Non-coding RNA ncR1 (e.g. miRNA) TSS of G1 (Transcription Start Site) >>>> >> >>>> > >>>>>>>>> >

Module bioinformatics.ca Gene Mapping: Annovar’s priority system ** Splice sites after the first were omitted to avoid clutter G1 Upstream G1 UTR 5’ G1 Exonic G1 Intronic G1 ExonicG1 IntronicG1 Exonic G1 Intronic ncR1 G1 Exonic G1 UTR 3’G1 Downstream ncR1 Downstream G1 Splicing ** >>>> >> >>>> > >>>>>>>>> >

Module bioinformatics.ca Example of Annovar Output ChrStartEndRefAltFunc.refGeneGene.refGene GAintergenicFAM41C(dist=35435),LOC (dist=5336) CGintronicNOC2L ACUTR3AGRN GCupstreamCCDC GAUTR5KCNAB CTexonicESPN GAsplicingGORASP2(NM_ :exon7:c.496-1G>A,NM_015530:exon7:c.700-1G>A)

Module bioinformatics.ca Example of Annovar Output ChrStartEndRefAltFunc.refGeneGene.refGene GAintergenicFAM41C(dist=35435),LOC (dist=5336) CGintronicNOC2L ACUTR3AGRN GCupstreamCCDC GAUTR5KCNAB CTexonicESPN GAsplicingGORASP2(NM_ :exon7:c.496-1G>A,NM_015530:exon7:c.700-1G>A)

Module bioinformatics.ca Example of Annovar Output ChrStartEndRefAltFunc.refGeneGene.refGene GAintergenicFAM41C(dist=35435),LOC (dist=5336) CGintronicNOC2L ACUTR3AGRN GCupstreamCCDC GAUTR5KCNAB CTexonicESPN GAsplicingGORASP2(NM_ :exon7:c.496-1G>A,NM_015530:exon7:c.700-1G>A)

Module bioinformatics.ca Example of Annovar Output ChrStartEndRefAltFunc.refGeneGene.refGene GAintergenicFAM41C(dist=35435),LOC (dist=5336) CGintronicNOC2L ACUTR3AGRN GCupstreamCCDC GAUTR5KCNAB CTexonicESPN GAsplicingGORASP2(NM_ :exon7:c.496-1G>A,NM_015530:exon7:c.700-1G>A)

Module bioinformatics.ca Example of Annovar Output ChrStartEndRefAltFunc.refGeneGene.refGene GAintergenicFAM41C(dist=35435),LOC (dist=5336) CGintronicNOC2L ACUTR3AGRN GCupstreamCCDC GAUTR5KCNAB CTexonicESPN GAsplicingGORASP2(NM_ :exon7:c.496-1G>A,NM_015530:exon7:c.700-1G>A) More than one KCNAB2 isoform is present Annovar reported the UTR5 and not the intron, following the priority rules

Module bioinformatics.ca Example of Annovar Output ChrStartEndRefAltFunc.refGeneGene.refGene GAintergenicFAM41C(dist=35435),LOC (dist=5336) CGintronicNOC2L ACUTR3AGRN GCupstreamCCDC GAUTR5KCNAB CTexonicESPN GAsplicingGORASP2(NM_ :exon7:c.496-1G>A,NM_015530:exon7:c.700-1G>A)

Module bioinformatics.ca Example of Annovar Output ChrStartEndRefAltFunc.refGeneGene.refGene GAintergenicFAM41C(dist=35435),LOC (dist=5336) CGintronicNOC2L ACUTR3AGRN GCupstreamCCDC GAUTR5KCNAB CTexonicESPN GAsplicingGORASP2(NM_ :exon7:c.496-1G>A,NM_015530:exon7:c.700-1G>A) AG splicing acceptor intronic sequence becomes AA This happens for both GORASP2 transcript isoforms What will happen at the functional level..? Frameshift splicing?

Module bioinformatics.ca Splice Sites and Annovar Annovar considers a +/-2 bp window around the intron/exon junction and reports the following splicing categories: Splicing: 2 bp intronic Splicing;exonic: 2 bp exonic General things to keep in mind: The intronic site is much more biologically relevant Other sequence features outside the +/- 2 bp splice site window may be important for guiding splicing  Splicing variants always need to be manually reviewed

Module bioinformatics.ca AG splicing acceptor (intronic) is very well conserved across 46 vertebrates in UCSC

Module bioinformatics.ca Gene Mapping: Database Goal: map our variants to (coding and non-coding) genes RefSeq is the suggested database for transcribed gene and coding sequence definition – In the lab we will use Annovar with RefSeq database Other databases available: UCSC known genes, Ensembl

Module bioinformatics.ca Other Annotation Features Additional annotation features can be used, for instance: Protein Domains Encode Profiles – Histone marks – DNA methylation – DNAse hypersensitivity (proxy of binding sites) – Etc…

Module bioinformatics.ca Variant Annotation Components A.Variant database mapping A.Allele frequencies from reference data-sets (1000G, NHLBI-ESP, CGI-46) B.dbSNP (sequence variation database) C.COSMIC (somatic variant database) B.Gene mapping (coding/splicing, UTR, intergenic) C.Gene product effect type (e.g. loss of function, missense) D.Effect Scoring A.PhyloP (conservation) B.CADD E.Coding Missense Effect Scoring A.SIFT B.PolyPhen2 C.MutationAssessor

Module bioinformatics.ca Gene Product Effect Regulatory / other non-protein-coding sequences: difficult to establish what a change “means” (certain cases are easier, e.g. miRNA seed) Protein-coding sequences: how is protein sequence affected?  Definitely easier to chase after protein effects  But should don’t forget other gene products exist…

Module bioinformatics.ca Gene Product Effect: Protein-coding Stop-gain SNV: adds a STOP codon  truncated protein Frameshift In/Del: shifts the reading frame  protein translated incorrectly from that point Splicing: alters key sites guiding splicing In-frame In/Del: removes/add one or more amino acids Stoploss: loss of STOP codon  extra piece in the protein Missense SNV: modifies one amino acid Synonymous: no amino acid change

Module bioinformatics.ca Loss of Function (LOF) Variants Definition: Stop-gain, Frameshift, Splicing These are the more disruptive, BUT: What percentage of the protein is affected? Are there multiple transcript isoforms? Splicing effect difficult to predict – Cryptic splice sites Frameshift can be rescued by another frameshift or bypassed by splicing

Module bioinformatics.ca Missense Variants: Tell Me More.. How do we tell if a missense alters protein function? Type of amino acid change (amino acid groups) Conservation across species Conserved protein domain Secondary protein structure Tertiary (3D) protein structure + simulation Other functional features (e.g. phosphosite) Machine learning model tying all of these together – What training set?

Module bioinformatics.ca BRAF V600E T>C Somatic Pathogenic BRAF V600A T>A Somatic / germline Pathogenicity untested Missense Example: Back to BRAF

Module bioinformatics.ca Synonymous are not always Silent Does no amino acid change equal no functional effect? Often, but not always! – Codon usage – Cryptic regulatory sequences (e.g. splicing enhancers) – Strong conservation can be suggestive – No other broadly used tool to handle these (but stay tuned)

Module bioinformatics.ca Conservation and Missense Variant Scoring Models

Module bioinformatics.ca Conservation Conservation is a powerful and broadly used idea How conserved is a given nucleotide or genomic interval, comparing different species to human? How conserved is an amino acid in a protein sequence? Available from UCSC (nucleotide conservation): – PhyloP score – useful to assess single variants – PhastCons score/element – useful to assess putative regulatory regions and genes not coding proteins – Multi-species alignment – generally useful

Module bioinformatics.ca Look for coding exons, UTRs and third nucleotide within codons

Module bioinformatics.ca PhyloP PhyloP: test to detect if nucleotide substitution rates are faster or slower than expected under neutral drift – Only where aligned sequence available! PhyloP score – Positive: conserved (e.g. PhyloP > 2) – Zero: neutral – Negative: more diverged than neutral Species group: – All vertebrates – Only placental mammals – Only primates

Module bioinformatics.ca Conservation Main caveat: – if you use conservation for a given position, this will not tell you directly what is the effect of your variants, but only if the position is important!

Module bioinformatics.ca Missense Variant Effect: Scoring Models Overview Criteria to keep in mind: What features are used? – Nucleotide / amino acid conservation – Amino acid physicochemical properties Direct scoring versus Machine learning – Machine learning models are heavily dependent on the training-set used What data-set used for assessment / learning / optimization? – E.g. Activating / gain-of-function versus inactivating / loss-of-function mutations – E.g. Mendelian disorders (prevailingly loss-of-function) versus cancer (some are unique to cancer, e.g. drug resistance)

Module bioinformatics.ca SIFT Broadly used, relatively old (first published: 2001) Designed for deleterious mutation (i.e. disruptive of protein function) Based uniquely on protein sequence (amino acid) conservation 1.Start from query protein sequence 2.Identify similar protein sequences (PSI-BLAST) 3.Multiple alignment of protein sequences (orthologs and paralogs) 4.Amino acid x residue probability matrix (PSSM) 5.For every residue, amino acid probability reweighted by amino acid diversity at the position (sum of frequency rank * frequency)  Score: probability of observing amino acid normalized by residue conservation cut-off: 0.05 (based on case studies) Predicting deleterious amino acid substitutions. Ng PC, Henikoff S. Genome Res May;11(5):

Module bioinformatics.ca PolyPhen2 Integrates multiple features – 8 sequence-based, 3 structure-based (nucleotide and amino acid level) (e.g. side chain volume change, overlap with PFAM domain, multiple alignment metrics) Machine learning method (Naïve Bayes)  Requires training set – Set 1: HumDiv – Positive: damaging alleles for known Mendelian disorders (Uniprot) – Negative: nondamaging differences between human proteins and related mammalian homologs – Performance 5-fold crossv: (TP ~ 80%, FP ~10%), (TP ~ 90%, FP ~ 20%) – Set 2: HumVar – Positive: all human disease causing mutations (Uniprot) – Negative: non-synonymous SNPs without disease association  Richer model than SIFT  More biased towards training set(s) than SIFT A method and server for predicting damaging missense mutations. Adzhubei IA, Schmidt S, Peshkin L, […], Bork P, Kondrashov AS, Sunyaev SR. Nat Methods Apr;7(4):248-9.

Module bioinformatics.ca MutationAssessor Direct / theoretical model (no machine learning) Based on amino acid conservation also specifically modeling conservation unique to protein subfamilies (can be regarded as an enhanced SIFT) Entropy-based score based on protein sequence alignment Performs well for (recurrent) somatic variants Predicting the functional impact of protein mutations: application to cancer genomics. Reva B, Antipin Y, Sander C. Nucleic Acids Res Sep 1;39(17):e118

Module bioinformatics.ca MutationAssessor

Module bioinformatics.ca CADD Intended as a measure of “deleteriousness” for coding and non- coding sequence, not biased to known disease variation – However does not model gene specific constrain in detail Machine learning model (Linear SVM) – Negative training set: nearly fixed human alleles, variant if compared to inferred human-chimp ancestral genome – Positive training set: simulated variants based on mutation model aware of sequence context and primate substitution rates – Predictive features (63): VEP (Variant Effect Predictor) output, UCSC tracks, Encode tracks  includes missense predictions and nucleotide-level conservation – Performance assessment: using pathogenic variants from ClinVar performs a bit better PhyloP for all sites and PolyPhen/SIFT for missense coding A general framework for estimating the relative pathogenicity of human genetic variants. Kircher M, Witten DM, Jain P, O'Roak BJ, Cooper GM, Shendure J. Nat Genet Mar;46(3):310-5.

Module bioinformatics.ca CADD Pathogenic ClinVar vs NHLBI-ESP > 5%

Module bioinformatics.ca Effect Scoring: Conclusive Remarks 1.Nucleotide-level conservation (PhyloP) is simple yet powerful, and multiple alignments can be additionally inspected 2.Missense scoring models are powerful, but their strengths and weaknesses need to be understood 3.Variants should be always reviewed putting all information in context – Consider conservation and effect scores using different models – Review the amino acid change and sequence context – Look for clusters of somatic variants and protein domain – Don’t forget gene-level information!

Module bioinformatics.ca We are on a Coffee Break & Networking Session