Canadian Bioinformatics Workshops www.bioinformatics.ca.

Canadian Bioinformatics Workshops www.bioinformatics.ca

2Module #: Title of Module

Module 7 - Part.1 Annotation of Somatic Coding Variants Annovar Informatics Facility, The Centre for Applied Genomics (TCAG) The Hospital for Sick Children

Module bioinformatics.ca Learning Objectives of Module I have detected somatic variants in a cancer sample. What information can I use to interpret them? What do different annotations mean? How do missense effect prediction models work? How to use an annotation tool: Annovar (LAB)

Module bioinformatics.ca Introduction

Module bioinformatics.ca Cancer Driver Discovery: Biological Knowledge vs Frequency Small data-sets (1-10 subjects) – Variant previously reported – Gene function/disease phenotype + variant effect Large data-sets (> 100 subjects) – Variant recurrence – Over-represented pathways/networks

Module bioinformatics.ca Variant vs Gene Information We have to consider information at two levels: Gene – Is the gene central to processes related to cancer? (e.g. proliferation, apoptosis, matrix degradation) – Is the gene sensitive to perturbation? (e.g. haploinsufficiency) Variant – What is the variant effect on the gene product?

Module bioinformatics.ca Passengers and Drivers A Very Simplistic Model Important Activator + Activating Variant Important Repressor + Loss-of-function Variant Redundant Gene (or controlling unrelated process) Cancer Drive No effect Cancer Drive

Module bioinformatics.ca Passengers and Drivers A Very Simplistic Model Important Repressor + Silent Variant Important Repressor + Loss-of-function Variant Cancer Drive No effect

Module bioinformatics.ca Variant Recurrence Variant Gene Product Effect Gene Product Function / Pathway Integrating Different Evidences

Module bioinformatics.ca On Variant Size Small: 1-50 bp SNV (Single Nucleotide Variants): 1 bp substitution, relatively easy to detect Small In/dels: a bit more challenging to detect Most available in databases; can be mapped by exact coordinates Medium: 100-1,000 bp Insertions, Deletions, Translocations, Complex re-arrangements Most challenging to detect More tolerant mapping (e.g. partners of gene fusion) Large: > 5 kbp Copy number variants relatively easy to detect using arrays, more challenging using next generation sequencing More tolerant mapping (e.g. 50% reciprocal overlap, cytoband(s))

Module bioinformatics.ca SNV and In/Del Annotation

Module bioinformatics.ca Variant Annotation Components A.Variant database mapping A.Allele frequencies from reference data-sets (1000G, NHLBI-ESP, CGI-46) B.dbSNP (sequence variation database) C.COSMIC (somatic variant database) B.Gene mapping (coding/splicing, UTR, intergenic) C.Gene product effect type (e.g. loss of function, missense) D.Effect Scoring A.PhyloP (conservation) B.CADD E.Coding Missense Effect Scoring A.SIFT B.PolyPhen2 C.MutationAssessor

Module bioinformatics.ca Allele Frequencies Databases Use: identify “suspect” somatic variants that show up as germline variants in these databases 1000 Genomes NHLBI-ESP CGI-46 / CGI-69

Module bioinformatics.ca 1000 Genomes Goal: – Identify all variants at > 1% frequency in represented human populations Subjects: – 1092 with available variants – 2500 at project completion Launch date: 2007 – Many revisions (e.g. increase coverage)

Module bioinformatics.ca 1000 Genomes Phase 1: variants available (*) – 1092 apparently healthy subjects – Ethnicities: Represented: European, Black African, East Asian, Mixed Americans Missing: South-east Asians, Indians, Middle-east, North Africans – 38.2 M SNPs, 3.9 M In/Dels – Platform: Illumina + SOLID Low coverage (2-4x) whole genome Exon (50x coverage) – Variant calling: multiple methods including GATK Unified Genotyper Phase 2: variant calling on-going Phase 3: alignments just made available * version: 30 April 2012 ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521/REA DME.phase1_integrated_release_version3_20120430

Module bioinformatics.ca

Module bioinformatics.ca NHLBI-ESP Goal: – discover heart, lung and blood disorder variants at frequency < 1% Subjects: 6503 (ESP 6500 release) – Not necessarily healthy (includes individuals with extreme subclinical traits and diseased) – Ethnicities: 2203 African-Americans, 4300 European-Americans Platform: Illumina, exome sequencing (average 110x) Variant calling: – SNV: glfMultiples + ad-hoc quality filtering – In/Dels: GATK unified genotyper

Module bioinformatics.ca CG-46 and CG-69 Goal: variation in controls Subjects: – CG-46: 46 unrelated (recommended for allele frequencies) – CG-69: CG-46 + two trios + extended CEU pedigree Ethnicities: European, Black African, East Asian, Indian, Mexican Platform: Complete Genomics (whole genome, 80x) Variant calling: Complete Genomics pipeline

Module bioinformatics.ca CG-46

Module bioinformatics.ca Allele Frequency Databases: Take Home Messages Different ethnic compositions Whole genome / exome Different platforms and (diploid) variant callers Different sequencing depth Different power for variant detection at different frequencies Different number of subjects Different capability to generalize across population  Data-sets are complementary Constant updates, keep yourself updated!

Module bioinformatics.ca Variant Annotation Components A.Variant database mapping A.Allele frequencies from reference data-sets (1000G, NHLBI-ESP, CGI-46) B.dbSNP (sequence variation database) C.COSMIC (somatic variant database) B.Gene mapping (coding/splicing, UTR, intergenic) C.Gene product effect type (e.g. loss of function, missense) D.Conservation A.PhyloP E.Missense Effect Scoring A.SIFT B.PolyPhen2 C.MutationAssessor

Module bioinformatics.ca dbSNP Broad scope repository of “small” genetic variation (e.g. NCBI counterpart for structural variants: dbVAR) – Submissions before and after NGS era – Includes polymorphisms found in general population – Includes rare germline disease-associated (or suspected to be) – Includes somatic variants (also in COSMIC)  Good to look up variants  If you want to use as filter, make sure you remove “clinically flagged” variants (somatic, germline)

Module bioinformatics.ca COSMIC “Catalogue of Somatic Mutation In Cancer” Reference database for somatic variation in cancer Worth following up variants matching COSMIC entries – How many studies/samples was it found in? 1, many? – Does the variant overlap a hotspot? – Is the gene frequently mutated?

Module bioinformatics.ca Looking up a well-established driver mutation in dbSNP and Cosmic (BRAF V600E)

Module bioinformatics.ca BRAF V600E: rs113488022 dbSNP

Module bioinformatics.ca BRAF V600E: rs113488022 dbSNP click

Module bioinformatics.ca BRAF V600E: rs113488022 dbSNP Clinical view V600E T>C Somatic Pathogenic V600A T>A Somatic / germline Pathogenicity untested

Module bioinformatics.ca BRAF V600E: rs113488022 dbSNP Clinical view V600E T>C Somatic Pathogenic V600A T>A Somatic / germline Pathogenicity untested click

Module bioinformatics.ca BRAF V600E: rs113488022 From dbSNP to OMIM

Module bioinformatics.ca BRAF V600E

Module bioinformatics.ca click

Module bioinformatics.ca Gene Mapping: Types of Genes Types of genes: Protein-coding genes Non-protein-coding RNA genes (e.g. miRNA) Different functional relevance Different knowledge of variant effects

Module bioinformatics.ca Gene Mapping: Parts of Genes Protein-coding genes have these parts: – UTR (transcribed, not translated) – Coding exons (translated) – Introns (spliced out, not translated) – Splice sites Also: Upstream, downstream transcribed gene Inter-genic

Module bioinformatics.ca Gene Mapping: Annovar’s priority system Gene types and parts: what if they overlap..? Whenever more than one mapping is possible, Annovar will follow this priority system

Module bioinformatics.ca Gene Mapping: Annovar’s priority system Protein Coding Gene G1 Non-coding RNA ncR1 (e.g. miRNA) TSS of G1 (Transcription Start Site) >>>> >> >>>> > >>>>>>>>> >

Module bioinformatics.ca Gene Mapping: Annovar’s priority system ** Splice sites after the first were omitted to avoid clutter G1 Upstream G1 UTR 5’ G1 Exonic G1 Intronic G1 ExonicG1 IntronicG1 Exonic G1 Intronic ncR1 G1 Exonic G1 UTR 3’G1 Downstream ncR1 Downstream G1 Splicing ** >>>> >> >>>> > >>>>>>>>> >

Module bioinformatics.ca Example of Annovar Output ChrStartEndRefAltFunc.refGeneGene.refGene 1847617 GAintergenicFAM41C(dist=35435),LOC100130417(dist=5336) 1882050 CGintronicNOC2L 1990394 ACUTR3AGRN 13668186 GCupstreamCCDC27 16094426 GAUTR5KCNAB2 16512048 CTexonicESPN 2171812945 GAsplicingGORASP2(NM_001201428:exon7:c.496-1G>A,NM_015530:exon7:c.700-1G>A)

Module bioinformatics.ca Example of Annovar Output ChrStartEndRefAltFunc.refGeneGene.refGene 1847617 GAintergenicFAM41C(dist=35435),LOC100130417(dist=5336) 1882050 CGintronicNOC2L 1990394 ACUTR3AGRN 13668186 GCupstreamCCDC27 16094426 GAUTR5KCNAB2 16512048 CTexonicESPN 2171812945 GAsplicingGORASP2(NM_001201428:exon7:c.496-1G>A,NM_015530:exon7:c.700-1G>A) More than one KCNAB2 isoform is present Annovar reported the UTR5 and not the intron, following the priority rules

Module bioinformatics.ca Example of Annovar Output ChrStartEndRefAltFunc.refGeneGene.refGene 1847617 GAintergenicFAM41C(dist=35435),LOC100130417(dist=5336) 1882050 CGintronicNOC2L 1990394 ACUTR3AGRN 13668186 GCupstreamCCDC27 16094426 GAUTR5KCNAB2 16512048 CTexonicESPN 2171812945 GAsplicingGORASP2(NM_001201428:exon7:c.496-1G>A,NM_015530:exon7:c.700-1G>A)

Module bioinformatics.ca Example of Annovar Output ChrStartEndRefAltFunc.refGeneGene.refGene 1847617 GAintergenicFAM41C(dist=35435),LOC100130417(dist=5336) 1882050 CGintronicNOC2L 1990394 ACUTR3AGRN 13668186 GCupstreamCCDC27 16094426 GAUTR5KCNAB2 16512048 CTexonicESPN 2171812945 GAsplicingGORASP2(NM_001201428:exon7:c.496-1G>A,NM_015530:exon7:c.700-1G>A) AG splicing acceptor intronic sequence becomes AA This happens for both GORASP2 transcript isoforms What will happen at the functional level..? Frameshift splicing?

Module bioinformatics.ca Splice Sites and Annovar Annovar considers a +/-2 bp window around the intron/exon junction and reports the following splicing categories: Splicing: 2 bp intronic Splicing;exonic: 2 bp exonic General things to keep in mind: The intronic site is much more biologically relevant Other sequence features outside the +/- 2 bp splice site window may be important for guiding splicing  Splicing variants always need to be manually reviewed

Module bioinformatics.ca AG splicing acceptor (intronic) is very well conserved across 46 vertebrates in UCSC

Module bioinformatics.ca Gene Mapping: Database Goal: map our variants to (coding and non-coding) genes RefSeq is the suggested database for transcribed gene and coding sequence definition – In the lab we will use Annovar with RefSeq database Other databases available: UCSC known genes, Ensembl

Module bioinformatics.ca Other Annotation Features Additional annotation features can be used, for instance: Protein Domains Encode Profiles – Histone marks – DNA methylation – DNAse hypersensitivity (proxy of binding sites) – Etc…

Module bioinformatics.ca Gene Product Effect Regulatory / other non-protein-coding sequences: difficult to establish what a change “means” (certain cases are easier, e.g. miRNA seed) Protein-coding sequences: how is protein sequence affected?  Definitely easier to chase after protein effects  But should don’t forget other gene products exist…

Module bioinformatics.ca Gene Product Effect: Protein-coding Stop-gain SNV: adds a STOP codon  truncated protein Frameshift In/Del: shifts the reading frame  protein translated incorrectly from that point Splicing: alters key sites guiding splicing In-frame In/Del: removes/add one or more amino acids Stoploss: loss of STOP codon  extra piece in the protein Missense SNV: modifies one amino acid Synonymous: no amino acid change

Module bioinformatics.ca Loss of Function (LOF) Variants Definition: Stop-gain, Frameshift, Splicing These are the more disruptive, BUT: What percentage of the protein is affected? Are there multiple transcript isoforms? Splicing effect difficult to predict – Cryptic splice sites Frameshift can be rescued by another frameshift or bypassed by splicing

Module bioinformatics.ca Missense Variants: Tell Me More.. How do we tell if a missense alters protein function? Type of amino acid change (amino acid groups) Conservation across species Conserved protein domain Secondary protein structure Tertiary (3D) protein structure + simulation Other functional features (e.g. phosphosite) Machine learning model tying all of these together – What training set?

Module bioinformatics.ca BRAF V600E T>C Somatic Pathogenic BRAF V600A T>A Somatic / germline Pathogenicity untested Missense Example: Back to BRAF

Module bioinformatics.ca Synonymous are not always Silent Does no amino acid change equal no functional effect? Often, but not always! – Codon usage – Cryptic regulatory sequences (e.g. splicing enhancers) – Strong conservation can be suggestive – No other broadly used tool to handle these (but stay tuned)

Module bioinformatics.ca Conservation and Missense Variant Scoring Models

Module bioinformatics.ca Conservation Conservation is a powerful and broadly used idea How conserved is a given nucleotide or genomic interval, comparing different species to human? How conserved is an amino acid in a protein sequence? Available from UCSC (nucleotide conservation): – PhyloP score – useful to assess single variants – PhastCons score/element – useful to assess putative regulatory regions and genes not coding proteins – Multi-species alignment – generally useful

Module bioinformatics.ca Look for coding exons, UTRs and third nucleotide within codons

Module bioinformatics.ca PhyloP PhyloP: test to detect if nucleotide substitution rates are faster or slower than expected under neutral drift – Only where aligned sequence available! PhyloP score – Positive: conserved (e.g. PhyloP > 2) – Zero: neutral – Negative: more diverged than neutral Species group: – All vertebrates – Only placental mammals – Only primates

Module bioinformatics.ca Conservation Main caveat: – if you use conservation for a given position, this will not tell you directly what is the effect of your variants, but only if the position is important!

Module bioinformatics.ca Missense Variant Effect: Scoring Models Overview Criteria to keep in mind: What features are used? – Nucleotide / amino acid conservation – Amino acid physicochemical properties Direct scoring versus Machine learning – Machine learning models are heavily dependent on the training-set used What data-set used for assessment / learning / optimization? – E.g. Activating / gain-of-function versus inactivating / loss-of-function mutations – E.g. Mendelian disorders (prevailingly loss-of-function) versus cancer (some are unique to cancer, e.g. drug resistance)

Module bioinformatics.ca SIFT Broadly used, relatively old (first published: 2001) Designed for deleterious mutation (i.e. disruptive of protein function) Based uniquely on protein sequence (amino acid) conservation 1.Start from query protein sequence 2.Identify similar protein sequences (PSI-BLAST) 3.Multiple alignment of protein sequences (orthologs and paralogs) 4.Amino acid x residue probability matrix (PSSM) 5.For every residue, amino acid probability reweighted by amino acid diversity at the position (sum of frequency rank * frequency)  Score: probability of observing amino acid normalized by residue conservation cut-off: 0.05 (based on case studies) Predicting deleterious amino acid substitutions. Ng PC, Henikoff S. Genome Res. 2001 May;11(5):863-74.

Module bioinformatics.ca PolyPhen2 Integrates multiple features – 8 sequence-based, 3 structure-based (nucleotide and amino acid level) (e.g. side chain volume change, overlap with PFAM domain, multiple alignment metrics) Machine learning method (Naïve Bayes)  Requires training set – Set 1: HumDiv – Positive: damaging alleles for known Mendelian disorders (Uniprot) – Negative: nondamaging differences between human proteins and related mammalian homologs – Performance 5-fold crossv: (TP ~ 80%, FP ~10%), (TP ~ 90%, FP ~ 20%) – Set 2: HumVar – Positive: all human disease causing mutations (Uniprot) – Negative: non-synonymous SNPs without disease association  Richer model than SIFT  More biased towards training set(s) than SIFT A method and server for predicting damaging missense mutations. Adzhubei IA, Schmidt S, Peshkin L, […], Bork P, Kondrashov AS, Sunyaev SR. Nat Methods. 2010 Apr;7(4):248-9.

Module bioinformatics.ca MutationAssessor Direct / theoretical model (no machine learning) Based on amino acid conservation also specifically modeling conservation unique to protein subfamilies (can be regarded as an enhanced SIFT) Entropy-based score based on protein sequence alignment Performs well for (recurrent) somatic variants Predicting the functional impact of protein mutations: application to cancer genomics. Reva B, Antipin Y, Sander C. Nucleic Acids Res. 2011 Sep 1;39(17):e118

Module bioinformatics.ca MutationAssessor

Module bioinformatics.ca CADD Intended as a measure of “deleteriousness” for coding and non- coding sequence, not biased to known disease variation – However does not model gene specific constrain in detail Machine learning model (Linear SVM) – Negative training set: nearly fixed human alleles, variant if compared to inferred human-chimp ancestral genome – Positive training set: simulated variants based on mutation model aware of sequence context and primate substitution rates – Predictive features (63): VEP (Variant Effect Predictor) output, UCSC tracks, Encode tracks  includes missense predictions and nucleotide-level conservation – Performance assessment: using pathogenic variants from ClinVar performs a bit better PhyloP for all sites and PolyPhen/SIFT for missense coding A general framework for estimating the relative pathogenicity of human genetic variants. Kircher M, Witten DM, Jain P, O'Roak BJ, Cooper GM, Shendure J. Nat Genet. 2014 Mar;46(3):310-5.

Module bioinformatics.ca CADD Pathogenic ClinVar vs NHLBI-ESP > 5%

Module bioinformatics.ca Effect Scoring: Conclusive Remarks 1.Nucleotide-level conservation (PhyloP) is simple yet powerful, and multiple alignments can be additionally inspected 2.Missense scoring models are powerful, but their strengths and weaknesses need to be understood 3.Variants should be always reviewed putting all information in context – Consider conservation and effect scores using different models – Review the amino acid change and sequence context – Look for clusters of somatic variants and protein domain – Don’t forget gene-level information!

Module bioinformatics.ca We are on a Coffee Break & Networking Session

Canadian Bioinformatics Workshops www.bioinformatics.ca.

Similar presentations

Presentation on theme: "Canadian Bioinformatics Workshops www.bioinformatics.ca."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Canadian Bioinformatics Workshops www.bioinformatics.ca.

Similar presentations

Presentation on theme: "Canadian Bioinformatics Workshops www.bioinformatics.ca."— Presentation transcript:

Similar presentations

About project

Feedback