Presentation is loading. Please wait.

Presentation is loading. Please wait.

Canadian Bioinformatics Workshops www.bioinformatics.ca.

Similar presentations


Presentation on theme: "Canadian Bioinformatics Workshops www.bioinformatics.ca."— Presentation transcript:

1 Canadian Bioinformatics Workshops www.bioinformatics.ca

2 2Module #: Title of Module

3 Module 7 - Part.1 Annotation of Somatic Coding Variants Annovar Informatics Facility, The Centre for Applied Genomics (TCAG) The Hospital for Sick Children

4 Module 7 – Part I bioinformatics.ca Learning Objectives of Module I have detected somatic variants in a cancer sample. What information can I use to interpret them? What do different annotations mean? How do missense effect prediction models work? How to use an annotation tool: Annovar (LAB)

5 Module 7 – Part I bioinformatics.ca Introduction

6 Module 7 – Part I bioinformatics.ca Cancer Driver Discovery: Biological Knowledge vs Frequency Small data-sets (1-10 subjects) – Variant previously reported – Gene function/phenotype + variant effect Large data-sets (> 100 subjects) – Variant frequency – Over-represented pathways/networks

7 Module 7 – Part I bioinformatics.ca Variant vs Gene Information We have to consider information at two levels: Gene – Is the gene central to processes related to cancer? (e.g. proliferation, apoptosis, matrix degradation) – Is the gene sensitive to perturbation? (e.g. haploinsufficiency) Variant – What is the variant effect on the gene product?

8 Module 7 – Part I bioinformatics.ca Passengers and Drivers A Very Simplistic Model Important Activator + Activating Variant Important Repressor + Loss-of-function Variant Redundant Gene (or controlling unrelated process) Cancer Drive No effect Cancer Drive

9 Module 7 – Part I bioinformatics.ca Passengers and Drivers A Very Simplistic Model Important Repressor + Silent Variant Important Repressor + Loss-of-function Variant Cancer Drive No effect

10 Module 7 – Part I bioinformatics.ca Variant Frequency Variant Gene Product Effect Gene Product Function / Pathway Integrating Different Evidences

11 Module 7 – Part I bioinformatics.ca Integrating Different Evidences Variant Gene Product Effect Gene Product Function / Pathway Variant Frequency Not in this module This Module Next Module

12 Module 7 – Part I bioinformatics.ca On Variant Size Small: 1-50 bp SNV (Single Nucleotide Variants): 1 bp substitution, relatively easy to detect Small In/dels: a bit more challenging to detect Most available in databases; can be mapped by exact coordinates Medium: 100-1,000 bp Insertions, Deletions, Translocations, Complex re-arrangements Most challenging to detect More tolerant mapping (e.g. partners of gene fusion) Large: > 5 kbp Copy number variants relatively easy to detect using arrays, more challenging using next generation sequencing More tolerant mapping (e.g. 50% reciprocal overlap, cytoband(s))

13 Module 7 – Part I bioinformatics.ca SNV In/Del Annotation

14 Module 7 – Part I bioinformatics.ca Variant Annotation Components A.Variant database mapping A.Allele frequencies from reference data-sets (1000G, NHLBI-ESP, CGI-46) B.dbSNP (sequence variation database) C.COSMIC (somatic variant database) B.Gene mapping (coding/splicing, UTR, intergenic) C.Gene product effect type (e.g. loss of function, missense) D.Conservation A.PhyloP E.Missense Effect Scoring A.SIFT B.PolyPhen2 C.MutationAssessor

15 Module 7 – Part I bioinformatics.ca Allele Frequencies Databases Use: identify “suspect” somatic variants that show up as germline variants in these databases 1000 Genomes NHLBI-ESP CGI-46 / CGI-69

16 Module 7 – Part I bioinformatics.ca 1000 Genomes Goal: – Identify all variants at > 1% frequency in represented human populations Subjects: – 1092 with available variants – 2500 at project completion Launch date: 2007 – Many revisions (e.g. increase coverage)

17 Module 7 – Part I bioinformatics.ca 1000 Genomes Phase 1: variants available (*) – 1092 apparently healthy subjects – Ethnicities: Represented: European, Black African, East Asian, Mixed Americans Missing: South-east Asians, Indians, Middle-east, North Africans – 38.2 M SNPs, 3.9 M In/Dels – Platform: Illumina + SOLID Low coverage (2-4x) whole genome Exon (50x coverage) – Variant calling: multiple methods including GATK Unified Genotyper Phase 2: variant calling on-going Phase 3: alignments just made available * version: 30 April 2012 ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521/REA DME.phase1_integrated_release_version3_20120430

18 Module 7 – Part I bioinformatics.ca

19 Module 7 – Part I bioinformatics.ca NHLBI-ESP Goal: – discover heart, lung and blood disorder variants at frequency < 1% Subjects: 6503 (ESP 6500 release) – Not necessarily healthy (includes individuals with extreme subclinical traits and diseased) – Ethnicities: 2203 African-Americans, 4300 European-Americans Platform: Illumina, exome sequencing (average 110x) Variant calling: – SNV: glfMultiples + ad-hoc quality filtering – In/Dels: GATK unified genotyper

20 Module 7 – Part I bioinformatics.ca CG-46 and CG-69 Goal: variation in controls Subjects: – CG-46: 46 unrelated (recommended for allele frequencies) – CG-69: CG-46 + two trios + extended CEU pedigree Ethnicities: European, Black African, East Asian, Indian, Mexican Platform: Complete Genomics (whole genome, 80x) Variant calling: Complete Genomics pipeline

21 Module 7 – Part I bioinformatics.ca CG-46

22 Module 7 – Part I bioinformatics.ca Allele Frequency Databases: Take Home Messages Different ethnic compositions Whole genome / exome Different platforms and (diploid) variant callers Different sequencing depth Different power for variant detection at different frequencies Different number of subjects Different capability to generalize across population  Data-sets are complementary Constant updates, keep yourself updated!

23 Module 7 – Part I bioinformatics.ca Variant Annotation Components A.Variant database mapping A.Allele frequencies from reference data-sets (1000G, NHLBI-ESP, CGI-46) B.dbSNP (sequence variation database) C.COSMIC (somatic variant database) B.Gene mapping (coding/splicing, UTR, intergenic) C.Gene product effect type (e.g. loss of function, missense) D.Conservation A.PhyloP E.Missense Effect Scoring A.SIFT B.PolyPhen2 C.MutationAssessor

24 Module 7 – Part I bioinformatics.ca dbSNP Broad scope repository of “small” genetic variation (e.g. NCBI counterpart for structural variants: dbVAR) – Submissions before and after NGS era – Includes polymorphisms found in general population – Includes rare germline disease-associated (or suspected to be) – Includes somatic variants (also in COSMIC)  Good to look up variants  If you want to use as filter, make sure you remove “clinically flagged” variants (somatic, germline)

25 Module 7 – Part I bioinformatics.ca COSMIC “Catalogue of Somatic Mutation In Cancer” Reference database for somatic variation in cancer Worth following up variants matching COSMIC entries – How many studies/samples was it found in? 1, many? – Does the variant overlap a hotspot? – Is the gene frequently mutated?

26 Module 7 – Part I bioinformatics.ca Looking up a well-established driver mutation in dbSNP and Cosmic (BRAF V600E)

27 Module 7 – Part I bioinformatics.ca BRAF V600E: rs113488022 dbSNP

28 Module 7 – Part I bioinformatics.ca BRAF V600E: rs113488022 dbSNP Clinical association Somatic cluster

29 Module 7 – Part I bioinformatics.ca BRAF V600E: rs113488022 dbSNP click

30 Module 7 – Part I bioinformatics.ca BRAF V600E: rs113488022 dbSNP Clinical view V600E T>C Somatic Pathogenic V600A T>A Somatic / germline Pathogenicity untested

31 Module 7 – Part I bioinformatics.ca

32 Module 7 – Part I bioinformatics.ca BRAF V600E: rs113488022 From dbSNP to OMIM click

33 Module 7 – Part I bioinformatics.ca BRAF V600E: rs113488022 From dbSNP to OMIM

34 Module 7 – Part I bioinformatics.ca BRAF

35 Module 7 – Part I bioinformatics.ca click

36 Module 7 – Part I bioinformatics.ca BRAF V600E: COSMIC

37 Module 7 – Part I bioinformatics.ca BRAF V600E: COSMIC

38 Module 7 – Part I bioinformatics.ca BRAF COSMIC V600E overlaps a hotspot BRAF V600E: COSMIC

39 Module 7 – Part I bioinformatics.ca Variant Annotation Components A.Variant database mapping A.Allele frequencies from reference data-sets (1000G, NHLBI-ESP, CGI-46) B.dbSNP (sequence variation database) C.COSMIC (somatic variant database) B.Gene mapping (coding/splicing, UTR, intergenic) C.Gene product effect type (e.g. loss of function, missense) D.Conservation A.PhyloP E.Missense Effect Scoring A.SIFT B.PolyPhen2 C.MutationAssessor

40 Module 7 – Part I bioinformatics.ca Gene Mapping: Only Genes? Goal: map our variants to (coding and non-coding) genes What about other features (e.g. regulatory sequences)? – If a consensus catalogue is achieved like for genes, these can also be used as reference genomic entities for annotation

41 Module 7 – Part I bioinformatics.ca Gene Mapping: Types of Genes Types of genes: Protein-coding genes Non-protein-coding RNA genes (e.g. miRNA) Different functional relevance Different knowledge of variant effects

42 Module 7 – Part I bioinformatics.ca Gene Mapping: Parts of Genes Protein-coding genes have these parts: – UTR (transcribed, not translated) – Coding exons (translated) – Introns (spliced out, not translated) – Splice sites Also: Upstream, downstream transcribed gene Inter-genic

43 Module 7 – Part I bioinformatics.ca Gene Mapping: Annovar’s priority system Gene types and parts: what if they overlap..? Whenever more than one mapping is possible, Annovar will follow this priority system

44 Module 7 – Part I bioinformatics.ca Gene Mapping: Annovar’s priority system Protein Coding Gene G1 Non-coding RNA ncR1 (e.g. miRNA) TSS of G1 (Transcription Start Site) >>>> >> >>>> > >>>>>>>>> >

45 Module 7 – Part I bioinformatics.ca Gene Mapping: Annovar’s priority system * Splice sites after the first were omitted to avoid clutter G1 Upstream G1 UTR 5’ G1 Exonic G1 Intronic G1 ExonicG1 IntronicG1 Exonic G1 Intronic ncR1 G1 Exonic G1 UTR 3’ G1 Downstream ncR1 Downstream G1 Exonic-splicing >>>> >> >>>> > >>>>>>>>> >

46 Module 7 – Part I bioinformatics.ca Example of Annovar Output ChrStartEndRefAltFunc.refGeneGene.refGene 1847617 GAintergenicFAM41C(dist=35435),LOC100130417(dist=5336) 1882050 CGintronicNOC2L 1990394 ACUTR3AGRN 13668186 GCupstreamCCDC27 16094426 GAUTR5KCNAB2 16512048 CTexonicESPN 2171812945 GAsplicingGORASP2(NM_001201428:exon7:c.496-1G>A,NM_015530:exon7:c.700-1G>A)

47 Module 7 – Part I bioinformatics.ca Example of Annovar Output ChrStartEndRefAltFunc.refGeneGene.refGene 1847617 GAintergenicFAM41C(dist=35435),LOC100130417(dist=5336) 1882050 CGintronicNOC2L 1990394 ACUTR3AGRN 13668186 GCupstreamCCDC27 16094426 GAUTR5KCNAB2 16512048 CTexonicESPN 2171812945 GAsplicingGORASP2(NM_001201428:exon7:c.496-1G>A,NM_015530:exon7:c.700-1G>A)

48 Module 7 – Part I bioinformatics.ca Example of Annovar Output ChrStartEndRefAltFunc.refGeneGene.refGene 1847617 GAintergenicFAM41C(dist=35435),LOC100130417(dist=5336) 1882050 CGintronicNOC2L 1990394 ACUTR3AGRN 13668186 GCupstreamCCDC27 16094426 GAUTR5KCNAB2 16512048 CTexonicESPN 2171812945 GAsplicingGORASP2(NM_001201428:exon7:c.496-1G>A,NM_015530:exon7:c.700-1G>A)

49 Module 7 – Part I bioinformatics.ca Example of Annovar Output ChrStartEndRefAltFunc.refGeneGene.refGene 1847617 GAintergenicFAM41C(dist=35435),LOC100130417(dist=5336) 1882050 CGintronicNOC2L 1990394 ACUTR3AGRN 13668186 GCupstreamCCDC27 16094426 GAUTR5KCNAB2 16512048 CTexonicESPN 2171812945 GAsplicingGORASP2(NM_001201428:exon7:c.496-1G>A,NM_015530:exon7:c.700-1G>A) More than one KCNAB2 isoform is present Annovar reported the UTR5 and not the intron, following the priority rules

50 Module 7 – Part I bioinformatics.ca Example of Annovar Output ChrStartEndRefAltFunc.refGeneGene.refGene 1847617 GAintergenicFAM41C(dist=35435),LOC100130417(dist=5336) 1882050 CGintronicNOC2L 1990394 ACUTR3AGRN 13668186 GCupstreamCCDC27 16094426 GAUTR5KCNAB2 16512048 CTexonicESPN 2171812945 GAsplicingGORASP2(NM_001201428:exon7:c.496-1G>A,NM_015530:exon7:c.700-1G>A)

51 Module 7 – Part I bioinformatics.ca Example of Annovar Output ChrStartEndRefAltFunc.refGeneGene.refGene 1847617 GAintergenicFAM41C(dist=35435),LOC100130417(dist=5336) 1882050 CGintronicNOC2L 1990394 ACUTR3AGRN 13668186 GCupstreamCCDC27 16094426 GAUTR5KCNAB2 16512048 CTexonicESPN 2171812945 GAsplicingGORASP2(NM_001201428:exon7:c.496-1G>A,NM_015530:exon7:c.700-1G>A)

52 Module 7 – Part I bioinformatics.ca Example of Annovar Output ChrStartEndRefAltFunc.refGeneGene.refGene 1847617 GAintergenicFAM41C(dist=35435),LOC100130417(dist=5336) 1882050 CGintronicNOC2L 1990394 ACUTR3AGRN 13668186 GCupstreamCCDC27 16094426 GAUTR5KCNAB2 16512048 CTexonicESPN 2171812945 GAsplicingGORASP2(NM_001201428:exon7:c.496-1G>A,NM_015530:exon7:c.700-1G>A) AG splicing acceptor intronic sequence becomes AA This happens for both GORASP2 transcript isoforms What will happen at the functional level..? Frameshift splicing?

53 Module 7 – Part I bioinformatics.ca Splice Sites and Annovar Annovar considers a +/-2 bp window around the intron/exon junction and reports the following splicing categories: Splicing: 2 bp intronic Splicing;exonic: 2 bp exonic General things to keep in mind: The intronic site is biologically more relevant Other sequence features outside the +/- 2 bp splice site window may be important for guiding splicing  Splicing variants always need to be manually reviewed

54 Module 7 – Part I bioinformatics.ca AG splicing acceptor (intronic) is very well conserved across 46 vertebrates in UCSC

55 Module 7 – Part I bioinformatics.ca Gene Mapping: Database Goal: map our variants to (coding and non-coding) genes RefSeq is the suggested database for transcribed gene and coding sequence definition – In the lab we will use Annovar with RefSeq database Other databases available: UCSC known genes, Ensembl

56 Module 7 – Part I bioinformatics.ca Other Annotation Features Additional annotation features can be used, for instance: Protein Domains Encode Profiles – Histone marks – DNA methylation – DNAse hypersensitivity (proxy of binding sites) – Etc…

57 Module 7 – Part I bioinformatics.ca Variant Annotation Components A.Variant database mapping A.Allele frequencies from reference data-sets (1000G, NHLBI-ESP, CGI-46) B.dbSNP (sequence variation database) C.COSMIC (somatic variant database) B.Gene mapping (coding/splicing, UTR, intergenic) C.Gene product effect type (e.g. loss of function, missense) D.Conservation A.PhyloP E.Missense Effect Scoring A.SIFT B.PolyPhen2 C.MutationAssessor

58 Module 7 – Part I bioinformatics.ca Gene Product Effect Regulatory / other non-protein-coding sequences: difficult to establish what a change “means” (certain cases are easier, e.g. miRNA seed) Protein-coding sequences: how is protein sequence affected?  Definitely easier to chase after protein effects  But should don’t forget other gene products exist…

59 Module 7 – Part I bioinformatics.ca Gene Product Effect: Protein-coding Stop-gain SNV: adds a STOP codon  truncated protein Frameshift In/Del: shifts the reading frame  protein translated incorrectly from that point Splicing: alters key sites guiding splicing In-frame In/Del: removes/add one or more amino acids Stoploss: loss of STOP codon  extra piece in the protein Missense SNV: modifies one amino acid Synonymous: no amino acid change

60 Module 7 – Part I bioinformatics.ca Loss of Function (LOF) Variants Definition: Stop-gain, Frameshift, Splicing These are the more disruptive, BUT: What percentage of the protein is affected? Are there multiple transcript isoforms? Splicing effect difficult to predict – Cryptic splice sites Frameshift can be rescued by another frameshift or bypassed by splicing

61 Module 7 – Part I bioinformatics.ca Missense Variants: Tell Me More.. How do we tell if a missense alters protein function? Type of amino acid change (amino acid groups) Conservation across species Conserved protein domain Secondary protein structure Tertiary (3D) protein structure + simulation Other functional features (e.g. phosphosite) Machine learning model tying all of these together – What training set?

62 Module 7 – Part I bioinformatics.ca BRAF V600E T>C Somatic Pathogenic BRAF V600A T>A Somatic / germline Pathogenicity untested Missense Example: Back to BRAF

63 Module 7 – Part I bioinformatics.ca Synonymous are not always Silent Does no amino acid change equal no functional effect? Often, but not always! – Codon usage – Cryptic regulatory sequences (e.g. splicing enhancers) – Strong conservation can be suggestive – No other broadly used tool to handle these (but stay tuned)

64 Module 7 – Part I bioinformatics.ca Variant Zygosity Not to be forgotten! Heterozygous Homozygous Haploid – X in males – Somatic loss of heterozygosity (LOH)

65 Module 7 – Part I bioinformatics.ca Conservation and Missense Variant Scoring Models

66 Module 7 – Part I bioinformatics.ca Variant Annotation Components A.Variant database mapping A.Allele frequencies from reference data-sets (1000G, NHLBI-ESP, CGI-46) B.dbSNP (sequence variation database) C.COSMIC (somatic variant database) B.Gene mapping (coding/splicing, UTR, intergenic) C.Gene product effect type (e.g. loss of function, missense) D.Conservation A.PhyloP E.Missense Effect Scoring A.SIFT B.PolyPhen2 C.MutationAssessor

67 Module 7 – Part I bioinformatics.ca Conservation Conservation is a powerful and broadly used idea How conserved is a given nucleotide or genomic interval, comparing different species to human? How conserved is an amino acid in a protein sequence? Available from UCSC (nucleotide conservation): – PhyloP score – useful to assess single variants – PhastCons score/element – useful to assess putative regulatory regions and genes not coding proteins – Multi-species alignment – generally useful

68 Module 7 – Part I bioinformatics.ca Look for coding exons, UTRs and third nucleotide within codons

69 Module 7 – Part I bioinformatics.ca PhyloP PhyloP: test to detect if nucleotide substitution rates are faster or slower than expected under neutral drift – Only where aligned sequence available! PhyloP score – Positive: conserved (e.g. PhyloP > 2) – Zero: neutral – Negative: more diverged than neutral Species group: – All vertebrates – Only placental mammals – Only primates

70 Module 7 – Part I bioinformatics.ca Missense Variant Effect: Scoring Models Overview Criteria to keep in mind: What features are used? – Nucleotide / amino acid conservation – Amino acid physicochemical properties Direct scoring versus Machine learning – Machine learning models are heavily dependent on the training-set used What data-set used for assessment / learning / optimization? – E.g. Activating / gain-of-function versus inactivating / loss-of-function mutations – E.g. Mendelian disorders (prevailingly loss-of-function) versus cancer (some are unique to cancer, e.g. drug resistance)

71 Module 7 – Part I bioinformatics.ca SIFT Broadly used, relatively old (first published: 2001) Designed for deleterious mutation (i.e. disruptive of protein function) Based uniquely on protein sequence (amino acid) conservation 1.Start from query protein sequence 2.Identify similar protein sequences (PSI-BLAST) 3.Multiple alignment of protein sequences 4.Amino acid x residue probability matrix (PSSM) 5.For every residue, probability normalized across amino acid by maximum amino acid probability  Score: probability of observing amino acid normalized by residue conservation cut-off: 0.05 (based on case studies)

72 Module 7 – Part I bioinformatics.ca PolyPhen2 Integrates multiple features – 8 sequence-based, 3 structure-based (nucleotide and amino acid level) Machine learning method  Requires training set – Set 1: HumDiv – Positive: damaging alleles for known Mendelian disorders – Negative: human proteins and related mammalian homologs – Set 2: HumVar – Positive: all human disease causing mutations (Uniprot) – Negative: non-synonymous SNPs without disease association  Richer model than SIFT  More biased towards training set(s) than SIFT  Training set appropriate for somatic variants?

73 Module 7 – Part I bioinformatics.ca MutationAssessor Based on amino acid conservation also specifically modeling conservation unique to protein subfamilies (unlike SIFT) Performs well for (recurrent) somatic variants

74 Module 7 – Part I bioinformatics.ca Missense Scoring: Conclusive Remarks 1.Nucleotide-level conservation (PhyloP) is simple yet powerful, and multiple alignments can be additionally inspected 2.Missense scoring models are powerful, but their strengths and weaknesses need to be understood 3.Missense variants should be always reviewed putting all information in context – Consider PhyloP and missense scores using different models – Review the amino acid change and sequence context – Look for clusters of somatic variants and protein domain – Don’t forget gene-level information!

75 Module 7 – Part I bioinformatics.ca We are on a Coffee Break & Networking Session


Download ppt "Canadian Bioinformatics Workshops www.bioinformatics.ca."

Similar presentations


Ads by Google