Download presentation
Presentation is loading. Please wait.
1
Canadian Bioinformatics Workshops
2
Module #: Title of Module
2
3
Module 8 – Variants to Networks
Part 1 – How to annotate variants and prioritize potentially relevant ones Jüri Reimand Bioinformatics for Cancer Genomics May 25-29, 2015 Informatics and Biocomputing Ontario Institute for Cancer Research
4
Learning Objectives of Module
I have detected somatic variants in a cancer sample. What information can I use to interpret them? What variant annotations can I use? How do impact prediction models work? How to use an annotation tool: Annovar (LAB)
5
Introduction
6
Variant vs Gene Information
We have to consider information at two levels: Gene Is the gene central to processes related to cancer? (e.g. proliferation, apoptosis, matrix degradation) Is the gene sensitive to perturbation? (e.g. haploinsufficiency) Variant What is the variant effect on the gene product?
7
Integrating Different Evidences
Variant Recurrence Gene Product Function / Pathway Variant Gene Product Effect
8
On Variant Size Small: 1-50 bp Medium: 100-1,000 bp Large: > 5 kbp
SNV (Single Nucleotide Variants): 1 bp substitution, relatively easy to detect Small In/dels: a bit more challenging to detect Most available in databases; can be mapped by exact coordinates Medium: 100-1,000 bp Insertions, Deletions, Translocations, Complex re-arrangements Most challenging to detect More tolerant mapping (e.g. partners of gene fusion) Large: > 5 kbp Copy number variants relatively easy to detect using arrays, more challenging using next generation sequencing More tolerant mapping (e.g. 50% reciprocal overlap, cytoband(s))
9
Variant Annotation Components
Variant database mapping Allele frequencies from reference data-sets (1000G, NHLBI-ESP, ExAC) dbSNP (sequence variation database) COSMIC (somatic variant database) Gene mapping (coding/splicing, UTR, intergenic) Gene product effect type (e.g. loss of function, missense) Coding Missense Effect Scoring SIFT PolyPhen2 MutationAssessor Other Effect Scoring PhyloP (conservation) CADD Splicing-regulatory predictions Consequtive basepairs
10
Variant databases and allele frequencies
11
1000 Genomes (Phase 3) Goal: Subjects: 2,504 Platform: Illumina
Identify all variants at > 1% frequency in represented human populations Subjects: 2,504 Apparently healthy Ethnicities: caucasian European, admixed Latin Americans, African, South Asians, East Asians Platform: Illumina Low coverage (2-4x) whole genome Exon (50x coverage)
12
NHLBI-ESP Goal: Subjects: 6,503 (ESP 6500 release)
discover heart, lung and blood disorder variants at frequency < 1% Subjects: 6,503 (ESP 6500 release) Not necessarily healthy (includes individuals with extreme subclinical traits and diseased) Ethnicities: 2,203 African-Americans, 4,300 European-Americans Platform: Illumina, exome sequencing (average 110x) Consequtive basepairs
13
ExAC (Exome Aggregation Consortium)
Goal: Compile the largest set of exomes ever Subjects: 60,706 (unrelated) Not necessarily healthy: includes cardiovascular, autoimmune, schizophrenia and cancer, but removed individuals with severe pediatric disease Ethnicities: non-Finnish European, Finnish, Latin Americans, African, South Asians, East Asians, Other Platform: Illumina, exome Variant calling: GATK
14
dbSNP Broad scope repository of “small” genetic variation
(e.g. NCBI counterpart for structural variants: dbVAR) Submissions before and after NGS era Includes polymorphisms found in general population Includes rare germline disease-associated (or suspected to be) Includes somatic variants (also in COSMIC) Good to look up variants If you want to use as filter, make sure you remove “clinically flagged” variants (somatic, germline) Consequtive basepairs
15
COSMIC “Catalogue of Somatic Mutation In Cancer”
Reference database for somatic variation in cancer Worth following up variants matching COSMIC entries How many studies/samples was it found in? 1, many? Does the variant overlap a hotspot? Is the gene frequently mutated? Consequtive basepairs
16
Gene mapping
17
Gene Mapping: Types of Genes
Protein-coding genes Non-protein-coding RNA genes (e.g. miRNA) Different functional relevance Different knowledge of variant effects Consequtive basepairs
18
Gene Mapping: Parts of Genes
Protein-coding genes have these parts: UTR (transcribed, not translated) Coding exons (translated) Introns (spliced out, not translated) Splice sites Also: Upstream, downstream transcribed gene Inter-genic Consequtive basepairs
19
Gene Mapping: Annovar’s priority system
Gene types and parts: what if they overlap..? Whenever more than one mapping is possible, Annovar will follow this priority system You can also ask Annovar to report all possible effects Consequtive basepairs
20
Gene Mapping: Annovar’s priority system
Protein Coding Gene G1 >>>> >> >> >> >>>> >>>>>>>>>> TSS of G1 (Transcription Start Site) Non-coding RNA ncR1 (e.g. miRNA)
21
Gene Mapping: Annovar’s priority system
>>>> >> >> >> >>>> >>>>>>>>>> G1 Intronic G1 Upstream G1 UTR 5’ G1 Exonic G1 Exonic G1 Intronic G1 Exonic G1 Intronic ncR1 G1 Exonic ncR1 G1 UTR 3’ G1 Downstream ncR1 Downstream G1 Splicing ** ** Splice sites after the first were omitted to avoid clutter
22
Gene Mapping: Database
Goal: map our variants to (coding and non-coding) genes RefSeq is the suggested database for transcribed gene and coding sequence definition In the lab we will use Annovar with RefSeq database Other databases available: UCSC known genes, Ensembl Consequtive basepairs
23
Gene product effect type
24
Gene Product Effect Regulatory / other non-protein-coding sequences:
difficult to establish what a change “means” (certain cases are easier, e.g. miRNA seed) Protein-coding sequences: how is protein sequence affected? Definitely easier to chase after protein effects But should don’t forget other gene products exist… Consequtive basepairs
25
Gene Product Effect: Protein-coding
Stop-gain SNV: adds a STOP codon truncated protein Frameshift In/Del: shifts the reading frame protein translated incorrectly from that point Splicing: alters key sites guiding splicing In-frame In/Del: removes/add one or more amino acids Stoploss: loss of STOP codon extra piece in the protein Missense SNV: modifies one amino acid Synonymous: no amino acid change Consequtive basepairs
26
Loss of Function (LOF) Variants
Definition: Stop-gain, Frameshift, Splicing These are the more disruptive, BUT: What percentage of the protein is affected? Are there multiple transcript isoforms? Splicing effect difficult to predict Cryptic splice sites Frameshift can be rescued by another frameshift or bypassed by splicing Consequtive basepairs
27
Missense Variants: Tell Me More..
How do we tell if a missense alters protein function? Type of amino acid change (amino acid groups) Conservation across species Conserved protein domain Secondary protein structure Tertiary (3D) protein structure + simulation Other functional features (e.g. phosphosite) Machine learning model tying all of these together What training set? Consequtive basepairs
28
Missense Example: Back to BRAF BRAF V600E T>C Somatic Pathogenic
BRAF V600A T>A Somatic / germline Pathogenicity untested
29
Conservation and Missense Variant Scoring Models
30
Conservation Conservation is a powerful and broadly used idea
How conserved is a given nucleotide or genomic interval, comparing different species to human? How conserved is an amino acid in a protein sequence? Available from UCSC (nucleotide conservation): PhyloP score – useful to assess single variants PhastCons score/element – useful to assess putative regulatory regions and genes not coding proteins Multi-species alignment – generally useful
31
Look for coding exons, UTRs and third nucleotide within codons
32
PhyloP PhyloP: test to detect if nucleotide substitution rates are faster or slower than expected under neutral drift Only where aligned sequence available! PhyloP score Positive: conserved (e.g. PhyloP > 2) Zero: neutral Negative: more diverged than neutral Species group: All vertebrates Only placental mammals Only primates
33
Conservation Main caveat:
if you use conservation for a given position, this will not tell you directly what is the effect of your variants, but only if the position is important!
34
Missense Variant Effect: Scoring Models Overview
Criteria to keep in mind: What features are used? Nucleotide / amino acid conservation Amino acid physicochemical properties Direct scoring versus Machine learning Machine learning models are heavily dependent on the training-set used What data-set used for assessment / learning / optimization? E.g. Activating / gain-of-function versus inactivating / loss-of-function mutations E.g. Mendelian disorders (prevailingly loss-of-function) versus cancer (some are unique to cancer, e.g. drug resistance)
35
SIFT Broadly used, relatively old (first published: 2001)
Designed for deleterious mutation (i.e. disruptive of protein function) Based uniquely on protein sequence (amino acid) conservation Start from query protein sequence Identify similar protein sequences (PSI-BLAST) Multiple alignment of protein sequences (orthologs and paralogs) Amino acid x residue probability matrix (PSSM) For every residue, amino acid probability reweighted by amino acid diversity at the position (sum of frequency rank * frequency) Score: probability of observing amino acid normalized by residue conservation cut-off: 0.05 (based on case studies) Consequtive basepairs Predicting deleterious amino acid substitutions. Ng PC, Henikoff S. Genome Res May;11(5):
36
PolyPhen2 Integrates multiple features
8 sequence-based, 3 structure-based (nucleotide and amino acid level) (e.g. side chain volume change, overlap with PFAM domain, multiple alignment metrics) Machine learning method (Naïve Bayes) Requires training set Set 1: HumDiv Positive: damaging alleles for known Mendelian disorders (Uniprot) Negative: nondamaging differences between human proteins and related mammalian homologs Performance 5-fold crossv: (TP ~ 80%, FP ~10%), (TP ~ 90%, FP ~ 20%) Set 2: HumVar Positive: all human disease causing mutations (Uniprot) Negative: non-synonymous SNPs without disease association Richer model than SIFT More biased towards training set(s) than SIFT Consequtive basepairs A method and server for predicting damaging missense mutations. Adzhubei IA, Schmidt S, Peshkin L, […], Bork P, Kondrashov AS, Sunyaev SR. Nat Methods Apr;7(4):248-9.
37
MutationAssessor Direct / theoretical model (no machine learning)
Based on amino acid conservation also specifically modeling conservation unique to protein subfamilies (can be regarded as an enhanced SIFT) Entropy-based score based on protein sequence alignment Performs well for (recurrent) somatic variants Consequtive basepairs Predicting the functional impact of protein mutations: application to cancer genomics. Reva B, Antipin Y, Sander C. Nucleic Acids Res Sep 1;39(17):e118
38
CADD Intended as a measure of “deleteriousness” for coding and non-coding sequence, not biased to known disease variation However does not model gene specific constrain in detail Machine learning model (Linear SVM) Negative training set: nearly fixed human alleles, variant if compared to inferred human-chimp ancestral genome Positive training set: simulated variants based on mutation model aware of sequence context and primate substitution rates Predictive features (63): VEP (Variant Effect Predictor) output, UCSC tracks, Encode tracks includes missense predictions and nucleotide-level conservation Performance assessment: using pathogenic variants from ClinVar performs a bit better PhyloP for all sites and PolyPhen/SIFT for missense coding A general framework for estimating the relative pathogenicity of human genetic variants. Kircher M, Witten DM, Jain P, O'Roak BJ, Cooper GM, Shendure J. Nat Genet Mar;46(3):310-5.
39
CADD Pathogenic ClinVar vs NHLBI-ESP > 5%
40
Splicing Regulatory Predictions
Goal: predict how SNVs affect exon inclusion / exclusion Strategy: Learn “Wild Type” splicing code based on reference genome sequence motifs and experimentally-measured splicing patterns in human tissues “Mutant” code: predicts splicing change when variant alters splicing-guiding sequence motif Does not learn based on known disease splicing alterations Science 2015
41
Phosphorylation and other protein modifications
Post-translational modifications (PTMs) extend protein function Human: >130,000 PTM sites, % of protein sequence Enriched in inherited disease and somatic cancer mutations Negatively selected in population Often not detected with mutation assessment tools Reimand et al, 2013 Mol Sys Bio; 2015 PLOS Genet
42
Effect Scoring: Conclusive Remarks
Nucleotide-level conservation (PhyloP) is simple yet powerful, and multiple alignments can be additionally inspected Missense scoring models are powerful, but their strengths and weaknesses need to be understood Variants should be always reviewed putting all information in context Consider conservation and effect scores using different models Review the amino acid change and sequence context Look for clusters of somatic variants and protein domain Don’t forget gene-level information!
43
We are on a Coffee Break & Networking Session
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.