Canadian Bioinformatics Workshops

Slides:



Advertisements
Similar presentations
Using SuSPect to Predict the Phenotypic Effects of Missense Variants Chris Yates UCL Cancer Institute
Advertisements

Outline to SNP bioinformatics lecture
Protein Modules An Introduction to Bioinformatics.
Displaying associations, improving alignments and gene sets at UCSC Jim Kent and the UCSC Genome Bioinformatics Group.
Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
PolyPhen and SIFT: Tools for predicting functional effects of SNPs Epi 244 Spring 2009 Sam S. Oh.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Supplementary Figure 1. Somatic mutation spectrum # Substitutions # Substitutions per Mb b c a Repeats Pseudogenes Whole genome Splice sites Non-coding.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
MES Genome Informatics I - Lecture VIII. Interpreting variants Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute,
Doug Brutlag 2011 Genomics & Medicine Doug Brutlag Professor Emeritus of Biochemistry &
RNA and Protein Synthesis
The Biology and Genetic Base of Cancer. 2 (Mutation)
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Aims and objectives of the workshop David Moore. Aims Classification of variants is subjective and NEQAS results suggest this is not a major problem To.
Fea- ture Num- ber Feature NameFeature description 1 Average number of exons Average number of exons in the transcripts of a gene where indel is located.
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
Epidemiology 217 Molecular and Genetic Epidemiology Bioinformatics & Proteomics John Witte.
Bioinformatics and Computational Biology
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
Mutations.
12/16/14 StarterConnection/Exit: What is the true meaning of the word mutation? Are mutations bad / harmful? 12/16/14 Protein Synthesis Writing
Key Area 1.6 (a) and (b) Gene Mutations. Learning Outcomes.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Single nucleotide polymorphisms and Large scale variation
Computational Biology and Genomics at Boston College Biology Gabor T. Marth Department of Biology, Boston College
Current Data And Future Analysis Thomas Wieland, Thomas Schwarzmayr and Tim M Strom Helmholtz Zentrum München Institute of Human Genetics Geneva, 16/04/12.
A high-resolution map of human evolutionary constraints using 29 mammals Kerstin Lindblad-Toh et al Presentation by Robert Lewis and Kaylee Wells.
Canadian Bioinformatics Workshops
Using public resources to understand associations Dr Luke Jostins Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015.
Identifying disease causal variants Mendelian disorders A. Mesut Erzurumluoglu 1.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.
Canadian Bioinformatics Workshops
From Reads to Results Exome-seq analysis at CCBR
Genetic Code and Interrupted Gene Chapter 4. Genetic Code and Interrupted Gene Aala A. Abulfaraj.
Integrated sequence analysis pipeline provides one-stop solution for identifying disease-causing mutations Cougar Hao Hu, MPIMG.
Canadian Bioinformatics Workshops
Interpreting exomes and genomes: a beginner’s guide
Evolution Aristotle: classification of animals theories on change
Week-6: Genomics Browsers
Lesson Four Structure of a Gene.
Lesson Four Structure of a Gene.
Complex disease and long-range regulation: Interpreting the GWAS using a Dual Colour Transgenesis Strategy in Zebrafish.
Evolution of gene function
Wild-type hemoglobin DNA Mutant hemoglobin DNA LE Wild-type hemoglobin DNA Mutant hemoglobin DNA 3¢ 5¢ 3¢ 5¢ mRNA mRNA 5¢ 3¢ 5¢ 3¢ Normal hemoglobin.
Functional Mapping and Annotation of GWAS: FUMA
Interpretation Next Generation Sequencing (Bench Clinic)
Basics of Comparative Genomics
Class meetings: TR 3:30-4:50 MCGIL 2315
Chapter 4 – proteins, mutations & genetic disorders
From Gene to Protein Chapter 2 and 7 of IB Bio book.
Content and Labeling of Tests Marketed as Clinical “Whole-Exome Sequencing” Perspectives from a cancer genetics clinician and clinical lab director Allen.
Mark M Metzstein, H.Robert Horvitz  Molecular Cell 
Ab initio gene prediction
DOMINO: Using Machine Learning to Predict Genes Associated with Dominant Disorders  Mathieu Quinodoz, Beryl Royer-Bertrand, Katarina Cisarova, Silvio.
Mutations changes in the DNA sequence that can be inherited
Ensembl Genome Repository.
Some mutations affect a single gene, while others affect an entire chromosome.
Pharmacogenomic variability and anaesthesia
Novel approach to genetic analysis and results in 3000 hemophilia patients enrolled in the My Life, Our Future initiative by Jill M. Johnsen, Shelley N.
DNA and the Genome Key Area 6a & b Mutations.
DNA and the Genome Key Area 6a & b Mutations.
Basics of Comparative Genomics
BF528 - Whole Genome Sequencing and Genomic Variation
Basic Local Alignment Search Tool
Analysis of protein-coding genetic variation in 60,706 humans
Pan-cancer genome and transcriptome analyses of 1,699 paediatric leukaemias and solid tumours By: Anh Pham.
Presentation transcript:

Canadian Bioinformatics Workshops www.bioinformatics.ca

Module #: Title of Module 2

Module 8 – Variants to Networks Part 1 – How to annotate variants and prioritize potentially relevant ones Jüri Reimand Bioinformatics for Cancer Genomics May 25-29, 2015 Informatics and Biocomputing Ontario Institute for Cancer Research

Learning Objectives of Module I have detected somatic variants in a cancer sample. What information can I use to interpret them? What variant annotations can I use? How do impact prediction models work? How to use an annotation tool: Annovar (LAB)

Introduction

Variant vs Gene Information We have to consider information at two levels: Gene Is the gene central to processes related to cancer? (e.g. proliferation, apoptosis, matrix degradation) Is the gene sensitive to perturbation? (e.g. haploinsufficiency) Variant What is the variant effect on the gene product?

Integrating Different Evidences Variant Recurrence Gene Product Function / Pathway Variant Gene Product Effect

On Variant Size Small: 1-50 bp Medium: 100-1,000 bp Large: > 5 kbp SNV (Single Nucleotide Variants): 1 bp substitution, relatively easy to detect Small In/dels: a bit more challenging to detect Most available in databases; can be mapped by exact coordinates Medium: 100-1,000 bp Insertions, Deletions, Translocations, Complex re-arrangements Most challenging to detect More tolerant mapping (e.g. partners of gene fusion) Large: > 5 kbp Copy number variants relatively easy to detect using arrays, more challenging using next generation sequencing More tolerant mapping (e.g. 50% reciprocal overlap, cytoband(s))

Variant Annotation Components Variant database mapping Allele frequencies from reference data-sets (1000G, NHLBI-ESP, ExAC) dbSNP (sequence variation database) COSMIC (somatic variant database) Gene mapping (coding/splicing, UTR, intergenic) Gene product effect type (e.g. loss of function, missense) Coding Missense Effect Scoring SIFT PolyPhen2 MutationAssessor Other Effect Scoring PhyloP (conservation) CADD Splicing-regulatory predictions Consequtive basepairs

Variant databases and allele frequencies

1000 Genomes (Phase 3) Goal: Subjects: 2,504 Platform: Illumina Identify all variants at > 1% frequency in represented human populations Subjects: 2,504 Apparently healthy Ethnicities: caucasian European, admixed Latin Americans, African, South Asians, East Asians Platform: Illumina Low coverage (2-4x) whole genome Exon (50x coverage)

NHLBI-ESP Goal: Subjects: 6,503 (ESP 6500 release) discover heart, lung and blood disorder variants at frequency < 1% Subjects: 6,503 (ESP 6500 release) Not necessarily healthy (includes individuals with extreme subclinical traits and diseased) Ethnicities: 2,203 African-Americans, 4,300 European-Americans Platform: Illumina, exome sequencing (average 110x) Consequtive basepairs

ExAC (Exome Aggregation Consortium) Goal: Compile the largest set of exomes ever Subjects: 60,706 (unrelated) Not necessarily healthy: includes cardiovascular, autoimmune, schizophrenia and cancer, but removed individuals with severe pediatric disease Ethnicities: non-Finnish European, Finnish, Latin Americans, African, South Asians, East Asians, Other Platform: Illumina, exome Variant calling: GATK

dbSNP Broad scope repository of “small” genetic variation (e.g. NCBI counterpart for structural variants: dbVAR) Submissions before and after NGS era Includes polymorphisms found in general population Includes rare germline disease-associated (or suspected to be) Includes somatic variants (also in COSMIC) Good to look up variants If you want to use as filter, make sure you remove “clinically flagged” variants (somatic, germline) Consequtive basepairs

COSMIC “Catalogue of Somatic Mutation In Cancer” Reference database for somatic variation in cancer Worth following up variants matching COSMIC entries How many studies/samples was it found in? 1, many? Does the variant overlap a hotspot? Is the gene frequently mutated? Consequtive basepairs

Gene mapping

Gene Mapping: Types of Genes Protein-coding genes Non-protein-coding RNA genes (e.g. miRNA) Different functional relevance Different knowledge of variant effects Consequtive basepairs

Gene Mapping: Parts of Genes Protein-coding genes have these parts: UTR (transcribed, not translated) Coding exons (translated) Introns (spliced out, not translated) Splice sites Also: Upstream, downstream transcribed gene Inter-genic Consequtive basepairs

Gene Mapping: Annovar’s priority system Gene types and parts: what if they overlap..? Whenever more than one mapping is possible, Annovar will follow this priority system You can also ask Annovar to report all possible effects Consequtive basepairs

Gene Mapping: Annovar’s priority system Protein Coding Gene G1 >>>> >> >> >> >>>> >>>>>>>>>> TSS of G1 (Transcription Start Site) Non-coding RNA ncR1 (e.g. miRNA)

Gene Mapping: Annovar’s priority system >>>> >> >> >> >>>> >>>>>>>>>> G1 Intronic G1 Upstream G1 UTR 5’ G1 Exonic G1 Exonic G1 Intronic G1 Exonic G1 Intronic ncR1 G1 Exonic ncR1 G1 UTR 3’ G1 Downstream ncR1 Downstream G1 Splicing ** ** Splice sites after the first were omitted to avoid clutter

Gene Mapping: Database Goal: map our variants to (coding and non-coding) genes RefSeq is the suggested database for transcribed gene and coding sequence definition In the lab we will use Annovar with RefSeq database Other databases available: UCSC known genes, Ensembl Consequtive basepairs

Gene product effect type

Gene Product Effect Regulatory / other non-protein-coding sequences: difficult to establish what a change “means” (certain cases are easier, e.g. miRNA seed) Protein-coding sequences: how is protein sequence affected? Definitely easier to chase after protein effects But should don’t forget other gene products exist… Consequtive basepairs

Gene Product Effect: Protein-coding Stop-gain SNV: adds a STOP codon  truncated protein Frameshift In/Del: shifts the reading frame  protein translated incorrectly from that point Splicing: alters key sites guiding splicing In-frame In/Del: removes/add one or more amino acids Stoploss: loss of STOP codon  extra piece in the protein Missense SNV: modifies one amino acid Synonymous: no amino acid change Consequtive basepairs

Loss of Function (LOF) Variants Definition: Stop-gain, Frameshift, Splicing These are the more disruptive, BUT: What percentage of the protein is affected? Are there multiple transcript isoforms? Splicing effect difficult to predict Cryptic splice sites Frameshift can be rescued by another frameshift or bypassed by splicing Consequtive basepairs

Missense Variants: Tell Me More.. How do we tell if a missense alters protein function? Type of amino acid change (amino acid groups) Conservation across species Conserved protein domain Secondary protein structure Tertiary (3D) protein structure + simulation Other functional features (e.g. phosphosite) Machine learning model tying all of these together What training set? Consequtive basepairs

Missense Example: Back to BRAF BRAF V600E T>C Somatic Pathogenic BRAF V600A T>A Somatic / germline Pathogenicity untested

Conservation and Missense Variant Scoring Models

Conservation Conservation is a powerful and broadly used idea How conserved is a given nucleotide or genomic interval, comparing different species to human? How conserved is an amino acid in a protein sequence? Available from UCSC (nucleotide conservation): PhyloP score – useful to assess single variants PhastCons score/element – useful to assess putative regulatory regions and genes not coding proteins Multi-species alignment – generally useful

Look for coding exons, UTRs and third nucleotide within codons

PhyloP PhyloP: test to detect if nucleotide substitution rates are faster or slower than expected under neutral drift Only where aligned sequence available! PhyloP score Positive: conserved (e.g. PhyloP > 2) Zero: neutral Negative: more diverged than neutral Species group: All vertebrates Only placental mammals Only primates

Conservation Main caveat: if you use conservation for a given position, this will not tell you directly what is the effect of your variants, but only if the position is important!

Missense Variant Effect: Scoring Models Overview Criteria to keep in mind: What features are used? Nucleotide / amino acid conservation Amino acid physicochemical properties Direct scoring versus Machine learning Machine learning models are heavily dependent on the training-set used What data-set used for assessment / learning / optimization? E.g. Activating / gain-of-function versus inactivating / loss-of-function mutations E.g. Mendelian disorders (prevailingly loss-of-function) versus cancer (some are unique to cancer, e.g. drug resistance)

SIFT Broadly used, relatively old (first published: 2001) Designed for deleterious mutation (i.e. disruptive of protein function) Based uniquely on protein sequence (amino acid) conservation Start from query protein sequence Identify similar protein sequences (PSI-BLAST) Multiple alignment of protein sequences (orthologs and paralogs) Amino acid x residue probability matrix (PSSM) For every residue, amino acid probability reweighted by amino acid diversity at the position (sum of frequency rank * frequency)  Score: probability of observing amino acid normalized by residue conservation cut-off: 0.05 (based on case studies) Consequtive basepairs Predicting deleterious amino acid substitutions. Ng PC, Henikoff S. Genome Res. 2001 May;11(5):863-74.

PolyPhen2 Integrates multiple features 8 sequence-based, 3 structure-based (nucleotide and amino acid level) (e.g. side chain volume change, overlap with PFAM domain, multiple alignment metrics) Machine learning method (Naïve Bayes)  Requires training set Set 1: HumDiv Positive: damaging alleles for known Mendelian disorders (Uniprot) Negative: nondamaging differences between human proteins and related mammalian homologs Performance 5-fold crossv: (TP ~ 80%, FP ~10%), (TP ~ 90%, FP ~ 20%) Set 2: HumVar Positive: all human disease causing mutations (Uniprot) Negative: non-synonymous SNPs without disease association Richer model than SIFT More biased towards training set(s) than SIFT Consequtive basepairs A method and server for predicting damaging missense mutations. Adzhubei IA, Schmidt S, Peshkin L, […], Bork P, Kondrashov AS, Sunyaev SR. Nat Methods. 2010 Apr;7(4):248-9.

MutationAssessor Direct / theoretical model (no machine learning) Based on amino acid conservation also specifically modeling conservation unique to protein subfamilies (can be regarded as an enhanced SIFT) Entropy-based score based on protein sequence alignment Performs well for (recurrent) somatic variants Consequtive basepairs Predicting the functional impact of protein mutations: application to cancer genomics. Reva B, Antipin Y, Sander C. Nucleic Acids Res. 2011 Sep 1;39(17):e118

CADD Intended as a measure of “deleteriousness” for coding and non-coding sequence, not biased to known disease variation However does not model gene specific constrain in detail Machine learning model (Linear SVM) Negative training set: nearly fixed human alleles, variant if compared to inferred human-chimp ancestral genome Positive training set: simulated variants based on mutation model aware of sequence context and primate substitution rates Predictive features (63): VEP (Variant Effect Predictor) output, UCSC tracks, Encode tracks  includes missense predictions and nucleotide-level conservation Performance assessment: using pathogenic variants from ClinVar performs a bit better PhyloP for all sites and PolyPhen/SIFT for missense coding A general framework for estimating the relative pathogenicity of human genetic variants. Kircher M, Witten DM, Jain P, O'Roak BJ, Cooper GM, Shendure J. Nat Genet. 2014 Mar;46(3):310-5.

CADD Pathogenic ClinVar vs NHLBI-ESP > 5%

Splicing Regulatory Predictions Goal: predict how SNVs affect exon inclusion / exclusion Strategy: Learn “Wild Type” splicing code based on reference genome sequence motifs and experimentally-measured splicing patterns in human tissues “Mutant” code: predicts splicing change when variant alters splicing-guiding sequence motif Does not learn based on known disease splicing alterations Science 2015

Phosphorylation and other protein modifications Post-translational modifications (PTMs) extend protein function Human: >130,000 PTM sites, 12% of protein sequence Enriched in inherited disease and somatic cancer mutations Negatively selected in population Often not detected with mutation assessment tools Reimand et al, 2013 Mol Sys Bio; 2015 PLOS Genet

Effect Scoring: Conclusive Remarks Nucleotide-level conservation (PhyloP) is simple yet powerful, and multiple alignments can be additionally inspected Missense scoring models are powerful, but their strengths and weaknesses need to be understood Variants should be always reviewed putting all information in context Consider conservation and effect scores using different models Review the amino acid change and sequence context Look for clusters of somatic variants and protein domain Don’t forget gene-level information!

We are on a Coffee Break & Networking Session