Using SuSPect to Predict the Phenotypic Effects of Missense Variants Chris Yates UCL Cancer Institute

Slides:



Advertisements
Similar presentations
WKinMut An integrated tool for the analysis and interpretation of mutations in human protein kinases José MG Izarzugaza 1 Spanish National Cancer Research.
Advertisements

MitoInteractome : Mitochondrial Protein Interactome Database Rohit Reja Korean Bioinformation Center, Daejeon, Korea.
1 Genome information GenBank (Entrez nucleotide) Species-specific databases Protein sequence GenBank (Entrez protein) UniProtKB (SwissProt) Protein structure.
Protein Structure Database Introduction Database of Comparative Protein Structure Models ModBase 生資所 g 詹濠先.
Improved prediction of protein-protein binding sites using a support vector machine ( James Bradford, et al (2004)) Tapan Patel CISC841 Trypsin (and inhibitor.
Homology 3D modeling and effect of mutations. X-ray crystallography (70,714 in PDB) need crystals Nuclear Magnetic Resonance (NMR) (9,312) proteins in.
Protein Sectors: Evolutionary Units of Three-Dimensional Structure Najeeb Halabi, Olivier Rivoire, Stanislas Leibler, and Rama Ranganthan Cell 138, ,
1 Detecting selection using phylogeny. 2 Evaluation of prediction methods  Comparing our results to experimentally verified sites Positive (hit)Negative.
Expect value Expect value (E-value) Expected number of hits, of equivalent or better score, found by random chance in a database of the size.
Herpes Jeff Brown Dante Kappotis Robert Vanderley Anthony Biasella.
1 Functional prediction in proteins (purifying and positive selection)
PolyPhen and SIFT: Tools for predicting functional effects of SNPs Epi 244 Spring 2009 Sam S. Oh.
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Protein Sequence Analysis - Overview Raja Mazumder Senior Protein Scientist, PIR Assistant Professor, Department of Biochemistry and Molecular Biology.
A Statistical Geometry Approach to the Study of Protein Structure Majid Masso Bioinformatics and Computational Biology George Mason University.
Identifying deleterious Single Nucleotide Polymorphisms using multiple sequence alignments CMSC858P Project by Maya Zuhl.
Analysis of HIV Evolution Bobak Seddighzadeh and Kristoffer Chin Department of Biology Loyola Marymount University Bio February 23, 2010.
Overcoming the Curse of Dimensionality in a Statistical Geometry Based Computational Protein Mutagenesis Majid Masso Bioinformatics and Computational Biology.
MES Genome Informatics I - Lecture VIII. Interpreting variants Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute,
Friday 17 rd December 2004Stuart Young Capstone Project Presentation Predicting Deleterious Mutations Young SP, Radivojac P, Mooney SD.
PhenCode Linking Human Mutations to Phenotype. PhenCode Brings the deep information on genotypes and phenotypes in locus specific databases (LSDBs) into.
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Identification and evaluation of causative genetic variants corresponding to a certain phenotype Xidan Li.
Protein Secondary Structure Prediction Based on Position-specific Scoring Matrices Yan Liu Sep 29, 2003.
© Wiley Publishing All Rights Reserved. Protein 3D Structures.
Supplemental data 1. Correspondence between Ensembl genes and proteins accession numbers.
Computational prediction of protein-protein interactions Rong Liu
Aims and objectives of the workshop David Moore. Aims Classification of variants is subjective and NEQAS results suggest this is not a major problem To.
Biological Networks. Can a biologist fix a radio? Lazebnik, Cancer Cell, 2002.
Fea- ture Num- ber Feature NameFeature description 1 Average number of exons Average number of exons in the transcripts of a gene where indel is located.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
SCRIPPS GENOME ADVISER Galina Erikson Senior Bioinformatics Programmer The Scripps Translational Science Institute Scripps Translational Science Institute.
Tutorial 4 Substitution matrices and PSI-BLAST 1.
Study of Protein Prediction Related Problems Ph.D. candidate Le-Yi WEI 1.
LARVA: An integrative framework for Large-scale Analysis of Recurrent Variants in noncoding Annotations M Gerstein, Yale Slides freely downloadable from.
Protein Sequence Analysis - Overview - NIH Proteomics Workshop 2007 Raja Mazumder Scientific Coordinator, PIR Research Assistant Professor, Department.
Identification of amino acid residues in protein-protein interaction interfaces using machine learning and a comparative analysis of the generalized sequence-
Motif discovery and Protein Databases Tutorial 5.
1 Improve Protein Disorder Prediction Using Homology Instructor: Dr. Slobodan Vucetic Student: Kang Peng.
3DM: Protein engineering Super-family platforms Bio-Prodict DM super-family systems Henk-Jan Joosten Remko Kuipers Tom v/d Bergh Bas Vroling.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics.
Rita Casadio BIOCOMPUTING GROUP University of Bologna, Italy Prediction of protein function from sequence analysis.
Personalized genomics
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Homology 3D modeling and effect of mutations. X-ray crystallography (70,714 in PDB) need crystals Nuclear Magnetic Resonance (NMR) (9,312) proteins in.
Canadian Bioinformatics Workshops
Identifying disease causal variants Mendelian disorders A. Mesut Erzurumluoglu 1.
Canadian Bioinformatics Workshops
Modeling Cell Proliferation Activity of Human Interleukin-3 (IL-3) Upon Single Residue Replacements Majid Masso Bioinformatics and Computational Biology.
CCRC Cancer Conference November 8, 2015.
Adva Yeheskel The Bioinformatics Unit Tel Aviv University April 10 th 2016.
Sequence: PFAM Used example: Database of protein domain families. It is based on manually curated alignments.
Canadian Bioinformatics Workshops
CSE 182 Project.
Figure 2. VIPUR training ROC and PR performance
Interpretation Next Generation Sequencing (Bench Clinic)
Secondary structure prediction
Have (y)Our Protein Explained
Deep Phenotyping for Deep Learning (DPDL): Progress Report
Relationship between Genotype and Phenotype
By Stitziel, Tseng, Pervouchine, Goddeau, Kasif, Liang
Daniel C. Koboldt, David E. Larson, Lori S. Sullivan, Sara J
Pan-Cancer Analysis of Mutation Hotspots in Protein Domains
Crystal Structure of a Phosphoinositide Phosphatase, MTMR2
Chloe Jones, Isabel Gonzaga, and Nicole Anguiano
Selectivity-determining regions
Predicted pathogenic BRCA1 amino acid substitutions are confined to the evolutionarily conserved N- and C-terminal domains. Predicted pathogenic BRCA1.
Presentation transcript:

Using SuSPect to Predict the Phenotypic Effects of Missense Variants Chris Yates UCL Cancer Institute

Outline SAVs and Disease Development of SuSPect Features included Feature selection Performance Web-Server & Availability Usage Example results

Outline SAVs and Disease Development of SuSPect Features included Feature selection Performance Web-Server & Availability Usage Example results

Background 10-15,000 single amino acid variants (SAVs) per exome. Many variants are tolerated, but some SAVs cause disease. Glu6Val in HBB causes sickle cell anæmia. Many mechanisms by which SAVs can impair function. Decrease stability, Change active site, Protein-protein interaction. Need methods for predicting SAV effects Sequence- and structure-based.

Hexokinase

Transthyretin

Outline SAVs and Disease Development of SuSPect Features included Feature selection Performance Web-Server & Availability Usage Example results

Features Sequence conservation Position-specific scoring matrix (PSI-BLAST) Pfam domain Jensen-Shannon divergence Structural features From PDB or Phyre2 homology models where available. Secondary structure Solvent accessibility Network features Protein-protein interaction (PPI) Domain-domain interaction (DDI) Domain bigram

Features Sequence conservation Position-specific scoring matrix (PSI-BLAST) Pfam domain Jensen-Shannon divergence Structural features From PDB or Phyre2 homology models where available. Secondary structure Solvent accessibility Network features Protein-protein interaction (PPI) Domain-domain interaction (DDI) Domain bigram

Features Sequence conservation Position-specific scoring matrix (PSI-BLAST) Pfam domain Jensen-Shannon divergence Structural features From PDB or Phyre2 homology models where available. Secondary structure Solvent accessibility Network features Protein-protein interaction (PPI) Domain-domain interaction (DDI)

Network Features Change in protein function is not the same as causing disease. More ‘important’ proteins are more likely to be involved in disease. Centrality of a protein within a protein-protein interaction network can be used to measure ‘importance’.

VariBench Neutral and Pathogenic datasets obtained from VariBench (Thusberg et al. 2011). Neutral SAVs from dbSNP version 131, filtered by allele frequency (>0.01) and chromosome count (>49). SAVs present in OMIM removed. Pathogenic SAVs from PhenCode (2009). VariBench datasets were filtered to remove any SAVs present in training data. 13,236Neutral 5,397Pathogenic

VariBench MethodAUCBalanced Accuracy SuSPect MutPred MutationAssessor SIFT FATHMM0.63 Condel PANTHER PolyPhen

Results – Take home messages Feature selection improves performance Top 9 features selected. Predicted relative solvent accessibility; WT and Variant scores in PSSM, and their difference; Number of UniProt annotations; Difference in Pfam scores; PPI network degree centrality; Jensen-Shannon divergence; Sequence identity with best-matching sequence to lack WT amino acid. Network features are important Removal of network features drops AUC from 0.88 to Removal of PPI centrality from SuSPect-FS gives drop from 0.90 to Network centrality helps show the difference between variants affecting protein function and leading to disease.

Results – Feature Selection

Results – Take home messages Feature selection improves performance Top 9 features selected. Predicted relative solvent accessibility; WT and Variant scores in PSSM, and their difference; Number of UniProt annotations; Difference in Pfam scores; PPI network degree centrality; Jensen-Shannon divergence; Sequence identity with best-matching sequence to lack WT amino acid Network features are important Removal of network features drops AUC from 0.88 to Removal of PPI centrality from SuSPect-FS gives drop from 0.90 to Network centrality helps show the difference between variants affecting protein function and leading to disease.

Results – No Network Features

Results – Take home messages Feature selection improves performance Top 9 features selected. Predicted relative solvent accessibility; WT and Variant scores in PSSM, and their difference; Number of UniProt annotations; Difference in Pfam scores; PPI network degree centrality; Jensen-Shannon divergence; Sequence identity with best-matching sequence to lack WT amino acid Network features are important Removal of network features drops AUC from 0.88 to Removal of PPI centrality from SuSPect-FS gives drop from 0.90 to Network centrality helps show the difference between variants affecting protein function and leading to disease.

Results - Prokaryotic Mutations HIV-1 protease – Loeb et al. (1989) 225 deleterious 111 neutral LacI repressor – Suckow et al. (1996) 1,774 deleterious 2,267 neutral T4 lysozyme – Rennel et al. (1991) 638 deleterious 1,377 neutral

Results - Prokaryotic Mutations HIV-1 ProteaseE. coli LacI repressorT4 Lysozyme

Outline SAVs and Disease Development of SuSPect Features included Feature selection Performance Web-Server & Availability Usage Example results

Web-Server & Download Available at Upload list of SAVs or VCF file to obtain scores for human missense variants In addition to score, gives easily interpretable descriptions. Sequence conservation, structure, active site, and much more. Useful for interpretation of how variants can have their effects. SuSPect Package – downloadable database of pre- calculated scores for all possible human missense variants.

Web-Server & Download

Human Proteins Scores have been pre-calculated for the Mar-2013 release of UniProt. If human variants or proteins are uploaded (either as sequence, structure or ID), these pre-calculated scores are used. These scores are calculated using SuSPect-FS, which is quicker and shows better performance than the full version. Other Organisms For non-human proteins, scores are calculated on-the-fly, using a version of SuSPect including all features except the PPI network information and UniProt annotations.

SuSPectP Disease-specific scores associating SAVs with disease

SuSPectP

Acknowledgements & References Prof. Michael Sternberg Dr Ioannis Filippis Dr Lawrence Kelley Dr Suhail Islam Yates CM & Sternberg MJE (2013) Proteins and domains vary in their tolerance of non- synonymous single nucleotide polymorphisms. J. Mol. Biol., 425: Yates CM et al. (2014) SuSPect: enhanced prediction of single amino acid variant (SAV) phenotype using network features. J. Mol. Biol., 426:

Cross-Validation PrecisionRecallMCCBalanced Accuracy SAV Protein Feature Selection

Results – No Structural Features

Results – No Network Features