Protein Function Analysis using Computational Mutagenesis

Slides:



Advertisements
Similar presentations
Transmembrane Protein Topology Prediction Using Support Vector Machines Tim Nugent and David Jones Bioinformatics Group, Department of Computer Science,
Advertisements

Chapter 4 Pattern Recognition Concepts: Introduction & ROC Analysis.
Learning Algorithm Evaluation
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
Using phylogenetic profiles to predict protein function and localization As discussed by Catherine Grasso.
50%, guessing 100%, all correct Accuracy = Figure 2 Predictive Accuracy of SMO algorithm using each attribute separately Prediction of catalytic residues.
Profiles for Sequences
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Improved prediction of protein-protein binding sites using a support vector machine ( James Bradford, et al (2004)) Tapan Patel CISC841 Trypsin (and inhibitor.
Assessing and Comparing Classification Algorithms Introduction Resampling and Cross Validation Measuring Error Interval Estimation and Hypothesis Testing.
Classification and risk prediction
Cost-Sensitive Classifier Evaluation Robert Holte Computing Science Dept. University of Alberta Co-author Chris Drummond IIT, National Research Council,
Identifying functional residues of proteins from sequence info Using MSA (multiple sequence alignment) - search for remote homologs using HMMs or profiles.
Methods for Improving Protein Disorder Prediction Slobodan Vucetic1, Predrag Radivojac3, Zoran Obradovic3, Celeste J. Brown2, Keith Dunker2 1 School of.
Optimized Numerical Mapping Scheme for Filter-Based Exon Location in DNA Using a Quasi-Newton Algorithm P. Ramachandran, W.-S. Lu, and A. Antoniou Department.
Protein Structures.
Protein Mutational Analysis Using Statistical Geometry Methods Majid Masso Bioinformatics and Computational.
Sequencing a genome and Basic Sequence Alignment
A Statistical Geometry Approach to the Study of Protein Structure Majid Masso Bioinformatics and Computational Biology George Mason University.
CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 07: Cost-Sensitive Measures.
Evaluating Classifiers
Protein Tertiary Structure Prediction
Evaluation – next steps
CRB Journal Club February 13, 2006 Jenny Gu. Selected for a Reason Residues selected by evolution for a reason, but conservation is not distinguished.
Overcoming the Curse of Dimensionality in a Statistical Geometry Based Computational Protein Mutagenesis Majid Masso Bioinformatics and Computational Biology.
From Genomic Sequence Data to Genotype: A Proposed Machine Learning Approach for Genotyping Hepatitis C Virus Genaro Hernandez Jr CMSC 601 Spring 2011.
Friday 17 rd December 2004Stuart Young Capstone Project Presentation Predicting Deleterious Mutations Young SP, Radivojac P, Mooney SD.
Prediction of HIV-1 Drug Resistance: Representation of Target Sequence Mutational Patterns via an n-Grams Approach Majid Masso School of Systems Biology,
Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o.
GA-Based Feature Selection and Parameter Optimization for Support Vector Machine Cheng-Lung Huang, Chieh-Jen Wang Expert Systems with Applications, Volume.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
From Structure to Function. Given a protein structure can we predict the function of a protein when we do not have a known homolog in the database ?
Exploring Alternative Splicing Features using Support Vector Machines Feature for Alternative Splicing Alternative splicing is a mechanism for generating.
Protein-Protein Interaction Hotspots Carved into Sequences Yanay Ofran 1,2, Burkhard Rost 1,2,3 1.Department of Biochemistry and Molecular Biophysics,
PREDICTION OF CATALYTIC RESIDUES IN PROTEINS USING MACHINE-LEARNING TECHNIQUES Natalia V. Petrova (Ph.D. Student, Georgetown University, Biochemistry Department),
Evaluating Results of Learning Blaž Zupan
Identification of amino acid residues in protein-protein interaction interfaces using machine learning and a comparative analysis of the generalized sequence-
Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work Exploring Alternative Splicing Features.
A MULTIBODY ATOMIC STATISTICAL POTENTIAL FOR PREDICTING ENZYME-INHIBITOR BINDING ENERGY Majid Masso Laboratory for Structural Bioinformatics,
A New Supervised Over-Sampling Algorithm with Application to Protein-Nucleotide Binding Residue Prediction Li Lihong (Anna Lee) Cumputer science 22th,Apr.
Feature Extraction Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and.
Typically, classifiers are trained based on local features of each site in the training set of protein sequences. Thus no global sequence information is.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine 朱林娇 14S
Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.
Final Report (30% final score) Bin Liu, PhD, Associate Professor.
A Kernel Approach for Learning From Almost Orthogonal Pattern * CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin * B. Scholkopf.
Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.
1 Three-Body Delaunay Statistical Potentials of Protein Folding Andrew Leaver-Fay University of North Carolina at Chapel Hill Bala Krishnamoorthy, Alex.
Modeling Cell Proliferation Activity of Human Interleukin-3 (IL-3) Upon Single Residue Replacements Majid Masso Bioinformatics and Computational Biology.
Improvement of SSR Redundancy Identification by Machine Learning Approach Using Dataset from Cotton Marker Database Pengfei Xuan 1,2, Feng Luo 2, Albert.
A new protein-protein docking scoring function based on interface residue properties Reporter: Yu Lun Kuo (D )
Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)
7. Performance Measurement
Bioinformatics Overview
Chapter 7. Classification and Prediction
Majid Masso School of Systems Biology, George Mason University
Evaluating Results of Learning
Introduction Feature Extraction Discussions Conclusions Results
Extra Tree Classifier-WS3 Bagging Classifier-WS3
חיזוי ואפיון אתרי קישור של חלבון לדנ"א מתוך הרצף
Combining HMMs with SVMs
Support Vector Machine (SVM)
Experiments in Machine Learning
Protein Structures.
Generalizations of Markov model to characterize biological sequences
Protein structure prediction.
Roc curves By Vittoria Cozza, matr
Presentation transcript:

Protein Function Analysis using Computational Mutagenesis CASB workshop, 9/23/10 Protein Function Analysis using Computational Mutagenesis Iosif Vaisman Laboratory for Structural Bioinformatics proteins.gmu.edu Department of Bioinformatics and Computational Biology

Dealunay simplices classification

Protein representation (Crambin)

Neighbor identification in proteins: Voronoi/Delaunay Tessellation in 2D Delaunay simplex is defined by points, whose Voronoi polyhedra have common vertex always a triangle in a 2D space and a tetrahedron in a 3D space Delaunay Tessellation Voronoi Tessellation

Neighbor identification in proteins: Voronoi/Delaunay Tessellation in 2D 6 7 6 Voronoi Tessellation Delaunay Tessellation

Delaunay tessellation of Crambin

Delaunay Tessellation of Protein Structure D (Asp) Cα or center of mass Abstract each amino acid to a point Atomic coordinates – Protein Data Bank (PDB) D3 A22 S64 L6 F7 G62 C63 K4 R5 Delaunay tessellation: 3D “tiling” of space into non-overlapping, irregular tetrahedral simplices. Each simplex objectively defines a quadruplet of nearest-neighbor amino acids at its vertices.

Compositional propensities of Delaunay simplices k j q ijkl  log f p f- observed quadruplet frequency, pijkl = Caiajakal, a - residue frequency C  4 ! i n  ( t ) AAAA: C = 4! / 4! = 1 AAAV: C = 4! / (3! x 1!) = 4 AAVV: C = 4! / (2! x 2!) = 6 AAVR: C = 4! / (2! x 1! x 1!) = 12 AVRS: C = 4! / (1! x 1! x 1! x 1!) ) = 24 10

Counting Quadruplets assuming order independence among residues comprising Delaunay simplices, the maximum number of all possible combinations of quadruplets forming such simplices is 8855 4845 3420 190 380 20 8855

Log-likelihood of amino acid quadruplets with different compositions

Log-likelihood of amino acid quadruplets

Log-likelihood of amino acid quadruplets

Computational Mutagenesis Methodology Observations: Relatively few mutant and wt structures of same protein have been solved Tessellations of mutant and wt protein structures are very similar or identical Approach: Obtain topological score (TSmut) and 3D-1D potential profile vector (Qmut) for any mutant protein by using the wt structure tessellation as a template Simply change the residue label at a given point(s) and re-compute s(R,D,A,L) A22 s(I,D,A,L) A22 s(R,G,F,L) s(I,G,F,L) L6 L6 D3 Mutation D3 F7 F7 (R5  I5) s(R,D,K,S) s(I,D,K,S) G62 G62 K4 K4 S64 S64 s(R,S,C,G) s(I,S,C,G) R5 I5 C63 C63 (TSwt, Qwt) (TSmut, Qmut)

Computational Mutagenesis Methodology Scalar “Residual Score” of a mutant: (mutant – wt) topological score difference = TSmut – TSwt (empirical measure of relative structural change due to mutation) Vector “Residual Profile” of a mutant: R = Qmut – Qwt = (mutant – wt) 3D-1D potential profile difference (environmental perturbation score at every position in structure) Denote R = < EC1, EC2, EC3,…, ECN > ECi = qi,mut – qi,wt = relative Environmental Change at position i Geometric property: If mutant is due to a single substitution at position j, then ECj ≡ mutant residual score (“epicenter” of impact) The only other nonzero EC components correspond to neighboring positions that participate in simplices with j

Approach 1: Protein Topological Score (TS) Obtained by summing the log-likelihood scores of all simplicial quadruplets defined by the protein tessellation Global measure of protein sequence-structure compatibility Total (empirical or statistical) potential of the protein TS = ∑î s(î), sum taken over all simplex quadruplets î in the entire tessellation. s(R,D,A,L) A22 s(R,G,F,L) L6 D3 F7 s(R,D,K,S) G62 K4 S64 s(R,S,C,G) R5 C63 Close-up view of only the four simplices that use R at position 5 as a vertex (hypothetical)

Approach 2: Residue Environment Scores For each amino acid position, locally sum the log-likelihood scores s(i,j,k,l) of only simplex quadruplets that include it as a vertex s(R,D,A,L) A22 s(R,G,F,L) L6 D3 Example: q5 = q(R5) = ∑(i,j,k,l) s(i,j,k,l), sum over all simplex quadruplets (i,j,k,l) that include amino acid R5 F7 s(R,D,K,S) G62 K4 S64 s(R,S,C,G) R5 C63 The scores of all amino acid positions in the protein structure form a 3D-1D Potential Profile vector Q = < q1, q2, q3,…,qN > (N = length of primary sequence in solved structure)

Reversibility Analysis S1,E1 ‘reference’ PDB S1,E2 Calculated Mutant Forward Mutation S2,E2 Mutant PDB S2,E1 Calculated ‘reference’ Reverse Mutation

Reversibility of mutations (T4 lysozyme) Protein Mutation Score change 1l63 T26E -2.49 180l E26T 2.01 1l63 A82S 1.49 123l S82A -1.49 1l63 V87M -0.28 1cu3 M87V 0.22 1l63 A93C -1.98 138l C93A 1.78 1l63 T152S -1.08 1goj S152T 1.12

Reversibility Analysis

Functional Effects of Amino Acid Substitutions Change in protein stability: Effect on melting temperature: ΔTm = Tm (mutant) – Tm (wt) Effect on thermal denaturation: ΔΔG = ΔG (mutant) – ΔG (wt) Effect on denaturant denaturation: ΔΔGH2O = ΔGH2O (mutant) – ΔGH2O (wt) Change in protein activity: Mutant enzymatic activity relative to wt Mutant strength of DNA binding relative to wt Disease potential of human coding nsSNPs Neutral polymorphism or disease-associated mutation? For protein targets of inhibitor drugs: Continued susceptibility or (degree of ) resistance that patients with the mutant protein have to the inhibitor Inhibitor binding energy to mutant target relative to wt

Examples ofExperimental Mutagenesis Data

Example: HIV-1 Protease (PR)

HIV-1 PR Dataset Example: Residual Profiles of 536 Experimental Mutants … …

Experimental Mutants: Residual Scores Elucidate the Structure-Function Relationship 536 HIV-1 protease mutants 4041 lac repressor mutants 630 hIL-3 mutants 371 gene V protein mutants

Universal Model Approach: 8635 Experimental Mutants from 7 Proteins

Universal Model Approach: 980 Experimental Mutants from 20 Proteins

Structure-Function Correlation Based on Residual Scores: nsSNPs 1790 nsSNPs corresponding to single amino acid substitutions in several hundred proteins with tessellatable structures Function: 1332 nsSNPs associated with disease; 458 neutral Data obtained from Swiss-Prot and HPI

Structure-Function Correlation Based on Residual Scores: Drug Susceptibility

Algorithm Performance: 2015 T4 Lysozyme Mutants

Learning Curves for HIV-1 protease and T4 lysozyme mutants

Real-World Application: T4 Lysozyme Predictions Experimental data (not part of training set) obtained from ProTherm database Result: predictions match experiments for 30/35 (~86%) of the mutants

T4 Lysozyme Mutational Array Training set mutants (n = 2015) Predicted test set mutants (n = 1101) Active Inactive Active Inactive

GVP Mutational Array

Support Vector Regression Capriotti et al. SVM regression (for comparison): r = 0.71, Standard Error = 1.3 kcal/mol, y = 0.5223x – 0.4705

Conclusions Computational mutagenesis derived from a four-body, knowledge-based statistical potential uniquely characterizes each protein mutant using both sequential and structural features Attributes correlate well with mutant function - valuable for developing accurate machine learning based predictive models

Acknowledgements Structural Bioinformatics Laboratory (GMU): Tariq Alsheddi (structure alignment) David Bostick (topological similarity) Andrew Carr (functional sites, visualization) Sunita Kumari (structural genomics) Yong Luo (evolutionary structure analysis) Majid Masso (mutagenesis, HIV-1 protease, LAC repressor, T4 lysozyme, SNP) Ewy Mathe (mutagenesis, p53) Olivia Peters (protein-protein interfaces) Vadim Ravich (HIV RT mutagenesis) Greg Reck (hydration potentials, amyloids) Todd Taylor (statistical potentials, secondary structure, topology, protein stability) Bill Zhang (mutagenesis, BRCA1) Collaborators: John Grefenstette (GMU) Curt Jamison (GMU) Dmitri Klimov (GMU) Dan Carr (GMU) Estela Blaisten (GMU) Vladimir Karginov (IB) Unpublished data: Clyde Hutchison (UNC) Ron Swanstrom (UNC) Funding: NSF NIH-Innovative Biologics GMU-INOVA Research Fund

Evaluating Algorithm Performance Overall goal: Develop model with known examples to accurately predict class (or value) of instances that have not yet been assayed experimentally (potentially great savings of time and money) Ideal situation: split large original dataset into 3 subsets Training set (learn model) Validation set (optimize model by tweaking model parameters) Test set (evaluate model on new data not used to develop model) Errors measured at each step (resubstitution, validation, generalization) Approaches: Tenfold cross-validation (10-fold CV); leave-one-out CV (i.e., jackknife or N-fold CV, N = dataset size); % split (e.g., use only 2/3 for training, 1/3 held out for testing)

Evaluating Algorithm Performance 10-fold CV Randomly split the dataset instances into 10 equally-sized subsets Hold-out subset 1; combine subsets 2-10 into one training set for learning a model; use trained model to predict classes of instances in subset 1 Repeat previous step 9 more times (e.g., hold-out subset 2, combine subsets 1 and 3-10 together to train a model, use model to predict subset 2, etc) We end up with 10 models, each trained using 90% of the original dataset, and each used to predict the held-out 10% subset. In the end, each instance has one class prediction – compare to actual class LOOCV (leave-one-out CV, jackknife, or N-fold CV) Similar to above, but each subset contains only 1 instance Deterministic – no randomness to which instances are grouped as subsets Overall prediction accuracy provides rough idea of how a model trained with the full dataset will perform % split (self-explanatory)

Evaluating Algorithm Performance Assume instances belong to two generic classes (Pos/Neg) Results of comparing predictions with actual classes based on the approaches described (10-fold CV, LOOCV, % split) can be summarized in a confusion matrix: Classification performance measures: accuracy = (TP+TN) / (TP+FP+TN+FN); sensitivity = TP / (TP+FN); specificity = TN / (TN+FP); precision = TP / (TP+FP); BER = 0.5 × [FP / (FP+TN) + FN / (FN+TP)]; MCC = (TP×TN – FP×FN) / (TP+FN)(TP+FP)(TN+FN)(TN+FP); AUC = area under ROC curve (plot of sensitivity vs. 1 – specificity) For regression models: correlation coefficient, standard error Predicted as Pos Neg TP FN FP TN Actual class Pos Neg

ROC Curve Plot of true positive rate (sensitivity) versus false positive rate (1 – specificity) in the unit square AUC = probability that classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one AUC ~ 0.5 (ROC close to diagonal line joining points (0,0) and (1,1)) suggests no signal in dataset and that trained model is not likely to perform any better than random guessing AUC = 1 (piecewise linear ROC joining (0,0) to (0,1) and (0,1) to (1,1)) indicates a perfect classifier