Overcoming the Curse of Dimensionality in a Statistical Geometry Based Computational Protein Mutagenesis Majid Masso Bioinformatics and Computational Biology.

Slides:

Advertisements

Similar presentations

Predicting Kinase Binding Affinity Using Homology Models in CCORPS

Advertisements

Protein Function Analysis using Computational Mutagenesis

Data Mining Classification: Alternative Techniques

SVM—Support Vector Machines

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Application of Stacked Generalization to a Protein Localization Prediction Task Melissa K. Carroll, M.S. and Sung-Hyuk Cha, Ph.D. Pace University, School.

50%, guessing 100%, all correct Accuracy = Figure 2 Predictive Accuracy of SMO algorithm using each attribute separately Prediction of catalytic residues.

Chapter 9 Structure Prediction. Motivation Given a protein, can you predict molecular structure Want to avoid repeated x-ray crystallography, but want.

Lesson 8: Machine Learning (and the Legionella as a case study) Biological Sequences Analysis, MTA.

Three kinds of learning

Methods for Improving Protein Disorder Prediction Slobodan Vucetic1, Predrag Radivojac3, Zoran Obradovic3, Celeste J. Brown2, Keith Dunker2 1 School of.

KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.

1 Automated Feature Abstraction of the fMRI Signal using Neural Network Clustering Techniques Stefan Niculescu and Tom Mitchell Siemens Medical Solutions,

Semi-supervised protein classification using cluster kernels Jason Weston, Christina Leslie, Eugene Ie, Dengyong Zhou, Andre Elisseeff and William Stafford.

Evaluation of Results (classifiers, and beyond) Biplav Srivastava Sources: [Witten&Frank00] Witten, I.H. and Frank, E. Data Mining - Practical Machine.

05/06/2005CSIS © M. Gibbons On Evaluating Open Biometric Identification Systems Spring 2005 Michael Gibbons School of Computer Science & Information Systems.

Guidelines on Statistical Analysis and Reporting of DNA Microarray Studies of Clinical Outcome Richard Simon, D.Sc. Chief, Biometric Research Branch National.

A Statistical Geometry Approach to the Study of Protein Structure Majid Masso Bioinformatics and Computational Biology George Mason University.

Protein Tertiary Structure Prediction

This week: overview on pattern recognition (related to machine learning)

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

Predicting Secondary Structure of All-Helical Proteins Using Hidden Markov Support Vector Machines Blaise Gassend, Charles W. O'Donnell, William Thies,

Friday 17 rd December 2004Stuart Young Capstone Project Presentation Predicting Deleterious Mutations Young SP, Radivojac P, Mooney SD.

Prediction of HIV-1 Drug Resistance: Representation of Target Sequence Mutational Patterns via an n-Grams Approach Majid Masso School of Systems Biology,

LOGO Ensemble Learning Lecturer: Dr. Bo Yuan

Development of Novel Geometrical Chemical Descriptors and Their Application to the Prediction of Ligand-Protein Binding Affinity Shuxing Zhang, Alexander.

 2003, G.Tecuci, Learning Agents Laboratory 1 Learning Agents Laboratory Computer Science Department George Mason University Prof. Gheorghe Tecuci 5.

Computational prediction of protein-protein interactions Rong Liu

A Study of Residue Correlation within Protein Sequences and its Application to Sequence Classification Christopher Hemmerich Advisor: Dr. Sun Kim.

Frontiers in the Convergence of Bioscience and Information Technologies 2007 Seyed Koosha Golmohammadi, Lukasz Kurgan, Brendan Crowley, and Marek Reformat.

Today Ensemble Methods. Recap of the course. Classifier Fusion

Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.

Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.

PREDICTION OF CATALYTIC RESIDUES IN PROTEINS USING MACHINE-LEARNING TECHNIQUES Natalia V. Petrova (Ph.D. Student, Georgetown University, Biochemistry Department),

Identification of amino acid residues in protein-protein interaction interfaces using machine learning and a comparative analysis of the generalized sequence-

Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program.

A MULTIBODY ATOMIC STATISTICAL POTENTIAL FOR PREDICTING ENZYME-INHIBITOR BINDING ENERGY Majid Masso Laboratory for Structural Bioinformatics,

Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.

LOGO iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance- Pairs and Reduced Alphabet Profile into the General Pseudo Amino.

Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

A New Supervised Over-Sampling Algorithm with Application to Protein-Nucleotide Binding Residue Prediction Li Lihong (Anna Lee) Cumputer science 22th,Apr.

CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel:

Feature Extraction Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and.

Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.

Data Mining and Decision Support

Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S

Matching Protein  -Sheet Partners by Feedforward and Recurrent Neural Network Proceedings of Eighth International Conference on Intelligent Systems for.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.

Ubiquitination Sites Prediction Dah Mee Ko Advisor: Dr.Predrag Radivojac School of Informatics Indiana University May 22, 2009.

1 Three-Body Delaunay Statistical Potentials of Protein Folding Andrew Leaver-Fay University of North Carolina at Chapel Hill Bala Krishnamoorthy, Alex.

Overfitting, Bias/Variance tradeoff. 2 Content of the presentation Bias and variance definitions Parameters that influence bias and variance Bias and.

We propose an accurate potential which combines useful features HP, HH and PP interactions among the amino acids Sequence based accessibility obtained.

Modeling Cell Proliferation Activity of Human Interleukin-3 (IL-3) Upon Single Residue Replacements Majid Masso Bioinformatics and Computational Biology.

A new protein-protein docking scoring function based on interface residue properties Reporter: Yu Lun Kuo (D )

Combining Bagging and Random Subspaces to Create Better Ensembles

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Majid Masso School of Systems Biology, George Mason University

Madhavi Ganapathiraju Graduate student Carnegie Mellon University

Basic machine learning background with Python scikit-learn

Feature Extraction Introduction Features Algorithms Methods

Introduction Feature Extraction Discussions Conclusions Results

Machine Learning Week 1.

Extra Tree Classifier-WS3 Bagging Classifier-WS3

חיזוי ואפיון אתרי קישור של חלבון לדנ"א מתוך הרצף

Machine Learning to Predict Experimental Protein-Ligand Complexes

Concave Minimization for Support Vector Machine Classifiers

Semi-Supervised Learning

Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017

Presentation transcript:

Overcoming the Curse of Dimensionality in a Statistical Geometry Based Computational Protein Mutagenesis Majid Masso Bioinformatics and Computational Biology George Mason University, Manassas, Virginia, USA BioDM Workshop, IEEE ICDM 2010

Delaunay Tessellation of Protein Structure Aspartic Acid (Asp or D) Abstract every amino acid residue to a point Atomic coordinates – Protein Data Bank (PDB) center of mass (CM) D3 A22 S64 L6 F7 G62 C63 K4 R5 Delaunay tessellation: 3D “tiling” of space into non-overlapping, irregular tetrahedral simplices. Each simplex objectively identifies a quadruplet of nearest-neighbor amino acids at its vertices.

Delaunay Tessellation of T4 Lysozyme Ribbon diagram (left) based on PDB file 3lzm (164 residues) Each amino acid residue represented as a CM point in 3D space Tessellation of the 164 CM points (right) performed using a 12Å edge-length cutoff, for “true” residue quadruplet interactions

Four-Body Statistical Potential Training set: 1,375 diverse high-resolution x-ray structures PDB … 1efaB lac repressor 1bniA barnase 1rtjA HIV-1 RT Tessellate 1jli IL-3 Pool together all simplices from the tessellations, and compute observed frequencies of simplicial quadruplets

Four-Body Statistical Potential

Computational Mutagenesis: Residual Profiles ribbon CM trace 10 simplices share N163 vertex, and 10 total vertices; in the structure, N163 has 9 neighbors tessellation nonzero components identify the mutated position 163 and its 9 neighbors environmental change (EC)

Computational Mutagenesis: Residual Profiles Nonzero ECs identify mutated position 163 and its 9 neighbors So, the 19 mutants (N163A, N163C, etc.) at 163 will have nonzero ECs at the same 10 positions only, but nonzero values will differ Each position has a different number of structural neighbors (min of 6, max of 19), which can be located throughout the sequence Number of neighbors and their locations (position numbers) are dependent on the position being mutated

Experimental Data: Mutant T4 Lysozyme Activity 2015 mutants synthesized by introducing the same 13 amino acids as replacements at 163 positions (all except the first) Rennell, D., Bouvier, S.E., Hardy, L.W. & Poteete, A.R. (1991) J. Mol. Biol. 222, 67-88. Each position yields either 12 or 13 mutants, depends on whether or not native amino acid there is also one of the 13 replacements Mutant activity is based on plaque sizes on Petri dishes, 2 classes: “unaffected” = large plaques (same as native T4 lysozyme) “affected” = medium, small, or no plaques 1377 “unaffected” and 638 “affected” T4 lysozyme mutants

Computational Mutagenesis: Feature Vectors Approach 1 – represent mutants by 164D residual profile vectors; training set consists of all 2015 T4 lysozyme mutants Approach 2 (dimensionality reduction) – select and order the 6 closest neighbors to the mutated position; create 7D vector of nonzero EC scores for mutated position and 6 closest neighbors Approach 3 (subspace modeling) – segregate mutants by position number, consider each subset as a separate training set for classification, and combine the results; can be applied to 164D or 7D feature vectors

Supervised Classification Algorithms: decision tree (DT), neural network (NN), support vector machine (SVM), and random forest (RF) Testing: leave-one-out cross-validation (LOOCV) Evaluation of performance: Overall accuracy, or proportion of correct predictions: Q Sensitivity and precision for both classes: S(U), P(U), S(A), and P(A) Balanced error rate: BER Matthew’s correlation coefficient: MCC

Results Full training set with 164D surpasses 7D due to loss of implicit structural information (i.e., location of nonzeros in 164D vector) Subspace modeling (SM) improves performance due to dramatic increase in S(A); 164D and 7D SM results are equal SM with 164D vectors amounts to dimensionality reduction that uses the entire neighborhood of mutated position (unlike 7D, which uses only the 6 closest neighbors)

Conclusion and Future Directions Residual profile vectors provide a natural way to introduce subspace modeling and achieve improved performance Current work focused on inductive learning, future project could apply transductive learning to the dataset Transduction allows us to also use vectors of all remaining mutants not classified experimentally – wet-lab collaborations can then validate our predictions These techniques could be applied to a similarly comprehensive experimental dataset: 4041 mutants of lac repressor protein Contact: mmasso@gmu.edu Slides available at: http://binf.gmu.edu/mmasso