1 John Mitchell; James McDonagh; Neetika Nath Rob Lowe; Richard Marchese Robinson.

Slides:



Advertisements
Similar presentations
Conceptual Clustering
Advertisements

Molecular Biomedical Informatics 分子生醫資訊實驗室 Machine Learning and Bioinformatics 機器學習與生物資訊學 Machine Learning & Bioinformatics 1.
Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.
Improving enrichment rates A practical solution to an impractical problem Noel O’Boyle Cambridge Crystallographic Data Centre
Jürgen Sühnel Institute of Molecular Biotechnology, Jena Centre for Bioinformatics Jena / Germany Supplementary Material:
Multiscale Stochastic Simulation Algorithm with Stochastic Partial Equilibrium Assumption for Chemically Reacting Systems Linda Petzold and Yang Cao University.
Bioinformatics Vol. 21 no (Pages ) Reporter: Yu Lun Kuo (D )
Computers in Chemistry Dr John Mitchell University of St Andrews.
Molecular Docking G. Schaftenaar Docking Challenge Identification of the ligand’s correct binding geometry in the binding site ( Binding Mode ) Observation:
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Ensemble Learning: An Introduction
A quick introduction to the analysis of questionnaire data John Richardson.
Molecular Kinesis CM2004 States of Matter: Gases.
An Integrated Approach to Protein-Protein Docking
BL5203: Molecular Recognition & Interaction Lecture 5: Drug Design Methods Ligand-Protein Docking (Part I) Prof. Chen Yu Zong Tel:
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
Meeting Presentation sept.12 Things to do since last meeting: (1) find out the number of drug name in FDA website (done, the number is 6244 which is OK.
Geometric Approaches to Reconstructing Time Series Data Final Presentation 10 May 2007 CSC/Math 870 Computational Discrete Geometry Connie Phong.
For Better Accuracy Eick: Ensemble Learning
2-Variable Gas Laws. Kinetic-Molecular Theory 1. Gas particles do not attract or repel each other 2. Gas particles are much smaller than the distances.
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
EVALUATION David Kauchak CS 451 – Fall Admin Assignment 3 - change constructor to take zero parameters - instead, in the train method, call getFeatureIndices()
eHiTS Score Darryl Reid, Zsolt Zsoldos, Bashir S. Sadjad, Aniko Simon, The next stage in scoring function evolution: a new statistically.
Boltzmann Distribution and Helmholtz Free Energy
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
Zhangxi Lin ISQS Texas Tech University Note: Most slides are from Decision Tree Modeling by SAS Lecture Notes 6 Ensembles of Trees.
ENSEMBLE LEARNING David Kauchak CS451 – Fall 2013.
Non-parametric Tests. With histograms like these, there really isn’t a need to perform the Shapiro-Wilk tests!
PARAMETRIC STATISTICAL INFERENCE
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Development of Novel Geometrical Chemical Descriptors and Their Application to the Prediction of Ligand-Protein Binding Affinity Shuxing Zhang, Alexander.
Predicting Phospholipidosis Using Machine Learning 1 Lowe et al., Molec. Pharmaceutics, 7, 1708 (2010) Robert Lowe (Cambridge) John Mitchell (St Andrews)
Integrals  In Chapter 2, we used the tangent and velocity problems to introduce the derivative—the central idea in differential calculus.  In much the.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
1 M.Sc. Project of Hanif Bayat Movahed The Phase Transitions of Semiflexible Hard Sphere Chain Liquids Supervisor: Prof. Don Sullivan.
Altman et al. JACS 2008, Presented By Swati Jain.
Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.
ISQS 6347, Data & Text Mining1 Ensemble Methods. ISQS 6347, Data & Text Mining 2 Ensemble Methods Construct a set of classifiers from the training data.
Virtual Screening C371 Fall INTRODUCTION Virtual screening – Computational or in silico analog of biological screening –Score, rank, and/or filter.
Evaluating Results of Learning Blaž Zupan
Ch 22 pp Lecture 2 – The Boltzmann distribution.
An Investigation of Commercial Data Mining Presented by Emily Davis Supervisor: John Ebden.
A MULTIBODY ATOMIC STATISTICAL POTENTIAL FOR PREDICTING ENZYME-INHIBITOR BINDING ENERGY Majid Masso Laboratory for Structural Bioinformatics,
N318b Winter 2002 Nursing Statistics Specific statistical tests Chi-square (  2 ) Lecture 7.
Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007.
Lecture 9: Theory of Non-Covalent Binding Equilibria Dr. Ronald M. Levy Statistical Thermodynamics.
Dalton, Gay-Lussac, and Avogadro Contributions to Atomic Mass.
In Chapters 6 and 8, we will see how to use the integral to solve problems concerning:  Volumes  Lengths of curves  Population predictions  Cardiac.
Kinetic Theory of Gases 4 Main Postulates. Kinetic Theory Postulate 1 – Gases consist of tiny particles (atoms or molecules) whose size is negligible.
Final Report (30% final score) Bin Liu, PhD, Associate Professor.
Surflex: Fully Automatic Flexible Molecular Docking Using a Molecular Similarity-Based Search Engine Ajay N. Jain UCSF Cancer Research Institute and Comprehensive.
Ensemble Learning, Boosting, and Bagging: Scaling up Decision Trees (with thanks to William Cohen of CMU, Michael Malohlava of 0xdata, and Manish Amde.
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
AP Statistics From Randomness to Probability Chapter 14.
Structural Bioinformatics Elodie Laine Master BIM-BMC Semester 3, Genomics of Microorganisms, UMR 7238, CNRS-UPMC e-documents:
LangTest: An easy-to-use stats calculator Punjaporn P.
(Proof By) Induction Recursion
Majid Masso School of Systems Biology, George Mason University
ECE 5424: Introduction to Machine Learning
CS548 Fall 2017 Decision Trees / Random Forest Showcase by Yimin Lin, Youqiao Ma, Ran Lin, Shaoju Wu, Bhon Bunnag Showcasing work by Cano,
EMPIRICAL FORMULA The empirical formula represents the smallest ratio of atoms present in a compound. The molecular formula gives the total number of atoms.
TOP DM 10 Algorithms C4.5 C 4.5 Research Issue:
Virtual Screening.
Gas Laws Section 3.2.
Ensemble learning.
Dynamic Graph Algorithms
Gas Laws Section 3.2.
Dr John Mitchell (Chemistry, St Andrews, 2019)
Machine Learning: Lecture 5
Presentation transcript:

1 John Mitchell; James McDonagh; Neetika Nath Rob Lowe; Richard Marchese Robinson

RF-Score: a Machine Learning Scoring Function for Protein-Ligand Binding Affinities Ballester, P.J. & Mitchell, J.B.O. (2010) Bioinformatics 26,

Calculating the affinities of protein-ligand complexes:  For docking  For post-processing docking hits  For virtual screening  For lead optimisation  For 3D QSAR  Within series of related complexes  For any general complex  Absolute (hard!)  Relative A difficult, unsolved problem.

Three existing approaches … 1. Force fields

Three existing approaches … 2. Empirical Functions

Three existing approaches … 2. Empirical Functions

Three existing approaches … 3. Knowledge based

How knowledge-based scoring functions have worked …  P-L complexes from PDB  Assign atoms to types  Find histograms of type-type distances  Convert to an ‘energy’  Add up the energies from all P-L atom pairs

 This conversion of the histogram into an energy function uses a “reverse Boltzmann” methodology.  Thus it “assumes” that the atoms of protein and ligand are independent particles in equilibrium at temperature T.  For a variety of reasons, these are poor assumptions …

 Molecular connectivity: atom-atom distances are miles from being independent.  Excluded volume effects.  No physical basis for assuming such an equilibrium.  Changes in structure with T are small and not like those implied by the Boltzmann distribution.

We thought about this … … and wrote a paper saying “It’s not true, but it sort of works”

We thought about this … … and wrote a paper saying “It’s not true, but it sort of works”

Then we had a better idea – could we dispense with the reverse Boltzmann formalism?

 Instead of assuming a formula that relates the distance distribution to the binding free energy … … use machine learning to learn the relationship from known structures and binding affinities.

 Instead of assuming a formula that relates the distance distribution to the binding free energy … … use machine learning to learn the relationship from known structures and binding affinities.  And persuade someone to pay for it!

Random Forest Predicted binding affinity

Random Forest ● Introduced by Briemann and Cutler (2001) ● Development of Decision Trees (Recursive Partitioning): ● Dataset is partitioned into consecutively smaller subsets ● Each partition is based upon the value of one descriptor ● The descriptor used at each split is selected so as to optimise splitting ● Bootstrap sample of N objects chosen from the N available objects with replacement

 The Random Forest is a just forest of randomly generated decision trees … … whose outputs are averaged to give the final prediction

Building RF-Score PDBbind 2007

Building RF-Score PDBbind 2007

Validation results: PDBbind set  Following method of Cheng et al. JCIM 49, 1079 (2009)  Independent test set PDBbind core 2007, 195 complexes from 65 clusters

Validation results: PDBbind set  RF-Score outperforms competitor scoring functions, at least on our test  RF-Score is available for free from our group website

26 John Mitchell; James McDonagh; Neetika Nath Rob Lowe; Richard Marchese Robinson