Informed by Informatics? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University of Cambridge, U.K.

Slides:



Advertisements
Similar presentations
Analysis of High-Throughput Screening Data C371 Fall 2004.
Advertisements

Computers in Chemistry Dr John Mitchell & Rosanna Alderson University of St Andrews.
Random Forest Predrag Radenković 3237/10
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Feature selection and transduction for prediction of molecular bioactivity for drug design Reporter: Yu Lun Kuo (D )
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
In silico calculation of aqueous solubility Dr John Mitchell University of St Andrews.
In silico prediction of solubility: Solid progress but no solution? Dr John Mitchell University of St Andrews.
« هو اللطیف » By : Atefe Malek. khatabi Spring 90.
Lipinski’s rule of five
Computers in Chemistry Dr John Mitchell University of St Andrews.
Faculty of Computer Science © 2006 CMPUT 605February 04, 2008 Novel Approaches for Small Bio-molecule Classification and Structural Similarity Search Karakoc.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
x – independent variable (input)
Sparse vs. Ensemble Approaches to Supervised Learning
Quantitative Structure-Activity Relationships (QSAR) Comparative Molecular Field Analysis (CoMFA) Gijs Schaftenaar.
Three kinds of learning
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Computers in Chemistry Dr John Mitchell University of St Andrews.
Classification and Prediction: Regression Analysis
Ensemble Learning (2), Tree and Forest
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Application and Efficacy of Random Forest Method for QSAR Analysis
Solubility is an important issue in drug discovery and a major source of attrition This is expensive for the pharma industry A good model for predicting.
1 Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry.
Presented By Wanchen Lu 2/25/2013
In Silico Methods for ADMET and Solubility Prediction
Predicting Income from Census Data using Multiple Classifiers Presented By: Arghya Kusum Das Arnab Ganguly Manohar Karki Saikat Basu Subhajit Sidhanta.
Efficient Model Selection for Support Vector Machines
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Surveillance monitoring Operational and investigative monitoring Chemical fate fugacity model QSAR Select substance Are physical data and toxicity information.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Outline 1-D regression Least-squares Regression Non-iterative Least-squares Regression Basis Functions Overfitting Validation 2.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
Learning from Observations Chapter 18 Through
Use of Machine Learning in Chemoinformatics Irene Kouskoumvekaki Associate Professor December 12th, 2012 Biological Sequence Analysis course.
Quantum Chemical and Machine Learning Calculations of the Intrinsic Aqueous Solubility of Druglike Molecules Dr John Mitchell University of St Andrews.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Direct Message Passing for Hybrid Bayesian Networks Wei Sun, PhD Assistant Research Professor SFL, C4I Center, SEOR Dept. George Mason University, 2009.
1 John Mitchell; James McDonagh; Neetika Nath Rob Lowe; Richard Marchese Robinson.
1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.
Identifying Applicability Domains for Quantitative Structure Property Relationships Mordechai Shacham a, Neima Brauner b Georgi St. Cholakov c and Roumiana.
Ensemble Methods: Bagging and Boosting
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
Virtual Screening C371 Fall INTRODUCTION Virtual screening – Computational or in silico analog of biological screening –Score, rank, and/or filter.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
In silico calculation of aqueous solubility Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University of Cambridge,
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Multiple Instance Learning for Sparse Positive Bags Razvan C. Bunescu Machine Learning Group Department of Computer Sciences University of Texas at Austin.
Ensemble Methods in Machine Learning
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
In silico calculation of aqueous solubility Dr John Mitchell University of St Andrews.
NTU & MSRA Ming-Feng Tsai
Use of Machine Learning in Chemoinformatics
Regression Tree Ensembles Sergey Bakin. Problem Formulation §Training data set of N data points (x i,y i ), 1,…,N. §x are predictor variables (P-dimensional.
Classification and Regression Trees
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
 Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems n Introduction.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Lipinski’s rule of five
CS548 Fall 2017 Decision Trees / Random Forest Showcase by Yimin Lin, Youqiao Ma, Ran Lin, Shaoju Wu, Bhon Bunnag Showcasing work by Cano,
Machine Learning Today: Reading: Maria Florina Balcan
Virtual Screening.
Classification with CART
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

Informed by Informatics? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University of Cambridge, U.K.

Modelling in Chemistry Density Functional Theory ab initio Molecular Dynamics Monte Carlo Docking PHYSICS-BASED EMPIRICAL ATOMISTIC Car-Parrinello NON-ATOMISTIC DPD CoMFA 2-D QSAR/QSPR Machine Learning AM1, PM3 etc. Fluid Dynamics

Density Functional Theory ab initio Molecular Dynamics Monte Carlo Docking Car-Parrinello DPD CoMFA 2-D QSAR/QSPR Machine Learning AM1, PM3 etc. HIGH THROUGHPUT LOW THROUGHPUT Fluid Dynamics

Density Functional Theory ab initio Molecular Dynamics Monte Carlo Docking Car-Parrinello DPD CoMFA 2-D QSAR/QSPR Machine Learning AM1, PM3 etc. INFORMATICS THEORETICAL CHEMISTRY NO FIRM BOUNDARIES! Fluid Dynamics

Density Functional Theory ab initio Molecular Dynamics Monte Carlo Docking Car-Parrinello DPD CoMFA 2-D QSAR/QSPR Machine Learning AM1, PM3 etc. Fluid Dynamics

Informatics and Empirical Models In general, Informatics methods represent phenomena mathematically, but not in a physics-based way. Inputs and output model are based on an empirically parameterised equation or more elaborate mathematical model. Do not attempt to simulate reality. Usually High Throughput.

QSPR Quantitative Structure  Property Relationship Physical property related to more than one other variable Hansch et al developed QSPR in 1960’s, building on Hammett (1930’s). Property-property relationships from 1860’s General form (for non-linear relationships): y = f (descriptors)

QSPR Y = f (X 1, X 2,..., X N ) Optimisation of Y = f(X 1, X 2,..., X N ) is called regression. Model is optimised upon N “training molecules” and then tested upon M “test” molecules.

QSPR Quality of the model is judged by three parameters:

QSPR Different methods for carrying out regression: LINEAR - Multi-linear Regression (MLR), Partial Least Squares (PLS), Principal Component Regression (PCR), etc. NON-LINEAR - Random Forest, Support Vector Machines (SVM), Artificial Neural Networks (ANN), etc.

QSPR However, this does not guarantee a good predictive model….

QSPR Problems with experimental error. QSPR only as accurate as data it is trained upon. Therefore, we are need accurate experimental data.

QSPR Problems with “chemical space”. “Sample” molecules must be representative of “Population”. Prediction results will be most accurate for molecules similar to training set. Global or Local models?

1. Solubility Dave Palmer Pfizer Institute for Pharmaceutical Materials Science D.S. Palmer, et al., J. Chem. Inf. Model., 47, (2007) D.S. Palmer, et al., Molecular Pharmaceutics, ASAP article (2008)

Solubility is an important issue in drug discovery and a major source of attrition This is expensive for the industry A good model for predicting the solubility of druglike molecules would be very valuable.

Datasets Phase 1 – Literature Data Compiled from Huuskonen dataset and AquaSol database – pharmaceutically relevant molecules All molecules solid at room temperature n = 1000 molecules Phase 2 – Our Own Experimental Data ● Measured by Toni Llinàs using CheqSol Machine ● pharmaceutically relevant molecules ● n = 135 molecules ● Aqueous solubility – the thermodynamic solubility in unbuffered water (at 25 o C)

Diversity-Conserving Partitioning ● MACCS Structural Key fingerprints ● Tanimoto coefficient ● MaxMin Algorithm Full dataset n = 1000 molecules Training n = 670 molecules Test n = 330 molecules

Diversity Analysis

Structures & Descriptors ● 3D structures from Concord ● Minimised with MMFF94 ● MOE descriptors 2D/ 3D ● Separate analysis of 2D and 3D descriptors ● QuaSAR Contingency Module (MOE) ● 52 descriptors selected

Machine Learning Method Random Forest

● Introduced by Briemann and Cutler (2001) ● Development of Decision Trees (Recursive Partitioning): ● Dataset is partitioned into consecutively smaller subsets ● Each partition is based upon the value of one descriptor ● The descriptor used at each split is selected so as to optimise splitting ● Bootstrap sample of N objects chosen from the N available objects with replacement

Random Forest ● Random Forest is a collection of Decision or Regression Trees grown with the CART algorithm. ● Standard Parameters: ● 500 decision trees ● No pruning back: Minimum node size = 5 ● “mtry” descriptors (square root of number of total number) tried at each split Important features: ● Incorporates descriptor selection ● Incorporates “Out-of-bag” validation – using those molecules not in the bootstrap samples

Random Forest for Solubility Prediction A Forest of Regression Trees Dataset is partitioned into consecutively smaller subsets (of similar solubility) Each partition is based upon the value of one descriptor The descriptor used at each split is selected so as to minimise the MSE Leo Breiman, "Random Forests“, Machine Learning 45, 5-32 (2001).

Random Forest for Predicting Solubility A Forest of Regression Trees Each tree grown until terminal nodes contain specified number of molecules No need to prune back High predictive accuracy Includes method for descriptor selection No training problems – largely immune from overfitting.

Random Forest: Solubility Results RMSE(te)=0.69 r 2 (te)=0.89 Bias(te)=-0.04 RMSE(tr)=0.27 r 2 (tr)=0.98 Bias(tr)=0.005 RMSE(oob)=0.68 r 2 (oob)=0.90 Bias(oob)=0.01 DS Palmer et al., J. Chem. Inf. Model., 47, (2007)

Drug Disc.Today, 10 (4), 289 (2005)

Can we use theoretical chemistry to calculate solubility via a thermodynamic cycle?

 G sub from lattice energy & an entropy term  G hydr from a semi-empirical solvation model (i.e., different kinds of theoretical/computational methods)

… or this alternative cycle?

 G sub from lattice energy (DMAREL) plus entropy  G solv from SCRF using the B3LYP DFT functional  G tr from ClogP (i.e., different kinds of theoretical/computational methods)

● Nice idea, but didn’t quite work – errors larger than QSPR ● “Why not add a correction factor to account for the difference between the theoretical methods?” …

● Within a week this had become a hybrid method, essentially a QSPR with the theoretical energies as descriptors

This regression equation gave r 2 =0.77 and RMSE=0.71

Solubility by TD Cycle: Conclusions ● We have a hybrid part-theoretical, part-empirical method. ● An interesting idea, but relatively low throughput as a crystal structure is needed. ● Slightly more accurate than pure QSPR for a druglike set. ● Instructive to compare with literature of theoretical solubility studies.

2. Bioactivity Florian Nigsch F. Nigsch, et al., J. Chem. Inf. Model., 48, (2008)

Feature Space - Chemical Space m = (f 1,f 2,…,f n ) f1f1 f2f2 f3f3 Feature spaces of high dimensionality COX2 f2f2 f3f3 f1f1 DHFR CDK1 CDK2

Properties of Drugs High affinity to protein target Soluble Permeable Absorbable High bioavailability Specific rate of metabolism Renal/hepatic clearance? Volume of distribution? Low toxicity Plasma protein binding? Blood-Brain-Barrier penetration? Dosage (once/twice daily?) Synthetic accessibility Formulation (important in development)

Multiobjective Optimisation Bioactivity Synthetic accessibility Permeability Toxicity Metabolism Solubility Huge number of candidates …

Multiobjective Optimisation Bioactivity Synthetic accessibility Permeability Toxicity Metabolism Solubility U S E L E S S Drug Huge number of candidates most of which are useless!

Spam Unsolicited (commercial) Approx. 90% of all traffic is spam Where are the legitimate messages? Filtering

Analogy to Drug Discovery Huge number of possible candidates Virtual screening to help in selection process

Winnow Algorithm Invented in late 1980s by Nick Littlestone to learn Boolean functions Name from the verb “to winnow” –High-dimensional input data Natural Language Processing (NLP), text classification, bioinformatics Different varieties (regularised, Sparse Network Of Winnow - SNOW, …) Error-driven, linear threshold, online algorithm

Machine Learning Methods Winnow (“Molecular Spam Filter”)

Training Example

Features of Molecules Based on circular fingerprints

Combinations of Features Combinations of molecular features to account for synergies.

Pairing Molecular Features Gives Bigrams

Orthogonal Sparse Bigrams Technique used in text classification/spam filtering Sliding window process Sparse - not all combinations Orthogonal - non- redundancy

Workflow

Protein Target Prediction Which proteins does a given molecule bind to? Virtual Screening Multiple endpoint drugs - polypharmacology New targets for existing drugs Prediction of adverse drug reactions (ADR) –Computational toxicology

Predicted Protein Targets Selection of ~230 classes from the MDL Drug Data Report ~90,000 molecules 15 independent 50%/50% splits into training/test set

Predicted Protein Targets Cumulative probability of correct prediction within the three top-ranking predictions: 82.1% (±0.5%)

3. Melting Points Florian Nigsch F. Nigsch, et al., J. Chem. Inf. Model., 46, (2006)

Interest in Modelling Interest in modelling properties of molecules - solubility, toxicity, drug-drug interactions, … Virtual screening has become a necessity in terms of time and cost efficiency ADMET integration into drug development process - many fewer late stage failures - continuous refinement with new experimental data Absorption, Distribution, Metabolism, Excretion and Toxicity

Melting Points are Important Melting point facilitates prediction of solubility - General Solubility Equation (Yalkowsky) Notorious challenge (unknown crystal packing, interactions in liquid state) Goal: Stepwise model for bioavailability, from solid compound to systemic availability

Can Current Models be Improved? Bergström: 277 total compounds, PLS, RMSE=49.8 ºC, q 2 =0.54; n test :n train =185:92 Karthikeyan: 4173 total compounds, ANN, RMSE=49.3 ºC, q 2 =0.66; n test :n train =1042:2089 Jain: 1215 total compounds, MLR, AAE=33.2 ºC, no q 2 reported; n test :n train =119:1094 RMSE slightly below 50 ºC! q 2 just above 0.50!

Machine Learning Method k-Nearest Neighbours

Used for classification and regression Suitable for large and diverse datasets Local (tuneable via k) Non-linear Calculation of distance matrix instead of training step Distance matrix depends on descriptors used Classification Regression

Molecules and Numbers Datasets 4119 organic molecules of diverse complexity, MPs from 14 to ºC (dataset 1) 277 drug-like molecules, MPs from 40 to 345 ºC (dataset 2) Both datasets available in the recent literature Descriptors Quasar descriptors from MOE (Molecular Operating Environment), 146 2D and 57 3D descriptors Sybyl Atom Type Counts (53 different types, including information about hybridisation state)

Finding the Neighbours Distances Euclidean distance between point i and point j in r dimensions Set of k nearest neighbours N = {T i, d i }k = 1,5,10,15

Calculation of Predictions Predictions: T pred = f(N) unweightedweighted Arithmetic (1) and geometric (2) average Inverse distance (3) Exponential decrease (4)

Assessment by Cross-validation Select m random molecules for test set N-m remaining molecules form training set OrganicsDrugs N: 4119 m: 1000 m/N: 24.3% N: 277 m: 80 m/N: 28.9% Data Random split TrainingTest Predict test set (from training set) RMSEP Repeat 25 times Cross-validated root-mean-square error

Assessment by Cross-validation OrganicsDrugs N: 4119 m: 1000 m/N: 24.3% N: 277 m: 80 m/N: 28.9% Data Random split TrainingTest Predict test set (from training set) RMSEP Repeat 25 times Cross-validated root-mean-square error Not “25-fold CV” where 1/25th is predicted based on the rest Prediction of 25 random sets of 1000 molecules Monte Carlo Cross-validation

Results NN Method AGIE RMSEr2r2 q2q2 r2r2 q2q2 r2r2 q2q2 r2r2 q2q Using a combination of 2D and 3D descriptors: Training set Test set1000 molecules (organic) 3119 molecules (organic)

Results: Organic Dataset NN Method AGIE RMSEr2r2 q2q2 r2r2 q2q2 r2r2 q2q2 r2r2 q2q RMSE = 47.2 ºC r 2 = 0.47 q 2 = D only RMSE = 50.7 ºC r 2 = 0.39 q 2 = D only Using 15 nearest neighbours and exponential weighting Exponential weighting performs best!

Results: Drugs Dataset NN Method AGIE RMSEr2r2 q2q2 r2r2 q2q2 r2r2 q2q2 r2r2 q2q RMSE = 46.5 ºC r 2 = 0.30 q 2 = D only RMSE = 48.4 ºC r 2 = 0.24 q 2 = D only Using 10 nearest neighbours and inverse distance weighting Inverse distance weighting performs best!

Summary of Results Organics: RMSE = 46.2 ºC, r 2 = 0.49, n test :n train = 1000:3119 Drugs: RMSE = 42.2 ºC, r 2 = 0.42, n test :n train = 80:197 Difficulties at extremities of temperature scale due to: distances in descriptor space and temperature

Summary of Results Nonetheless: Comparison with other models in the literature shows that the presented model: - performs better in terms of RMSE under more stringent conditions (even unoptimised) - parameterless; optimised model many fewer parameters than, for example, in an ANN

Conclusions Information content of descriptors Exponential weighting performs best Genetic algorithm is able to improve performance, especially for atom type counts as descriptors Remaining error - interactions in liquid and/or solid state not captured (by single-molecule descriptors) - kNN method in this setting (datasets + descriptors)

Thanks Pfizer & PIPMS Unilever Dave Palmer Florian Nigsch

All slides after here are for information only

The Two Faces of Computational Chemistry Theory Empiricism

Informatics and Empirical Models In general, Informatics methods represent phenomena mathematically, but not in a physics-based way. Inputs and output model are based on an empirically parameterised equation or more elaborate mathematical model. Do not attempt to simulate reality. Usually High Throughput.

Theoretical Chemistry Calculations and simulations based on real physics. Calculations are either quantum mechanical or use parameters derived from quantum mechanics. Attempt to model or simulate reality. Usually Low Throughput.

Optimisation with Genetic Algorithm Transformation of data to yield lower global RMSEP - global means: for whole data set Three operations to transform the columns of data matrix for optimal weighting One bit per descriptor on chromosomes: (op; param) consisting of an operation and a parameter { multiplication power log e { (a,b) = (0, 30) for mult. (a,b) = (0, 3) for power none for log e

Iterative Genetic Optimisation Initial dataInitial chromosomes Transformation of data Calculation of distance matrix Evaluation of models Survival of the fittest Genetic operations Stop if criterion reached Set of optimal transformations

Genetic Algorithm Able to Improve Results Data setDescriptorsRMSE (ºC)r2r2 q2q2 12D D SAC SAC SAC: Sybyl Atom Counts Application of optimal transformations to initial data Reassessment of models Averages over 25 runs!

Errors at Low/High-Melting Ends… Signed RMSE analysisDefinition of signed RMSE Details In 30 ºC steps across whole range All 4119 molecules of data set 1

… Result From Large Distances Distance in descriptor space NNs of high-melting compounds typically further away Distance in temperature { < 0: negative neighbour > 0: positive neighbour T test -T NN

Over- & Underpredictions at Extremities Distance in descriptor space NNs of high-melting compounds typically further away Distance in temperature { < 0: negative neighbour > 0: positive neighbour T test -T NN Underprediction! Overprediction!