Faculté de Chimie, ULP, Strasbourg, FRANCE Master Chemoinfo Criblage virtuel Alexandre Varnek Faculté de Chimie, ULP, Strasbourg, FRANCE
Small Library of selected hits experimental computational Virtual Screening Filtering, QSAR, Docking Small Library of selected hits High Throughout Screening Hit Target Protein Large libraries of molecules
Virtual screening must be fast and reliable Chemical universe: 10200 molecules 1060 druglike molecules Virtual screening must be fast and reliable Molecules are considered as vectors in multidimentional chemical space defined by the descriptors 3
Candidat au développement Criblage à haut débit Cible HTS Criblage à haut débit High-throughput screening Hits Lead Génomique Analyse de données Optimisation Candidat au développement
Drug Discovery and ADME/Tox studies should be performed in parallel idea target combichem/HTS hit lead candidate drug ADME/Tox studies
Methodologies of a virtual screening from A.R. Leach, V.J. Gillet “An Introduction to Chemoinformatics”, Kluwer Academic Publisher, 2003
Platform for Ligand Based Virtual Screening ~106 – 109 molecules Filters Similarity search ~103 - – 104 molecules QSAR models Candidates for docking or experimental tests 7
Criblage à haut débit (HTS) Mots clés: - Chimie combinatoire Criblage à haut débit (High Throughput Screening (HTS)) - Screening virtuel - Aspect Drug-like - Training sets jusqu’à 1000000 composés
Virtual Screening Molecules available for screening (1) Real molecules 1 - 2 millions in in-house archives of large pharma and agrochemical companies 3 - 4 millions of samples available commercially (2) Hypothetical molecules Virtual combinatorial libraries (up to 1060 molecules)
Methods of virtual High-Throughput Screening Filters Similarity search Classification and regression structure – property models Docking
Filters to estimate “drug-likeness”
Lipinski rules for intestinal absorption (« Rules of 5 ») H-bond donors < 5 (the sum of OH and NH groups); MWT < 500; LogP < 5 H-bond acceptors < 10 (the sum of N and O atoms without H attached).
Lipinski rules for drug-like molecules (« Rules of 5 »)
Lipinski rules for drug-like molecules (« Rules of 5 »)
Example of different filters: Rules for Absorbable compounds It is quite interesting to compare our permeability model to the Lipinski’s and Veber’s rules. All three models are described by similar parameters. The following table shows the maximum cut-off values for absorbable compounds in our data set. In bold we show cut-off values of Lipinski’s and Veber’s rules. By comparing these three columns we can see that in most cases the cut-off values of Lipinski’s and Veber’s rules have been exceeded by 100 percent. This observation has a dual explanation. First, all three models dealt with quite different biological phenomena. Lipinski analyzed compounds that reached the second phase of clinical trials. Veber analyzed oral bioavailability in rats (which is affected by metabolism to a much greater extend than HIA). Whereas we analyzed HIA. The second explanation is that all models have been derived using quite different analytical tools. We used C-SAR analysis that automatically considers a large variety of possible causes that determine poor permeability. Lipinski and Veber used conventional data mining techniques.
Remove compounds containing too many rings
Remove compounds with toxic groups
Remove compounds with reactive groups
Remove False-Positive Hits
Remove poorly soluble compounds
Filter on inorganic and heteroatom compounds
Remove compounds with multiple chiral centers
Paclitaxel (Taxol): violation of 2 rules MW = 837 logP=4.49 HD = 3 HA = 15
logD vs logP 95% of all drugs are ionizable : 75% are bases and 20% acids Utilizing pH dependent log D as a descriptor for lipophilicity in place of log P significantly increases the number of compounds correctly identified as drug-like using the drug-likeness filter: log D5.5 < 5 The Rule of Five Revisited: Applying Log D in Place of Log P in Drug-Likeness Filters S. K. Bhal, K. Kassam, I. G. Peirson, and G. M. Pearl , MOLECULAR PHARMACEUTICS, v.4, 556-560, (2007)
Synthetic Accessibility is proportional to fragment’s occurrence in the PubChem database Ertl and Schuffenhauer Journal of Cheminformatics 2009 1:8
Synthetic Accessibility Frequency distribution of fragments Altogether 605,864 different fragment types have been obtained by fragmenting the PubChem structures. Most of them (51%), however are singletons (present only once in the whole set). Only a relatively small number of fragments, namely 3759 (0.62%), are frequent (i.e. present more than 1000-times in the database). Ertl and Schuffenhauer Journal of Cheminformatics 2009 1:8
Synthetic Accessibility The most common fragments present in the million PubChem molecules. The "A" represents any non-hydrogen atom, "dashed" double bond indicates an aromatic bond and the yellow circle marks the central atom of the fragment. Ertl and Schuffenhauer Journal of Cheminformatics 2009 1:8
Synthetic Accessibility Distribution of (- Sascore) for natural products, bioactive molecules and molecules from catalogues. Correlation of calculated (-SAscore ) and average chemist estimation for 40 molecules (r2 = 0.890) Ertl and Schuffenhauer Journal of Cheminformatics 2009 1:8
Similarity Search: unsupervised and supervised approaches 29
2d (unsupervised) Similarity Search Tanimoto coef Recherche par similarité; comparaison des clés structurales; 1 0 1 0 0 0 1 0 0 1 1 1 0 1 1 0 1 0 1 0 0 1 0 0 0 1 0 0 1 1 1 0 1 1 0 1 0 1 molecular fingerprints 30
Contineous and Discontineous SAR
Structural Spectrum of Thrombin Inhibitors structural similarity “fading away” … reference compounds 0.56 0.72 0.53 0.84 0.67 0.52 0.82 0.64 0.39
R. Guha et al. J.Chem.Inf.Mod., 2008, 48, 646 discontinuous SARs continuous SARs gradual changes in structure result in moderate changes in activity “rolling hills” (G. Maggiora) small changes in structure have dramatic effects on activity “cliffs” in activity landscapes Structure-Activity Landscape Index: SALIij = DAij / DSij DAij (DSij ) is the difference between activities (similarities) of molecules i and j R. Guha et al. J.Chem.Inf.Mod., 2008, 48, 646
discontinuous SARs VEGFR-2 tyrosine kinase inhibitors MACCSTc: 1.00 Analog 6 nM 2390 nM small changes in structure have dramatic effects on activity “cliffs” in activity landscapes lead optimization, QSAR bad news for molecular similarity analysis...
Example of a “Classical” Discontinuous SAR Any similarity method must recognize these compounds as being “similar“ ... (MACCS Tanimoto similarity) Adenosine deaminase inhibitors
Supervised Molecular Similarity Analysis
Dynamic Mapping of Consensus Positions Prototypic “mapping algorithm” for simplified binary-transformed* descriptor spaces Uses known active compounds to create activity-dependent consensus positions in chemical space Operates in descriptor spaces of step-wise increasing dimensionality (“dimension extension”) Selects preferred descriptors from large pools * median-based, i.e. assign “1” to a descriptor if its value is greater than (or equal to) its screening database median; assign “0” if it is smaller Godden et al. & Bajorath. J Chem Inf Comput Sci 44, 21 (2004)
DMC Algorithm Calculate and binary transform descriptors Descriptor bit strings for reference molecules DMC Algorithm Calculate and binary transform descriptors Compare descriptor bit strings of reference molecules and determine consensus bits Select DB compounds matching consensus bits Re-generate bit strings permitting bit variability Select DB compounds matching extended bit strings Repeat until a small selection set is obtained … Calculate consensus bit string: = 1.0 or = 0.0 no variability 1. Dimension extension: ³ 0.9 or £ 0.1 10% variability 2. Dimension extension: ³ 0.8 or £ 0.2 20% variability (white “0”, black “1” gray, variably set bits) 1 2 e.g. 0%, 10%, 20% permitted bit variability: longer bit strings – fewer matching DB compounds
QSAR/QSPR models 40
Screening and hits selection Database Virtual Sreening QSPR model Experimental Tests Hits Useless compounds 41
Libraries profiling: indexing a database by simultaneous assessment of various activities Example: PASS software (Prediction of Activity Spectra for Substances)
For each fragment i
PASS Naïve Bayes estimator Calculations of « P(act) » and « P(inact) » Molecule is considered as active if P(act) > P(inact) or/and P(act) > 0.7
Quantitative Structure-Property Relationships (QSPR) Y = f (Structure) = f (descriptors) QSPR restricts reliable predictions for compounds which are similar to those used for the obtaining the models. Similarity / pharmacophore search approaches are still inevitable as complementary tools
Combinatorial Library Design
... when target structure is unknown Virtual Screening ... when target structure is unknown Virtual library Screening library Diverse Subset Parallel synthesis or synthesis of single compounds Design of focussed library Screening HTS Hits
Generation of Virtual Combinatorial Libraries Fragment Marking approach Markush structure if R1, R2, R3 = and then
The types of variation in Markush structures: Substituent variation (R1) Position variation (R2) Frequency variation Homology variation (R3) (only for patent search) n = 1 – 3 R2 =NH2 R3 = alkyl or heterocycle R1 = Me, Et, Pr
Generation of Virtual Combinatorial Libraries Reaction transform approach from A.R. Leach, V.J. Gillet “An Introduction to Chemoinformatics”, Kluwer Academic Publisher, 2003
Issues and Concepts in Combinatorial Library Design Size of the library Coverage of properties („chemical space“) Diversity, Similarity, Redundancy Descriptor validation Subset selection from virtual libraries
Hot topics in chemoinformatics Predictions vs interpretation New approaches in structure-property modeling descriptors, applicability domain machine-learning methods (inductive learning transfer, semi-supervised learning, ....) New techniques to mine chemical reactions Schematiquement, QSAR of complex systems multi-component synergistic mixtures, new materials, metabolic pathways, ... Public availability of chemoinformatics tools
Predictions vs interpretation Nathan BROWN “Chemoinformatics—An Introduction for Computer Scientists” ACM Computing Surveys, Vol. 41, No. 2, Article 8, February 2009
Predictions vs interpretation Problems : Ensemble modeling Non-linear machine-learning methods (SVM, NN, …) Descriptors correlations What do end users expect from QSAR models ? Reliable estimation (prediction) of the given property.
Public accessibility of models: WEB based platform for virtual screening Schematiquement,
Some Screen Shots: Welcome Page…
ISIDA property prediction WEB server infochim.u-strasbg.fr/webserv/VSEngine.html
ISIDA ScreenDB tools only INTERNET browser is required http://infochim.u-strasbg.fr/webserv/VSEngine.html only INTERNET browser is required Different descriptors (ISIDA fragments, FPT, ChemAxon) Similarity search with different metrics (Tanimoto, Dice, …) ensemble modeling approach (simulteneous application of several models) models applicability domain (automatic detection of useless models)
The most fundamental and lasting objective of synthesis is not production of new compounds but production of properties George S. Hammond Norris Award Lecture, 1968 59