Combining Statistical and Physical Considerations in Deriving Targeted QSPRs Using Very Large Molecular Descriptor Databases Inga Paster and Mordechai.

Slides:



Advertisements
Similar presentations
Analysis of High-Throughput Screening Data C371 Fall 2004.
Advertisements

Introducing AnalyzerPro. Chapter 1: Qualitative Analysis.
Design of Experiments Lecture I
Feature Selection as Relevant Information Encoding Naftali Tishby School of Computer Science and Engineering The Hebrew University, Jerusalem, Israel NIPS.
Quantum Mechanics Calculations II Apr 2010 Postgrad course on Comp Chem Noel M. O’Boyle.
Transfer FAS UAS SAINT-PETERSBURG STATE UNIVERSITY COMPUTATIONAL PHYSICS Introduction Physical basis Molecular dynamics Temperature and thermostat Numerical.
Kriging.
A SOFTWARE TOOL DEVELOPED FOR THE CLASSIFICATION OF REMOTE SENSING SPECTRAL REFLECTANCE DATA Abdullah Faruque School of Computing & Software Engineering.
Cost of surrogates In linear regression, the process of fitting involves solving a set of linear equations once. For moving least squares, we need to.
1 RegionKNN: A Scalable Hybrid Collaborative Filtering Algorithm for Personalized Web Service Recommendation Xi Chen, Xudong Liu, Zicheng Huang, and Hailong.
Introduction to Molecular Orbitals
Data preprocessing before classification In Kennedy et al.: “Solving data mining problems”
Chemistry 6440 / 7440 Semi-Empirical Molecular Orbital Methods.
Quantum Mechanics and Force Fields Hartree-Fock revisited Semi-Empirical Methods Basis sets Post Hartree-Fock Methods Atomic Charges and Multipoles QM.
Computational Chemistry
Molecular Modeling: Semi-Empirical Methods C372 Introduction to Cheminformatics II Kelsey Forsythe.
ABSTRACT The BEAM EU research project focuses on the risk assessment of mixture toxicity. A data set of 124 heterogeneous chemicals of high concern as.
Molecular dynamics refinement and rescoring in WISDOM virtual screenings Gianluca Degliesposti University of Modena and Reggio Emilia Molecular Modelling.
Case Studies Class 5. Computational Chemistry Structure of molecules and their reactivities Two major areas –molecular mechanics –electronic structure.
A Study on Feature Selection for Toxicity Prediction*
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
1 M. Cristina Menziani. 2 Quantitative Structure-Property Relationship (QSPR) Atomistic scale Descriptors Structure Composition/Formulation Experimental.
Face Recognition Based on 3D Shape Estimation
Data Handling l Classification of Errors v Systematic v Random.
An Introduction to Molecular Orbital Theory. Levels of Calculation Classical (Molecular) Mechanics quick, simple; accuracy depends on parameterization;
Protein Tertiary Structure Prediction
Structural alignments of Proteins using by TOPOFIT method Vitkup D., Melamud E., Moult J., Sander C. Completeness in structural genomics. Nature Struct.
1Computational Chemistry for Chemistry Educators - Gotwals/Sendlinger Copyright© 2007 All Rights Reserved Chapter 24 Computational Chemistry Research.
Computational Chemistry, WebMO, and Energy Calculations
RESULT and DISCUSSION In order to find a relation between the three rate reaction constant (k OH, k NO3 and k O3 ) and the structural features of chemicals,
Chem 1140; Molecular Modeling Molecular Mechanics Semiempirical QM Modeling CaCHE.
Theoretical Study of Photodissociation dynamics of Hydroxylbenzoic Acid Yi-Lun Sun and Wei-Ping Hu* Department of Chemistry and Biochemistry, National.
Lecture 20: Cluster Validation
Predicting a Variety of Constant Pure Compound Properties by the Targeted QSPR Method Abstract The possibility of obtaining a reliable prediction a wide.
20/03/2008 Dept. of Pharmaceutics 1. Use of BIOINFORMATICS in Pharmaciutics 2  Presented By  Shafnan Nazar  Hamid Nasir 
Considering Physical Property Uncertainties in Process Design Abstract A systematic procedure has been developed for process unit design based on the “worst.
What is "In" and What is "Out" in Engineering Problem Solving Mordechai Shacham Chem. Eng. Dept., Ben-Gurion University, Beer-Sheva 84105,Israel Michael.
Identifying Applicability Domains for Quantitative Structure Property Relationships Mordechai Shacham a, Neima Brauner b Georgi St. Cholakov c and Roumiana.
Martin Waldseemüller's World Map of 1507 Zanjan. Roberto Todeschini Viviana Consonni Davide Ballabio Andrea Mauri Alberto Manganaro chemometrics molecular.
Paola Gramatica, Elena Bonfanti, Manuela Pavan and Federica Consolaro QSAR Research Unit, Department of Structural and Functional Biology, University of.
Statistical Mechanics and Multi- Scale Simulation Methods ChBE Prof. C. Heath Turner Lecture 07 Some materials adapted from Prof. Keith E. Gubbins:
QSAR Study of HIV Protease Inhibitors Using Neural Network and Genetic Algorithm Akmal Aulia, 1 Sunil Kumar, 2 Rajni Garg, * 3 A. Srinivas Reddy, 4 1 Computational.
Solution of a Partial Differential Equations using the Method of Lines
Identification and Estimation of the Influential Parameters in Bioreaction Systems Mordechai Shacham Ben Gurion University of the Negev Beer-Sheva, Israel.
Evaluation of a Targeted-QSPR Based Pure Compound Property Prediction System Abstract The use of the DD – TQSPR (Dominant-Descriptor Targeted QSPR) method.
ABSTRACT The behavior and fate of chemicals in the environment is strongly influenced by the inherent properties of the compounds themselves, particularly.
P. Gramatica and F. Consolaro QSAR Research Unit, Dept. of Structural and Functional Biology, University of Insubria, Varese, Italy.
MODELING MATTER AT NANOSCALES 3. Empirical classical PES and typical procedures of optimization Classical potentials.
QSAR AND CHEMOMETRIC APPROACHES TO THE SCREENING OF POPs FOR ENVIRONMENTAL PERSISTENCE AND LONG RANGE TRANSPORT FOR ENVIRONMENTAL PERSISTENCE AND LONG.
Protein Structure Prediction: Homology Modeling & Threading/Fold Recognition D. Mohanty NII, New Delhi.
Molecular Modelling - Lecture 2 Techniques for Conformational Sampling Uses CHARMM force field Written in C++
Organic pollutants environmental fate: modeling and prediction of global persistence by molecular descriptors P.Gramatica, F.Consolaro and M.Pavan QSAR.
Quality of model and Error Analysis in Variational Data Assimilation François-Xavier LE DIMET Victor SHUTYAEV Université Joseph Fourier+INRIA Projet IDOPT,
Selecting Diverse Sets of Compounds C371 Fall 2004.
Selection of Molecular Descriptor Subsets for Property Prediction Inga Paster a, Neima Brauner b and Mordechai Shacham a, a Department of Chemical Engineering,
Log Koc = MW nNO – 0.19 nHA CIC MAXDP Ts s = 0.35 F 6, 134 = MW: molecular weight nNO: number of NO bonds.
A "Reference Series" Method for Prediction of Properties of Long-Chain Substances Inga Paster and Mordechai Shacham Dept. Chem. Eng. Ben-Gurion University.
Chemistry 700 Lectures. Resources Grant and Richards, Foresman and Frisch, Exploring Chemistry with Electronic Structure Methods (Gaussian Inc., 1996)
OpenMolGRID Open Computing GRID for Molecular Science and Engineering OpenMolGRID European Union 5th Framework Programme project
Eurostat Accuracy of Results of Statistical Matching Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts.
A molecular descriptor database for homologous series of hydrocarbons ( n - alkanes, 1-alkenes and n-alkylbenzenes) and oxygen containing organic compounds.
Modular and Sequential Construction of Complex Process Models – Applications to Process Hazard Assessment Mordechai Shacham Dept. of Chem. Engng, Ben Gurion.
1 Prediction of Phase Equilibrium Related Properties by Correlations Based on Similarity of Molecular Structures N. Brauner a, M. Shacham b, R.P. Stateva.
Computational chemistry May Computer & chemistry.
Chapter 7. Classification and Prediction
Problem Solving in Chemical Engineering with Numerical Methods
PHYSICO-CHEMICAL PROPERTIES MODELLING FOR ENVIRONMENTAL POLLUTANTS
Signal, Noise, and Variation in Neural and Sensory-Motor Latency
M.Pavan, P.Gramatica, F.Consolaro, V.Consonni, R.Todeschini
Combining Efficient Conformational Sampling with a Deformable Elastic Network Model Facilitates Structure Refinement at Low Resolution  Gunnar F. Schröder,
Presentation transcript:

Combining Statistical and Physical Considerations in Deriving Targeted QSPRs Using Very Large Molecular Descriptor Databases Inga Paster and Mordechai Shacham Dept. Chem. Eng. Ben-Gurion University of the Negev Beer-Sheva, Israel Greta Tovarovski and Neima Brauner School of Engineering Tel-Aviv University Tel-Aviv, Israel

The Targeted QSPR Method OBJECTIVE: Predicting physical properties of a Target compound using structural information of this compound and structural and property related information of Similar, predictive compounds (Training set). The structural information is presented in the form of molecular descriptors (calculated properties of the molecule) ALGORITHM – STEP 1: Similarity group- Select a group of compounds similar to the target based on similarity measures (e.g., correlation between the vectors of molecular descriptors of the target and potential predictive compounds). Training set- Select the most similar predictive compounds for which data for the target property value are available.

The Targeted QSPR Method ALGORITHM – STEP 2 : QSPR Model- Use a stepwise regression program to identify a (linear) QSPR model (Quantitative Structure Property Relationship) that can best represent the property data of the training set in terms of molecular descriptors. ALGORITHM – STEP 3: Prediction-Use the QSPR model and the descriptor values of the target compound (and other compounds in the similarity group (Validation set) to predict its property value

Similarity group of 1-tridecanol – an example

Predicting NBP for 1-tridecanol The QSPR: NBP = *EEig10r *E3m Descriptor EEig10r - Eigenvalue 10 from edge adjacency matrix weighted by resonance integrals ( a 2-D descriptor)

Predicting NBP for 1-tridecanol The QSPR: NBP = *EEig10r *E3m Descriptor E3m - 3rd component accessibility directional WHIM (Weighted Holistic Invariant Molecular descriptor) index/weighted by atomic masses (3-D) descriptor)

Descriptor Types Computer programs that can calculate several thousands of molecular descriptors are available. Molecular Weight Number of aromatic C Wiener Index 3D Wiener Index MlogP

Algorithms Used by NIST* for Minimization of the Molecular Structure *National Institute of Standards and Technology (NIST). In: Linstrom PJ, Mallard WG, eds. Chemistry WebBook, NIST Standard Reference Database Number 69. Gaithersburg, MD: NIST; June 2005 ( Initial structures are the 2-D MOL files. 3-D structures are generated using the Alchemy 2000 desktop software package and its native molecular-mechanics force field. The structures are re-optimized using the MM3 force field and the simulated annealing algorithm included in the Tinker software package. Final optimization- at the PM3 level (using the version of MOPAC6 bundled with the Alchemy 2000 package or, in some cases, the Gaussian 94 software package).

The Importance of the Reliability of the Descriptors It is practically impossible to check the accuracy and consistency of the individual-descriptor values for the large number of descriptors and compounds involved The 3-D descriptors can be in particular unreliable because of the uncertainty associated with the minimization of the 3-D structure. Reliable and consistent descriptor values are important in particular in the selection of the training set. If different software packages are used for calculating the descriptors of the predictive compounds (database) and for the target compound (not in the data base), inconsistency in the descriptors included in the QSPR may cause poor property prediction. Descriptor “noise level”- The effect of the 3-D minimization technique on the descriptor value should be considered in establishing a reliable estimate for the “noise level”.

Generation of Molecular Structure Files and Molecular Descriptors For the first part of this study a database containing 326 compounds (hydrocarbons, 1-alcohols and n-aliphatic acids) was used. The molecular geometries were optimized using the CNDO (Complete Neglect of Differential Overlap) semi-empirical method implemented in the HyperChem package*. The Dragon + program was used to calculate 1664 descriptors for the compounds in the database from minimized energy molecular models. *HyperChem program, version 7.01, Hyperchem is copyrighted by Hypercube Inc. ( ). + Todeschini, R.; Consonni, V.; Mauri, A.; Pavan, M. DRAGON user manual, Talete srl, Milano, Italy, ©TALETE srl,

Test 1 – Plotting Descriptors of Same Family Neighboring Compounds Outlying, unreliable descriptors

Test 2 – Plotting Descriptor Values Versus the No. Of C atoms in (n-alkene) Homologous Series Monotonic Change – Reliable Descriptors

Test 2 – Plotting Descriptor Values Versus the No. Of C atoms in (n-alkene) Homologous Series Separate curves for odd and even n c – Consistent with some solid properties

Test 2 – Plotting Descriptor Values Versus the No. Of C atoms in (n-alkene) Homologous Series Inconsistent (random) variation of the descriptor value with n c – Unreliable descriptor (Gm – 3-D WHIM descriptor)

Test 3 – Comparing 3-D descriptors obtained from 3-D structure files minimized by different algorithms Compounds for which 3-D MOL files are available from NIST and Dragon

Visual comparison of Dragon and NIST 3-D structure files using Gaussian 3

Groups of 3 – D Descriptors Calculated by Dragon

Percent differences between 3-D descriptors based on NIST and Dragon Library MOL files (28 compounds) n-hexane, 2-methylpentane and 1-propanol

Data for an Extensive 709 Compounds Study For this study 709 compounds from the DIPPR database were used. For these compounds 3-D MOL files are available from the NIST database and from molecular structures minimized by the DIPPR staff. For the later the minimization was done for most of the compounds in Gaussian 03 using B3LYP/6-311+G (3df, 2p). This is a density functional method. Most of the other compounds were optimized using HF/6-31G*, which is a Hartree-Fock ab initio method with a medium-sized basis set. The Dragon 5.5 program was used to calculate 3224 descriptors.

Results for the 709 Compounds Study

Conclusions and Future Work 1.In has been shown that the 3-D descriptors may have various levels of inconsistency depending on the algorithms used for minimization of the 3-D structure. 2.In order to determine the effects of the inconsistency of the descriptors on the training set selection for various families of compounds comparative studies involving descriptors of various levels of consistency must be carried out. 3.To determine whether inconsistent descriptors can be excluded from TQSPRs prepared for particular properties and particular families of target compounds a comparative study to this effect must be carried out. 4. It is always preferable to use the same molecular structure minimization algorithm for the members of the training set and the target compound.

Selection of the Database and the Target Property Using the Property Prediction GUI

Similarity Group Identification for 1-methyl-3-iso-prophylbenzene

Derivation of the “Target QSPR Model BP = ALOGP