Presentation is loading. Please wait.

Presentation is loading. Please wait.

Combining Statistical and Physical Considerations in Deriving Targeted QSPRs Using Very Large Molecular Descriptor Databases Inga Paster and Mordechai.

Similar presentations


Presentation on theme: "Combining Statistical and Physical Considerations in Deriving Targeted QSPRs Using Very Large Molecular Descriptor Databases Inga Paster and Mordechai."— Presentation transcript:

1 Combining Statistical and Physical Considerations in Deriving Targeted QSPRs Using Very Large Molecular Descriptor Databases Inga Paster and Mordechai Shacham Dept. Chem. Eng. Ben-Gurion University of the Negev Beer-Sheva, Israel Greta Tovarovski and Neima Brauner School of Engineering Tel-Aviv University Tel-Aviv, Israel

2 The Targeted QSPR Method OBJECTIVE: Predicting physical properties of a Target compound using structural information of this compound and structural and property related information of Similar, predictive compounds (Training set). The structural information is presented in the form of molecular descriptors (calculated properties of the molecule) ALGORITHM – STEP 1: Similarity group- Select a group of compounds similar to the target based on similarity measures (e.g., correlation between the vectors of molecular descriptors of the target and potential predictive compounds). Training set- Select the most similar predictive compounds for which data for the target property value are available.

3 The Targeted QSPR Method ALGORITHM – STEP 2 : QSPR Model- Use a stepwise regression program to identify a (linear) QSPR model (Quantitative Structure Property Relationship) that can best represent the property data of the training set in terms of molecular descriptors. ALGORITHM – STEP 3: Prediction-Use the QSPR model and the descriptor values of the target compound (and other compounds in the similarity group (Validation set) to predict its property value

4 Similarity group of 1-tridecanol – an example

5 Predicting NBP for 1-tridecanol The QSPR: NBP = 601.0456+65.76087*EEig10r-1061.998*E3m Descriptor EEig10r - Eigenvalue 10 from edge adjacency matrix weighted by resonance integrals ( a 2-D descriptor)

6 Predicting NBP for 1-tridecanol The QSPR: NBP = 601.0456+65.76087*EEig10r-1061.998*E3m Descriptor E3m - 3rd component accessibility directional WHIM (Weighted Holistic Invariant Molecular descriptor) index/weighted by atomic masses (3-D) descriptor)

7 Descriptor Types Computer programs that can calculate several thousands of molecular descriptors are available. Molecular Weight Number of aromatic C Wiener Index 3D Wiener Index MlogP

8 Algorithms Used by NIST* for Minimization of the Molecular Structure *National Institute of Standards and Technology (NIST). In: Linstrom PJ, Mallard WG, eds. Chemistry WebBook, NIST Standard Reference Database Number 69. Gaithersburg, MD: NIST; June 2005 (http://webbook.nist.gov). Initial structures are the 2-D MOL files. 3-D structures are generated using the Alchemy 2000 desktop software package and its native molecular-mechanics force field. The structures are re-optimized using the MM3 force field and the simulated annealing algorithm included in the Tinker software package. Final optimization- at the PM3 level (using the version of MOPAC6 bundled with the Alchemy 2000 package or, in some cases, the Gaussian 94 software package).

9 The Importance of the Reliability of the Descriptors It is practically impossible to check the accuracy and consistency of the individual-descriptor values for the large number of descriptors and compounds involved The 3-D descriptors can be in particular unreliable because of the uncertainty associated with the minimization of the 3-D structure. Reliable and consistent descriptor values are important in particular in the selection of the training set. If different software packages are used for calculating the descriptors of the predictive compounds (database) and for the target compound (not in the data base), inconsistency in the descriptors included in the QSPR may cause poor property prediction. Descriptor “noise level”- The effect of the 3-D minimization technique on the descriptor value should be considered in establishing a reliable estimate for the “noise level”.

10 Generation of Molecular Structure Files and Molecular Descriptors For the first part of this study a database containing 326 compounds (hydrocarbons, 1-alcohols and n-aliphatic acids) was used. The molecular geometries were optimized using the CNDO (Complete Neglect of Differential Overlap) semi-empirical method implemented in the HyperChem package*. The Dragon + program was used to calculate 1664 descriptors for the compounds in the database from minimized energy molecular models. *HyperChem program, version 7.01, Hyperchem is copyrighted by Hypercube Inc. (http://www.hyper.com/ ). + Todeschini, R.; Consonni, V.; Mauri, A.; Pavan, M. DRAGON user manual, Talete srl, Milano, Italy, 2006. ©TALETE srl, http://www.talete.mi.it.

11 Test 1 – Plotting Descriptors of Same Family Neighboring Compounds Outlying, unreliable descriptors

12 Test 2 – Plotting Descriptor Values Versus the No. Of C atoms in (n-alkene) Homologous Series Monotonic Change – Reliable Descriptors

13 Test 2 – Plotting Descriptor Values Versus the No. Of C atoms in (n-alkene) Homologous Series Separate curves for odd and even n c – Consistent with some solid properties

14 Test 2 – Plotting Descriptor Values Versus the No. Of C atoms in (n-alkene) Homologous Series Inconsistent (random) variation of the descriptor value with n c – Unreliable descriptor (Gm – 3-D WHIM descriptor)

15 Test 3 – Comparing 3-D descriptors obtained from 3-D structure files minimized by different algorithms Compounds for which 3-D MOL files are available from NIST and Dragon

16 Visual comparison of Dragon and NIST 3-D structure files using Gaussian 3

17 Groups of 3 – D Descriptors Calculated by Dragon

18 Percent differences between 3-D descriptors based on NIST and Dragon Library MOL files (28 compounds) n-hexane, 2-methylpentane and 1-propanol

19 Data for an Extensive 709 Compounds Study For this study 709 compounds from the DIPPR database were used. For these compounds 3-D MOL files are available from the NIST database and from molecular structures minimized by the DIPPR staff. For the later the minimization was done for most of the compounds in Gaussian 03 using B3LYP/6-311+G (3df, 2p). This is a density functional method. Most of the other compounds were optimized using HF/6-31G*, which is a Hartree-Fock ab initio method with a medium-sized basis set. The Dragon 5.5 program was used to calculate 3224 descriptors.

20 Results for the 709 Compounds Study

21 Conclusions and Future Work 1.In has been shown that the 3-D descriptors may have various levels of inconsistency depending on the algorithms used for minimization of the 3-D structure. 2.In order to determine the effects of the inconsistency of the descriptors on the training set selection for various families of compounds comparative studies involving descriptors of various levels of consistency must be carried out. 3.To determine whether inconsistent descriptors can be excluded from TQSPRs prepared for particular properties and particular families of target compounds a comparative study to this effect must be carried out. 4. It is always preferable to use the same molecular structure minimization algorithm for the members of the training set and the target compound.

22 Selection of the Database and the Target Property Using the Property Prediction GUI

23 Similarity Group Identification for 1-methyl-3-iso-prophylbenzene

24 Derivation of the “Target QSPR Model BP = 285.8355 + 46.66 ALOGP


Download ppt "Combining Statistical and Physical Considerations in Deriving Targeted QSPRs Using Very Large Molecular Descriptor Databases Inga Paster and Mordechai."

Similar presentations


Ads by Google