Download presentation
Presentation is loading. Please wait.
Published byAmi Powell Modified over 9 years ago
1
Predicting a Variety of Constant Pure Compound Properties by the Targeted QSPR Method Abstract The possibility of obtaining a reliable prediction a wide variety of constant properties is examined. To this aim, a modified version of the Targeted QSPR (Brauner et al., Ind. Eng. Chem. Res., 45, 8430, 2006) method is applied. The prediction of a particular property of a target compound is carried out in two stages. The first stage involves the identification of a similarity group and a small training set whose members are structurally similar to the target compound. This stage is carried out based on a robust sub-set of the descriptor data-base (no 2D or 3D descriptors) that reflects the diversity in the chemical structures. In the second stage, the full data-base of molecular descriptors is used to develop a single-descriptor linear QSPR (TQSPR1) based on the available property data for the training set. Statistical indicators are introduced which enable a reliable estimation of the prediction uncertainty for the (unknown) property of the target compound based on the training set data. It is shown that while increasing the number of descriptors in the QSPR enables better representation of the training set data, it may significantly deteriorate the prediction of the target compound property value. If necessary, improved prediction is achievable by using the statistical information to refine the training set, rather than by increasing the number of the descriptors used. It is demonstrated that by proper adjustment of the training set, the great majority of the constant properties can be predicted within the experimental error level. Mordechai Shacham,, Dept. of Chem. Engng, Ben Gurion University of the Negev, Beer-Sheva, Israel Neima Brauner, School of Engineering, Tel-Aviv University, Tel-Aviv, Israel, The TQSPR1 method was able to predict 32 properties of the target compound within the experimental error level. The appropriate training set (similarity group) is dependent on the target property. The TSAE (Training Set Average Error) has proven to be a good indicator for the appropriateness of the training set and the prediction accuracy of TQSPR1. This criterion is independent of the target-compound properties. Prediction of properties of n-hexyl mercaptan – basic training set Summary of Results for 32 Properties – Optimal Training Sets Conclusions Constant Properties Included in the DIPPR Database - Property value (from DIPPR) p – No. of comps.in training set ζ - Descriptor Attainable accuracy measures (independent of the target comp. property value) 1. DIPPR uncertainty values for the properties of the training set members ; 2. Average (U avg ) and maximal (U max ) DIPPR uncertainty values 3. Training Set Average Error (TSAE) Mv – Mean atomic van der Waals volume –scaled on Carbon atom The prediction accuracy can be enhanced by refinement of the training set and not by increasing the number of the descriptors in the TQSPR. The descriptor subset used here for identifying a refined training set has proven to be appropriate for some homologous series. Work currently is underway to identify descriptors subsets that are appropriate for other groups of compounds. TSAE = 3.4 % Prediction Error = 10% 3D-Morse signal 29/weighted by atomic masses Statistical indicators: 1. Outlying (high leverage) descriptor values can be detected based on excessive values of the diagonal hat matrix elements: h ii. 2. Outlying property values can be detected by high value of the studentized deleted residual t i of component i h 99 = 1 TSAE= 60 % Prediction error = 37 % A property and molecular descriptor database containing 1798 compounds for which 34 constant properties (source: DIPPR database http://dippr.byu.edu ) and 3224 descriptors (source: Dragon 5.5, http://www.talete.mi.it ) are available. Several variations of training sets of compounds were used: 1. A “basic” training set identified using the full set of the available descriptors; 2. A “refined” training set identified using only “constitutional” and “functional group count” descriptors; 3. Use of only odd (or even) carbon number compounds in the training set; 4. Removal of compounds with outlying property values Training set Identification and Refinement Property and Descriptor Databases “Basic” and “Refined” Training Sets of n -hexyl mercaptan Oxygen atom instead of sulfur Range of the numbers of the carbon atoms Immediate neighbors of the target in the homologous series TSAE = 0.65 % Prediction Error = 0.45%
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.