Selection of Molecular Descriptor Subsets for Property Prediction Inga Paster a, Neima Brauner b and Mordechai Shacham a, a Department of Chemical Engineering,

Selection of Molecular Descriptor Subsets for Property Prediction Inga Paster a, Neima Brauner b and Mordechai Shacham a, a Department of Chemical Engineering, Ben-Gurion University Beer-Sheva, Israel b School of Engineering, Tel-Aviv University Tel-Aviv, Israel

The Needs  Physicochemical and biological properties are needed for risk assessment, environmental impact assessment and process design, analysis and optimization  The number of the compounds used at present by the industry or those of its immediate interest ~100,000. Those theoretically possible and may be of future interest several tens of millions.  The Toxic Substances Control Act (TSCA) inventory has 80,000 chemicals. Only 50% have some physicochemical property data, only 15% have data from genotoxicity bioassays  DIPPR 801 database contains 2101 compounds (33 constant properties, 15 temperature dependent properties)

Property Prediction Methods  “Group contribution” methods  Methods based on the "corresponding-states principle“  “Asymptotic behavior" correlations (ABC’s)  “Quantitative Structure Property Relationships” (QSPR’s), based on the use of molecular descriptors The existing methods cannot provide satisfactory predictions for certain properties (such as normal melting temperature) and for certain groups of compounds. Thus, research and development of new prediction techniques are essential.

Collinearity Between Vectors of Descriptors of Similar Compounds 99 normalized molecular descriptors of n-heptane versus those of n-hexane. Linear relationship between the descriptors

Collinearity Between Vectors of Properties of Similar Compounds Selected properties of n-heptane versus those of n-hexane. Linear relationship between the vectors of properties Basis of the QS2PR method (Shacham et al, AIChE J. 50(10), 2481- 2492, 2004)

Collinearity Between a Vector of Descriptors and a Vector of Properties for a Group of Similar Compounds VRD2- Average Randic-type eigenvector-based index from distance matrix (eigenvalue-based indices) Measured value for 3,3-dimethylhexane Prediction error 0.68 %

Similarity Group (Training Set) of 3,3-dimethylhexane Similarity group of 10 predictive compounds has found to be sufficient in most cases. A measure of the level of group similarity Basis of the Targeted QSPR method (Brauner et al, I&EC Research 45, 8430-8437, 2006)

Collinearity Between a Vector of Descriptors and a Vector of Properties for a Group of Similar Compounds Collinearity between the descriptor VEv1 and normal boiling temperature for the n-alkanoic acid homologous series

Sources of Molecular Descriptors and Thermo-Physical Properties  The molecular geometries were optimized using the CNDO (Complete Neglect of Differential Overlap) semi-empirical method implemented in the HyperChem package  The Dragon program (http://www.talete.mi.it ) was used to calculate 1664 descriptors for the 340 compounds in the database from minimized energy molecular models  Property data (measured and predicted) were taken from DIPPR (http://dippr.byu.edu ) and NIST (National Institute of Standards, http://webbook.nist.gov/chemistry) databases.

Descriptor Types Generated by the Dragon Program 3-D descriptors, very sensitive to molecular structure minimization

Identifying Inaccuracy and Inconsistency Among 1600 Molecular Descriptors Sources of inaccuracy and inconsistency: The descriptor cannot be calculated by DRAGON (-999); The descriptor value is set at zero for certain compounds; and Sensitivity of 3-D descriptors to the structure minimization method

Presentation Outline  Categorizing the Molecular Descriptors According to the Trend of Their Change with n C for Homologous Series  Identifying Training Sets from Compounds Belonging to the Target Compounds Homologous Series  Predicting Critical Properties, Normal Boiling and Melting Temperatures, Liquid Molar Volume and Refractive Index for Five Homologous Series with and without the Use of 3-D descriptors.  Comparison of the Results and Conclusions

Checking Consistency of Molecular Descriptors – Consistent Change with n C for Homologous Series The descriptor ADDD changes with n C for the 1-alkene series in a trend similar to the change of liquid molar volume

Checking Consistency of Molecular Descriptors – Consistent Change with n C for Homologous Series Normalized values of the descriptors AGDD, ASP and H4m versus n C for the 1-alkene homologous series Similar to the trend of T C

Checking Consistency of Molecular Descriptors – Consistent Change with n C for Homologous Series The descriptor ICR changes with n C for the 1-alkene series in a trend similar to the change of normal melting temperature

Checking Consistency of Molecular Descriptors – Inconsistent Change with n C for Homologous Series The descriptor Gm changes with n C for the 1-alkene series in an apparently random manner

Trend of change of descriptors with n C for homologous series Constant descriptors identify compounds of the HS of the target compound and linearly increasing descriptors used to rank the compounds according to the distance from the target

Prediction of T C, T b and RI (Refractive Index) for n-alkanes, 1- alkenes, n-alkylbenzenes, 1-alcohols and n-alkanoic acids In ~ 93 % of the cases descriptors of category IIIA used as dominant (1 st to enter, out of one or two) descriptor. Exception 3-D descriptors for 1-alcohols (category IV)

Prediction of V C and V m (Liquid molar vol.) for n-alkanes, 1- alkenes, n-alkylbenzenes, 1-alcohols and n-alkanoic acids In 90 % of the cases descriptors of category II used Exception: 3-D descriptors for 1-alkenes, 1-alcohols (category IV)

Prediction of P C and T m (Melting Point.) for n-alkanes, 1- alkenes, n-alkylbenzenes, 1-alcohols and n-alkanoic acids In 40 % of the cases descriptors of category IIIA used, descriptors IV 35%, descriptors V 20%, descriptor II 5 %.

Uncertainty (%) in Predicting Various Properties Without 3-D Descriptors Large prediction errors in V c (and P c ) because of the uncertainty of the DIPPR data. The irregular shape of the melting point curve causes the errors in this property (3-D descriptors needed).

Conclusions 1.The Dragon descriptors were divided into seven categories according to the trend of their change as function of n c in homologous series. 2.It was observed that 3-D descriptors may exhibit very irregular (or even random) behavior. 3.The exclusive use of descriptors of two categories: “Constant” and “Linear Increase”, enabled selection of training sets belonging to the target compound’s homologous series. 4.The use of the proposed method for predicting 7 properties for 5 homologous series has shown that most properties can be predicted on experimental uncertainty level, without using 3- D descriptors. This extends the method’s applicability, increases its reliability and reduces the probability of “Chance Correlations”.

Selection of Molecular Descriptor Subsets for Property Prediction Inga Paster a, Neima Brauner b and Mordechai Shacham a, a Department of Chemical Engineering,

Similar presentations

Presentation on theme: "Selection of Molecular Descriptor Subsets for Property Prediction Inga Paster a, Neima Brauner b and Mordechai Shacham a, a Department of Chemical Engineering,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Selection of Molecular Descriptor Subsets for Property Prediction Inga Paster a, Neima Brauner b and Mordechai Shacham a, a Department of Chemical Engineering,

Similar presentations

Presentation on theme: "Selection of Molecular Descriptor Subsets for Property Prediction Inga Paster a, Neima Brauner b and Mordechai Shacham a, a Department of Chemical Engineering,"— Presentation transcript:

Similar presentations

About project

Feedback