Identifying Applicability Domains for Quantitative Structure Property Relationships Mordechai Shacham a, Neima Brauner b Georgi St. Cholakov c and Roumiana.

Slides:



Advertisements
Similar presentations
Design of Experiments Lecture I
Advertisements

11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Screening of a Sulfonamides Library by Supercritical Fluid Chromatography Coupled to Mass Spectrometry (SFC-MS). Preliminary properties-retention study.
Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 1 ~ Curve Fitting ~ Least Squares Regression Chapter.
Probability & Statistical Inference Lecture 9
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 13 Nonlinear and Multiple Regression.
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Chapter 4: Linear Models for Classification
Software Quality Ranking: Bringing Order to Software Modules in Testing Fei Xing Michael R. Lyu Ping Guo.
Engineering experiments involve the measuring of the dependent variable as the independent one has been altered, so as to determine the relationship between.
Section 4.2 Fitting Curves and Surfaces by Least Squares.
x – independent variable (input)
Basic Steps of QSAR/QSPR Investigations
Curve-Fitting Regression
L Berkley Davis Copyright 2009 MER301: Engineering Reliability Lecture 14 1 MER301: Engineering Reliability LECTURE 14: Chapter 7: Design of Engineering.
QUALITY CONTROL OF POLYETHYLENE POLYMERIZATION REACTOR M. Al-haj Ali, Emad M. Ali CHEMICAL ENGINEERING DEPARTMENT KING SAUD UNIVERSITY.
Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display. by Lale Yurttas, Texas A&M University Chapter 171 CURVE.
Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation.
Designing a Separations Process Without VLE Data by Thomas Schafer - Koch Modular Process Systems, LLC This presentation utilizes as it’s example a problem.
Classification and Prediction: Regression Analysis
Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.
Correlation & Regression
Calibration & Curve Fitting
Introduction to Linear Regression and Correlation Analysis
Inference for regression - Simple linear regression
Combining Statistical and Physical Considerations in Deriving Targeted QSPRs Using Very Large Molecular Descriptor Databases Inga Paster and Mordechai.
Process modelling and optimization aid FONTEIX Christian Professor of Chemical Engineering Polytechnical National Institute of Lorraine Chemical Engineering.
On Estimation of Surface Soil Moisture from SAR Jiancheng Shi Institute for Computational Earth System Science University of California, Santa Barbara.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 1 Part 4 Curve Fitting.
Adventures in Thermochemistry James S. Chickos * Department of Chemistry and Biochemistry University of Missouri-St. Louis Louis MO 63121
Ch4 Describing Relationships Between Variables. Pressure.
Time Series Data Analysis - I Yaji Sripada. Dept. of Computing Science, University of Aberdeen2 In this lecture you learn What are Time Series? How to.
“Topological Index Calculator” A JavaScript application to introduce quantitative structure-property relationships (QSPR) in undergraduate organic chemistry.
Predicting a Variety of Constant Pure Compound Properties by the Targeted QSPR Method Abstract The possibility of obtaining a reliable prediction a wide.
Ch4 Describing Relationships Between Variables. Section 4.1: Fitting a Line by Least Squares Often we want to fit a straight line to data. For example.
Atoms, Elements, and Compounds Chapter Fourteen: Elements and the Periodic Table 14.1 The Periodic Table 14.2 Properties of the Elements.
Curve-Fitting Regression
Considering Physical Property Uncertainties in Process Design Abstract A systematic procedure has been developed for process unit design based on the “worst.
1 Chapter 12 Simple Linear Regression. 2 Chapter Outline  Simple Linear Regression Model  Least Squares Method  Coefficient of Determination  Model.
Lecture 8 Simple Linear Regression (cont.). Section Objectives: Statistical model for linear regression Data for simple linear regression Estimation.
Regression Regression relationship = trend + scatter
Chapter 4 Linear Regression 1. Introduction Managerial decisions are often based on the relationship between two or more variables. For example, after.
Accuracy Based Generation of Thermodynamic Properties for Light Water in RELAP5-3D 2010 IRUG Meeting Cliff Davis.
Paola Gramatica, Elena Bonfanti, Manuela Pavan and Federica Consolaro QSAR Research Unit, Department of Structural and Functional Biology, University of.
Identification and Estimation of the Influential Parameters in Bioreaction Systems Mordechai Shacham Ben Gurion University of the Negev Beer-Sheva, Israel.
Kinetics of CO2 Absorption into MEA-AMP Blended Solution
Evaluation of a Targeted-QSPR Based Pure Compound Property Prediction System Abstract The use of the DD – TQSPR (Dominant-Descriptor Targeted QSPR) method.
QSAR AND CHEMOMETRIC APPROACHES TO THE SCREENING OF POPs FOR ENVIRONMENTAL PERSISTENCE AND LONG RANGE TRANSPORT FOR ENVIRONMENTAL PERSISTENCE AND LONG.
What does boiling temperature measure?. Figure. The boiling temperatures of the n-alkanes.
S519: Evaluation of Information Systems Social Statistics Inferential Statistics Chapter 16: reliability and validity.
Selection of Molecular Descriptor Subsets for Property Prediction Inga Paster a, Neima Brauner b and Mordechai Shacham a, a Department of Chemical Engineering,
A "Reference Series" Method for Prediction of Properties of Long-Chain Substances Inga Paster and Mordechai Shacham Dept. Chem. Eng. Ben-Gurion University.
CpSc 881: Machine Learning
STATISTICS 12.0 Correlation and Linear Regression “Correlation and Linear Regression -”Causal Forecasting Method.
A) I. I. Mechnikov National University, Chemistry Department, Dvorianskaya 2, Odessa 65026, Ukraine, b) Department of Molecular.
A molecular descriptor database for homologous series of hydrocarbons ( n - alkanes, 1-alkenes and n-alkylbenzenes) and oxygen containing organic compounds.
Simple Linear Regression The Coefficients of Correlation and Determination Two Quantitative Variables x variable – independent variable or explanatory.
Error Analysis, Statistics, Graphing and Excel Necessary skills for Chem V01BL.
Jump to first page Inferring Sample Findings to the Population and Testing for Differences.
Sensitivity Analysis for the Purposes of Parameter Identification of a S. cerevisiae Fed-batch Cultivation Sensitivity Analysis for the Purposes of Parameter.
1 Prediction of Phase Equilibrium Related Properties by Correlations Based on Similarity of Molecular Structures N. Brauner a, M. Shacham b, R.P. Stateva.
DATA ANALYSIS AND MODEL BUILDING LECTURE 9 Prof. Roland Craigwell Department of Economics University of the West Indies Cave Hill Campus and Rebecca Gookool.
LOAD FORECASTING. - ELECTRICAL LOAD FORECASTING IS THE ESTIMATION FOR FUTURE LOAD BY AN INDUSTRY OR UTILITY COMPANY - IT HAS MANY APPLICATIONS INCLUDING.
Chapter Outline EMPIRICAL MODELS 11-2 SIMPLE LINEAR REGRESSION 11-3 PROPERTIES OF THE LEAST SQUARES ESTIMATORS 11-4 SOME COMMENTS ON USES OF REGRESSION.
Process Design Course Using the NIST, DIPPR and DDBSP databases for Finding Physical, Chemical and Thermodynamic Properties Process Design Course.
Part 5 - Chapter
Part 5 - Chapter 17.
Problem Solving in Chemical Engineering with Numerical Methods
Mordechai Shacham, Dept. of Chem
Part 5 - Chapter 17.
Presentation transcript:

Identifying Applicability Domains for Quantitative Structure Property Relationships Mordechai Shacham a, Neima Brauner b Georgi St. Cholakov c and Roumiana P. Stateva d, a Dept. Chem. Eng., Ben-Gurion University Beer-Sheva, Israel b School of Engineering, Tel-Aviv University Tel-Aviv, Israel c Dept. Org. Synth. and Fuels, University of Chemical Technology and Metallurgy,Sofia, Bulgaria d Institute of Chemical Engineering, Bulgarian Academy of Sciences, Sofia 1113, Bulgaria

The Needs  Physicochemical and biological properties are needed for risk assessment, environmental impact assessment and process design, analysis and optimization  The number of the compounds used at present by the industry or those of its immediate interest ~100,000. Those theoretically possible and may be of future interest several tens of millions.  DIPPR 801 database contains 2101 compounds (33 constant properties, 15 temperature dependent properties)

Presentation Outline  Review of Structure-Property Relationships (QSPR) based on Molecular Descriptors  The “Targeted” and “Homologous Series” QSPR Methods  Representation of Liquid and Gas Properties by Molecular Descriptors  Representation of Normal Melting Temperature by Molecular Descriptors  Long Range Extrapolation from small Training Sets

References for the New Techniques "A Structurally "Targeted" QSPR Method for Property Prediction". Ind. Eng. Chem. Res., 45, (2006 ) Molecular descriptors database (non-constant) descriptors for 324 compounds (hydrocarbons and oxygen containing organic compounds). The descriptors are calculated using the Dragon program (version 5.4, DRAGON ) Physical properties databases: DIPPR ( ) NIST ( )/

Row-wise Representation of a Molecular Descriptors Database Database subset contains 324 compounds 1280 descriptors

Dragon Molecular Descriptor Categories

Structure-Property Relationships (QSPR) Based on Molecular Descriptors Normal Boiling Point: Relative Liquid Density at 20 °C: e.g. : Chi0 – connectivity topological index, J – average distance sum index MI – cyclomatic number

Descriptors and Model Parameters for a Linear QSPR for Predicting Melting Point (480 compounds)* *Godavarthy et al., Ind. Eng. Chem. Res. 45, 5117 (2006)

Predicted vs. Experimental Melting Point Using a Linear QSPR with 16 Descriptors (480 compounds)* *Godavarthy et al., Ind. Eng. Chem. Res. 45, 5117 (2006)

Limitations of the QSPR Techniques with Unrestricted Applicability Domains  Complex, often nonlinear QSPRs are needed in order to match the great variability of property values caused by the many structural differences between the various compounds.  Prediction errors are very large especially for properties which are highly sensitive to structural differences (i. e. solid properties)  The accuracy of the property prediction will be much higher for compounds which are well represented in the “training set" than for compounds which are sparsely represented. No systematic way is offered to categorize a particular target compound.  For a target compound of unmeasured properties it is impossible to assess the prediction accuracy.

The “Targeted” and “Homologous Series” QSPR Methods  In the TQSPR method, a similarity group of compounds for a target compound is first identified, using correlation coefficients between vectors of descriptors as measures of “similarity”.  In the HS-QSPR method the members of the homologous series are assigned into the “similarity group”.  In the second step a linear QSPR is tailored to a particular property of the target compound.  Row-wise representation (a row of descriptors for each compound) of a subset of the database, which contains only the members of the similarity group is used to derive the QSPR.  Only the HS-QSPR method is discussed here

Row-wise Representation of a Molecular Descriptors Database (associated with QSPR derivation) Similarity Group contains compounds

Derivation of the HS-QSPR Model  To tailor an HS-QSPR for a particular property of the homologous series, only members of the series with experimental data available are used (the training set).  Considering the limited variability of the property values within the similarity group, a linear structure-property relation is assumed of the form: y - a p vector of the target property values p - number of compounds included in the similarity group ζ 1, ζ 2 … ζ m - p vectors of the predictive molecular descriptors ( to be identified) corresponding model parameters (to be estimated).

The SROV Algorithm Stepwise Regression using Orthogonalized Variables (C&ChE, 27, , 2003) Used to derive the property – structure correlation. At each step (step k) of the algorithm, a new descriptor is entered into the model according to the value of the partial correlation coefficient, |  yj | between the vector of the target property values y, and that of a potential predictive descriptor  j. column vectors, are centered and normalized to a unit length. Absolute  yj values close to one ( ≈1) indicate high correlation. Signal-to-noise ratio of the partial correlation coefficient is used as a stopping criterion for determining the number of the descriptors that should be included in the model (m).

Normal Boiling Temperature Data for 1-alcohol homologous series Estimated upper error bound Training Set

The 10 descriptors with the highest correlation with Tb for 1-alcohol homologous series Selected Descriptor Descriptors colinear with each other for the training set

One Descriptor QSPR T b Prediction error for the 1-alcohol homologous series

Two Descriptor QSPR for the 1-alcohol series T b = H3v HTe

Two Descriptor QSPR T b Prediction error for the 1-alcohol homologous series

Aliphatic Acids Normal Boiling Temperature vs. Number of C atoms Values Note nonlinear (asymptotic) change of the property as function of the C number DIPPR predicted values

Aliphatic Acids Normal Boiling Temperature vs. the descriptor vEv1 Note collinearity between T b and the descriptor DIPPR predicted values

Aliphatic Monocarboxylic Acids Normal Melting Temperature versus number of C-atoms For T m the first descriptor captures only the general trend (average value) of the property.

Aliphatic Monocarboxylic Acids Normal Melting Temperature versus Descriptor EEig06x Note that the first descriptor captures the general trend (average value) of the property.

Prediction of T m for Aliphatic Acids using the QSPR: T m = PJI IVDE EEig06x Mor16v

T m Prediction Error for Aliphatic Acids Prediction error exceeds reliability for one compound

Prediction of the Critical Pressure for 1-alkenes Pc= H3e (MPa) The “training set” includes only five measured values Note highly nonlinear relationship between P c and the number of C atoms

Prediction of the Critical Pressure for 1-alkenes Pc= H3e (MPa) Note straight line representation when P c is plotted versus the descriptor H3e

Prediction Error of the Critical Pressure for 1-alkenes Pc= H3e Prediction error exceeds reliability only for one compound in spite of the long range extrapolation

Conclusions 1.Prediction of constant properties (including solid properties) within experimental error (reliability) level. 2.Long range extrapolation from small training sets of 3-5 compounds for which experimental data is available. 3.Use of linear QSPRs that include one to four descriptors. 4.The maximal prediction error of the melting point temperature is 3 K. This is smaller by at least an order of magnitude than the errors reported in the literature. Selecting the molecular descriptors that exhibit the highest level of collinearity with a particular property from a very large pool of descriptors enables developing simple linear QSPRs for prediction of properties of homologous series with the characteristics: