Selection of Molecular Descriptor Subsets for Property Prediction Inga Paster a, Neima Brauner b and Mordechai Shacham a, a Department of Chemical Engineering,

Slides:



Advertisements
Similar presentations
Yinyin Yuan and Chang-Tsun Li Computer Science Department
Advertisements

Design of Experiments Lecture I
Screening of a Sulfonamides Library by Supercritical Fluid Chromatography Coupled to Mass Spectrometry (SFC-MS). Preliminary properties-retention study.
Everardo Macias, Patrick Tomboc Eamonn F. Healy, Chemistry Department,
1 A Common Measure of Identity and Value Disclosure Risk Krish Muralidhar University of Kentucky Rathin Sarathy Oklahoma State University.
Fighting the Great Challenges in Large-scale Environmental Modelling I. Dimov n Great challenges in environmental modelling n Impact of climatic changes.
1 Development & Evaluation of Ecotoxicity Predictive Tools EPA Development Team Regional Stakeholder Meetings January 11-22, 2010.
Software Quality Ranking: Bringing Order to Software Modules in Testing Fei Xing Michael R. Lyu Ping Guo.
Bi-correlation clustering algorithm for determining a set of co- regulated genes BIOINFORMATICS vol. 25 no Anindya Bhattacharya and Rajat K. De.
Memory-Based Recommender Systems : A Comparative Study Aaron John Mani Srinivasan Ramani CSCI 572 PROJECT RECOMPARATOR.
MEASUREMENT. Measurement “If you can’t measure it, you can’t manage it.” Bob Donath, Consultant.
A Simple Model of GC x GC Separations John V. Seeley Oakland University 3/6/07.
A Study on Feature Selection for Toxicity Prediction*
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Attention Deficit Hyperactivity Disorder (ADHD) Student Classification Using Genetic Algorithm and Artificial Neural Network S. Yenaeng 1, S. Saelee 2.
Combining Statistical and Physical Considerations in Deriving Targeted QSPRs Using Very Large Molecular Descriptor Databases Inga Paster and Mordechai.
BINF6201/8201 Principle components analysis (PCA) -- Visualization of amino acids using their physico-chemical properties
Some Background Assumptions Markowitz Portfolio Theory
BsysE595 Lecture Basic modeling approaches for engineering systems – Summary and Review Shulin Chen January 10, 2013.
Adventures in Thermochemistry James S. Chickos * Department of Chemistry and Biochemistry University of Missouri-St. Louis Louis MO 63121
Department of Tool and Materials Engineering Investigation of hot deformation characteristics of AISI 4340 steel using processing map.
Name: Angelica F. White WEMBA10. Teach students how to make sound decisions and recommendations that are based on reliable quantitative information During.
Property Scaling Relations for Nonpolar Hydrocarbons Sai R. Panuganti 1, Francisco M. Vargas 1, 2, Walter G. Chapman 1 1 Chemical and Biomolecular Engineering.
Stochastic Linear Programming by Series of Monte-Carlo Estimators Leonidas SAKALAUSKAS Institute of Mathematics&Informatics Vilnius, Lithuania
Predicting a Variety of Constant Pure Compound Properties by the Targeted QSPR Method Abstract The possibility of obtaining a reliable prediction a wide.
Use of Machine Learning in Chemoinformatics Irene Kouskoumvekaki Associate Professor December 12th, 2012 Biological Sequence Analysis course.
Protein Folding Programs By Asım OKUR CSE 549 November 14, 2002.
Considering Physical Property Uncertainties in Process Design Abstract A systematic procedure has been developed for process unit design based on the “worst.
Investment Analysis and Portfolio Management First Canadian Edition By Reilly, Brown, Hedges, Chang 6.
1 Chapter 12 Simple Linear Regression. 2 Chapter Outline  Simple Linear Regression Model  Least Squares Method  Coefficient of Determination  Model.
Identifying Applicability Domains for Quantitative Structure Property Relationships Mordechai Shacham a, Neima Brauner b Georgi St. Cholakov c and Roumiana.
Accuracy Based Generation of Thermodynamic Properties for Light Water in RELAP5-3D 2010 IRUG Meeting Cliff Davis.
Forecasting. 預測 (Forecasting) A Basis of Forecasting In business, forecasts are the basis for budgeting and planning for capacity, sales, production and.
Identification and Estimation of the Influential Parameters in Bioreaction Systems Mordechai Shacham Ben Gurion University of the Negev Beer-Sheva, Israel.
Evaluation of a Targeted-QSPR Based Pure Compound Property Prediction System Abstract The use of the DD – TQSPR (Dominant-Descriptor Targeted QSPR) method.
ABSTRACT The behavior and fate of chemicals in the environment is strongly influenced by the inherent properties of the compounds themselves, particularly.
Physical versus Chemical Properties. What is a property? Property: a characteristic of a substance that can be observed.
What does boiling temperature measure?. Figure. The boiling temperatures of the n-alkanes.
A "Reference Series" Method for Prediction of Properties of Long-Chain Substances Inga Paster and Mordechai Shacham Dept. Chem. Eng. Ben-Gurion University.
Chapter 8: Simple Linear Regression Yang Zhenlin.
Correlation of Solid Solubility for Biological Compounds in Supercritical Carbon Dioxide: Comparative Study Using Solution Model and Other Approaches Jaw-Shin.
O PTIMAL NANO - DESCRIPTORS AS TRANSLATORS OF ECLECTIC DATA INTO PREDICTION OF THE CELL MEMBRANE DAMAGE BY MEANS OF NANO METAL - OXIDES A LLA P. T OROPOVA.
1 OSHA’s Approach to Nanotechnology: Developing a Searchable "Health Effects Matrix" Database for Nanomaterials Utilizing Existing Published Data Janet.
Output Grouping Method Based on a Similarity of Boolean Functions Petr Fišer, Pavel Kubalík, Hana Kubátová Czech Technical University in Prague Department.
1 Module One: Measurements and Uncertainties No measurement can perfectly determine the value of the quantity being measured. The uncertainty of a measurement.
Hybrid Load Forecasting Method With Analysis of Temperature Sensitivities Authors: Kyung-Bin Song, Seong-Kwan Ha, Jung-Wook Park, Dong-Jin Kweon, Kyu-Ho.
ECOTOXICOLOGY AND ENVIRONMENTAL SAFETY 108 (2014) 203–209 ACKNOWLEDGMENTS WE THANK THE EC PROJECT NANOPUZZLES (PROJECT REFERENCE: ) Optimal descriptor.
1-1 Copyright © 2014, 2011, and 2008 Pearson Education, Inc.
Measurement Experiment - effect of IV on DV. Independent Variable (2 or more levels) MANIPULATED a) situational - features in the environment b) task.
A) I. I. Mechnikov National University, Chemistry Department, Dvorianskaya 2, Odessa 65026, Ukraine, b) Department of Molecular.
FORECASTING METHODS OF NON- STATIONARY STOCHASTIC PROCESSES THAT USE EXTERNAL CRITERIA Igor V. Kononenko, Anton N. Repin National Technical University.
A molecular descriptor database for homologous series of hydrocarbons ( n - alkanes, 1-alkenes and n-alkylbenzenes) and oxygen containing organic compounds.
Use of Machine Learning in Chemoinformatics
Statistical Inference: Poverty Indices and Poverty Decompositions Michael Lokshin DECRG-PO The World Bank.
1 Prediction of Phase Equilibrium Related Properties by Correlations Based on Similarity of Molecular Structures N. Brauner a, M. Shacham b, R.P. Stateva.
CORRELATION-REGULATION ANALYSIS Томский политехнический университет.
F5 Performance Management. 2 Section C: Budgeting Designed to give you knowledge and application of: C1. Objectives C2. Budgetary systems C3. Types of.
공정 열역학 Chapter 3. Volumetric Properties of Pure Fluids
Introduction to Lab Techniques Measurements and Calibration.
1 Approaches in the Area of Measurement Uncertainties.
Process Design Course Using the NIST, DIPPR and DDBSP databases for Finding Physical, Chemical and Thermodynamic Properties Process Design Course.
Figure 6.2 Comparison among the Debye heat capacity, the Einstein heat capacity, and the actual heat capacity of aluminum.
Mordechai Shacham, Dept. of Chem
Bibliometric Analysis of Water Research
Rutgers Intelligent Transportation Systems (RITS) Laboratory
RDE Task Force Meeting, 28th November 2013, Brussels
Network Screening & Diagnosis
MATTER Definition States/Phases Takes up space Has mass
Ifo Institute for Economic Research
Slides for Sampling from Posterior of Shape
Presentation transcript:

Selection of Molecular Descriptor Subsets for Property Prediction Inga Paster a, Neima Brauner b and Mordechai Shacham a, a Department of Chemical Engineering, Ben-Gurion University Beer-Sheva, Israel b School of Engineering, Tel-Aviv University Tel-Aviv, Israel

The Needs  Physicochemical and biological properties are needed for risk assessment, environmental impact assessment and process design, analysis and optimization  The number of the compounds used at present by the industry or those of its immediate interest ~100,000. Those theoretically possible and may be of future interest several tens of millions.  The Toxic Substances Control Act (TSCA) inventory has 80,000 chemicals. Only 50% have some physicochemical property data, only 15% have data from genotoxicity bioassays  DIPPR 801 database contains 2101 compounds (33 constant properties, 15 temperature dependent properties)

Property Prediction Methods  “Group contribution” methods  Methods based on the "corresponding-states principle“  “Asymptotic behavior" correlations (ABC’s)  “Quantitative Structure Property Relationships” (QSPR’s), based on the use of molecular descriptors The existing methods cannot provide satisfactory predictions for certain properties (such as normal melting temperature) and for certain groups of compounds. Thus, research and development of new prediction techniques are essential.

Collinearity Between Vectors of Descriptors of Similar Compounds 99 normalized molecular descriptors of n-heptane versus those of n-hexane. Linear relationship between the descriptors

Collinearity Between Vectors of Properties of Similar Compounds Selected properties of n-heptane versus those of n-hexane. Linear relationship between the vectors of properties Basis of the QS2PR method (Shacham et al, AIChE J. 50(10), , 2004)

Collinearity Between a Vector of Descriptors and a Vector of Properties for a Group of Similar Compounds VRD2- Average Randic-type eigenvector-based index from distance matrix (eigenvalue-based indices) Measured value for 3,3-dimethylhexane Prediction error 0.68 %

Similarity Group (Training Set) of 3,3-dimethylhexane Similarity group of 10 predictive compounds has found to be sufficient in most cases. A measure of the level of group similarity Basis of the Targeted QSPR method (Brauner et al, I&EC Research 45, , 2006)

Collinearity Between a Vector of Descriptors and a Vector of Properties for a Group of Similar Compounds Collinearity between the descriptor VEv1 and normal boiling temperature for the n-alkanoic acid homologous series

Sources of Molecular Descriptors and Thermo-Physical Properties  The molecular geometries were optimized using the CNDO (Complete Neglect of Differential Overlap) semi-empirical method implemented in the HyperChem package  The Dragon program ( ) was used to calculate 1664 descriptors for the 340 compounds in the database from minimized energy molecular models  Property data (measured and predicted) were taken from DIPPR ( ) and NIST (National Institute of Standards, databases.

Descriptor Types Generated by the Dragon Program 3-D descriptors, very sensitive to molecular structure minimization

Identifying Inaccuracy and Inconsistency Among 1600 Molecular Descriptors Sources of inaccuracy and inconsistency: The descriptor cannot be calculated by DRAGON (-999); The descriptor value is set at zero for certain compounds; and Sensitivity of 3-D descriptors to the structure minimization method

Presentation Outline  Categorizing the Molecular Descriptors According to the Trend of Their Change with n C for Homologous Series  Identifying Training Sets from Compounds Belonging to the Target Compounds Homologous Series  Predicting Critical Properties, Normal Boiling and Melting Temperatures, Liquid Molar Volume and Refractive Index for Five Homologous Series with and without the Use of 3-D descriptors.  Comparison of the Results and Conclusions

Checking Consistency of Molecular Descriptors – Consistent Change with n C for Homologous Series The descriptor ADDD changes with n C for the 1-alkene series in a trend similar to the change of liquid molar volume

Checking Consistency of Molecular Descriptors – Consistent Change with n C for Homologous Series Normalized values of the descriptors AGDD, ASP and H4m versus n C for the 1-alkene homologous series Similar to the trend of T C

Checking Consistency of Molecular Descriptors – Consistent Change with n C for Homologous Series The descriptor ICR changes with n C for the 1-alkene series in a trend similar to the change of normal melting temperature

Checking Consistency of Molecular Descriptors – Inconsistent Change with n C for Homologous Series The descriptor Gm changes with n C for the 1-alkene series in an apparently random manner

Trend of change of descriptors with n C for homologous series Constant descriptors identify compounds of the HS of the target compound and linearly increasing descriptors used to rank the compounds according to the distance from the target

Prediction of T C, T b and RI (Refractive Index) for n-alkanes, 1- alkenes, n-alkylbenzenes, 1-alcohols and n-alkanoic acids In ~ 93 % of the cases descriptors of category IIIA used as dominant (1 st to enter, out of one or two) descriptor. Exception 3-D descriptors for 1-alcohols (category IV)

Prediction of V C and V m (Liquid molar vol.) for n-alkanes, 1- alkenes, n-alkylbenzenes, 1-alcohols and n-alkanoic acids In 90 % of the cases descriptors of category II used Exception: 3-D descriptors for 1-alkenes, 1-alcohols (category IV)

Prediction of P C and T m (Melting Point.) for n-alkanes, 1- alkenes, n-alkylbenzenes, 1-alcohols and n-alkanoic acids In 40 % of the cases descriptors of category IIIA used, descriptors IV 35%, descriptors V 20%, descriptor II 5 %.

Uncertainty (%) in Predicting Various Properties Without 3-D Descriptors Large prediction errors in V c (and P c ) because of the uncertainty of the DIPPR data. The irregular shape of the melting point curve causes the errors in this property (3-D descriptors needed).

Conclusions 1.The Dragon descriptors were divided into seven categories according to the trend of their change as function of n c in homologous series. 2.It was observed that 3-D descriptors may exhibit very irregular (or even random) behavior. 3.The exclusive use of descriptors of two categories: “Constant” and “Linear Increase”, enabled selection of training sets belonging to the target compound’s homologous series. 4.The use of the proposed method for predicting 7 properties for 5 homologous series has shown that most properties can be predicted on experimental uncertainty level, without using 3- D descriptors. This extends the method’s applicability, increases its reliability and reduces the probability of “Chance Correlations”.