Hierarchical Classification of Calculated Molecular Descriptors

Slides:



Advertisements
Similar presentations
Dummy Dependent variable Models
Advertisements

C A INTRODUCTION An Environmental Quality Objective (EQO), intended as a real “No Effect Concentration” (NEC), is not accessible experimentally. The usual.
Design of Experiments Lecture I
Introduction: Correlation and Regression The General Linear Model is a phrase used to indicate a class of statistical models which include simple linear.
Everardo Macias, Patrick Tomboc Eamonn F. Healy, Chemistry Department,
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Introduction: The General Linear Model b b The General Linear Model is a phrase used to indicate a class of statistical models which include simple linear.
« هو اللطیف » By : Atefe Malek. khatabi Spring 90.
Chapter 10 Simple Regression.
Basic Steps of QSAR/QSPR Investigations
4 Th Iranian chemometrics Workshop (ICW) Zanjan-2004.
L Berkley Davis Copyright 2009 MER301: Engineering Reliability Lecture 14 1 MER301: Engineering Reliability LECTURE 14: Chapter 7: Design of Engineering.
Quantitative Structure-Activity Relationships (QSAR) Comparative Molecular Field Analysis (CoMFA) Gijs Schaftenaar.
Chapter 11 Multiple Regression.
Part 18: Regression Modeling 18-1/44 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics.
Correlational Designs
Relationships Among Variables
1 CHAPTER M4 Cost Behavior © 2007 Pearson Custom Publishing.
Non ionic organic pesticide environmental behaviour: ranking and classification F. Consolaro and P. Gramatica QSAR Research Unit, Dept. of Structural and.
Molecular Descriptors
1 Least squares procedure Inference for least squares lines Simple Linear Regression.
L 1 Chapter 12 Correlational Designs EDUC 640 Dr. William M. Bauer.
Linear Regression Least Squares Method: the Meaning of r 2.
Statistical Methods Statistical Methods Descriptive Inferential
Use of Machine Learning in Chemoinformatics Irene Kouskoumvekaki Associate Professor December 12th, 2012 Biological Sequence Analysis course.
Abstract Background: In this work, a candidate gene prioritization method is described, and based on protein-protein interaction network (PPIN) analysis.
Identifying Applicability Domains for Quantitative Structure Property Relationships Mordechai Shacham a, Neima Brauner b Georgi St. Cholakov c and Roumiana.
Regression Analysis A statistical procedure used to find relations among a set of variables.
Simple Linear Regression. The term linear regression implies that  Y|x is linearly related to x by the population regression equation  Y|x =  +  x.
Paola Gramatica, Elena Bonfanti, Manuela Pavan and Federica Consolaro QSAR Research Unit, Department of Structural and Functional Biology, University of.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
Selecting Diverse Sets of Compounds C371 Fall 2004.
McKim Conference on Predictive Toxicology
Log Koc = MW nNO – 0.19 nHA CIC MAXDP Ts s = 0.35 F 6, 134 = MW: molecular weight nNO: number of NO bonds.
CpSc 881: Machine Learning
F.Consolaro 1, P.Gramatica 1, H.Walter 2 and R.Altenburger 2 1 QSAR Research Unit - DBSF - University of Insubria - VARESE - ITALY 2 UFZ Centre for Environmental.
MUTAGENICITY OF AROMATIC AMINES: MODELLING, PREDICTION AND CLASSIFICATION BY MOLECULAR DESCRIPTORS M.Pavan and P.Gramatica QSAR Research Unit, Dept. of.
A) I. I. Mechnikov National University, Chemistry Department, Dvorianskaya 2, Odessa 65026, Ukraine, b) Department of Molecular.
Regression Analysis1. 2 INTRODUCTION TO EMPIRICAL MODELS LEAST SQUARES ESTIMATION OF THE PARAMETERS PROPERTIES OF THE LEAST SQUARES ESTIMATORS AND ESTIMATION.
P. Gramatica 1, H. Walter 2 and R. Altenburger 2 1 QSAR Research Unit - DBSF - University of Insubria - VARESE - ITALY 2 UFZ Centre for Environmental Research.
Use of Machine Learning in Chemoinformatics
Quantitative Methods. Bivariate Regression (OLS) We’ll start with OLS regression. Stands for  Ordinary Least Squares Regression. Relatively basic multivariate.
CORRELATION-REGULATION ANALYSIS Томский политехнический университет.
Statistics 350 Lecture 2. Today Last Day: Section Today: Section 1.6 Homework #1: Chapter 1 Problems (page 33-38): 2, 5, 6, 7, 22, 26, 33, 34,
I. Statistical Methods for Genome-Enabled Prediction of Complex Traits OUTLINE THE CHALLENGES OF PREDICTING COMPLEX TRAITS ORDINARY LEAST SQUARES (OLS)
LECTURE 15: PARTIAL LEAST SQUARES AND DEALING WITH HIGH DIMENSIONS March 23, 2016 SDS 293 Machine Learning.
Stats Methods at IC Lecture 3: Regression.
Multiple Regression.
Linear Regression Essentials Line Basics y = mx + b vs. Definitions
AP Statistics Chapter 14 Section 1.
Chapter 2 Functions and Graphs
PHYSICO-CHEMICAL PROPERTIES MODELLING FOR ENVIRONMENTAL POLLUTANTS
Mixed Costs Chapter 2: Managerial Accounting and Cost Concepts. In this chapter we explain how managers need to rely on different cost classifications.
General Concepts in QSAR for Using the QSAR Application Toolbox
SIMPLE LINEAR REGRESSION MODEL
Linear Regression Prof. Andy Field.
Correlation and Regression
Two-Variable Regression Model: The Problem of Estimation
Multiple Regression.
Least Squares Method: the Meaning of r2
P. Gramatica1, F. Consolaro1, M. Vighi2, A. Finizio2 and M. Faust3
The Least-Squares Line Introduction
Linear Model Selection and regularization
Regression Analysis Jared Dean as quoted in Big Data, Data Mining, and Machine Learning From my experience, regression is the most dominant force in driving.
Simple Linear Regression
Bias-variance Trade-off
DSS-ESTIMATING COSTS Cost estimation is the process of estimating the relationship between costs and cost driver activities. We estimate costs for three.
M.Pavan, P.Gramatica, F.Consolaro, V.Consonni, R.Todeschini
BEC 30325: MANAGERIAL ECONOMICS
Bootstrapping and Bootstrapping Regression Models
Presentation transcript:

Hierarchical Classification of Calculated Molecular Descriptors Prediction of Biological Partition Coefficients: Calculated Molecular Descriptors vs Experimentally Determined Properties Denise Mills1, Subhash C. Basak1, Brian D. Gute1, and Moiz M. Mumtaz2 1Natural Resources Research Institute, University of Minnesota Duluth, Duluth, MN, USA 2 Computational Toxicology Laboratory, Division of Toxicology, Agency for Toxic Substances and Disease Registry, Atlanta, GA 30333, USA Results Abstract Biological partition coefficients are routinely used as input parameters in physiologically based pharmacokinetic (PBPK) models, which are developed for the assessment of chemical toxicity. In this study, predictive quantitative structure-activity relationship (QSAR) models for rat and human biological partition coefficients, namely blood:air, fat:air, brain:air, liver:air, muscle:air, and kidney:air, were developed utilizing experimentally determined partition coefficients for 131 chemicals obtained from the literature and calculated molecular descriptors based solely on chemical structure. The descriptors were partitioned into four hierarchical classes, including topostructural, topochemical, 3-dimensional, and ab initio quantum chemical. Three types of regression methodologies—ridge regression, principal components regression, and partial least squares—were used comparatively in the development of these models. In addition to the structure-based models, ordinary least squares regression was used to develop comparative models based on experimentally determined properties including saline:air and olive oil:air partition coefficients. The results of the study indicate that many of the structure-based models are comparable or superior to their respective property-based models. This is an important result considering that structural descriptors can be calculated quickly and inexpensively for both existing chemicals and those not yet synthesized. With respect to the structure-based models, it was also found that ridge regression outperformed principal components regression and partial least squares regression, and that generally the topochemical descriptors alone produced models of good predictive ability. The descriptors found to be most influential in biological partitioning of chemicals include those which encode information regarding hydrogen bonding, polarity, and molecular size and shape. 3-methylcyclohexanone Chemist’s representation of structure Topostructural Model (TS) Simple graph: Purely structural representation Topochemical Model (TC) Chemical graph: Contains chemical and valency information Geometrical Model (3D) 3-Dimensional: Based on chemical graph Quantum Chemical Model (QC) H = E Based on quantum mechanics Complexity Hierarchical Classification of Calculated Molecular Descriptors * Topochemical descriptors ** Saline:air + oil:air partition coefficients Actual vs Predicted Values: Rat Brain:Air PC Structural Descriptors Experimental Properties Predicted Biological Partition Coefficients PBPK Model QSAR QPAR Predicted Chemical Toxicity Current Study or Statistical Analysis Regression Methodologies Conventional ordinary least squares (OLS) regression was used to create the QPAR models. However, OLS is not appropriate when the number of descriptors exceeds the number of chemicals in the data set, therefore, ridge regression (RR) was used to develop the QSAR models.* RR is an alternative linear method that: Makes use of all descriptors as opposed to subset regression Is useful when the number of descriptors exceeds the number of observations Is useful when the descriptors are highly intercorrelated Cross Validation The cross-validated R2 is based on the leave-one-out approach. Unlike a fitted R2, the R2c.v. does not increase upon the addition of irrelevant descriptors, but rather ends to decrease, providing a reliable measure of model predictive ability. Identification of Important Descriptors Important descriptors were identified according to high | t | value, where t is the model coefficient divided by its standard error. Conclusions Experimentally Determined Biological Partition Coefficient Data Used in QSAR and QPAR Model Development Statistical Analysis The structure-based models are comparable to the property-based models with respect to predictive ability, an important result considering that structural descriptors can be calculated quickly and inexpensively for any chemical, real or hypothetical. The most predictive structure-based models were those based on the easily calculated topological indices. Addition of the 3-dimensional and/or quantum chemical descriptors did not result in model improvement. The structural descriptors most important in the prediction of biological partitioning are those which encode information regarding hydrogen bonding, polarity, and molecular size and shape. This, and other studies have shown that a large pool of structural descriptors capable of representing diverse molecular and submolecular features is capable of predicting a wide range of properties. * Although PLS and PCR were also used to develop QSAR models, RR provided the best results. Only RR results are reported here. Source:C.J.W. Meulenberg, H.P.M. Vijverberg. Toxicol. Appl. Pharmacol. 165, 206 (2000). Disclaimer: The opinions expressed are those of the authors and not necessarily represent the opinion or policy of the agency ATSDR.