AN APPROACH TO DETERMINE THE APPLICATION DOMAIN OF GROUP CONTRIBUTION MODELS Nina Jeliazkova 1 Joanna Jaworska 2, (2) Central Product Safety, Procter &

Slides:



Advertisements
Similar presentations
Design of Experiments Lecture I
Advertisements

11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
The Multiple Regression Model.
CHAPTER 2: Supervised Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Learning a Class from Examples.
Fast Algorithms For Hierarchical Range Histogram Constructions
Experimental Design, Response Surface Analysis, and Optimization
3.2 OLS Fitted Values and Residuals -after obtaining OLS estimates, we can then obtain fitted or predicted values for y: -given our actual and predicted.
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
11 Simple Linear Regression and Correlation CHAPTER OUTLINE
Use of regression analysis Regression analysis: –relation between dependent variable Y and one or more independent variables Xi Use of regression model.
Correlation & Regression Chapter 15. Correlation statistical technique that is used to measure and describe a relationship between two variables (X and.
PROBABILISTIC ASSESSMENT OF THE QSAR APPLICATION DOMAIN Nina Jeliazkova 1, Joanna Jaworska 2 (1) IPP, Bulgarian Academy of Sciences, Sofia, Bulgaria (2)
Session 2. Applied Regression -- Prof. Juran2 Outline for Session 2 More Simple Regression –Bottom Part of the Output Hypothesis Testing –Significance.
Lecture 3 Confidence Intervals and Experimental Objectives.
Econ 140 Lecture 121 Prediction and Fit Lecture 12.
Chapter 10 Simple Regression.
Bivariate Regression CJ 526 Statistical Analysis in Criminal Justice.
1 Unsupervised Learning With Non-ignorable Missing Data Machine Learning Group Talk University of Toronto Monday Oct 4, 2004 Ben Marlin Sam Roweis Rich.
When Measurement Models and Factor Models Conflict: Maximizing Internal Consistency James M. Graham, Ph.D. Western Washington University ABSTRACT: The.
Quantitative Structure-Activity Relationships (QSAR) Comparative Molecular Field Analysis (CoMFA) Gijs Schaftenaar.
Development of Empirical Models From Process Data
Chapter 11 Multiple Regression.
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Regression and Correlation Methods Judy Zhong Ph.D.
The Mole Atomic mass provides a means to count atoms by measuring the mass of a sample The periodic table on the inside cover of the text gives atomic.
Gaussian process modelling
Ch 9 pages Lecture 18 – Quantization of energy.
Development of An ERROR ESTIMATE P M V Subbarao Professor Mechanical Engineering Department A Tolerance to Error Generates New Information….
Review of methods to assess a QSAR Applicability Domain Joanna Jaworska Procter & Gamble European Technical Center Brussels, Belgium and Nina Nikolova.
Fundamentals of Data Analysis Lecture 4 Testing of statistical hypotheses.
Surveillance monitoring Operational and investigative monitoring Chemical fate fugacity model QSAR Select substance Are physical data and toxicity information.
Evaluation of software engineering. Software engineering research : Research in SE aims to achieve two main goals: 1) To increase the knowledge about.
A unifying model of cation binding by humic substances Class: Advanced Environmental Chemistry (II) Presented by: Chun-Pao Su (Robert) Date: 2/9/1999.
Why Is It There? Getting Started with Geographic Information Systems Chapter 6.
THE MANAGEMENT AND CONTROL OF QUALITY, 5e, © 2002 South-Western/Thomson Learning TM 1 Chapter 9 Statistical Thinking and Applications.
Statistical analysis Prepared and gathered by Alireza Yousefy(Ph.D)
Applied Quantitative Analysis and Practices LECTURE#23 By Dr. Osman Sadiq Paracha.
Geographic Information Science
Brian Macpherson Ph.D, Professor of Statistics, University of Manitoba Tom Bingham Statistician, The Boeing Company.
Identifying Applicability Domains for Quantitative Structure Property Relationships Mordechai Shacham a, Neima Brauner b Georgi St. Cholakov c and Roumiana.
Introduction to Machine Learning Supervised Learning 姓名 : 李政軒.
6-1 Introduction To Empirical Models Based on the scatter diagram, it is probably reasonable to assume that the mean of the random variable Y is.
SAR vs QSAR or “is QSAR different from SAR”
1 11 Simple Linear Regression and Correlation 11-1 Empirical Models 11-2 Simple Linear Regression 11-3 Properties of the Least Squares Estimators 11-4.
Paola Gramatica, Elena Bonfanti, Manuela Pavan and Federica Consolaro QSAR Research Unit, Department of Structural and Functional Biology, University of.
บทบาทของนักสถิติต่อภาคธุรกิจ และอุตสาหกรรม. Scientific method refers to a body of techniques for investigating phenomena, acquiring new knowledge, or.
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-Hall, Inc. Chap 13-1 Introduction to Regression Analysis Regression analysis is used.
Exit Slip 1.How are the lessons? – Too fast? long? slow? Boring? How are you finding the pace of the course? Difficult? Easy? Challenging? 3. What.
Inferential Statistics Introduction. If both variables are categorical, build tables... Convention: Each value of the independent (causal) variable has.
Machine Learning 5. Parametric Methods.
Design of a Compound Screening Collection Gavin Harper Cheminformatics, Stevenage.
Tutorial I: Missing Value Analysis
Dimensional Analysis. Experimentation and modeling are widely used techniques in fluid mechanics.
Chapter 14 Introduction to Regression Analysis. Objectives Regression Analysis Uses of Regression Analysis Method of Least Squares Difference between.
Building Valid, Credible & Appropriately Detailed Simulation Models
Introduction to emulators Tony O’Hagan University of Sheffield.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Chapter 7. Classification and Prediction
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
CH. 2: Supervised Learning
Machine Learning Basics
Statistical Methods For Engineers
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
DATABASE HISTOGRAMS E0 261 Jayant Haritsa
15.1 The Role of Statistics in the Research Process
Parametric Methods Berlin Chen, 2005 References:
Statistical Thinking and Applications
Presentation transcript:

AN APPROACH TO DETERMINE THE APPLICATION DOMAIN OF GROUP CONTRIBUTION MODELS Nina Jeliazkova 1 Joanna Jaworska 2, (2) Central Product Safety, Procter & Gamble, Brussels, Belgium (1) IPP, Bulgarian Academy of Sciences, Sofia, Bulgaria Abstract There is a practical need for an automatic (computerized) procedure to find out the application domain of a QSAR model. In this paper we attempt to address this need and focus on defining the application domain of group contribution methods. These methods are characterized by high number of descriptors i.e. high dimensionality. For feasibility reasons we propose to estimate the application domain as the parameter space, bounded by the training set parameter ranges. Then, we demonstrate how to practically apply this approach using the Syracuse Research Corporation KOWWIN model as an example. Discussion Atom Fragment Contribution (AFC) method Uses counts of fragments as descriptors; Uses very simple fragments (each non- hydrogen atom is a core for fragment; this minimizes the possibility of missing fragments); In addition to simple fragments uses correction (these are complex fragments always larger than a single atom) Two-stage multivariate regression KOWWIN training set and validation set were provided by Syracuse Research Corp. Approach Approximate application domain by ranges determined from the training set: Fragment and correction factors range Log Kow range because the combination of fragments is out of range Analyse KOWWIN training set and obtain fragment and correction factor statistics for training and validation sets Compare training and validation set of KOWWIN model  The AFC method is representative of group contribution methods, which have two inherent fundamental assumptions: Additivity - implies that each of the structural components of a compound makes a separate and additive contribution to the property of interest for the compound. Additivity is widely agreed hypothesis, with supporting evidence from empirical studies and contemporary quantum theories. Transferability - assumes that these contributions are the same across a wide variety of compounds. The property of a single compound is modelled as a sum of the contributions associated with an atom or fragment (additivity) assuming that the contributions of the identical atoms or fragments are the same as that in the original compounds used to develop these contributions (transferability). Assumptions failures examples: molecules where the same fragment occurs many times in a molecule (e.g. a long aliphatic chain) – additivity exceeded beyond training set. molecules with “uncommon” functional groups because transferability is difficult to establish because of poor statistics.  Complex structures are not always sufficiently represented, because the AFC method uses very simple fragments (e.g. compounds with large aliphatic rings are treated like aliphatic chains). fi - the coefficient for each fragment; ni - the number of times the fragment occurs in the structure; cj - the coefficient for each correction factor; nj - the number of times, the correction factor occurs or is applied in the structure SMILES : Oc(c(cc(c1)Cc(cc(c(O)c2C(C)(C)C)C(C)(C)C)c2)C(C)(C)C)c1C(C)(C)C CHEM : Phenol, 4,4'-methylenebis 2,6-bis(1,1-dimethylethyl)- MOL FOR: C29 H44 O2 MOL WT : TYPE | NUM | LOGKOW FRAGMENT DESCRIPTION | COEFF | VALUE Frag | 12 | -CH3 [aliphatic carbon] | | Frag | 1 | -CH2- [aliphatic carbon] | | Frag | 12 | Aromatic Carbon | | Frag | 2 | -OH [hydroxy, aromatic attach] | | Frag | 4 | -tert Carbon [3 or more carbon attach] | | Factor| 1 | -CH2- (aliphatic), 2 phenyl attach correc | | Factor| 2 | Ring rx: -OH / di-ortho;sec- or t- carbon | | Const | | Equation Constant | | Log Kow = Methods Data FragmentKOWWIN Training setValidation set FrequencyMINMAXFrequencyMinMax Aromatic Carbon1786 (73%) (80%)130 CH3[aliphatic carbon]1388 (57%) (67%)120 CH2[aliphatic carbon]1076 (44%) (64%)128 CH[aliphatic carbon]457 (18%) (35%)123 C[aliphatic carbon-No H not tert]229 (9%) (12%)111 O[oxygen aliphatic attach]108 (4%) (11%)112 F[fluorine aliphatic attach]103 (4%)16540 (5%)123 Cl[chlorine aliphatic attach]100 (4%)16354 (3%)112 Analyzed Data SetsNo. Compounds No. Fragments No. Correction factors Experimental Log KOW range KOWWIN Training set , 8.19 KOWWIN Validation set , A software was developed in order to read the full text output of SRC KOWWIN program and extract the fragment and correction factor statistics of training and validation set Application domain and prediction error Number of compoundsTraining setValidation set All In-domain Out-of-domain0651 The average prediction error outside application domain defined by the training set ranges is twice larger than the prediction error inside the domain. Note that it is true only on average, i.e. there are many individual compounds with low error outside of the domain, as well as individual compounds with high error inside the domain. The training space as defined by fragment and correction factor ranges consists of 5.44E+41 unique points. Of this enormous space the training set uses only 2113 unique points (some of the 2434 points coincide). This means only 3.88E-37 % of the training space is covered by the training set points! Given good practical experience with the model means that additivity assumption is working within the training set space. These observations support the view that to determine the applicability of a (QSAR) model it is essential to evaluate the model assumptions. An excerpt from the 508 fragment list for the KOWWIN and its representation in training and validation sets Overlay between training and validation sets SRC KOWWIN full text output