(Q)SAR and (Q)AAR analysis of ToxCast Dataset Using PASS

Slides:



Advertisements
Similar presentations
6th lecture Modern Methods in Drug Discovery WS10/11 1 More QSAR Problems: Which descriptors to use How to test/validate QSAR equations (continued from.
Advertisements

Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
Learning Algorithm Evaluation
Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.
Quantitative Structure-Activity Relationships (QSAR) Comparative Molecular Field Analysis (CoMFA) Gijs Schaftenaar.
Evaluating Hypotheses
QSAR Modelling of Carcinogenicity for Regulatory Use in Europe Natalja Fjodorova, Marjana Novič, Marjan Vračko, Marjan Tušar, National institute of Chemistry,
Quantitative Genetics
Statistical Methods For Engineers ChE 477 (UO Lab) Larry Baxter & Stan Harding Brigham Young University.
ANCOVA Lecture 9 Andrew Ainsworth. What is ANCOVA?
Institute of Biomedical Chemistry of Rus. Acad. Med. Sci.; *A.N. Sysin Institute of Human Ecology and Environmental Health of Rus. Acad. Med. Sci., Moscow,
David Kim Allergan Inc. SoCalBSI California State University, Los Angeles.
Dr. Asawer A. Alwasiti.  Chapter one: Introduction  Chapter two: Frequency Distribution  Chapter Three: Measures of Central Tendency  Chapter Four:
Use of Machine Learning in Chemoinformatics Irene Kouskoumvekaki Associate Professor December 12th, 2012 Biological Sequence Analysis course.
Identifying Applicability Domains for Quantitative Structure Property Relationships Mordechai Shacham a, Neima Brauner b Georgi St. Cholakov c and Roumiana.
By: Amani Albraikan.  Pearson r  Spearman rho  Linearity  Range restrictions  Outliers  Beware of spurious correlations….take care in interpretation.
PCB 3043L - General Ecology Data Analysis. OUTLINE Organizing an ecological study Basic sampling terminology Statistical analysis of data –Why use statistics?
Paola Gramatica, Elena Bonfanti, Manuela Pavan and Federica Consolaro QSAR Research Unit, Department of Structural and Functional Biology, University of.
Chemical Reactions in Ideal Gases. Non-reacting ideal gas mixture Consider a binary mixture of molecules of types A and B. The canonical partition function.
P. Gramatica and F. Consolaro QSAR Research Unit, Dept. of Structural and Functional Biology, University of Insubria, Varese, Italy.
Evaluating Results of Learning Blaž Zupan
PCB 3043L - General Ecology Data Analysis. PCB 3043L - General Ecology Data Analysis.
STATISTICS AND OPTIMIZATION Dr. Asawer A. Alwasiti.
O PTIMAL NANO - DESCRIPTORS AS TRANSLATORS OF ECLECTIC DATA INTO PREDICTION OF THE CELL MEMBRANE DAMAGE BY MEANS OF NANO METAL - OXIDES A LLA P. T OROPOVA.
F.Consolaro 1, P.Gramatica 1, H.Walter 2 and R.Altenburger 2 1 QSAR Research Unit - DBSF - University of Insubria - VARESE - ITALY 2 UFZ Centre for Environmental.
Feature Extraction Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and.
MUTAGENICITY OF AROMATIC AMINES: MODELLING, PREDICTION AND CLASSIFICATION BY MOLECULAR DESCRIPTORS M.Pavan and P.Gramatica QSAR Research Unit, Dept. of.
P. Gramatica 1, H. Walter 2 and R. Altenburger 2 1 QSAR Research Unit - DBSF - University of Insubria - VARESE - ITALY 2 UFZ Centre for Environmental Research.
Use of Machine Learning in Chemoinformatics
Discriminating between Drugs and Nondrugs by Prediction of Activity Spectra for Substances (PASS) Soheila Anzali, Gerhard Barnickel, Bertram Cezanne, Michael.
(Quantitative) Structure- Activity Relationships (Q)SAR.
이 장 우. 1. Introduction  HPLC-MS/MS methodology achieved its preferred status -Highly selective and effectively eliminated interference -Without.
So that k k E 5 = - E 2 = = x J = x J Therefore = E 5 - E 2 = x J Now so 631.
Lipinski’s rule of five
General Concepts in QSAR for Using the QSAR Application Toolbox
7. Performance Measurement
Matteo Reggente Giulia Ruggeri Satoshi Takahama
How to forecast solar flares?
PHYSICO-CHEMICAL PROPERTIES MODELLING FOR ENVIRONMENTAL POLLUTANTS
Performance Evaluation 02/15/17
Inference for Regression
Chapter 11: Simple Linear Regression
Evaluating Results of Learning
PCB 3043L - General Ecology Data Analysis.
Hierarchical Classification of Calculated Molecular Descriptors
Hypothesis Testing and Confidence Intervals (Part 1): Using the Standard Normal Lecture 8 Justin Kern October 10 and 12, 2017.
ADME/Tox PredictionTox Prediction. The characterization of Absorption, Distribution, Metabolism, and Excretion (also known as ADME) and Toxicity are essential.
Introduction to Inferential Statistics
Analytical Method Validation
Virtual Screening.
Remedial Kinetics and Mechanism
Evaluation and Its Methods
10701 / Machine Learning Today: - Cross validation,
What is Regression Analysis?
P. Gramatica1, F. Consolaro1, M. Vighi2, A. Finizio2 and M. Faust3
Learning Algorithm Evaluation
What are BLUP? and why they are useful?
Ovanes Mekenyan, Milen Todorov, Ksenia Gerova
Product moment correlation
An Introduction to Correlational Research
DSS-ESTIMATING COSTS Cost estimation is the process of estimating the relationship between costs and cost driver activities. We estimate costs for three.
Inferential Statistics
Lecture # 2 MATHEMATICAL STATISTICS
Parametric Methods Berlin Chen, 2005 References:
SKTN 2393 Numerical Methods for Nuclear Engineers
M.Pavan, P.Gramatica, F.Consolaro, V.Consonni, R.Todeschini
Evaluation and Its Methods
Introduction to Machine learning
More on Maxent Env. Variable importance:
ECE – Pattern Recognition Lecture 8 – Performance Evaluation
Presentation transcript:

(Q)SAR and (Q)AAR analysis of ToxCast Dataset Using PASS and GUSAR approaches Vladimir Poroikov, Dmitry Filimonov, Alexey Zakharov, Alexey Lagunin, Sergey Novikov* Institute of Biomedical Chemistry of Rus. Acad. Med. Sci.; *A.N. Sysin Institute of Human Ecology and Environmental Health of Rus. Acad. Med. Sci., Moscow, Russia. Analysis of Chemical Space Introduction The aim of the study: (1) To estimate the possibility of prediction of ToxCast Phase 1 (TC1) in vivo data on the basis of structural formulae, physical-chemical properties and in vitro data from TC1 dataset. (2) To estimate the possibility of prioritization of molecules from the TC1 dataset for the toxicological testing using the integral parameter. ToxDose We calculated the integral parameter ToxDose for twenty end-points of carcinogenicity for mouse and rat. ToxDose values are correlated with other carcinogenicity end-points and may be use for prioritization of molecules from the TC1 dataset for toxicological testing. The data for cholinesterase inhibition differs significantly from all other end-points; thus they were excluded from the further analysis. Materials The data on in vivo and in vitro assays of chemical compounds were used for (quantitative) structure-activity relationships ((Q)SAR) and (quantitative) activity-activity relationships ((Q)AAR) analysis from the ToxCast Phase 1 dataset. The data from CPDB dataset (CPDB, 25 October 2007) was used as the training set for the carcinogenicity prediction of in vivo ToxCast assays. The data were extracted from EPA Distributed Structure-Searchable Toxicity (DSSTox) Public Database Network [1]. We used 1397 compounds that were tested in the standard two-year rodent carcinogenicity bioassay. Small inorganic compounds (e.g. NO2), oils, paraffins and mixtures of compounds were excluded from the set. Methods PASS program. (Prediction of Activity Spectra for Substances) is a computer program for evaluation of general biological potential in a molecule on the basis of its structural formulae [2]. MNA ("Multilevel Neighbourhoods of Atoms") descriptors are used for presentation of a compound’s structure. The list of predictable biological activities contains 3750 types (PASS 2009.1 version) including main and side pharmacological effects (antihypertensive, hepatoprotective, anti-inflammatory etc.), mechanisms of action (5-hydroxytryptamine agonist, cyclooxygenase inhibitor, adenosine uptake inhibitor, etc.), specific toxicities (mutagenicity, carcinogenicity, teratogenicity, etc.) and metabolic terms (CYP1A substrate, CYP3A4 inhibitor, CYP2C9 inducer, etc.). The mean accuracy calculated by leave-one-out cross-validation procedure is 95%. PASS predictions for TC1 molecules are presented at the ToxCast web-site, and can be used as parameters characterizing these compounds in biological space. QNA descriptors. A molecular structure is described as a set of QNA (Quantitative Neighborhoods of Atoms) descriptors [3]. QNA descriptors are based on values of ionization potential (IP) and electron affinity (EA) of each atom in the molecule. QNA descriptors are calculated as following: Pi = Σk Bi-½(Exp(–½C))ikBk-½, Qi = Σk Bi-½(Exp(–½C))ikBk-½Ak, where Ak = ½(IPk + EAk), Bk = IPk – EAk, C is a molecular connectivity matrix. Thus, each atom of molecule is described by two values, P and Q. Since any molecule has different number of atoms, P and Q are proportional to the number of atoms in molecule, but for regression analysis it is necessary to describe the molecular structure as a vector with the fixed length. Therefore, Chebyshev polynomial’s are used for vector’s presentation of a molecular structure: where Tn is nth degree of Chebyshev polynomial, P` and Q` are the orthonormalized representation of P and Q values (zero mean values of P` and Q`, unit variance and absence of correlation of P` and Q`). The Tn(P,Q) values are calculated for each atom of a molecule. A whole molecule is presented as an average value of Chebyshev polynomials for all atoms; therefore, the length of the vector is defined by the numbers of Chebyshev polynomials - m. On one hand the large number of Chebyshev polynomials may describe complex structure-activity relationships; on the other hand the large length of the vector that represents the structure may provide overtraining in regression analysis. Therefore, the initial value of m is determined as a half of number of molecules in the training set. Self-Consistent Regression (SCR). Self-consistent regression can obtain the best QSAR/QSPR model for the training set with a large number of descriptors. SCR is based on least-squares regularized method. The main feature of SCR method is a removal of variables, which are worse for description of an appropriate value [4]. Integral parameter ToxDose. Using dosage characteristics from all 75 end-points, experimental data for which were obtained in vivo, we calculated the integral parameter - ToxDose for 283 compounds: We compared the distribution of molecules from ToxCast Phase 1 dataset with compounds from the PASS Training set (10_MF), MDDR 2003, and RoadMap 2008. The compounds from ToxCast Phase 1 dataset contain less non-hydrogen atoms than typical drug-like molecules, and RoadMap dataset. In PASS training set average MW=416 Dalton; The average molecular weight of compounds from ToxCast Phase 1 database is 302 Dalton that smaller than those for drug-like compounds. Results of PASS Training for Predicting Different Categories of ToxDose No Activity Type Number IEP, % Data1 IEP, % Data2 1 ToxCast 301 5.151 5.267 2 CHR_ToxDose < 1.0 mkM/kg 9 25.689 5.706 3 CHR_ToxDose < 3.16 mkM/kg 19 28.623 9.155 4 CHR_ToxDose < 10.0 mkM/kg 51 28.052 6.487 5 CHR_ToxDose < 31.6 mkM/kg 89 25.008 5.24 6 CHR_ToxDose < 100.0 mkM/kg 145 27.272 6.457 7 CHR_ToxDose < 316.0 mkM/kg 176 24.13 6.011 8 CHR_ToxDose < 1000.0 mkM/kg 196 23.499 5.727 CHR_ToxDose 0.316-3.16 mkM/kg 15 33.091 14.14 10 CHR_ToxDose 1.0-10.0 mkM/kg 42 26.576 7.669 11 CHR_ToxDose 3.16-31.6 mkM/kg 70 28.502 7.024 12 CHR_ToxDose 10.0-100.0 mkM/kg 94 38.76 10.417 13 CHR_ToxDose 31.6-316.0 mkM/kg 87 33.645 9.508 14 CHR_ToxDose 100.0-1000.0 mkM/kg 40.901 15.01 CHR_ToxDose 316.0-3160.0 mkM/kg 26 34.029 14.915 Analysis of Biological Parameters Biological activities predicted by PASS can be directly compared to TC1 in vivo data only in a few cases (carcinogenicity and cholinesterase inhibitors). Comparison of TC1 in vivo data for the same species and between the species lead to the following conclusions: 1) Correlation coefficients between the in vivo data for same species varies from 0.75 to 0.95. 2) Correlation coefficients for the same tissues between the species less than 0.10. Comparison of in vivo data presented as integral parameter (ToxDose) with in vitro data demonstrated that the maximal value of correlation coefficient is 0.26 (ToxDose vs. ToxCast Novascreen data). Thus, no significant correlation between in vivo and in vitro data is found. PASS prediction of the carcinogenicity To predict the carcinogenicity with PASS, we used compounds from CPDB as a training set. After the training procedure average accuracy of prediction (LOO CV) equals to 74%. With the trained version of PASS we predicted carcinogenicity for 306 compounds from ToxRefDB. Four compounds that have two components were excluded from the prediction. Accuracy of prediction for ToxRefDB is given below.   NA TP TN FP FN Sensitivity Specificity Accuracy CHR_Mouse_LiverTumors 71 31 108 41 64 0.33 0.72 0.57 CHR_Mouse_LungTumors 11 138 84 0.12 0.93 0.61 CHR_Mouse_Tumorigen 25 129 49 0.34 0.76 0.63 CHR_Rat_LiverTumors 62 8 159 28 58 0.85 0.66 CHR_Rat_TesticularTumors 12 133 101 7 CHR_Rat_ThyroidTumors 3 212 20 18 0.14 0.91 CHR_Rat_Tumorigen 184 46 15 0.35 0.8 Here: Data1 is 301 chemicals from ToxCast in vivo dataset with 100 drug-like chemical compounds; Data2 is 301 chemicals from ToxCast in vivo dataset with 60620 chemical compounds from PASS training set. The ninety five percent of ToxCast compounds are discriminated from drug-like molecules. For different toxicity grouping PASS accuracy of recognition varies from 75.0% to 59.1%; and the most toxic compounds are predicted better. Thus, PASS prediction could be applied for selection of priorities in testing of the most probable toxic compounds. NA - data not available; TP - true positive; TN - true negative; FP - false positive; FN - false negative; Sensitivity – TP/(TP+FN); Specificity – TN/(TN+FP); Accuracy – (TP+TN)/(TP+TN+FP+FN) Conclusions Accuracy of prediction varies from 0.57 (CHR_Mouse_LiverTumors and CHR_Rat_TesticularTumors) to 0.85 (CHR_Rat_ThyroidTumors). Sensetivity varies from 0.12 (CHR_Mouse_LungTumors) to 0.63 (CHR_Rat_TesticularTumors). Specificity varies from 0.57 (CHR_Rat_TesticularTumors) to 0.93 (CHR_Mouse_LungTumors). Rat activities were predicted more accurately than mouse activities. 1) Compounds from TC1 dataset are smaller than typical drug-like structures and molecules from RoadMap set. 2) No significant correlation between the in vivo and in vitro data from TC1 set was observed. 3) Despite the chemical dissimilarity between the TC1 compounds and drug-like molecules, PASS-based prediction of carcinogenicity could be obtain with reasonable accuracy. 4) It is shown that integral parameter characterizing general toxicity ToxDose can be predicted by PASS with reasonable accuracy. Thus, such approach could be recommended for prioritization in chemicals testing. QSAR Models for Rat’s Cholinesterase Inhibitors We collected 45 inhibitors of Rat’s Cholinesterase (CHR_Rat_CholinesteraseInhibition) from TC1. The toxicity end-point based on the EC50 (mg/kg) values was used for building of QSAR models, using QNA descriptors and SCR. The eighteen QSAR models were created by QNA/SCR approach. Only four QSAR models have Q2 > 0.50 and R2 > 0.60. These models were additionally validated by leave-10%-out cross validation. Leave-10%-out cross validation procedure was repeated 20 times and average R2 of prediction was calculated. Results of validation are presented below. Acknowledgements. We gratefully acknowledge Prof. Alex Tropsha for kindly assistance in presentation of the results at the ToxCast Poster Session. The work was supported in part by the FP7 project 200787 (OpenTox) and ISTC project # 3777. Apologies. I am sorry for not obtaining the US visa in time and, therefore, inability to take part in the ToxCast Workshop on May 14-15, 2009. In case, if you will have any questions/suggestions, please, do not hesitate to contact me: vladimir.poroikov@ibmc.msk.ru; tel: 7 499 246-0920; fax: 7 499 245-0857. Name Number R2 Q2 Fisher SD Variables L10%OCV model 1 45 0.697 0.567 10.706 0.699 8 0.71 model 2 0.684 0.559 10.073 0.706 0.70 model 3 0.676 0.553 11.366 0.709 7 0.65 model 4 0.686 0.520 7.960 0.744 10 0.52 where: D is the LEL value; m is the number of end-points for a particular compound; n is the total number of tests. Three from four QSAR models have average R2 of prediction more then 0.60. Therefore, the obtained models are robust and predictive. References 1) http://www.epa.gov/NCCT/dsstox/ 2) Poroikov V, Filimonov D. 2005. PASS: Prediction of Biological Activity Spectra for Substances. In: Predictive Toxicology (Christoph Helma, eds). LLC, Boca Raton, Taylor & Francis Group, 459-478. 3) Lagunin, A.; Zakharov, A.; Filimonov, D.; Poroikov, V. A new approach to QSAR modelling of acute toxicity. SAR and QSAR in Environmental Research 2007, 18, 285-298. 4) Filimonov, D.; Akimov, D.; Poroikov, V. The Method of Self-Consistent Regression for the Quantitative Analysis of Relationships Between Structure and Properties of Chemicals. Pharm.Chem. J. 2004, 1, 21-24.