QSAR Modelling of Carcinogenicity for Regulatory Use in Europe Natalja Fjodorova, Marjana Novič, Marjan Vračko, Marjan Tušar, National institute of Chemistry,

Slides:



Advertisements
Similar presentations
CSE & CSE Multimedia Processing Lecture 09 Pattern Classifier and Evaluation for Multimedia Applications Spring 2009 New Mexico Tech.
Advertisements

LEARNING INFLUENCE PROBABILITIES IN SOCIAL NETWORKS Amit Goyal Francesco Bonchi Laks V. S. Lakshmanan University of British Columbia Yahoo! Research University.
Learning Algorithm Evaluation
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach,
Automatic classification of weld cracks using artificial intelligence and statistical methods Ryszard SIKORA, Piotr BANIUKIEWICZ, Marcin CARYK Szczecin.
Carcinogenicity prediction for Regulatory Use Natalja Fjodorova Marjana Novič, Marjan Vračko, Marjan Tušar National institute of Chemistry, Ljubljana,
Unsupervised Networks Closely related to clustering Do not require target outputs for each input vector in the training data Inputs are connected to a.
Artificial Neural Networks - Introduction -
1 Learning Semantics-Preserving Distance Metrics for Clustering Graphical Data Aparna S. Varde, Elke A. Rundensteiner, Carolina Ruiz, Mohammed Maniruzzaman.
Cost-Sensitive Classifier Evaluation Robert Holte Computing Science Dept. University of Alberta Co-author Chris Drummond IIT, National Research Council,
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Final Project: Project 9 Part 1: Neural Networks Part 2: Overview of Classifiers Aparna S. Varde April 28, 2005 CS539: Machine Learning Course Instructor:
Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.
Classification for High Dimensional Problems Using Bayesian Neural Networks and Dirichlet Diffusion Trees Radford M. Neal and Jianguo Zhang the winners.
Biomedical Tracers Biology 685 University of Massachusetts at Boston created by Kenneth L. Campbell, PhD.
ROC Curves.
Optimized Numerical Mapping Scheme for Filter-Based Exon Location in DNA Using a Quasi-Newton Algorithm P. Ramachandran, W.-S. Lu, and A. Antoniou Department.
Jeremy Wyatt Thanks to Gavin Brown
Multilayer feed-forward artificial neural networks for Class-modeling F. Marini, A. Magrì, R. Bucci Dept. of Chemistry - University of Rome “La Sapienza”
On the Application of Artificial Intelligence Techniques to the Quality Improvement of Industrial Processes P. Georgilakis N. Hatziargyriou Schneider ElectricNational.
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Ranga Rodrigo April 5, 2014 Most of the sides are from the Matlab tutorial. 1.
Bucharest, 10-February-2004 Neural Risk Management S.A. Scoring solutions Making full use of your data.
Predicting Highly Connected Proteins in PIN using QSAR Art Cherkasov Apr 14, 2011 UBC / VGH THE UNIVERSITY OF BRITISH COLUMBIA.
Evaluating Classifiers
Attention Deficit Hyperactivity Disorder (ADHD) Student Classification Using Genetic Algorithm and Artificial Neural Network S. Yenaeng 1, S. Saelee 2.
Lecture 11, CS5671 Secondary Structure Prediction Progressive improvement –Chou-Fasman rules –Qian-Sejnowski –Burkhard-Rost PHD –Riis-Krogh Chou-Fasman.
Artificial Neural Networks (ANN). Output Y is 1 if at least two of the three inputs are equal to 1.
NEURAL NETWORKS FOR DATA MINING
GA-Based Feature Selection and Parameter Optimization for Support Vector Machine Cheng-Lung Huang, Chieh-Jen Wang Expert Systems with Applications, Volume.
Data Mining: Classification & Predication Hosam Al-Samarraie, PhD. Centre for Instructional Technology & Multimedia Universiti Sains Malaysia.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Automatic Minirhizotron Root Image Analysis Using Two-Dimensional Matched Filtering and Local Entropy Thresholding Presented by Guang Zeng.
Institute for Advanced Studies in Basic Sciences – Zanjan Kohonen Artificial Neural Networks in Analytical Chemistry Mahdi Vasighi.
Museum and Institute of Zoology PAS Warsaw Magdalena Żytomska Berlin, 6th September 2007.
Dünaamiliste süsteemide modelleerimine Identification for control in a non- linear system world Eduard Petlenkov, Automaatikainstituut, 2013.
Handwritten Recognition with Neural Network Chatklaw Jareanpon, Olarik Surinta Mahasarakham University.
Martin Waldseemüller's World Map of 1507 Zanjan. Roberto Todeschini Viviana Consonni Davide Ballabio Andrea Mauri Alberto Manganaro chemometrics molecular.
Paola Gramatica, Elena Bonfanti, Manuela Pavan and Federica Consolaro QSAR Research Unit, Department of Structural and Functional Biology, University of.
Jennifer Lewis Priestley Presentation of “Assessment of Evaluation Methods for Prediction and Classification of Consumer Risk in the Credit Industry” co-authored.
A.N.N.C.R.I.P.S The Artificial Neural Networks for Cancer Research in Prediction & Survival A CSI – VESIT PRESENTATION Presented By Karan Kamdar Amit.
Computational Intelligence: Methods and Applications Lecture 16 Model evaluation and ROC Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Reservoir Uncertainty Assessment Using Machine Learning Techniques Authors: Jincong He Department of Energy Resources Engineering AbstractIntroduction.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
F.Consolaro 1, P.Gramatica 1, H.Walter 2 and R.Altenburger 2 1 QSAR Research Unit - DBSF - University of Insubria - VARESE - ITALY 2 UFZ Centre for Environmental.
MUTAGENICITY OF AROMATIC AMINES: MODELLING, PREDICTION AND CLASSIFICATION BY MOLECULAR DESCRIPTORS M.Pavan and P.Gramatica QSAR Research Unit, Dept. of.
ECE 471/571 - Lecture 19 Review 11/12/15. A Roadmap 2 Pattern Classification Statistical ApproachNon-Statistical Approach SupervisedUnsupervised Basic.
Lecture 8, CS5671 Neural Network Concepts Weight Matrix vs. NN MLP Network Architectures Overfitting Parameter Reduction Measures of Performance Sequence.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Evaluating Classifiers Reading: T. Fawcett, An introduction to ROC analysis, Sections 1-4, 7 (linked from class website)An introduction to ROC analysis.
P. Gramatica 1, H. Walter 2 and R. Altenburger 2 1 QSAR Research Unit - DBSF - University of Insubria - VARESE - ITALY 2 UFZ Centre for Environmental Research.
EEG processing based on IFAST system and Artificial Neural Networks for early detection of Alzheimer’s disease.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
7. Performance Measurement
Big data classification using neural network
How to forecast solar flares?
Performance Evaluation 02/15/17
Classification with Perceptrons Reading:
Machine Learning Week 10.
ECE 471/571 – Lecture 12 Perceptron.
Neural Networks and Their Application in the Fields of Coporate Finance By Eric Séverin Hanna Viinikainen.
network of simple neuron-like computing elements
EE513 Audio Signals and Systems
Computational Intelligence: Methods and Applications
Unsupervised Networks Closely related to clustering
Practical session on neural network modelling
Using Clustering to Make Prediction Intervals For Neural Networks
Hairong Qi, Gonzalez Family Professor
More on Maxent Env. Variable importance:
ECE – Pattern Recognition Lecture 8 – Performance Evaluation
Presentation transcript:

QSAR Modelling of Carcinogenicity for Regulatory Use in Europe Natalja Fjodorova, Marjana Novič, Marjan Vračko, Marjan Tušar, National institute of Chemistry, Ljubljana, Slovenia

CAESAR MEETING, , BERLIN, GERMANY

Overview Carcinogenic potency prediction- state of art Data and methods used for modeling by NIC_LJU Statistical performance of obtained models and their evaluation Some findings about structural alerts Conclusion

Carcinogenic potency prediction- state of art The QSAR models can be divided into two families: congeneric (for certain classes of chemicals); external prediction performance for rodent carcinogenicity is 58 to 71% accurate noncongeneric (for different classes of chemicals); accuracy is around 65%. Further studies are required to improve the predictive reliability of noncongeneric chemicals. Ref.Romualdo Benigni, Cecilia Bossa, Tatiana Netzeva, Andrew Worth. Collection and Evaluation of (Q)SAR Models for Mutagenicity and Carcinogenicity. EUR 22772EN, 2007

The chemicals involved in the study belong to different chemical classes, (noncongeneric substances) The work is addressed to industrial chemicals, referring to REACH initiative. The aim is to cover chemical space as much as possible

Carcinogenicity prediction in scope of CAESAR project Present state: - compilation of dataset for carcinogenicity  -cross-checking of structures  -calculation of descriptors  -selection of descriptors  -development of models – carcingenicity  -investigation of structural alerts (SA)- ongoing

Dataset: 805 chemicals were extracted from rodent carcinogenicity study findings for 1481chemicals taken from Distributed Structure-Searchable Toxicity (DSSTox) Public Database Network derived from the Lois Gold Carcinogenic Database (CPDBAS)

Response: for quantitative models TD50_Rat- Carcinogenic potency in rat (expressed in mmol/kg body wt/day) for qualitative models yes/no principle P-positive-active NP-not positive-inactive

Training and test sets 805 chemicals were splitted into training set (644 chemicals) and test set (161 chamicals) ( done at the Helmholtz Centre for Environmental Research – UFZ (Germany )

Distribution of active (P) and inactive (NP) chemicals in the total, training and test sets

Descriptors: 254 MDL descriptors calculated by MDL QSAR software, 254MDLdes_806carcinogenicity.rar file 835 Dragon descriptors calculated by DRAGON software, Dragon_Carc.xls file 88 CODESSA descriptors calculated using CODESSA software 88_CODESSA_descr_Cancer.xls file

Descriptors used for modeling Model CARC_NIC_CPANN_01 27 MDL descriptors provided by NIC_LJU (method for variable selection: Kohonen network and PCA). Model CARC_NIC_CPANN_02 18 DRAGON and MDL descriptors were taken from one of the best models (CARC_CSL_KNN_05) developed by CSL. The goal was to compare results obtained for carcinogenicity prediction using different methods. Model CARC_NIC_CPANN_03 34 CODESSA descriptors were taken from one of the best models (CARC_CSL_KNN_02) developed by CSL. (method for variable selection for models 2 and 3- cross correlation matrix, multicolinearity technique, fisher ratio and genetic algorithm)

Counter Propagation Artificial Neural Network Step1: mapping of molecule Xs (vector representing structure) into the Kohonen layer Step2: correction of weights in both, the Kohonen and the Output layer Step3: prediction of the four-dementional target (toxicity) Ts=carcinogenicity

Model input parameters Minimal correction factor Maximum correction factor- 0.5 Number of neurons in x direction- (35) Number of neurons in y direction- (35) Number of learning epochs- 100, 200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800

Statistical evaluation of models Confusion matrix for two class True positive (TP) True negative (TN) False positive (FP) False negative (FN) Accuracy (AC) =(TN+TP)/(TN+TP+FN+FP) Sensitivity(SE)=TP/(TP+FN) Specificity(SP)=TN/(TN+FP)

Statistical performance of models

Changing the threshold from 0 to 1 leads to decrease the number of false positive and increases and number of false negative increases. This tendency is common for all our models 1, 2 and 3.

In the figure we have marked the maximum accuracy and corresponding thresholds. For model 1 the optimal threshold is equal to In this case accuracy has a maximal value of 0.68, sensitivity is 0.71 and specificity is 0.65.

For model 2 optimal threshold for test set is 0.6 and accuracy has maximal value of Sensitivity in this point is 0.69 and specificity is 0.72.

For model 3 optimal threshold is equal to 0.5, maximum accuracy is 0.68, sensitivity is 0.70 and specificity is Changing the threshold leads to revision of sensitivity and specificity. It may be used to increase the number of correctly predicted carcinogens or non carcinogens.

The closer the curve tends towards (0,1) the more accurate are the prediction made A model with no predicted ability yields the diagonal line

Accuracy of prediction and area under the curve (AUC) (models 1,2,3)

Study structural alerts for our dataset collected from Benigni Toxtree program We have extracted the following alerts for out dataset of 805 compounds GA-genotoxic alerts nGA-non-genotoxic alerts NA-no carcinogenic alerts When we have calculated how many chemicals with pointed alerts fall into NP- not positive and P-positive area.

For substances with GA about 2/3 belong to Positive and about 1/3 to NP-not positive For substances with nGA about half substances belong to Positive and half to NP For substances with NA- no carcinogenic alerts about 2/3 belongs to NP and 1/3 belong to Positive P-positive and NP-not positive relates only for results for rats Needs for future investigations

Conclusion Quantitative models with dependent variable- tumorgenic dose TD50 for rats, have shown low prediction power with correlation coefficient for the test set less than 0.5. Conversely, qualitative models demonstrated an excellent accuracy of internal performance (accuracy of the training set is 91-93%) and good external performance (accuracy of the test set is 68-70%, sensitivity is 69-73% and specificity 63-72%). Changing the threshold leads to revision of sensitivity and specificity. It may be used to increase the number of correctly predicted carcinogens or non carcinogens.