1 New Lazar Developments 10/2008 A. Maunz 1) C. Helma 1), 2) 1) FDM Freiburg Univ. 2) in silico toxicology.

Slides:



Advertisements
Similar presentations
JKlustor clustering chemical libraries presented by … maintained by Miklós Vargyas Last update: 25 March 2010.
Advertisements

CONFIDENTIAL INFORMATION KnowTox, a Franco- Hungarian collaborative project relative to toxicity.
Richard M. Jacobs, OSA, Ph.D.
Analysis of High-Throughput Screening Data C371 Fall 2004.
Design of Experiments Lecture I
gSpan: Graph-based substructure pattern mining
Visual Data Mining: Concepts, Frameworks and Algorithm Development Student: Fasheng Qiu Instructor: Dr. Yingshu Li.
K-NEAREST NEIGHBORS AND DECISION TREE Nonparametric Supervised Learning.
Comparison of Data Mining Algorithms on Bioinformatics Dataset Melissa K. Carroll Advisor: Sung-Hyuk Cha March 4, 2003.
Regulatory Toxicology James Swenberg, D.V.M., Ph.D.
PharmaMiner: Geometric Mining of Pharmacophores 1.
1 Development & Evaluation of Ecotoxicity Predictive Tools EPA Development Team Regional Stakeholder Meetings January 11-22, 2010.
GraphSig: Mining Significant Substructures in Compound Libraries 1.
Collaborative Information Management: Advanced Information Processing in Bioinformatics Joost N. Kok LIACS - Leiden Institute of Advanced Computer Science.
1 Graph Mining Applications to Machine Learning Problems Max Planck Institute for Biological Cybernetics Koji Tsuda.
QUANTITATIVE DATA ANALYSIS
Discovering Substructures in Chemical Toxicity Domain Masters Project Defense by Ravindra Nath Chittimoori Committee: DR. Lawrence B. Holder, DR. Diane.
Comparison of Instance-Based Techniques for Learning to Predict Changes in Stock Prices iCML Conference December 10, 2003 Presented by: David LeRoux.
Active Learning Strategies for Drug Screening 1. Introduction At the intersection of drug discovery and experimental design, active learning algorithms.
Active Learning Strategies for Compound Screening Megon Walker 1 and Simon Kasif 1,2 1 Bioinformatics Program, Boston University 2 Department of Biomedical.
Causal Modeling for Anomaly Detection Andrew Arnold Machine Learning Department, Carnegie Mellon University Summer Project with Naoki Abe Predictive Modeling.
SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Basic Relationships Purpose of multiple regression Different types of multiple regression.
Decision Tree Models in Data Mining
Predicting Highly Connected Proteins in PIN using QSAR Art Cherkasov Apr 14, 2011 UBC / VGH THE UNIVERSITY OF BRITISH COLUMBIA.
Application of Toxicology Databases in Drug Development (Estimating potential toxicity) Joseph F. Contrera, Ph.D. Director, Regulatory Research and Analysis.
Development and Application of Computational Toxicology and Informatics Resources at the FDA CDER Office of Pharmaceutical Science The Informatics and.
Multiple Choice Questions for discussion
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Topological Summaries: Using Graphs for Chemical Searching and Mining Graphs are a flexible & unifying model Scalable similarity searches through novel.
Mining Optimal Decision Trees from Itemset Lattices Dr, Siegfried Nijssen Dr. Elisa Fromont KDD 2007.
Sung Kyu (Andrew) Maeng. Contents  QSAR Introduction  QSBR Introduction  Results and discussion  Current QSAR project in UNESCO-IHE.
Predicting a Variety of Constant Pure Compound Properties by the Targeted QSPR Method Abstract The possibility of obtaining a reliable prediction a wide.
Counseling Research: Quantitative, Qualitative, and Mixed Methods, 1e © 2010 Pearson Education, Inc. All rights reserved. Basic Statistical Concepts Sang.
Use of Machine Learning in Chemoinformatics Irene Kouskoumvekaki Associate Professor December 12th, 2012 Biological Sequence Analysis course.
Identifying Applicability Domains for Quantitative Structure Property Relationships Mordechai Shacham a, Neima Brauner b Georgi St. Cholakov c and Roumiana.
Discriminant Analysis Discriminant analysis is a technique for analyzing data when the criterion or dependent variable is categorical and the predictor.
Xiangnan Kong,Philip S. Yu Multi-Label Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago.
Stefan Mutter, Mark Hall, Eibe Frank University of Freiburg, Germany University of Waikato, New Zealand The 17th Australian Joint Conference on Artificial.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
Paola Gramatica, Elena Bonfanti, Manuela Pavan and Federica Consolaro QSAR Research Unit, Department of Structural and Functional Biology, University of.
University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.
McKim Conference on Predictive Toxicology
PharmaMiner: Geometric Mining of Pharmacophores 1.
An Overview of the Objectives, Approach, and Components of ComET™ Mr. Paul Price The LifeLine Group All slides and material Copyright protected.
Data Mining and Decision Support
McKim Conference on Predictive Toxicology The Inn of Lake Superior Duluth, Minnesota September 25-27, 2007 Narcosis as a Reference Gilman Veith.
Use of Machine Learning in Chemoinformatics
Educational Research: Data analysis and interpretation – 1 Descriptive statistics EDU 8603 Educational Research Richard M. Jacobs, OSA, Ph.D.
Identification of structurally diverse Growth Hormone Secretagogue (GHS) agonists by virtual screening and structure-activity relationship analysis of.
Acute Toxicity Studies Single dose - rat, mouse (5/sex/dose), dog, monkey (1/sex/dose) 14 day observation In-life observations (body wt., food consumption,
NURS 306, Nursing Research Lisa Broughton, MSN, RN, CCRN RESEARCH STATISTICS.
5. Evaluation of measuring tools: reliability Psychometrics. 2011/12. Group A (English)
DOSE-RESPONSE ASSESSMENT
General Concepts in QSAR for Using the QSAR Application Toolbox
Mining Utility Functions based on user ratings
k-Nearest neighbors and decision tree
Mining in Graphs and Complex Structures
DATA MINING © Prentice Hall.
Large-Scale Graph Mining Using Backbone Refine-ment Classes
Instance Based Learning
Fast Kernel-Density-Based Classification and Clustering Using P-Trees
General Concepts in QSAR for Using the QSAR Application Toolbox
Background This is a step-by-step presentation designed to take the first time user of the Toolbox through the workflow of a data filling exercise.
Decision Contexts in a Changing Toxicology Paradigm
Advisory Committee for Pharmaceutical Science (ACPS)
Chap 8. Instance Based Learning
TUTORIAL Similar Compounds Searching
Machine Learning – a Probabilistic Perspective
Physics-guided machine learning for milling stability:
EFSA’s Chemical Hazards Database
Presentation transcript:

1 New Lazar Developments 10/2008 A. Maunz 1) C. Helma 1), 2) 1) FDM Freiburg Univ. 2) in silico toxicology

INTRODUCTION What is Lazar?

Lazar is a fully automated SAR system. 2D fragment-based (linear, tree-shaped under development) Nearest-neighbor predictions (local models) Confidence-weighting for single predictions Applications include highlighting, screening, and ranking of pharma- ceuticals. In use by industrial corporations. Regulatory acceptance request as alternative test method has been submitted. Part of EU research project OpenTox Public web-based prototype and source code available at: DEFAULT STYLES 3 Introduction

DEFAULT STYLES 4 Introduction

DEFAULT STYLES 5 Introduction...more neighbors

6 Introduction...more neighbors

Lazar is completely data-driven, no expert knowledge is needed. DEFAULT STYLES 7 Introduction

Similarity for a specific endpoint. DEFAULT STYLES 8 Introduction Every fragment f has an assigned p -value p f indicating its significance. Similarity of compounds x, y is the p -weighted ratio of shared fragments: Note that this equals the standard Tanimoto similarity if all p -values are set to 1.0.  {gauss(p f ) | f x f y } sim(x,y, ) =.

Confidence index 04 Introduction 9 conf = gauss(s) For every prediction, calculate the confidence as: s : median similarity of neighbors (to the query structure)

Features and p -values DEFAULT STYLES 10 Introduction Lazar 1) 1) C. Helma (2006): “Lazy Structure-Activity Relationships (lazar) for the Prediction of Rodent Carcinogenicity and Salmonella Mutagenicity”, Molecular Diversity, 10(2), 147–158 p -values for quantitative activities Regression Similarity concept is vital for nearest-neighbor approaches. Weight of neighbor contribution to the prediction Confidence for individual predictions Higher Accuracy tree-shaped fragments

To date, Lazar has been a classifier for binary endpoints only. Published results include: 11 Introduction DatasetWeighted accuracy Kazius Salmonella Mutagenicity 2) 90% CPDB Multicell Call81% 2) Kazius, J., Nijssen, S., Kok, J., Back, T., & IJzerman, A.P. (2006): “Substructure Mining Using Elaborate Chemical Representation”, J. Chem. Inf. Model., 46(2),

Kazius: Salmonella Mutagenicity Left: Confidence vs. true prediction rate Right: ROC analysis 12

CPDB: Multicell Call Left: Confidence vs. true prediction rate Right: ROC analysis 13

QUANTITATIVE PREDICTIONS Extension by

Enable prediction of quantitative values (regression) –p -values as determined by KS test : Support vector regression on neighbors –Activity-specific similarity as a kernel function : –Superior to Tanimoto index Standard deviation of neighbor activities influence confidence –Applicability Domain estimation : based on new confidence values considering dependent and independent variables –Gaussian smoothed 04 Quantitative Predictions 2) 15 2) Maunz, A. & Helma, C. (2008): “Prediction of chemical toxicity with local support vector regression and activity-specific kernels”, SAR and QSAR in Environmental Research, 19 (5),

Confidence index 04 Quantitative Predictions 16 conf = gauss(s) e -  a s : median similarity of neighbors (to the query structure) For every prediction, calculate the confidence as:  a : standard deviation of neighbor‘s activities

Validation results on DSSTox project data include: EPAFHM Fathead Minnow Acute Toxicity (lc50 mmol, 573 compounds) FDAMDD Maximum Recommended Therapeutic Dose based on clinical trial data (dose mrdd mmol, 1215 pharmaceutical compounds) IRIS Upper-bound excess lifetime cancer risk from continuous exposure to 1 μg/L in drinking water (drinking water unit risk micromol per L, 68 compounds) 04 Quantitative Predictions 17 Previously, FDAMDD and IRIS were not included in (Q)SAR studies.

Validation (FDAMDD) Effect of confidence levels 0.0 to 0.6 Step: PredictivityNumber of PredictionsRMSE / Weighted accuracy Quantitative Predictions

Applicability Domain (FDAMDD) Effect of confidence levels 0.0 to 0.6 Step:

BACKBONE MINING Mechanistic descriptors

04 Better descriptors 21 Ideal: Few highly descriptive patterns that are easy to mine and allow for mechanistical reasoning in toxicity predictions. We have been using linear fragments as descriptors. Better descriptors would consider stereochemistry and include branched substructures be less intercorrelated and fewer in numbers be better correlated to target classes Tree-shaped fragments Problem: ~80% of subgraphs in typical databases are trees! A method to cut down on correlated fragments is needed.

04 Backbones and classes 22 The backbone of a tree is defined as its longest path with the lexicographically lowest sequence. Each backbone identifies a (disjunct) set of tree-shaped fragments that grow from this backbone. A Backbone Refinement Class (BBRC) consists of tree refinements with identical backbones. Definition: Example on next slide

04 BBR classes (1 step Ex.) 23 C-C(-O-C)(=C-c:c:c) C-C(=C(-O-C)(-C))(-c:c:c) C-C(=C-O-C)(-c:c:c) Backbone: c:c:c-C=C-O-C Refinement Class 1 Class 2

04 BBR classes 24 Idea: Represent each BBRC by a single feature 3) Nijssen S. & Kok J.N.: “A quickstart in frequent structure mining can make a difference”, KDD ’04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, New York, NY, USA: ACM 2004: 647–652. 4) Bringmann B., Zimmermann A., de Raedt L., Nijssen S.: “Dont be afraid of simpler patterns”, Proceedings 10th PKDD, Springer-Verlag 2006: 55–66. We use a modified version of the graph miner Gaston 3). Double-free enumeration through embedding lists and canonical depth sequences Our extension: Efficient mining of most significant BBRC representatives Supervised refinement based on  2 values by statistical metrical pruning 4) and dynamic upper bound adjustment

0425 CPDB Multicell call Comparison to linear fragments Filled: BBRC representatives Hollow: linear fragments BBRC validation

0426 CPDB Salmonella mutagenicity Comparison to linear fragments Filled: BBRC representatives Hollow: linear fragments BBRC validation

04 BBRC validation 27 Remarks: Linear fragments prev. used in Lazar Minimum frequency: lin. frag. 1, trees 6 Minimum correlation: lin. frag. none, trees p  2 >0.95 BBRC representatives time as measured on a lab workstation

SUMMARY

04 Summary 29 Reliable, automatic Applicability Domain estimation for individual predictions Includes both dependent and independent variables Significance-weighted kernel function Lazar for quantitative predictions

04 Summary 30 Highly heterogeneous Suitable for identification of structural alerts 94.5 % less fragments compared to linear fragments Mining time reduced by 84.3 % Descriptive power equal or superior to that of linear fragments Backbone refinement classes (Salmonella mutagenicity)

04 Acknowledgements 31 Thank you! Ann M. Richards (EPA) Jeroen Kazius (Leiden Univ.)