Empirical Validation of the Effectiveness of Chemical Descriptors in Data Mining Kirk Simmons DuPont Crop Protection Stine-Haskell Research Center 1090.

Slides:



Advertisements
Similar presentations
Yinyin Yuan and Chang-Tsun Li Computer Science Department
Advertisements

Analysis of High-Throughput Screening Data C371 Fall 2004.
1 Schedule 8:30-9:30 Introduction 9:40- 10:45 Analysis Methods 10:55-12:00 Design and Analysis 12:00 Lunch 1:00-2:05 Design and Analysis I (Will and Stan)
Validity of Quantitative Research Conclusions. Internal Validity External Validity Issues of Cause and Effect Issues of Generalizability Validity of Quantitative.
Lipinski’s rule of five
Research Design and Validity Threats
Establishing a Successful Virtual Screening Process Stephen Pickett Roche Discovery Welwyn.
The Motif Problem Paul Tamashiro School of Mathematics Georgia Institute of Technology April 16, 2008.
Corporate Finance Portfolio Theory Prof. André Farber SOLVAY BUSINESS SCHOOL UNIVERSITÉ LIBRE DE BRUXELLES.
Active Learning Strategies for Drug Screening 1. Introduction At the intersection of drug discovery and experimental design, active learning algorithms.
A novel interactive tool for multidimensional biological data analysis Zhaowen Luo, Xuliang Jiang Serono Research Institute, Inc.
Exploring Metabolomic data with recursive partitioning Metabolomic Workshop NISS July 14-15, 2005.
بسم الله الرحمن الرحيم * this presentation about :- “experimental design “ * Induced to :- Dr Aidah Abu Elsoud Alkaissi * Prepared by :- 1)-Hamsa karof.
Evidence-Based Practice Current knowledge and practice must be based on evidence of efficacy rather than intuition, tradition, or past practice. The importance.
Topological Summaries: Using Graphs for Chemical Searching and Mining Graphs are a flexible & unifying model Scalable similarity searches through novel.
Introduction to Chemoinformatics Irene Kouskoumvekaki Associate Professor December 12th, 2012 Biological Sequence Analysis course.
Copyright 1999, PRENTICE HALLChapter 31 Stoichiometry: Calculations with Chemical Formulas and Equations Chapter 3 David P. White University of North.
1. Chemometrices:  Signal processing  Classification & pattern reccognation  Experimental design  Multivariative calibration  Quantitative Structure.
Development of Novel Geometrical Chemical Descriptors and Their Application to the Prediction of Ligand-Protein Binding Affinity Shuxing Zhang, Alexander.
Predicting Phospholipidosis Using Machine Learning 1 Lowe et al., Molec. Pharmaceutics, 7, 1708 (2010) Robert Lowe (Cambridge) John Mitchell (St Andrews)
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
In silico discovery of inhibitors using structure-based approaches Jasmita Gill Structural and Computational Biology Group, ICGEB, New Delhi Nov 2005.
TIPM3 March 13, SBAC Update See Link on protopage Claims (p. 17) Reporting Scores (p.19) Summative Assessment Targets Grade 3 (p. 27) Summative.
An Introduction to Statistics and Research Design
Xiangnan Kong,Philip S. Yu Multi-Label Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago.
SimBioSys Inc.© Slide #1 Enrichment and cross-validation studies of the eHiTS high throughput screening software package.
The Use of Predictive Biomarkers in Clinical Trial Design Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer Institute
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
“Mapping the current and future risk of overheating in UK homes” London-Loughborough (LoLo) CDT By Argyris Oraiopoulos 1.
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Validation.
How to multiply a whole number by a fraction.
Selecting Diverse Sets of Compounds C371 Fall 2004.
Brad Evans, Ph.D. Pfizer Global Research and Development
December 1, Classification Analysis of HIV RNase H Bioassay Lianyi Han Computational Biology Branch NCBI/NLM/NIH Rocky ‘07.
“Mapping the current and future risk of overheating in UK homes” First Year Progress Report London-Loughborough CDT By Argyris Oraiopoulos 1.
1 The Role of Statistics in Engineering ENM 500 Chapter 1 The adventure begins… A look ahead.
PharmaMiner: Geometric Mining of Pharmacophores 1.
Evidence-Based Practice Evidence-Based Practice Current knowledge and practice must be based on evidence of efficacy rather than intuition, tradition,
Introduction to Data Mining by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003.
Design of a Compound Screening Collection Gavin Harper Cheminformatics, Stevenage.
1 Détection et élimination de l'erreur systématique lors du processus de criblage à haut débit Vladimir Makarenkov Université du Québec à Montréal (UQAM)
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Recognizing Partially Occluded, Expression Variant Faces.
Surflex: Fully Automatic Flexible Molecular Docking Using a Molecular Similarity-Based Search Engine Ajay N. Jain UCSF Cancer Research Institute and Comprehensive.
Part of a set or part of a whole. 3 4 =Numerator the number of parts = Denominator the number that equals the whole.
Identification of structurally diverse Growth Hormone Secretagogue (GHS) agonists by virtual screening and structure-activity relationship analysis of.
Computational Approach for Combinatorial Library Design Journal club-1 Sushil Kumar Singh IBAB, Bangalore.
Next, this study employed SVM to classify the emotion label for each EEG segment. The basic idea is to project input data onto a higher dimensional feature.
Validation of a Maternal Risk Index Across Multiple Counties Background: Given the current fiscal constraints and high demand for public health nursing.
Exploratory data analysis, descriptive measures and sampling or, “How to explore numbers in tables and charts”
I am able to use equivalent fractions as a strategy to add and subtract fractions; apply and extend previous understanding of multiplication and division.
TCOF 3 :Repositioning of Chemical compounds From Different Classes as part of Virtual Screening Under the Guidance of PI: Dr UCA JALEEL (IISc Research.
Frequent Sub-Structure-Based Approaches for Classifying Chemical Compounds Mukund Deshpande, Michihiro Kuramochi, George Karypis University of Minnesota,
TIDEA Target (and Lead) Independent Drug Enhancement Algorithm.
Understanding Populations & Samples
PSY 626: Bayesian Statistics for Psychological Science
DReNIn_O “A high-level ontology for drug repositioning” Joseph Mullen
Research Design & Analysis II: Class 10
CS548 Fall 2017 Decision Trees / Random Forest Showcase by Yimin Lin, Youqiao Ma, Ran Lin, Shaoju Wu, Bhon Bunnag Showcasing work by Cano,
EMPIRICAL FORMULA The empirical formula represents the smallest ratio of atoms present in a compound. The molecular formula gives the total number of atoms.
PSY 626: Bayesian Statistics for Psychological Science
Lecture 23: Feature Selection
Fractions
Reporter: Yu Lun Kuo (D )
Fractions, Decimals and Percent Part of a Whole.
Cross-validation Brenda Thomson/ Peter Fox Data Analytics
Design Of Experiment Eng. Ibrahim Kuhail.
CCSSO National Conference on Student Assessment June 21, 2010
Comparing Fractions Here are four strategies for comparing the size of fractions:
Chapter 10 Content Analysis
Volume 3, Issue 6, Pages e3 (December 2016)
Presentation transcript:

Empirical Validation of the Effectiveness of Chemical Descriptors in Data Mining Kirk Simmons DuPont Crop Protection Stine-Haskell Research Center 1090 Elkton Road Newark, DE

The Study Purpose Strategy –Methods –Metrics Results Practical Application Conclusions

Purpose Chemical Structure Conference (1996) – Holland –Data mining/similarity methodologies reported –Used numerous descriptor sets –No standard datasets –Comparisons difficult Comparative study of chemical descriptors across varied biology

Strategy Systematically evaluate descriptors within a compound dataset across multiple biological endpoints All compounds have experimentally measured endpoints Diversity of biological endpoints –In-Vitro (receptor affinity, enzyme inhibition) –In-Vivo (insect mortality) Explored nine common descriptor sets Train and then use model to forecast a validation set

Methods Four In-Vitro assays –48K compound dataset for training –Corporate database for validation Two In-Vivo assays –75-100K compound datasets –Randomly divided into training and validation subsets Recursive Partitioning - analytic method –Appropriate method for HTS data –Selected statistically conservative inputs (p-tail < 0.01)

Metrics 4-way Interaction –Analytic Method, Compound Set, Biology, and Descriptors Efficiency of analysis (Lift Chart) –Fraction of Actives found/Fraction of Dataset tested –Rewards efficiency only Effectiveness of analysis (Composite Score) –Fraction of Actives found x Efficiency –Rewards efficiency as well as completeness

Results - Training

Results - Forecasting

Averaged Results - Training

Averaged Results - Forecasting

Practical Application RP-based models using screening data on 3 targets –Activity treated as active/inactive –DiverseSolutions R BCUT descriptors RP-models used to forecast vendor compounds (1M) Selected compounds purchased/screened –Hit-rates improved 530% over training sets –New structures and improved activity

Historical Screening Results

RP-based Screening Results

Results Comparison

Conclusions Not all chemical descriptors equally effective –Whole molecule property-based less effective –Chemical feature-based appear more effective Training models effectiveness –Averaged 28% of theory –Room for 4-fold improvement Validation models effectiveness –Averaged 16% of theory –Room for 6-fold improvement

Acknowledgements Dr. Linrong Yang, FMC Corporation –Completed the work FMC Corporation –Release of the results Prof. Peter Willett, University of Sheffield Prof. Alex Tropsha, University of North Carolina Prof. Doug Hawkins, University Minnesota DuPont Corporation