Christian Kramer, Peter Gedeck Novartis Institutes for Biomedical Research, Basel, Switzerland Leave-cluster-out crossvalidation is appropriate for scoring.

Slides:



Advertisements
Similar presentations
EcoTherm Plus WGB-K 20 E 4,5 – 20 kW.
Advertisements

Números.
Trend for Precision Soil Testing % Zone or Grid Samples Tested compared to Total Samples.
Trend for Precision Soil Testing % Zone or Grid Samples Tested compared to Total Samples.
AGVISE Laboratories %Zone or Grid Samples – Northwood laboratory
Trend for Precision Soil Testing % Zone or Grid Samples Tested compared to Total Samples.
PDAs Accept Context-Free Languages
Reflection nurulquran.com.
Worksheets.
RWTÜV Fahrzeug Gmbh, Institute for Vehicle TechnologyTÜV Mitte Group 1 GRB Working Group Acceleration Pattern Results of pass-by noise measurements carried.
E166 Collaboration Meeting Princeton, May 2006 e+ Analyzing Power K. Peter Schüler e+ Analyzing Power: what we know from old G EANT 3 simulations.
Slide 1Fig 25-CO, p.762. Slide 2Fig 25-1, p.765 Slide 3Fig 25-2, p.765.
June, 2007 Ödön Farkas Group members: Imre Jákli, Adrián Kalászi and Gábor Imre Eötvös Loránd University, Institute of Chemistry, Laboratory.
STATISTICS Linear Statistical Models
Disability status in Ethiopia in 1984, 1994 & 2007 population and housing sensus Ehete Bekele Seyoum ESA/STAT/AC.219/25.
1 When you see… Find the zeros You think…. 2 To find the zeros...
AIDS epidemic update Figure AIDS epidemic update Figure 2007 Estimated adult (15–49 years) HIV prevalence rate (%) globally and in Sub-Saharan Africa,
4-4 Variability Objective: Learn to find measures of variability.
Demonstration of capabilities of a bi- regional CGE model to assess impacts of rural development policies (RURMOD-E) Demonstration Workshop Brussels,
AIDS epidemic update Figure AIDS epidemic update Figure 2007 Estimated adult (15–49 years) HIV prevalence rate (%) globally and in Sub-Saharan Africa,
CALENDAR.
Summative Math Test Algebra (28%) Geometry (29%)
I can count in decimal steps from 0.01 to
R + Hadoop = big data analytics Antonio Piccolboni Revolution Analytics.
HB 1358 Oil & Gas Production Tax Distribution Office of State Treasurer.
The 5S numbers game..
1 A B C
突破信息检索壁垒 -SciFinder Scholar 介绍
A Fractional Order (Proportional and Derivative) Motion Controller Design for A Class of Second-order Systems Center for Self-Organizing Intelligent.
Sampling in Marketing Research
Break Time Remaining 10:00.
The basics for simulations
1 Heating and Cooling of Structure Observations by Thermo Imaging Camera during the Cardington Fire Test, January 16, 2003 Pašek J., Svoboda J., Wald.
Elementary Statistics
The Use of Graph Matching Algorithms to Identify Biochemical Substructures in Synthetic Chemical Compounds Application to Metabolomics Mai Hamdalla, David.
First Experimental Tests 08/04/20141/18. First Experimental Tests Temperature sensors 08/04/20142/18.
The Pecan Market How long will prices stay this high?? Brody Blain Vice – President.
Chapter 16 Goodness-of-Fit Tests and Contingency Tables
1 Prediction of electrical energy by photovoltaic devices in urban situations By. R.C. Ott July 2011.
Dynamic Access Control the file server, reimagined Presented by Mark on twitter 1 contents copyright 2013 Mark Minasi.
Progressive Aerobic Cardiovascular Endurance Run
Biology 2 Plant Kingdom Identification Test Review.
2.5 Using Linear Models   Month Temp º F 70 º F 75 º F 78 º F.
WHY DID THE US GET INVOLVED IN VIETNAM? Learning Outcomes By the end of the lesson, we will have… … showed off our art skills … produced a timeline showing.
Moving to “T” National Instrument Institutional Trade Matching & Settlement Working Towards Successful Implementation Glenn MacPherson Program Director,
Making your point: debating Voting All school assemblies must be delivered as a rap. NOYES.
MaK_Full ahead loaded 1 Alarm Page Directory (F11)
Facebook Pages 101: Your Organization’s Foothold on the Social Web A Volunteer Leader Webinar Sponsored by CACO December 1, 2010 Andrew Gossen, Senior.
Week 2 Computer Programming Gray , Calibri 24
TCCI Barometer September “Establishing a reliable tool for monitoring the financial, business and social activity in the Prefecture of Thessaloniki”
When you see… Find the zeros You think….
2011 WINNISQUAM COMMUNITY SURVEY YOUTH RISK BEHAVIOR GRADES 9-12 STUDENTS=1021.
Before Between After.
2011 FRANKLIN COMMUNITY SURVEY YOUTH RISK BEHAVIOR GRADES 9-12 STUDENTS=332.
2.10% more children born Die 0.2 years sooner Spend 95.53% less money on health care No class divide 60.84% less electricity 84.40% less oil.
Subtraction: Adding UP
1 Non Deterministic Automata. 2 Alphabet = Nondeterministic Finite Accepter (NFA)
Static Equilibrium; Elasticity and Fracture
Resistência dos Materiais, 5ª ed.
Clock will move after 1 minute
Copyright © 2013 Pearson Education, Inc. All rights reserved Chapter 11 Simple Linear Regression.
Lial/Hungerford/Holcomb/Mullins: Mathematics with Applications 11e Finite Mathematics with Applications 11e Copyright ©2015 Pearson Education, Inc. All.
Select a time to count down from the clock above
Chart Deception Main Source: How to Lie with Charts, by Gerald E. Jones Dr. Michael R. Hyman, NMSU.
1 Non Deterministic Automata. 2 Alphabet = Nondeterministic Finite Accepter (NFA)
Introduction Embedded Universal Tools and Online Features 2.
Schutzvermerk nach DIN 34 beachten 05/04/15 Seite 1 Training EPAM and CANopen Basic Solution: Password * * Level 1 Level 2 * Level 3 Password2 IP-Adr.
eHiTS Score Darryl Reid, Zsolt Zsoldos, Bashir S. Sadjad, Aniko Simon, The next stage in scoring function evolution: a new statistically.
1 John Mitchell; James McDonagh; Neetika Nath Rob Lowe; Richard Marchese Robinson.
Presentation transcript:

Christian Kramer, Peter Gedeck Novartis Institutes for Biomedical Research, Basel, Switzerland Leave-cluster-out crossvalidation is appropriate for scoring functions derived on diverse protein datasets Dataset PDBbind07 [2] was used for reproducing RF-score results. The PDBbind09 refined set was used for demonstration of leave-cluster-out crossvalidation Descriptors The RFscore descriptors as published by Ballester and Mitchell were used for all models. For every ligand atom [C,N,O,F,P,S,Cl,Br,I] all protein atoms [C,N,O,S] within 12 Å distance are counted and summed up to give 4x9 atom pair descriptors Learning algorithm The Random Forest as implemented in R with default settings was used. Dataset & methods PDBbind core set & RFscore performance Empirical rescoring functions for predicting Protein- Ligand interaction energies can be trained based on large diverse collections of crystal structure geometries augmented with binding data, such as the PDBbind or the BindingMOAD database. In a recent publication remarkable success has been demonstrated in predicting the free energy of interaction based on atom counts in a 12 Å radius around the ligand. [1] However the quality of prediction depends strongly on the composition of training and validation set. We suggest a generally applicable validation strategy that is not prone to protein-family recognition pitfalls. Introduction References [1] Ballester, P.J. & Mitchell, J.B.O. A machine learning approach to predicting protein- ligand binding affinity with applications to molecular docking. Bioinformatics 26, (2010) [2] Cheng, T., Li, X., Li, Y., Liu, Z. & Wang, R. Comparative Assessment of Scoring Functions on a Diverse Test Set. Journal of Chemical Information and Modeling 49, (2009). The PDBbind07 core set can be predicted with RMSE = 1.58, R 2 = 0.59 and R = It has been assembled from a clustering of the PDBbind07 database according to BLAST similarities. The most active, the least active and the complex closest to the average activity have been extracted from each cluster with at least 4 members. This means that for every validation set entry there is at least one entry from the same protein family in the training set. Leave-cluster-out crossvalidation The PDBbind09 refined set consists of 1741 complexes in 561 clusters (90% BLAST similarity). The distribution of cluster population is shown below. For the leave-cluster-out crossvalidation we suggest the following clustering scheme: All clusters with more than nine members are kept (A-W). Clusters with four to nine members are united (X), clusters with two and three members are united (Y) and all singletons are united (Z). Multidimensional scaling of the RFscore space shows that complexes from the same protein family indeed cluster. A flexible learning algorithm should well be able to recognize protein family membership Complete Set Train 1 Validation 1 Cluster 1 outCluster 2 outCluster 3 outCluster 4 outCluster 5 out Train 2 Validation 2 Train 3 Validation 3 Train 4 Validation 4 Train 5 Validation 5 Composition of the PDBbind09 database The PDBbind09 cluster alphabet Cluster proximities The range of activities within protein families is smaller than the total range of activities. To avoid predictions that benefit from protein-family we suggest to do leave-cluster-out crossvalidation Biological TargetCluster#samplesRR2R2 RMSE HIV ProteaseA TrypsinB Carbonic AnhydraseC ThrombinD PTP1B (Protein Tyrosine Phosphatase)E Factor XaF UrokinaseG Different similar TransportersH c-AMP Dependent Kinase (PKA)I Beta-GlucosidaseJ AntibodiesK Casein Kinase IIL RibonucleaseM ThermolysinN CDK2 KinaseO Glutamate receptor 2P P38 KinaseQ Beta-secretase 1R tRNA-guanine transglycosylaseS EndothiapepsinT Alpha-mannosidase 2U Carboxypeptidase AV PenicillopepsinW All Clusters with 4-9 complexesX All Clusters with 2-3 complexesY SingletonsZ Performance for each cluster after leave-cluster-out crossvalidation If crystal structures with corresponding activities are available, target specific scoring functions can be generated. We generated scoring functions within the clusters with standard out-of-bag crossvalidation for the four largest clusters. The advent of large diverse datasets of protein- ligand complexes allows to generate scoring functions with a QSAR-type fitting procedure Global scoring functions must be validated with protein-ligand complexes that stem from protein families that are not present in the training set. Else the validation will look overoptimistic (R 2 = 0.59 vs R 2 = 0.21) Target specific scoring functions can be much more predictive than global scoring functions, even when trained with the same descriptors. Conclusion Target specific scoring functions Acknowledgments CK thanks the Novartis Education Office for a Presidential Postdoc Fellowship. Table 1: Leave-cluster-out crossvalidation results on the PDBbind09 refined set. Average R 2 = 0.21, average RMSE = 1.60 Biological TargetCluster#samplesRR2R2 RMSERR2R2 Validation SetOut-of-bag within clusterCluster left out HIV ProteaseA TrypsinB Carbonic AnhydraseC ThrombinD