1 …ask more of your data 1 Bayesian Learning Build a model which estimates the likelihood that a given data sample is from a "good" subset of a larger set of samples (classification learning) SciTegic uses modified Naïve Bayesian statistics –Efficient: scales linearly with large data sets –Robust: works for a few as well as many ‘good’ examples –Unsupervised: no tuning parameters needed –Multimodal: can model broad classes of compounds multiple modes of action represented in a single model

2 …ask more of your data 2 Learn Good from Bad “Learn Good from Bad” examines what distinguishes “good” from “baseline” compounds –Molecular properties (molecular weight, alogp, etc) –Molecular fingerprints Baseline O N A A “Good”

3 …ask more of your data 3 Learning: “Learn Good From Bad” User provides name for new component and a “Test for good”, e.g.: –Activity > 0.5 –Conclusion EQ ‘CA’ User specifies properties –Typical: fingerprints, alogp, donors/acceptors, number of rotatable bonds, etc. Model is new component Component calculates a number –The larger the number, the more likely a sample is “good”

4 …ask more of your data 4 Using the model Model can be used to prioritize samples for screening, or search vendor libraries for new candidates for testing Quality of model can be evaluated: –Split data into training and test sets –Build model using training set –Sort test set using model value –Plot how rapidly hits are found in sorted list

5 …ask more of your data 5 Using a Learned Model Model appears on your tab in LearnedProperties –Drag it into a protocol to use it “by value” –Refer to it by name to use it “by reference”

6 6 Fingerprints

7 …ask more of your data 7 ECFP: Extended Connectivity Fingerprints New class of fingerprints for molecular characterization –Each bit represents the presence of a structural (not substructural) feature –4 Billion different bits –Multiple levels of abstraction contained in single FP –Different starting atom codes lead to different fingerprints (ECFP, FCFP,...) –Typical molecule generates 100s - 1000s of bits –Typical library generates 100K - 10M different bits.

8 …ask more of your data 8 Advantages Fast to calculate Represents much larger number of features Features not "pre-selected" Represents tertiary/quaternary information –Opposed to path based fp’s Bits can be “interpreted”

9 …ask more of your data 9 FCFP: Initial Atom Codes

10 …ask more of your data 10 ECFP: Generating the Fingerprint Iteration is repeated desired number of times –Each iteration extends the diameter by two bonds Codes from all iterations are collected Duplicate bits may be removed

11 …ask more of your data 11 ECFP: Extending the Initial Atom Codes Fingerprint bits indicate presence and absence of certain structural features Fingerprints do not depend on a predefined set of substructural features O N A A A A O N A A A A A Each iteration adds bits that represent larger and larger structures Iteration 0 Iteration 1 Iteration 2

12 …ask more of your data 12 The Statistics Table: Features A feature is a binary attribute of a data record –For molecules, it may be derived from a property range or a fingerprint bit A molecule typically contains a few hundred features A count of each feature is kept: –Over all the samples –Over all samples that pass the test for good The Normalized Probability is log(Laplacian-corrected probability) The normalized probabilities are summed over all features to give the relative score.

13 …ask more of your data 13 Normalized Probability Given a set of N samples Given that some subset A of them are good (‘active’) –Then we estimate for a new compound: P(good) ~ A / N Given a set of binary features F i –For a given feature F: It appears in N F samples It appears in A F good samples –Can we estimate: P(good | F) ~ A F / N F (Problem: Error gets worse as N F  small)

14 …ask more of your data 14 Quiz Time Have an HTS screen with 1% actives Have two new samples X and Y to test For each sample, we are given the results from one feature (F X and F Y ) Which one is most likely to be active?

15 …ask more of your data 15 Question 1 Sample X: –A Fx : 0 –N Fx : 100 Sample Y: –A Fy : 100 –N Fy : 100

16 …ask more of your data 16 Question 2 Sample X: –A Fx : 0 –N Fx : 100 Sample Y: –A Fy : 1 –N Fy : 100

17 …ask more of your data 17 Question 3 Sample X: –A Fx : 0 –N Fx : 100 Sample Y: –A Fy : 0 –N Fy : 0

18 …ask more of your data 18 Question 4 Sample X: –A Fx : 2 –N Fx : 100 Sample Y: –A Fy : 0 –N Fy : 0

19 …ask more of your data 19 Question 5 Sample X: –A Fx : 2 –N Fx : 4 Sample Y: –A Fy : 200 –N Fy : 400

20 …ask more of your data 20 Question 6 Sample X: –A Fx : 0 –N Fx : 100 Sample Y: –A Fy : 0 –N Fy : 1,000,000

21 …ask more of your data 21 Normalized Probability Thought experiment: –What is the probability of a feature which we have seen in NO samples? (i.e., a novel feature) –Hint: assume most features have no connection to the reason for “goodness”…

22 …ask more of your data 22 Normalized Probability Thought experiment: –What is the probability of a feature which we have seen in NO samples? (i.e., a novel feature) –The best guess would be P(good) Conclusion: –Want estimator P(good | F)  P(good) as N F  small Add some “virtual” samples (with prob P(good)) to every bin

23 …ask more of your data 23 Normalized Probability Our new estimate (after adding K virtual samples) P’(good | F) = (A F + P(good)K) / (N F + K) –P’(good | F)  P(good) as N F  0 –P’(good | F)  A F / N F as N F  large (If K = 1/P(good) this is the Laplacian correction) K is the duplication factor in our data

24 …ask more of your data 24 Normalized Probability Final issue: How do I combine multiple features? –Assumption: number of features doesn’t matter –Want to limit contribution from random features P’’’(good | F) = ((A F + P(good)K) / (N F + K)) / P(good) P final = P’’’(good|F 1 ) * P’’’(good|F 2 ) * … Phew! (The good news: for most real-world data, default value of K is quite satisfactory…)

25 25 Validation of the Model

26 …ask more of your data 26 Generating Enrichment Plots “If I prioritized my testing using this model, how well would I do?” Graph shows % actives (“good”) found vs % tested Use it on a test dataset: –That was not part of the training data –That you already have results for

27 …ask more of your data 27 Modeling Known Activity Classes from the World Drug Index Training set 25,000 random selected compounds from WDI Test set 25,000 remaining cmpds from WDI + 25,000 cmpds from Maybridge Descriptors fingerprints, ALogP, molecular properties Build models for each activity class: progestogen, estrogen, etc WDI 50K 25K Maybridge 25K Training setTest set

28 …ask more of your data 28 Enrichment Plots Apply activity model to compounds in test set Order compounds from ‘best’ to ‘worst’ Plot cumulative distribution of known actives Do this for each activity class actives

29 …ask more of your data 29 Enrichment Plot for High Actives

30 …ask more of your data 30 Choosing a Cutoff Value Models are relative predictors –Suggest which to test first –Not a classifier (threshold independent) To make it a classifier, need to choose a cutoff –Balance between sensitivity (True Positive rate) specificity (1 - False Positive rate) –Requires human judgment Two useful views –Histogram plots –ROC (Receiver Operating Characteristic) plots

31 …ask more of your data 31 Choosing a Cutoff Value: Histograms A histogram can visually show the separation of actives and nonactives using a model

32 …ask more of your data 32 Choosing a Cutoff Value: ROC Plots Derived from clinical medicine Shows balance of costs of missing a true positive versus falsely accepting a negative Area under the curve is a measure of quality : –-.90-1 = excellent (A) –-.80-.90 = good (B) –-.70-.80 = fair (C) –-.60-.70 = poor (D) –-.50-.60 = fail (F)

33 …ask more of your data 33 ROC Plot for MAO

34 …ask more of your data 34 Postscript: non-FP Descriptors AlogP –A measure of the octanol/water partition coefficient –High value means molecule "prefers" to be in octanol rather than water – i.e., is nonpolar –A real number Molecular Weight –Total mass of all of the atoms making up the molecule –Units are atomic mass units (a.m.u.) in which the mass of each proton or neutron is approximately 1 –A positive real number

35 …ask more of your data 35 Postscript: non-FP Descriptors Num H Acceptors, Num H Donors –Molecules may link to each other via hydrogen bonds –H-bonds are weaker than true chemical bonds –H-bonds play a role in drug activity –H donors are polar atoms such as N and O with an attached H (can "donate" a hydrogen to form H-bond) –H acceptors are polar atoms lacking an attached H (can "accept" a hydrogen to form H-bond) –Num H Acceptors, Num H Donors are counts of atoms meeting the above criteria –Non-negative integers

36 …ask more of your data 36 Postscript: non-FP Descriptors Num Rotatable Bonds –Certain bonds between atoms are rigid Bonds within rings Double and triple bonds –Others are rotatable Attached parts of molecule can freely pivot around bond –Num Rotable Bonds is count of rotatable bonds in molecule –A non-negative integer

