Download presentation
Presentation is loading. Please wait.
1
Active Learning Strategies for Drug Screening 1. Introduction At the intersection of drug discovery and experimental design, active learning algorithms guide selection of successive compound batches for biological assays when screening a chemical library in order to identify many target binding compounds with minimal screening iterations. 1-3 The active learning paradigm refers to the ability of the learner to modify the sampling strategy of data chosen for training based on previously seen data. During each round of screening, the active learning algorithm selects a batch of unlabeled compounds to be tested for target binding activity and added to the training set. Once the labels for this batch are known, the model of activity is recomputed on all examples labeled so far, and a new chemical set for screening is selected (Figure 1). The drug screening pipeline proposed here combines committee- based active learning with bagging and boosting techniques and several options for sample selection. Our best strategy retrieves up to 87% of the active compounds after screening only 30% of the chemical datasets analyzed. Start Input data files with compound descriptors Designate training and testing sets for this round of cross validation 1 st batch of drugs whose labels are queried? Labels for a batch from the unlabeled training set queried committee of classifiers trained on sub-samples from the labeled training set drugs 1 st batch selected by chemist’s domain knowledge Figure 2: Pipeline Flowchart Classifiers Committees naïve Bayesbagging perceptronboosting Unlabeled testing set & training set drugs classified by committee (weighted majority vote) All training set labels queried? Cross validation completed? Accuracy and performance statistics End Sample Selection random uncertainty density P(active) yes no yes no Figure 3: Querying for labels & training classifiers on sub-samples Megon Walker 1 and Simon Kasif 1,2 1 Bioinformatics Program, Boston University, Boston, MA 2 Department of Biomedical Engineering, Boston University, Boston, MA Figure 4: Hit Performance and Sensitivity 6. References 1.N. Abe, and H. Mamitsuka. Query Learning Strategies Using Boosting and Bagging. ICML 1998, 1-9. 2.G. Forman. Incremental Machine Learning to Reduce Biochemistry Lab Costs in the Search for Drug Discovery. BIOKDD 2002, 33-36. 3.M. Warmuth, G. Ratsch, M. Mathieson, J. Liao, C. Lemmen. Active Learning in the Drug Discovery Process. NIPS 2001, 1449-1456. 4.KDD Cup 2001. http://www.cs.wisc.edu/~dpage/kddcup2001/ 5.R. Brown and Y. Martin. Use of Structure-Activity Data To Compare Structure-Based Clustering Methods and Descriptors for Use in Compound Selection. Journal of Chemical Information and Computer Science.1996. 36, 572- 584. 4. Results 2. Objectives exploitation: optimize the number of target binding (active) drugs retrieved with each batch exploration: optimize the prediction accuracy of the committee during each iteration of querying 3. Methods Datasets a binary feature vector for each compound indicated the presence or absence of structural fragments 200 features with highest feature-activity mutual information (MI) selected for each dataset retrospective data: labels provided with the features labels: target binding or active (A); non-binding or inactive (I) 632 DuPont thrombin-targeting compounds 4 (149 A, 483 I, mean MI = 0.126) 1346 Abbott monoamine oxidase inhibitors 5 (221 A,1125 I, mean MI = 0.006) Pipeline 5X cross validation 5% batch size 5 classifiers in the committee (Figure 2) perceptron classifier data shown Classifier committees bagging: samples from the labeled training data with uniform distribution boosting: samples from the labeled training data with varied sampling distribution such that compounds misclassified by the previously obtained hypothesis are more likely to be sampled again Sample selection strategies random uncertainty: compounds on which the committee disagrees most strongly are selected density with respect to actives: compounds most similar to previously labeled or predicted actives are selected (Tanimoto similarity metric) P(active) : compounds predicted active with highest probability by the committee are selected 5. Discussion exploitation: number of active drugs retrieved with each batch queried P(active) sample selection shows best hit performance when feature information content is higher (Figure 4a) -after 30% of drug are labeled (cross validation averages): 1. P(active)retrieves 84% actives 2. densityretrieves 77% actives 3. uncertaintyretrieves 65% actives 4. randomretrieves 42% actives density sample selection strategy shows best initial hit performance when feature information content is lower (Figure 4b) -classifier sensitivity is compromised -linear hit performance for all strategies after 20% of drugs labeled exploration: the prediction accuracy of the committee on the testing data set during each iteration of querying uncertainty sample selection shows best testing set sensitivity increases in the labeled training set size during progressive rounds of querying result in no significant increase in testing set sensitivity (Figure 4c) -labeled training set ratio actives:inactives biases the classifier? -multiple modes of drug activity present in datasets? tradeoff: sample selection methods resulting in the best hit performance display the lowest testing set sensitivity (Figure 4c) bagging and boosting methods do not result in significantly different hit performance for any sample selection strategy on these datasets bagging and boosting techniques significantly enhance the testing set sensitivity of the component learning algorithm (Figure 4d) Future work will involve ROC and precision-recall analysis, along with comparison of various classifiers and feature descriptors. Features Drugs 01110 10110 11101 01101 compoundsdescriptors selection screening Figure 1: The Drug Discovery Cycle after 1 st query after 2 nd query FeaturesA/I Drugs train classifier # 1 I train classifier # 2 A NOT labeled ? ? ? ? ? ? test ? ? FeaturesA/I Drugs train classifier # 1 I A A train classifier # 2 I A I NOT labeled ? ? test ? ?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.