Active Learning Strategies for Drug Screening 1. Introduction At the intersection of drug discovery and experimental design, active learning algorithms.

Slides:



Advertisements
Similar presentations
Florida International University COP 4770 Introduction of Weka.
Advertisements

Analysis of High-Throughput Screening Data C371 Fall 2004.
Recognizing Human Actions by Attributes CVPR2011 Jingen Liu, Benjamin Kuipers, Silvio Savarese Dept. of Electrical Engineering and Computer Science University.
Ensemble Learning Reading: R. Schapire, A brief introduction to boosting.
On-line learning and Boosting
Feature selection and transduction for prediction of molecular bioactivity for drug design Reporter: Yu Lun Kuo (D )
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.
Comparison of Data Mining Algorithms on Bioinformatics Dataset Melissa K. Carroll Advisor: Sung-Hyuk Cha March 4, 2003.
Paper presentation for CSI5388 PENGCHENG XI Mar. 23, 2005
Introduction to Boosting Slides Adapted from Che Wanxiang( 车 万翔 ) at HIT, and Robin Dhamankar of Many thanks!
Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.
Feature Selection for Regression Problems
Ensemble Learning: An Introduction
Presented by Zeehasham Rasheed
1 Ensembles of Nearest Neighbor Forecasts Dragomir Yankov, Eamonn Keogh Dept. of Computer Science & Eng. University of California Riverside Dennis DeCoste.
Active Learning Strategies for Compound Screening Megon Walker 1 and Simon Kasif 1,2 1 Bioinformatics Program, Boston University 2 Department of Biomedical.
Rotation Forest: A New Classifier Ensemble Method 交通大學 電子所 蕭晴駿 Juan J. Rodríguez and Ludmila I. Kuncheva.
For Better Accuracy Eick: Ensemble Learning
Selective Sampling on Probabilistic Labels Peng Peng, Raymond Chi-Wing Wong CSE, HKUST 1.
Evaluating Classifiers
Comparing the Parallel Automatic Composition of Inductive Applications with Stacking Methods Hidenao Abe & Takahira Yamaguchi Shizuoka University, JAPAN.
Active Learning for Class Imbalance Problem
Topological Summaries: Using Graphs for Chemical Searching and Mining Graphs are a flexible & unifying model Scalable similarity searches through novel.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00.
Experimental Evaluation of Learning Algorithms Part 1.
Benk Erika Kelemen Zsolt
Use of Machine Learning in Chemoinformatics Irene Kouskoumvekaki Associate Professor December 12th, 2012 Biological Sequence Analysis course.
1 Multiple Classifier Based on Fuzzy C-Means for a Flower Image Retrieval Keita Fukuda, Tetsuya Takiguchi, Yasuo Ariki Graduate School of Engineering,
Prediction of Molecular Bioactivity for Drug Design Experiences from the KDD Cup 2001 competition Sunita Sarawagi, IITB
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Ensemble Methods: Bagging and Boosting
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
CLASSIFICATION: Ensemble Methods
Exploring Alternative Splicing Features using Support Vector Machines Feature for Alternative Splicing Alternative splicing is a mechanism for generating.
QSAR Study of HIV Protease Inhibitors Using Neural Network and Genetic Algorithm Akmal Aulia, 1 Sunil Kumar, 2 Rajni Garg, * 3 A. Srinivas Reddy, 4 1 Computational.
CISC Machine Learning for Solving Systems Problems Presented by: Ashwani Rao Dept of Computer & Information Sciences University of Delaware Learning.
ISQS 6347, Data & Text Mining1 Ensemble Methods. ISQS 6347, Data & Text Mining 2 Ensemble Methods Construct a set of classifiers from the training data.
Learning from Positive and Unlabeled Examples Investigator: Bing Liu, Computer Science Prime Grant Support: National Science Foundation Problem Statement.
E NSEMBLE L EARNING : A DA B OOST Jianping Fan Dept of Computer Science UNC-Charlotte.
Online Multiple Kernel Classification Steven C.H. Hoi, Rong Jin, Peilin Zhao, Tianbao Yang Machine Learning (2013) Presented by Audrey Cheong Electrical.
COP5992 – DATA MINING TERM PROJECT RANDOM SUBSPACE METHOD + CO-TRAINING by SELIM KALAYCI.
PharmaMiner: Geometric Mining of Pharmacophores 1.
Ensemble Methods in Machine Learning
Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: Group:
Feature Extraction Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and.
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
Ensemble Methods Construct a set of classifiers from the training data Predict class label of previously unseen records by aggregating predictions made.
Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bagging and Boosting Cross-Validation ML.
Use of Machine Learning in Chemoinformatics
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:
Nawanol Theera-Ampornpunt, Seong Gon Kim, Asish Ghoshal, Saurabh Bagchi, Ananth Grama, and Somali Chaterji Fast Training on Large Genomics Data using Distributed.
A distributed PSO – SVM hybrid system with feature selection and parameter optimization Cheng-Lung Huang & Jian-Fan Dun Soft Computing 2008.
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
Ensemble Classifiers.
Object Detection with Bootstrapping Carlos Rubiano Mentor: Oliver Nina
Basic machine learning background with Python scikit-learn
Introduction to Data Mining, 2nd Edition
Ensembles.
Ensemble learning Reminder - Bagging of Trees Random Forest
Megon Walker Bioinformatics Program Boston University
Model generalization Brief summary of methods
Analysis for Predicting the Selling Price of Apartments Pratik Nikte
Learning to Rank with Ties
Presentation transcript:

Active Learning Strategies for Drug Screening 1. Introduction At the intersection of drug discovery and experimental design, active learning algorithms guide selection of successive compound batches for biological assays when screening a chemical library in order to identify many target binding compounds with minimal screening iterations. 1-3 The active learning paradigm refers to the ability of the learner to modify the sampling strategy of data chosen for training based on previously seen data. During each round of screening, the active learning algorithm selects a batch of unlabeled compounds to be tested for target binding activity and added to the training set. Once the labels for this batch are known, the model of activity is recomputed on all examples labeled so far, and a new chemical set for screening is selected (Figure 1). The drug screening pipeline proposed here combines committee- based active learning with bagging and boosting techniques and several options for sample selection. Our best strategy retrieves up to 87% of the active compounds after screening only 30% of the chemical datasets analyzed. Start Input data files with compound descriptors Designate training and testing sets for this round of cross validation 1 st batch of drugs whose labels are queried? Labels for a batch from the unlabeled training set queried committee of classifiers trained on sub-samples from the labeled training set drugs 1 st batch selected by chemist’s domain knowledge Figure 2: Pipeline Flowchart Classifiers Committees naïve Bayesbagging perceptronboosting Unlabeled testing set & training set drugs classified by committee (weighted majority vote) All training set labels queried? Cross validation completed? Accuracy and performance statistics End Sample Selection random uncertainty density P(active) yes no yes no Figure 3: Querying for labels & training classifiers on sub-samples Megon Walker 1 and Simon Kasif 1,2 1 Bioinformatics Program, Boston University, Boston, MA 2 Department of Biomedical Engineering, Boston University, Boston, MA Figure 4: Hit Performance and Sensitivity 6. References 1.N. Abe, and H. Mamitsuka. Query Learning Strategies Using Boosting and Bagging. ICML 1998, G. Forman. Incremental Machine Learning to Reduce Biochemistry Lab Costs in the Search for Drug Discovery. BIOKDD 2002, M. Warmuth, G. Ratsch, M. Mathieson, J. Liao, C. Lemmen. Active Learning in the Drug Discovery Process. NIPS 2001, KDD Cup R. Brown and Y. Martin. Use of Structure-Activity Data To Compare Structure-Based Clustering Methods and Descriptors for Use in Compound Selection. Journal of Chemical Information and Computer Science , Results 2. Objectives exploitation: optimize the number of target binding (active) drugs retrieved with each batch exploration: optimize the prediction accuracy of the committee during each iteration of querying 3. Methods Datasets a binary feature vector for each compound indicated the presence or absence of structural fragments 200 features with highest feature-activity mutual information (MI) selected for each dataset retrospective data: labels provided with the features labels: target binding or active (A); non-binding or inactive (I) 632 DuPont thrombin-targeting compounds 4 (149 A, 483 I, mean MI = 0.126) 1346 Abbott monoamine oxidase inhibitors 5 (221 A,1125 I, mean MI = 0.006) Pipeline 5X cross validation 5% batch size 5 classifiers in the committee (Figure 2) perceptron classifier data shown Classifier committees bagging: samples from the labeled training data with uniform distribution boosting: samples from the labeled training data with varied sampling distribution such that compounds misclassified by the previously obtained hypothesis are more likely to be sampled again Sample selection strategies random uncertainty: compounds on which the committee disagrees most strongly are selected density with respect to actives: compounds most similar to previously labeled or predicted actives are selected (Tanimoto similarity metric) P(active) : compounds predicted active with highest probability by the committee are selected 5. Discussion exploitation: number of active drugs retrieved with each batch queried P(active) sample selection shows best hit performance when feature information content is higher (Figure 4a) -after 30% of drug are labeled (cross validation averages): 1. P(active)retrieves 84% actives 2. densityretrieves 77% actives 3. uncertaintyretrieves 65% actives 4. randomretrieves 42% actives density sample selection strategy shows best initial hit performance when feature information content is lower (Figure 4b) -classifier sensitivity is compromised -linear hit performance for all strategies after 20% of drugs labeled exploration: the prediction accuracy of the committee on the testing data set during each iteration of querying uncertainty sample selection shows best testing set sensitivity increases in the labeled training set size during progressive rounds of querying result in no significant increase in testing set sensitivity (Figure 4c) -labeled training set ratio actives:inactives biases the classifier? -multiple modes of drug activity present in datasets? tradeoff: sample selection methods resulting in the best hit performance display the lowest testing set sensitivity (Figure 4c) bagging and boosting methods do not result in significantly different hit performance for any sample selection strategy on these datasets bagging and boosting techniques significantly enhance the testing set sensitivity of the component learning algorithm (Figure 4d) Future work will involve ROC and precision-recall analysis, along with comparison of various classifiers and feature descriptors. Features Drugs compoundsdescriptors selection screening Figure 1: The Drug Discovery Cycle after 1 st query after 2 nd query FeaturesA/I Drugs train classifier # 1 I train classifier # 2 A NOT labeled ? ? ? ? ? ? test ? ? FeaturesA/I Drugs train classifier # 1 I A A train classifier # 2 I A I NOT labeled ? ? test ? ?