Prediction of Molecular Bioactivity for Drug Design Experiences from the KDD Cup 2001 competition Sunita Sarawagi, IITB

Slides:



Advertisements
Similar presentations
DECISION TREES. Decision trees  One possible representation for hypotheses.
Advertisements

Integrated Instance- and Class- based Generative Modeling for Text Classification Antti PuurulaUniversity of Waikato Sung-Hyon MyaengKAIST 5/12/2013 Australasian.
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
Introduction Training Complexity, Pruning CART vs. ID3 vs. C4.5
Feature selection and transduction for prediction of molecular bioactivity for drug design Reporter: Yu Lun Kuo (D )
Data Mining Classification: Alternative Techniques
Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.
Data Mining Classification: Alternative Techniques
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
Comparison of Data Mining Algorithms on Bioinformatics Dataset Melissa K. Carroll Advisor: Sung-Hyuk Cha March 4, 2003.
Application of Stacked Generalization to a Protein Localization Prediction Task Melissa K. Carroll, M.S. and Sung-Hyuk Cha, Ph.D. Pace University, School.
Proactive Learning: Cost- Sensitive Active Learning with Multiple Imperfect Oracles Pinar Donmez and Jaime Carbonell Pinar Donmez and Jaime Carbonell Language.
Classification and Decision Boundaries
Decision Tree Rong Jin. Determine Milage Per Gallon.
Sparse vs. Ensemble Approaches to Supervised Learning
Reduced Support Vector Machine
Ensemble Learning: An Introduction
Three kinds of learning
Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
Anomaly Detection. Anomaly/Outlier Detection  What are anomalies/outliers? The set of data points that are considerably different than the remainder.
Active Learning Strategies for Compound Screening Megon Walker 1 and Simon Kasif 1,2 1 Bioinformatics Program, Boston University 2 Department of Biomedical.
Sparse vs. Ensemble Approaches to Supervised Learning
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Active Learning for Class Imbalance Problem
Data mining and machine learning A brief introduction.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Boris Babenko Department of Computer Science and Engineering University of California, San Diego Semi-supervised and Unsupervised Feature Scaling.
Active Learning An example From Xu et al., “Training SpamAssassin with Active Semi- Supervised Learning”
Semi-supervised Learning on Partially Labeled Imbalanced Data May 16, 2010 Jianjun Xie and Tao Xiong.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
Text Classification 2 David Kauchak cs459 Fall 2012 adapted from:
An Ensemble of Three Classifiers for KDD Cup 2009: Expanded Linear Model, Heterogeneous Boosting, and Selective Naive Bayes Members: Hung-Yi Lo, Kai-Wei.
Interactive Deduplication using Active Learning Sunita Sarawagi and Anuradha Bhamidipaty Presented by Doug Downey.
Data Reduction via Instance Selection Chapter 1. Background KDD  Nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable.
Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.
USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
COP5992 – DATA MINING TERM PROJECT RANDOM SUBSPACE METHOD + CO-TRAINING by SELIM KALAYCI.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: Group:
Rotem Golan Department of Computer Science Ben-Gurion University of the Negev, Israel.
Classification Ensemble Methods 1
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
1 January 24, 2016Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 7 — Classification Ensemble Learning.
Ensemble Methods Construct a set of classifiers from the training data Predict class label of previously unseen records by aggregating predictions made.
1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.
Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.
Notes on HW 1 grading I gave full credit as long as you gave a description, confusion matrix, and working code Many people’s descriptions were quite short.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Ensemble Learning, Boosting, and Bagging: Scaling up Decision Trees (with thanks to William Cohen of CMU, Michael Malohlava of 0xdata, and Manish Amde.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Data Mining and Text Mining. The Standard Data Mining process.
Ensemble Classifiers.
Machine Learning: Ensemble Methods
Evaluating Classifiers
KDD CUP 2001 Task 1: Thrombin Jie Cheng (
Boosted Augmented Naive Bayes. Efficient discriminative learning of
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Trees, bagging, boosting, and stacking
An Enhanced Support Vector Machine Model for Intrusion Detection
COSC 4335: Other Classification Techniques
Support Vector Machines
Chapter 7: Transformations
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
COSC 4368 Intro Supervised Learning Organization
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

Prediction of Molecular Bioactivity for Drug Design Experiences from the KDD Cup 2001 competition Sunita Sarawagi, IITB Joint work with B. Anuradha, IITB Anand Janakiraman, IITB Jayant Haritsa, IISc

The dataset Dataset provided by DuPont Pharmaceuticals Activity of compounds binding to thrombin Library of compounds included: 1909 known molecules (42 actively binding thrombin) 139,351 binary features describe the 3-D structure of each compound 636 new compounds with unknown capacity to bind to thrombin

Sample data 0,1,0,0,0,0,… …,0,0,0,0,0,0,I 0,0,0,0,0,0,… …,0,0,0,0,0,1,I 0,0,0,0,0,0,… …,0,0,0,0,0,0,I 0,1,0,0,0,1,… …,0,1,0,0,0,1,A 0,1,0,0,0,1,… …,0,1,0,0,1,1,? 0,1,1,0,0,1,… …,0,1,1,0,0,1,?

Challenges Large number of binary features, significantly fewer training instances: 140,000 vs 2000! Highly skewed: 1867 In-actives, 42 Actives. Varying degrees of correlation among features Differences in the training and test distributions

Steps Familiarization with data data has noise, four equal records (all 0s) with different labels Lots more 0s than 1s Number of 1s significantly higher for As than Is Feature selection Build classifiers Combine classifiers Incorporate unlabeled test instances

First step: feature selection Most commercial classifiers cannot handle 140,000 features even with 1 GB memory. Entropy-based individual feature selection Does not handle redundant attributes. Step-wise feature selection Too brittle Top entropy attribute with a “1” in each active compound Exploiting small counts of Actives Want all important groups of redundant attributes

Building classifiers Partition training data using stratified sampling Two-thirds training data One-third validation data Classification methods attempted Decision tree classifiers Naïve-Bayes SVMs Hand-crafted clustering/nearest neighbor hybrid

Decision Tree C4.5 I (338/6) A (2) A (3) A (4) A (5) A (10) f25144 = 1 f80106 = 1 f26913 = 1 f = 1 f = 1 f88235 = 1 AI A37 I1459

Naïve Bayes Data characteristics very similar to text lots of features, sparse data, few ones Naïve Bayes found very effective for text classification Accuracy: All actives misclassified! AI A010 I1459

Support vector machines Has received lots of attention recently Requires tuning: which kernel, what parameters? Several freely available packages: SVMTorch Accuracy: slightly worse than decision trees fifi fjfj

Hand-crafted hybrid Find features such that actives cluster together using appropriate distance measure Training active Training inactive Test Record fifi fjfj

Incremental Feature Selection Pick features ONE by ONE that result in maximum clustering of the actives. And maximum separation from the inactives. Objective function: Maximum separation between centroids of the Actives and In-actives Distance function: matching ones Careful selection of training Actives. Accuracy: 100%, 493 features

Final approach Test data: significantly denser Methods like SVM, NB, clustering-based will not generalize Preferred distribution independent method Ensemble of Decision Trees On disjoint attributes --- unconventional Semi-supervised training Introduce feedback from the test data in multiple rounds

Building tree ensembles Initially picked ~20000 features based on entropy. More than one tree to take care of large feature space. Repeat until accuracy on validation data does not drop All groups of redundant features exploited. Remove features Remove features

Incorporating unlabeled instances Augment training data with sure test instances. Re-train another ensemble of trees using same method Include more unlabelled instances with sure predictions Repeat few more times... How to capture drift?

Capturing drift Solution: Validate with independent data Be sure to include only correctly labeled data First approach: Same prediction by all trees On validation data, found errors in this scheme Pruning not a solution Weighted prediction by each tree Weight: fraction of Actives Pick the right threshold using validation data. Stop when no more unlabelled data can be added

Final state Three rounds each with about 6 trees Unlabelled data included: 126 actives & 311 inactives Remaining 200 in confusion Use meta-learner on validation data to pick final criteria Sum of scores times number of trees claiming Actives Several other last minute hacks.

Outcome Winning Entry: Weighted: 68.4% Accuracy: 70.03% Home Team

Winner’s method Pre-processing: Feature subset selection using mutual information (200 of 139,351 features) Learning Bayesian network models of different complexity (2 to 12 features) Choosing a model (ROC area, model complexity)

Postmortem: Was all this necessary? Without semi-supervised learning: Single decision tree = 49% 6-tree ensemble on training data alone: Majority = 57% Confidence weighted = 63% With unlabelled data: 64.3%

Lessons learnt Products: Need tools that scale in number of features Research problems: Classifiers that are not tied to distribution similarity with the training data More principled way of including unlabelled instances.