Prediction of Molecular Bioactivity for Drug Design Experiences from the KDD Cup 2001 competition Sunita Sarawagi, IITB Joint work with B. Anuradha, IITB Anand Janakiraman, IITB Jayant Haritsa, IISc
The dataset Dataset provided by DuPont Pharmaceuticals Activity of compounds binding to thrombin Library of compounds included: 1909 known molecules (42 actively binding thrombin) 139,351 binary features describe the 3-D structure of each compound 636 new compounds with unknown capacity to bind to thrombin
Sample data 0,1,0,0,0,0,… …,0,0,0,0,0,0,I 0,0,0,0,0,0,… …,0,0,0,0,0,1,I 0,0,0,0,0,0,… …,0,0,0,0,0,0,I 0,1,0,0,0,1,… …,0,1,0,0,0,1,A 0,1,0,0,0,1,… …,0,1,0,0,1,1,? 0,1,1,0,0,1,… …,0,1,1,0,0,1,?
Challenges Large number of binary features, significantly fewer training instances: 140,000 vs 2000! Highly skewed: 1867 In-actives, 42 Actives. Varying degrees of correlation among features Differences in the training and test distributions
Steps Familiarization with data data has noise, four equal records (all 0s) with different labels Lots more 0s than 1s Number of 1s significantly higher for As than Is Feature selection Build classifiers Combine classifiers Incorporate unlabeled test instances
First step: feature selection Most commercial classifiers cannot handle 140,000 features even with 1 GB memory. Entropy-based individual feature selection Does not handle redundant attributes. Step-wise feature selection Too brittle Top entropy attribute with a “1” in each active compound Exploiting small counts of Actives Want all important groups of redundant attributes
Building classifiers Partition training data using stratified sampling Two-thirds training data One-third validation data Classification methods attempted Decision tree classifiers Naïve-Bayes SVMs Hand-crafted clustering/nearest neighbor hybrid
Decision Tree C4.5 I (338/6) A (2) A (3) A (4) A (5) A (10) f25144 = 1 f80106 = 1 f26913 = 1 f = 1 f = 1 f88235 = 1 AI A37 I1459
Naïve Bayes Data characteristics very similar to text lots of features, sparse data, few ones Naïve Bayes found very effective for text classification Accuracy: All actives misclassified! AI A010 I1459
Support vector machines Has received lots of attention recently Requires tuning: which kernel, what parameters? Several freely available packages: SVMTorch Accuracy: slightly worse than decision trees fifi fjfj
Hand-crafted hybrid Find features such that actives cluster together using appropriate distance measure Training active Training inactive Test Record fifi fjfj
Incremental Feature Selection Pick features ONE by ONE that result in maximum clustering of the actives. And maximum separation from the inactives. Objective function: Maximum separation between centroids of the Actives and In-actives Distance function: matching ones Careful selection of training Actives. Accuracy: 100%, 493 features
Final approach Test data: significantly denser Methods like SVM, NB, clustering-based will not generalize Preferred distribution independent method Ensemble of Decision Trees On disjoint attributes --- unconventional Semi-supervised training Introduce feedback from the test data in multiple rounds
Building tree ensembles Initially picked ~20000 features based on entropy. More than one tree to take care of large feature space. Repeat until accuracy on validation data does not drop All groups of redundant features exploited. Remove features Remove features
Incorporating unlabeled instances Augment training data with sure test instances. Re-train another ensemble of trees using same method Include more unlabelled instances with sure predictions Repeat few more times... How to capture drift?
Capturing drift Solution: Validate with independent data Be sure to include only correctly labeled data First approach: Same prediction by all trees On validation data, found errors in this scheme Pruning not a solution Weighted prediction by each tree Weight: fraction of Actives Pick the right threshold using validation data. Stop when no more unlabelled data can be added
Final state Three rounds each with about 6 trees Unlabelled data included: 126 actives & 311 inactives Remaining 200 in confusion Use meta-learner on validation data to pick final criteria Sum of scores times number of trees claiming Actives Several other last minute hacks.
Outcome Winning Entry: Weighted: 68.4% Accuracy: 70.03% Home Team
Winner’s method Pre-processing: Feature subset selection using mutual information (200 of 139,351 features) Learning Bayesian network models of different complexity (2 to 12 features) Choosing a model (ROC area, model complexity)
Postmortem: Was all this necessary? Without semi-supervised learning: Single decision tree = 49% 6-tree ensemble on training data alone: Majority = 57% Confidence weighted = 63% With unlabelled data: 64.3%
Lessons learnt Products: Need tools that scale in number of features Research problems: Classifiers that are not tied to distribution similarity with the training data More principled way of including unlabelled instances.