Feature selection and transduction for prediction of molecular bioactivity for drug design Reporter: Yu Lun Kuo (D95922037)

Slides:



Advertisements
Similar presentations
Statistical Machine Learning- The Basic Approach and Current Research Challenges Shai Ben-David CS497 February, 2007.
Advertisements

AIME03, Oct 21, 2003 Classification of Ovarian Tumors Using Bayesian Least Squares Support Vector Machines C. Lu 1, T. Van Gestel 1, J. A. K. Suykens.
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Sequential Minimal Optimization Advanced Machine Learning Course 2012 Fall Semester Tsinghua University.
DECISION TREES. Decision trees  One possible representation for hypotheses.
Imbalanced data David Kauchak CS 451 – Fall 2013.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Comparison of Data Mining Algorithms on Bioinformatics Dataset Melissa K. Carroll Advisor: Sung-Hyuk Cha March 4, 2003.
Application of Stacked Generalization to a Protein Localization Prediction Task Melissa K. Carroll, M.S. and Sung-Hyuk Cha, Ph.D. Pace University, School.
COMPUTER AIDED DIAGNOSIS: FEATURE SELECTION Prof. Yasser Mostafa Kadah –
What is Statistical Modeling
KDD CUP 2001 Task 1: Thrombin Jie Cheng (Homepage: Global Analytics Canadian Imperial Bank of Commerce.
Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.
Decision Tree Algorithm
Lesson 8: Machine Learning (and the Legionella as a case study) Biological Sequences Analysis, MTA.
Reduced Support Vector Machine
Ensemble Learning: An Introduction
Classification for High Dimensional Problems Using Bayesian Neural Networks and Dirichlet Diffusion Trees Radford M. Neal and Jianguo Zhang the winners.
Three kinds of learning
1 MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING By Kaan Tariman M.S. in Computer Science CSCI 8810 Course Project.
Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.
Active Learning Strategies for Drug Screening 1. Introduction At the intersection of drug discovery and experimental design, active learning algorithms.
Active Learning Strategies for Compound Screening Megon Walker 1 and Simon Kasif 1,2 1 Bioinformatics Program, Boston University 2 Department of Biomedical.
05/06/2005CSIS © M. Gibbons On Evaluating Open Biometric Identification Systems Spring 2005 Michael Gibbons School of Computer Science & Information Systems.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Ensemble Learning (2), Tree and Forest
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
Ensembles of Classifiers Evgueni Smirnov
Evaluating Classifiers
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
by B. Zadrozny and C. Elkan
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Universit at Dortmund, LS VIII
Categorical data. Decision Tree Classification Which feature to split on? Try to classify as many as possible with each split (This is a good split)
Skewing: An Efficient Alternative to Lookahead for Decision Tree Induction David PageSoumya Ray Department of Biostatistics and Medical Informatics Department.
Prediction of Molecular Bioactivity for Drug Design Experiences from the KDD Cup 2001 competition Sunita Sarawagi, IITB
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science & Information Engineering.
An Ensemble of Three Classifiers for KDD Cup 2009: Expanded Linear Model, Heterogeneous Boosting, and Selective Naive Bayes Members: Hung-Yi Lo, Kai-Wei.
Exploring Alternative Splicing Features using Support Vector Machines Feature for Alternative Splicing Alternative splicing is a mechanism for generating.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
Bing LiuCS Department, UIC1 Chapter 8: Semi-supervised learning.
1 Chapter 10 Introduction to Machine Learning. 2 Chapter 10 Contents (1) l Training l Rote Learning l Concept Learning l Hypotheses l General to Specific.
Ohad Hageby IDC Support Vector Machines & Kernel Machines IP Seminar 2008 IDC Herzliya.
Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Ensemble Methods in Machine Learning
Data Mining and Decision Support
Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
NTU & MSRA Ming-Feng Tsai
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
1 Learning Bias & Clustering Louis Oliphant CS based on slides by Burr H. Settles.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
A distributed PSO – SVM hybrid system with feature selection and parameter optimization Cheng-Lung Huang & Jian-Fan Dun Soft Computing 2008.
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
Evaluating Classifiers
KDD CUP 2001 Task 1: Thrombin Jie Cheng (
Trees, bagging, boosting, and stacking
کاربرد نگاشت با حفظ تنکی در شناسایی چهره
Source: Procedia Computer Science(2015)70:
CS548 Fall 2017 Decision Trees / Random Forest Showcase by Yimin Lin, Youqiao Ma, Ran Lin, Shaoju Wu, Bhon Bunnag Showcasing work by Cano,
Data Mining Practical Machine Learning Tools and Techniques
Presentation transcript:

Feature selection and transduction for prediction of molecular bioactivity for drug design Reporter: Yu Lun Kuo (D ) Date: April 17, 2008 Bioinformatics Vol. 19 no (Pages )

Abstract Drug discovery –Identify characteristics that separate active (binding) compounds from inactive ones. Two method for prediction of bioactivity –Feature selection method –Transductive method Improvement over using only one of the techniques 2015/4/222

Introduction (1/4) Discovery of a new drug –Testing many small molecules for their ability to bind to the target site –The task of determining what separate the active (binding) compounds from the inactive ones 2015/4/223

Introduction (2/4) Design new compounds –Not only bind –But also possess certain other properties required for a drug The task of determination can be seen in a machine learning context as one of feature selection 2015/4/224

Introduction (3/4) Challenging –Few positive examples Little information is given indicating positive correlation between features and the labels –Large number of features Selected from a huge collection of useful features Some features are in reality uncorrelated with the labels –Different distributions Cannot expect the data to come from a fix distribution 2015/4/225

Introduction (4/4) Many conventional machine learning algorithms are illequiped to deal with these Many algorithms generalize poorly –The high dimensionality of the problem –The problem size many methods are no longer computationally feasible –Most cannot deal with training and testing data coming from different distributions 2015/4/226

Overcome Feature selection criterion –Called unbalanced correlation score Take into account the unbalanced nature of the data Simple enough to avoid overfitting Classifier –Takes into account the different distributions in the test data compared to the training data Induction Transduction 2015/4/227

Overcome Induction –Builds a model based only on the distribution of the training data Transduction –Also take into account the test data inputs Combining these two techniques we obtained improved prediction accuracy 2015/4/228

KDD Cup Competition (1/2) We focused on a well studies data set –KDD Cup 2001 competition Knowledge Discovery and Data Mining One of the premier meetings of the data mining community – 2015/4/229

KDD Cup Competition (2/2) KDD Cup 2006 –data mining for medical diagnosis, specifically identifying pulmonary embolisms from three-dimensional computed tomography data KDD Cup 2004 – features tasks in particle physics and bioinformatics evaluated on a variety of different measures KDD Cup 2002 –focus: bioinformatics and text mining KDD Cup 2001 –focus: bioinformatics and drug discovery 2015/4/2210

KDD Cup 2001 (1/2) Objective –Prediction of molecular bioactivity for drug design -- binding to Thrombin Data –Training: 1909 cases (42 positive), 139,351 binary features –Test: 634 cases 2015/4/2211

KDD Cup 2001 (2/2) Challenge –Highly imbalanced, high-dimensional, different distribution Approach –Bayesian network predictive model –Data PreProcessor system –BN PowerPredictor system –BN PowerConstructor system 2015/4/2212

Data Set (1/3) Provided by DuPont Pharmaceuticals –Drug binds to a target site on thrombin, a key receptor in blood clotting Each example has a fixed length vector of 139,351 binary features in {0, 1} –Which describe three-dimensional properties of the molecule 2015/4/2213

Data Set (2/3) Positive examples are labeled +1 Negative examples are labeled -1 In the training set –1909 examples, 42 of which bind (rather unbalanced, positive is 2.2%) In the test set –634 additional compounds 2015/4/2214

Data Set (3/3) An important characteristic of the data –Very few of the feature entries are non-zero (0.68% of the 1,909 X 139,351 training matrix) 2015/4/2215

System Assessment Performance is evaluated according to a weighted accuracy criterion –The score of an estimate y’ of the labels y –Complete success is a score of 1 Multiply this score by 100 as the percentage weighted success rate 2015/4/2216

Methodology Predict the labels on the test set by using a machine learning algorithm The positively and negatively labeled training examples are split randomly into n groups –For n-fold cross validation such that as close to 1/n of the positively labeled examples are present in each group as possible Called balanced cross validation –As few positive examples 2015/4/2217

Methodology The method is –Trained on n-1 of groups –Tested on the remaining group –Repeated n times (different group for testing) –Final score: mean of the n scores 2015/4/2218

Feature Selection (1/2) Called the unbalanced correlation score –f j : the score of feature j –X: training data as a matrix X where columns are features and examples are rows Take λ very large in order to select features which have non-zero entries (λ ≧ 3) 2015/4/2219

Feature Selection (2/2) This score is an attempt to encode prior information that –The data is unbalanced –Large number of features –Only positive correlations are likely to be useful 2015/4/2220

Justification Justify the unbalanced correlation score using methods of information theory –Entropy: higher  non-regular Pi: the probability of appearance of event i 2015/4/2221

Entropy The probability of random appearance of a feature with an unbalanced score of N=N p -N n –N p = number of one entries associated to +1 –N n = number of one entries associated to -1 –T p = total number of positive labels in training set –T n = total number of negative labels in training set 2015/4/2222

Entropy Need to compute the probability that a certain N might occur randomly Finally, compute the entropy for each feature 2015/4/2223

Entropy and unbalanced score The entropy and unbalanced score will not reach the same feature –Because the unbalanced correlation score will no select samples with low negative In this particular problem –Reach a similar ranking of the features Due to the unbalanced nature of the data 2015/4/2224

Entropy and unbalanced score The first 6 features for both scores –5 out of 6 are the same ones –For 16 features, 12 coincide –Pay more attention to positive correlations 2015/4/2225

Multivariate unbalanced correlation The feature selection algorithm described so far is univariate –Reduces the chance of overfitting –Between the inputs and targets are too complex this assumption may be to restrictive We extend our criterion to assign a rank to a subset of feature –Rather than just a single feature 2015/4/2226

Multivariate unbalanced correlation By computing the logical OR of the subset of features S (as they are binary) 2015/4/2227

Fisher Score –μ(+): the mean of the feature values for positive –μ(-): the mean of the feature values for negative –σ(+): standard deviations –σ(-): standard deviations 2015/4/2228

In each case, the algorithms are evaluated for different numbers of features d –The range d = 1, …, 40 Choose a small number of features in order to render interpretability of the decision function It is anticipated that a large number of features are noisy and should not be selected 2015/4/2229

Classification algorithms (Inductive) The task may not simply be just to identify relevant characteristics via feature selection –But also to provide a prediction system Simplest of classifiers –We call this a logical OR classifier 2015/4/2230

Comparison Techniques We compared a number of rather more sophisticated classification –Support vector machines (SVM) –SVM* Make a search over all possible values of the threshold parameter in the linear model after training –K-nearest neighbors (K-NN) –K-NN* (parameter γ) –C4.5 (decision tree learner) 2015/4/2231

Transductive Inference One is given labeled data from which builds a general model –Then applies this model to classify previously unseen (test) data Takes into account not only the given (labeled) training set but also unlabeled data –That one wishes to classify 2015/4/2232

Transductive Inference Different models can be built –Trying to classify different test sets –Even if the training set is the same in all cases It is this characteristic which help to solve problem 3 –The data we are given has different distribution in the training and test sets 2015/4/2233

Transductive Inference Transduction is not useful in all tasks –In drug discovery in particular we believe it is useful Developers often have access to huge databases of compounds –Compounds are often generated using virtual Combinatorial Chemistry –Compound descriptors can be computed even though the compounds have not been synthesized yet 2015/4/2234

Transductive Inference Drug discovery is an iterative process –Machine learning method is to help choose the next test set –Step in a two-step candidate selection procedure After candidate test set has been produced Its result is the final test set 2015/4/2235

Transductive algorithm 2015/4/2236

Results (with unbalanced correlation score) C4.5 gave only 50% success rate for all 2015/4/2237 The tansductive algorithm is consistently selecting more relevant features than the inductive one the Fisher score

Further Results We also tested some more sophisticated multivariate feature selection methods –Not as good as using the unbalanced criterion score Using non-linear SVMs –Not improve results (50% success) SVMs as a base classifier for our transduction –Improvement over using SVMs 2015/4/2238

Further Results Also tried training the classifiers with larger numbers of features –Inductive methods failed to learn anything after 200 features –Transductive methods Exhibit generalization behavior up to 1000 features (TRANS-Orcub:58% success with d=1000,77% with d=200) –KDD champion Success rate 68.4% (7% of entrants higher than 60%) 2015/4/2239

2015/4/2240 Thanks for your attention