Protein Fold Recognition as a Data Mining Coursework Project Badri Adhikari Department of Computer Science University of Missouri-Columbia
Overview Problem Description Data Description and Specification Data Preprocessing Methodology Discussion of Results
The Problem Identify if two proteins belong to same fold [Binary Classification] Protein [Structure] Database Protein 1 Protein 2 Protein1-Protein Feature Feature Feature 3 Protein 3 Protein2-Protein Protein Pair Y Same Fold? N
The Problem Identify if two proteins belong to same fold [Binary Classification] Protein1-Protein Feature Feature Feature 3 Protein2-Protein Protein Pair Y Same Fold? N Protein fold recognition Customer Feature Feature Feature 3 Customer Customer Identification Y Potential Customer Y Customer N Recognizing Potential New Customers
Data Specification File size: 1.5G Examples count: Positive(+1) labels:7438 Negative(-1) labels: Number of features: 84
Data Description #119l-d119l 1alo-d1alo_1 -1 1:1.62 2:1.13 3: : : : : #119l-d119l 1chka-d1chka +1 1:1.62 2:2.38 3: : : : : #119l-d119l 1fcdc-d1fcdc1 -1 1:1.62 2:0.8 3: : : : : #119l-d119l 1gal-d1gal_1 -1 1:1.62 2:3.22 3: : : : : #119l-d119l 1gbs-d1gbs +1 1:1.62 2:1.85 3: : : : : #119l-d119l 1iov-d2dln_2 -1 1:1.62 2:2.1 3: : : : : #119l-d119l 1kte-d1kte -1 1:1.62 2:1.05 3: : : : : #119l-d119l 1sly-d1sly_2 +1 1:1.62 2:1.68 3: : : : : #119l-d119l 2baa-d2baa +1 1:1.62 2:2.43 3: : : : : #119l-d119l 6lyt-d193l +1 1:1.62 2:1.29 3: : : : : #119l-d119l 1sly-d1sly_2 +1 1:1.62 2:1.68 3: : : #119l-d119l 1sly-d1sly_2
Data Description #119l-d119l 1alo-d1alo_1 -1 1:1.62 2:1.13 3: : : : : #119l-d119l 1chka-d1chka +1 1:1.62 2:2.38 3: : : : : #119l-d119l 1fcdc-d1fcdc1 -1 1:1.62 2:0.8 3: : : : : #119l-d119l 1gal-d1gal_1 -1 1:1.62 2:3.22 3: : : : : #119l-d119l 1gbs-d1gbs +1 1:1.62 2:1.85 3: : : : : #119l-d119l 1iov-d2dln_2 -1 1:1.62 2:2.1 3: : : : : #119l-d119l 1kte-d1kte -1 1:1.62 2:1.05 3: : : : : #119l-d119l 1sly-d1sly_2 +1 1:1.62 2:1.68 3: : : : : #119l-d119l 2baa-d2baa +1 1:1.62 2:2.43 3: : : : : #119l-d119l 6lyt-d193l +1 1:1.62 2:1.29 3: : : : : #119l-d119l 1sly-d1sly_2 +1 1:1.62 2:1.68 3: : : #119l-d119l 1sly-d1sly_2 Protein query-target pair as example id
Data Description #119l-d119l 1alo-d1alo_1 -1 1:1.62 2:1.13 3: : : : : #119l-d119l 1chka-d1chka +1 1:1.62 2:2.38 3: : : : : #119l-d119l 1fcdc-d1fcdc1 -1 1:1.62 2:0.8 3: : : : : #119l-d119l 1gal-d1gal_1 -1 1:1.62 2:3.22 3: : : : : #119l-d119l 1gbs-d1gbs +1 1:1.62 2:1.85 3: : : : : #119l-d119l 1iov-d2dln_2 -1 1:1.62 2:2.1 3: : : : : #119l-d119l 1kte-d1kte -1 1:1.62 2:1.05 3: : : : : #119l-d119l 1sly-d1sly_2 +1 1:1.62 2:1.68 3: : : : : #119l-d119l 2baa-d2baa +1 1:1.62 2:2.43 3: : : : : #119l-d119l 6lyt-d193l +1 1:1.62 2:1.29 3: : : : : #119l-d119l 1sly-d1sly_2 +1 1:1.62 2:1.68 3: : : #119l-d119l 1sly-d1sly_2 Labels
Data Description #119l-d119l 1alo-d1alo_1 -1 1:1.62 2:1.13 3: : : : : #119l-d119l 1chka-d1chka +1 1:1.62 2:2.38 3: : : : : #119l-d119l 1fcdc-d1fcdc1 -1 1:1.62 2:0.8 3: : : : : #119l-d119l 1gal-d1gal_1 -1 1:1.62 2:3.22 3: : : : : #119l-d119l 1gbs-d1gbs +1 1:1.62 2:1.85 3: : : : : #119l-d119l 1iov-d2dln_2 -1 1:1.62 2:2.1 3: : : : : #119l-d119l 1kte-d1kte -1 1:1.62 2:1.05 3: : : : : #119l-d119l 1sly-d1sly_2 +1 1:1.62 2:1.68 3: : : : : #119l-d119l 2baa-d2baa +1 1:1.62 2:2.43 3: : : : : #119l-d119l 6lyt-d193l +1 1:1.62 2:1.29 3: : : : : #119l-d119l 1sly-d1sly_2 +1 1:1.62 2:1.68 3: : : #119l-d119l 1sly-d1sly_2 Feature values for each examples
Preprocessing Task 1: Group related data rows Problem : All The records are not independent of each other. Solution: Group records with same query template together, so that they are together either in the test data set or training data set. #119l-d119l 1alo-d1alo_1 -1 1:1.62 2:1.13 3: : : : : #119l-d119l 1chka-d1chka +1 1:1.62 2:2.38 3: : : : : #1aab-d1aab 1cyx-d1cyx -1 1:0.83 2:1.58 3: : : : #119l-d119l 1fcdc-d1fcdc1 -1 1:1.62 2:0.8 3: : : : :
Preprocessing Task 2: Balance the positive and negative data Problem: Dataset has just 0.78% positive examples All Examples All +ve examples -ve examples (equal to +ve) Balancing the number of positive and negative examples All Random Remaining -ve examples Used only for testing Balanced examples
Methodology SVM light as the mining tool SVM light is an implementation of Support Vector Machines (SVMs) in C $ svm_learn example1/train.dat example1/model $ svm_classify example1/test.dat example1/model example1/prediction
Methodology Process for deciding the Kernel Function Many different kernels: linear, polynomial, radial basis function, or user defined. Consider the RBF kernel K(x, y) = Parameters to consider: -mmemory size of cache for kernel evaluations - ggamma value for rbf kernel -ctrade-off between training error and margin Use cross-validation to find the best parameter C and ϒ Use the best parameter C and ϒ to train the whole training set
Determining gamma parameter: – Ran training and testing for 100 gamma values between 0 and 1 – Found gamma = 0.15 as the best value – Ran again to find more precise gamma for 120 values from 0 to 0.3 – Found best value of gamma as 0.1 Used default C value of 0 Methodology Parameter Determination
ROC curve For different values of threshold - average sensitivity and specificity was computed from values in each fold threshold Evaluation with 10-fold cross-validation For threshold = ThresholdSensivitySpecificityFPRAccuracyPrecision specificity sensitivity
References A machine learning information retrieval approach to protein fold recognition by Jianlin Cheng and Pierre Baldi A Practical Guide to Support Vector Classification by Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin available at Cross-Validation by PAYAM REFAEILZADEH, LEI TANG, HUAN LIU available at Classroom slides at ppt ppt
Thank you for your time Questions and comments are welcome.