Download presentation
Presentation is loading. Please wait.
Published byRussell Haynes Modified over 9 years ago
1
Feature Extraction Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program Department of Computer Science Assessing the Performance of Macromolecular Sequence Classifiers Cornelia Caragea, Jivko Sinapov, Michael Terribilini, Drena Dobbs and Vasant Honavar Introduction Results Acknowledgements : This work is supported in part by a grant from the National Institutes of Health (GM 066387) to Vasant Honavar & Drena Dobbs Machine Learning Classifiers Datasets Sequence-based Cross-Validation: the training and test data typically correspond to disjoint sets of sequences. All instances belonging to the same sequence end up in the same set, preserving the natural distribution of the original sequence dataset. Fig 1. Comparison of Area Under the ROC Curve (AUC) (upper plots) and Matthews Correlation Coefficient (lower plots) between window-based and sequence-based cross-validation with varying dataset size. a) O-glycBaseb) RNA-Protein Interfacec) Protein-Protein Interface DatasetNumber of Sequences Number of + Instances Number of - Instances O-GlycBase216216812147 RNA-Protein147433627988 Protein-Protein4223509204 Machine learning approaches offer some of the most cost-effective approaches to building predictive models (e.g., classifiers) in a broad range of applications in computational biology, e.g., given an amino acid sequence, identifying the amino acid residues that are likely to bind to RNA. Comparing the effectiveness of different algorithms requires reliable procedures for accurately assessing the performance (e.g., accuracy, sensitivity, and specificity) of the resulting predictive classifiers. Evaluating the performance of classifiers K-Fold Cross-Validation S1S1 S k-1 S2S2 SkSk Learn classifier C Evaluate classifier C repeat k times Window-based Cross-Validation: the training and test data typically correspond to disjoint sets of sequence windows. Similar or identical instances are removed from the dataset to avoid overestimation of performance measures. Drawbacks: Support Vector Machine: 0/1 String Kernel O-GlycBase dataset: contains experimentally verified glycosylation sites compiled from protein databases and literature. (http://www.cbs.dtu.dk/databases/OGLYCBASE/)http://www.cbs.dtu.dk/databases/OGLYCBASE/ RNA-Protein Interface dataset, RP147: consists of RNA- binding protein sequences extracted from structures of known RNA-protein complexes solved by X-ray crystallography in the Protein Data Bank. (http://bindr.gdcb.iastate.edu/RNABindR/)http://bindr.gdcb.iastate.edu/RNABindR/ Protein-Protein Interface dataset: consists of protein- binding protein sequences. Table 1. Number of positive (+) and negative (-) instances used in our experiments for O-GlycBase, RNA-Protein, and Protein- Protein Interface datasets. Local window of length 2n+1: x = x -n x -n+1 …x -1 x 0 x 1 …x n-1 x n, with each target residue x 0 in the middle and its n neighbor residues, x i, i = -n,…,n, i≠0, on each side as input to the classifier. x i ∑, i = -n,…,n, and x ∑*, where ∑ represents the 20 amino acid alphabet. For the glycosylation dataset: a local window is extracted for each S/T glycosylation or non-glycosylation site, x 0 {S,T}. For RNA-Protein and Protein-Protein Interface datasets: a local window is extracted for every residue in a protein sequence, x 0 ∑, using the “sliding window” approach. Conclusion Eliminating similar or identical sequence windows from the dataset perturbs the “natural” distribution of the data extracted from the original sequence dataset. Ideally, the performance of the classifier must be estimated using the “natural” data distribution. Train and test sets are likely to contain some instances that originate from the same sequence. This violates the independence assumption between train and test sets. Naïve Bayes: Identity windows Compared two variants of k-fold cross-validation: window- based and sequence-based k-fold cross-validation. Results suggest that window-based can yield overly optimistic estimates of the performance of the classifiers relative to the estimates obtained using sequence-based cross-validation. Because predictors trained on labeled sequence data have to predict the labels for residues in a novel sequence, we believe that the estimates obtained using sequence-based cross-validation provide more realistic estimates of performance than those obtained using window-based cross- validation.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.