Download presentation
Presentation is loading. Please wait.
Published byAmber Cook Modified over 9 years ago
1
Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Assessing the Performance of Macromolecular Sequence Classifiers Cornelia Caragea (cornelia@cs.iastate.edu) Iowa State University Joint work with Jivko Sinapov, Drena Dobbs, and Vasant Honavar October 15, 2007
2
Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Background and Motivation Machine Learning methods offer some of the most cost- effective approaches to building predictive models One problem – multiple approaches Needed: comparing the effectiveness of different predictive classifiers Difficulty: different data selection and evaluation procedures
3
Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Outline Macromolecular Sequence Classification Performance Evaluation Window-Based Cross-Validation Sequence-Based Cross-Validation Experiments Conclusions
4
Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Macromolecular Sequence Classification Predict a label for each element in a given sequence Example: Identify post-translational modification residues M K LI TI L C F L S R L L P S L T Q E S S Q EID Glycosylated? H3N+H3N+ COO - Phosphorylated?
5
Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Macromolecular Sequence Classification Example: Identify RNA-binding residues 1T0K_B SINQKLALVIKSGKYTLGYKSTVKSLRQGKSKLIIIAANTPVLRKSELEYYAMLSKTKVYYFQGGNNELGTAVGKLFRVGVVSILEAGD SDILTTLA 0000000000000000111110010000000000000001100100000000000000000000010000000001111100000000000000000
6
Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Macromolecular Sequence Classification Training Data Test Data Learning System Resulting Classifier Validation Performance on test set All Data
7
Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Macromolecular Sequence Classification Sliding Window Approach: Sequence: DSNPKYLGVKKFGGEVVKAGNILVRQRGTKFKAGQGVGMGRDHTLFALSDGK Class: 1111110011111110011111001011111100000001111101000000 Target residue Class label. VKKFGGEVVKAGNIL,0 KKFGGEVVKAGNILV,0 KFGGEVVKAGNILVR,1 FGGEVVKAGNILVRQ,1.
8
Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Outline Macromolecular Sequence Classification Performance Evaluation Window-Based Cross-Validation Sequence-Based Cross-Validation Experiments Conclusions
9
Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Performance Evaluation K-Fold Cross-Validation: S1S1 S k-1 S2S2 SkSk Learn classifier C Evaluate classifier C repeat k times
10
Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Window-Based Cross-Validation Procedure: Extract windows from all sequences in the dataset Partition the set of windows into k disjoint subsets Perform standard cross-validation S1S1 S k-1 S2S2 SkSk Learn classifier C Evaluate classifier C repeat k times windows
11
Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Sequence-Based Cross-Validation Procedure: Partition the set of sequences into k disjoint subsets Extract windows from sequences in each subset Perform standard cross-validation S1S1 S k-1 S2S2 SkSk Learn classifier C Evaluate classifier C repeat k times sequences
12
Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Window-Based vs. Sequence-Based Cross-Validation Window-Based Cross-Validation: Train and test sets are likely to contain some windows that originate from the same sequence. This violates the independence assumption between train and test sets. Sequence-Based Cross-Validation: Windows belonging to the same sequence end up in the same set.
13
Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Machine Learning Classifiers Support Vector Machine: 0/1 String Kernel Example: Naïve Bayes: Identity Window: x = VKKFGGEVVKAGNIL y = KKFGGEVVKAGNILV I[x i =y i ] = 010010010000000 x = V,K,K,F,G,G,E,V,V,K,A,G,N,I,L VKKFGGEVVKAGNIL
14
Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Datasets O-GlycBase dataset: contains experimentally verified glycosylation sites http://www.cbs.dtu.dk/databases/OGLYCBASE/ http://www.cbs.dtu.dk/databases/OGLYCBASE/ RNA-Protein Interface dataset, RB147 : consists of RNA-binding protein sequences extracted from structures of known RNA-protein complexes solved by X-ray crystallography in the Protein Data Bank. http://bindr.gdcb.iastate.edu/RNABindR/ Protein-Protein Interface dataset: consists of protein-binding protein sequences
15
Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Datasets Number of positive and negative instances used in our experiments DatasetNumber of Sequences Number of + Instances Number of - Instances O-GlycBase216216812147 RNA-Protein147433627988 Protein-Protein4223509204
16
Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Outline Macromolecular Sequence Classification Performance Evaluation Window-Based Cross-Validation Sequence-Based Cross-Validation Experiments Conclusions
17
Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Experimental Design Questions: How does Sequence-Based Cross-Validation compare with Window-Based Cross-Validation? How do the results vary when we vary the size of the dataset?
18
Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Results Receiver Operating Characteristic (ROC) Curves for Window-Based and Sequence-Based 10-Fold Cross-Validation using SVM O-glycBase
19
Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Results a) O-glycBase b) RNA-Protein Interface c) Protein-Protein Interface AUC CC
20
Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Outline Macromolecular Sequence Classification Performance Evaluation Window-Based Cross-Validation Sequence-Based Cross-Validation Experiments Conclusions
21
Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Conclusions Compared two variants of k-fold cross-validation: window-based and sequence-based k-fold cross-validation. The comparison shows that Window-Based CV overestimates the performance of the classifiers relative to Sequence-Based CV. Sequence-Based CV provides more realistic estimates of performance, because predictors trained on labeled sequence data have to predict the labels for residues in a novel sequence.
22
Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National Institutes of Health (GM066387). Jivko Sinapov Drena Dobbs Vasant Honavar
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.