Feature Extraction Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and.

Slides:

Advertisements

Similar presentations

Transmembrane Protein Topology Prediction Using Support Vector Machines Tim Nugent and David Jones Bioinformatics Group, Department of Computer Science,

Advertisements

Weighing Evidence in the Absence of a Gold Standard Phil Long Genome Institute of Singapore (joint work with K.R.K. “Krish” Murthy, Vinsensius Vega, Nir.

Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris T. Lin and Vasant Honavar. BigData2013.

Comparison of Data Mining Algorithms on Bioinformatics Dataset Melissa K. Carroll Advisor: Sung-Hyuk Cha March 4, 2003.

Application of Stacked Generalization to a Protein Localization Prediction Task Melissa K. Carroll, M.S. and Sung-Hyuk Cha, Ph.D. Pace University, School.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by grants from the National.

Training a Neural Network to Recognize Phage Major Capsid Proteins Author: Michael Arnoult, San Diego State University Mentors: Victor Seguritan, Anca.

Assessing and Comparing Classification Algorithms Introduction Resampling and Cross Validation Measuring Error Interval Estimation and Hypothesis Testing.

1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.

Credibility: Evaluating what’s been learned. Evaluation: the key to success How predictive is the model we learned? Error on the training data is not.

1 7/27/2008 Center for Computational Intelligence, Learning, and Discovery Bioinformatics and Computational Biology Program ROC 2008 meeting A Computational.

Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.

1 Computational Analysis of Protein-DNA Interactions Changhui (Charles) Yan Department of Computer Science Utah State University.

Methods for Improving Protein Disorder Prediction Slobodan Vucetic1, Predrag Radivojac3, Zoran Obradovic3, Celeste J. Brown2, Keith Dunker2 1 School of.

Active Learning Strategies for Drug Screening 1. Introduction At the intersection of drug discovery and experimental design, active learning algorithms.

Training a Neural Network to Recognize Phage Major Capsid Proteins Author: Michael Arnoult, San Diego State University Mentors: Victor Seguritan, Anca.

© 2013 IBM Corporation Efficient Multi-stage Image Classification for Mobile Sensing in Urban Environments Presented by Shashank Mujumdar IBM Research,

Protein Tertiary Structure Prediction

CLassification TESTING Testing classifier accuracy

by B. Zadrozny and C. Elkan

Overcoming the Curse of Dimensionality in a Statistical Geometry Based Computational Protein Mutagenesis Majid Masso Bioinformatics and Computational Biology.

Protein Secondary Structure Prediction with inclusion of Hydrophobicity information Tzu-Cheng Chuang, Okan K. Ersoy and Saul B. Gelfand School of Electrical.

Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program.

Evaluating Hypotheses Reading: Coursepack: Learning From Examples, Section 4 (pp )

Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris Lin, Neeraj Koul, and Vasant.

 2003, G.Tecuci, Learning Agents Laboratory 1 Learning Agents Laboratory Computer Science Department George Mason University Prof. Gheorghe Tecuci 5.

Discovering the Correlation Between Evolutionary Genomics and Protein-Protein Interaction Rezaul Kabir and Brett Thompson

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Instance Filtering for Entity Recognition Advisor ： Dr.

Center for Computational Intelligence, Learning, and Discovery Artificial Intelligence Research Laboratory Department of Computer Science Supported in.

Exploring Alternative Splicing Features using Support Vector Machines Feature for Alternative Splicing Alternative splicing is a mechanism for generating.

Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program.

Associating Biomedical Terms: Case Study for Acetylation Aaron Buechlein Indiana University School of Informatics Advisor: Dr. Predrag Radivojac.

CISC Machine Learning for Solving Systems Problems Presented by: Ashwani Rao Dept of Computer & Information Sciences University of Delaware Learning.

Meng-Han Yang September 9, 2009 A sequence-based hybrid predictor for identifying conformationally ambivalent regions in proteins.

PREDICTION OF CATALYTIC RESIDUES IN PROTEINS USING MACHINE-LEARNING TECHNIQUES Natalia V. Petrova (Ph.D. Student, Georgetown University, Biochemistry Department),

Identification of amino acid residues in protein-protein interaction interfaces using machine learning and a comparative analysis of the generalized sequence-

Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work Exploring Alternative Splicing Features.

B IOINFORMATICS AND C OMPUTATIONAL B IOLOGY A Computational Method to Identify RNA Binding Sites in Proteins Jeff Sander Iowa State University Rocky 2006.

Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program.

LOGO iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance- Pairs and Reduced Alphabet Profile into the General Pseudo Amino.

Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program.

Multiple Instance Learning for Sparse Positive Bags Razvan C. Bunescu Machine Learning Group Department of Computer Sciences University of Texas at Austin.

 Developed Struct-SVM classifier that takes into account domain knowledge to improve identification of protein-RNA interface residues  Results show that.

A New Supervised Over-Sampling Algorithm with Application to Protein-Nucleotide Binding Residue Prediction Li Lihong (Anna Lee) Cumputer science 22th,Apr.

Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program.

Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris T. Lin, Sanghack Lee, Ngot Bui.

Prediction of Protein Binding Sites in Protein Structures Using Hidden Markov Support Vector Machine.

Protein Family Classification using Sparse Markov Transducers Proceedings of Eighth International Conference on Intelligent Systems for Molecular Biology.

Typically, classifiers are trained based on local features of each site in the training set of protein sequences. Thus no global sequence information is.

Machine Learning ICS 178 Instructor: Max Welling Supervised Learning.

Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.

Evaluating Classifiers Reading: T. Fawcett, An introduction to ROC analysis, Sections 1-4, 7 (linked from class website)An introduction to ROC analysis.

Final Report (30% final score) Bin Liu, PhD, Associate Professor.

Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program.

Evaluating Classifiers. Reading for this topic: T. Fawcett, An introduction to ROC analysis, Sections 1-4, 7 (linked from class website)

Modeling Cell Proliferation Activity of Human Interleukin-3 (IL-3) Upon Single Residue Replacements Majid Masso Bioinformatics and Computational Biology.

A new protein-protein docking scoring function based on interface residue properties Reporter: Yu Lun Kuo (D )

7. Performance Measurement

Evaluating Classifiers

Feature Extraction Introduction Features Algorithms Methods

Introduction Feature Extraction Discussions Conclusions Results

Prediction of RNA Binding Protein Using Machine Learning Technique

Artificial Intelligence Research Laboratory

Extra Tree Classifier-WS3 Bagging Classifier-WS3

Ontology-Based Information Integration Using INDUS System

Generalizations of Markov model to characterize biological sequences

Model generalization Brief summary of methods

Using Bayesian Network in the Construction of a Bi-level Multi-classifier. A Case Study Using Intensive Care Unit Patients Data B. Sierra, N. Serrano,

Presentation transcript:

Feature Extraction Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program Department of Computer Science Assessing the Performance of Macromolecular Sequence Classifiers Cornelia Caragea, Jivko Sinapov, Michael Terribilini, Drena Dobbs and Vasant Honavar Introduction Results Acknowledgements : This work is supported in part by a grant from the National Institutes of Health (GM ) to Vasant Honavar & Drena Dobbs Machine Learning Classifiers Datasets Sequence-based Cross-Validation: the training and test data typically correspond to disjoint sets of sequences. All instances belonging to the same sequence end up in the same set, preserving the natural distribution of the original sequence dataset. Fig 1. Comparison of Area Under the ROC Curve (AUC) (upper plots) and Matthews Correlation Coefficient (lower plots) between window-based and sequence-based cross-validation with varying dataset size. a) O-glycBaseb) RNA-Protein Interfacec) Protein-Protein Interface DatasetNumber of Sequences Number of + Instances Number of - Instances O-GlycBase RNA-Protein Protein-Protein Machine learning approaches offer some of the most cost-effective approaches to building predictive models (e.g., classifiers) in a broad range of applications in computational biology, e.g., given an amino acid sequence, identifying the amino acid residues that are likely to bind to RNA. Comparing the effectiveness of different algorithms requires reliable procedures for accurately assessing the performance (e.g., accuracy, sensitivity, and specificity) of the resulting predictive classifiers. Evaluating the performance of classifiers K-Fold Cross-Validation S1S1 S k-1 S2S2 SkSk Learn classifier C Evaluate classifier C repeat k times Window-based Cross-Validation: the training and test data typically correspond to disjoint sets of sequence windows. Similar or identical instances are removed from the dataset to avoid overestimation of performance measures. Drawbacks:  Support Vector Machine: 0/1 String Kernel  O-GlycBase dataset: contains experimentally verified glycosylation sites compiled from protein databases and literature. (  RNA-Protein Interface dataset, RP147: consists of RNA- binding protein sequences extracted from structures of known RNA-protein complexes solved by X-ray crystallography in the Protein Data Bank. (  Protein-Protein Interface dataset: consists of protein- binding protein sequences. Table 1. Number of positive (+) and negative (-) instances used in our experiments for O-GlycBase, RNA-Protein, and Protein- Protein Interface datasets.  Local window of length 2n+1: x = x -n x -n+1 …x -1 x 0 x 1 …x n-1 x n, with each target residue x 0 in the middle and its n neighbor residues, x i, i = -n,…,n, i≠0, on each side as input to the classifier. x i  ∑, i = -n,…,n, and x  ∑*, where ∑ represents the 20 amino acid alphabet.  For the glycosylation dataset: a local window is extracted for each S/T glycosylation or non-glycosylation site, x 0  {S,T}.  For RNA-Protein and Protein-Protein Interface datasets: a local window is extracted for every residue in a protein sequence, x 0  ∑, using the “sliding window” approach. Conclusion  Eliminating similar or identical sequence windows from the dataset perturbs the “natural” distribution of the data extracted from the original sequence dataset. Ideally, the performance of the classifier must be estimated using the “natural” data distribution.  Train and test sets are likely to contain some instances that originate from the same sequence. This violates the independence assumption between train and test sets.  Naïve Bayes: Identity windows  Compared two variants of k-fold cross-validation: window- based and sequence-based k-fold cross-validation.  Results suggest that window-based can yield overly optimistic estimates of the performance of the classifiers relative to the estimates obtained using sequence-based cross-validation.  Because predictors trained on labeled sequence data have to predict the labels for residues in a novel sequence, we believe that the estimates obtained using sequence-based cross-validation provide more realistic estimates of performance than those obtained using window-based cross- validation.