A Naive Bayesian Classifier To Assign Protein Sequences to Protein Subfamilies Learning Set Test Set The development of high throughput technologies in.

Slides:

Advertisements

Similar presentations

1 An Adaptive GA for Multi Objective Flexible Manufacturing Systems A. Younes, H. Ghenniwa, S. Areibi uoguelph.ca.

Advertisements

1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.

PREDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Bioinformatics.

Protein sequence analysis is a key issue in post-genomic biology. High-throughput genome sequencing and assembly techniques, structural proteomics and.

D ISCOVERING REGULATORY AND SIGNALLING CIRCUITS IN MOLECULAR INTERACTION NETWORK Ideker Bioinformatics 2002 Presented by: Omrit Zemach April Seminar.

A new method of finding similarity regions in DNA sequences Laurent Noé Gregory Kucherov LORIA/UHP Nancy, France LORIA/INRIA Nancy, France Corresponding.

 Aim in building a phylogenetic tree is to use a knowledge of the characters of organisms to build a tree that reflects the relationships between them.

Chapter 4: Linear Models for Classification

Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.

Profiles for Sequences

Lecture 8 Alignment of pairs of sequence Local and global alignment

With thanks to Zhijun Wu An introduction to the algorithmic problems of Distance Geometry.

Bioinformatics at IU - Ketan Mane. Bioinformatics at IU What is Bioinformatics? Bioinformatics is the study of the inherent structure of biological information.

A platform for pattern discovery in sets of biological sequences C. Alland, J. Nicolas.

Heuristic alignment algorithms and cost matrices

Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.

Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.

Tutorial 2: Some problems in bioinformatics 1. Alignment pairs of sequences Database searching for sequences Multiple sequence alignment Protein classification.

Molecular modelling / structure prediction (A computational approach to protein structure) Today: Why bother about proteins/prediction Concepts of molecular.

Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.

Raymond Ripp, Julie D. Thompson, Frédéric Plewniak, Jean-Claude Thierry, Olivier Poch Laboratoire de BioInformatique et Génomique Intégratives du Département.

Similar Sequence Similar Function Charles Yan Spring 2006.

Geometric Crossovers for Supervised Motif Discovery Rolv Seehuus NTNU.

BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.

Blast heuristics Morten Nielsen Department of Systems Biology, DTU.

Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.

Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.

Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Genome Evolution: Duplication (Paralogs) & Degradation (Pseudogenes)

Sequencing a genome and Basic Sequence Alignment

Information theoretic interpretation of PAM matrices Sorin Istrail and Derek Aguiar.

TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,

Computational Structure Prediction Kevin Drew BCH364C/391L Systems Biology/Bioinformatics 2/12/15.

Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,

Protein Tertiary Structure Prediction

CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

Protein Sequence Alignment and Database Searching.

Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.

Alignment methods April 26, 2011 Return Quiz 1 today Return homework #4 today. Next homework due Tues, May 3 Learning objectives- Understand the Smith-Waterman.

Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.

Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.

Bacterial Genetics - Assignment and Genomics Exercise: Aims –To provide an overview of the development and.

Sequencing a genome and Basic Sequence Alignment

Discovering the Correlation Between Evolutionary Genomics and Protein-Protein Interaction Rezaul Kabir and Brett Thompson

Comp. Genomics Recitation 3 The statistics of database searching.

Construction of Substitution Matrices

PreDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Department.

Protein Classification II CISC889: Bioinformatics Gang Situ 04/11/2002 Parts of this lecture borrowed from lecture given by Dr. Altman.

HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

341- INTRODUCTION TO BIOINFORMATICS Overview of the Course Material 1.

Sequence Alignment.

Construction of Substitution matrices

Zhijun Wu Department of Mathematics Program on Bio-Informatics and Computational Biology Iowa State University Joint Work with Tauqir Bibi, Feng Cui, Qunfeng.

Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,

The Multistrand Simulator: Stochastic Simulation of the Kinetics of Multiple Interacting DNA Strands Joseph Schaeffer, Caltech (slides by John Reif)

4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.

Multiple String Comparison – The Holy Grail. Why multiple string comparison? It is the most critical cutting-edge toοl for extracting and representing.

Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.

Bioinformatics Overview

Computational Structure Prediction

Multiple sequence alignment (msa)

Bayesian Refinement of Protein Functional Site Matching

1 Department of Engineering, 2 Department of Mathematics,

1 Department of Engineering, 2 Department of Mathematics,

1 Department of Engineering, 2 Department of Mathematics,

A: OAZ1 mRNA transcript of 775-1, and parental cell lines showing the stop codon introduced by the nonsense mutations in the and transcripts,

Homology Modeling.

Protein structure prediction.

Replica Exchange Molecular Dynamics Simulations Provide Insight into Substrate Recognition by Small Heat Shock Proteins Sunita Patel, Elizabeth Vierling,

Presentation transcript:

A Naive Bayesian Classifier To Assign Protein Sequences to Protein Subfamilies Learning Set Test Set The development of high throughput technologies in sequencing has generated a huge quantity of genomic data. Today, for one protein family hundreds of proteins may exist and the family can often be divided into functional subfamilies. Knowing the subfamily of a sequence can give hints about its function, phylogeny or structure. The aim of this work is the development of a naive Bayesian classifier to assign a new protein sequence to its subfamily. The Bayesian classifier method has been used to predict protein-protein interaction, structural conformation, drug resistance, for proteome annotations on public databases (1), etc. Here we propose a Bayesian classifier which uses a distance matrix based on percent identities. This new approach requires a strategy to convert the distances to coordinates involving the resolution of a least mean square minimization problem. Computation of percent identities with each subfamily Multidimensional scaling method (2) used to obtain coordinates y. Sequences set: alignment of k protein subfamilies s ij : percent of identity between sequences i and j n: number of identical residues. l : length of the shorter sequence between i and j. Conversion to distances between sequences and subfamilies Compute classifications Conversion to the subfamily coordinates in Computation of the percent identities between all sequence pairs Conversion to coordinates in The starting points of Newton-Raphson are the subfamily coordinates. Then, the best solution is kept. The Newton-Raphson algorithm is used to search the x l coordinates of each sequence i. The function to minimize is : The similarities S' ij are converted to distances D' ij with: with f j the density function of a multivariate normal distribution: Sequence i is assigned to the subfamily j that maximize: Algorithm for Assigning New Sequences to Subfamilies Using a Multiple Alignment Test case: the ARP Families Conclusion and Perspectives Actin-related proteins (ARPs) are very important for cytoskeleton activities (intracellular locomotion, cellular division ), and nuclear functions (chromatin modulation, regulation of transcription and DNA repair ). For studies of ARP families, a high- quality Multiple Alignment of Complete Sequences has been built (available on strasbg.fr/ARPAnno/ARPMACS.html). This alignment is accessible through the ARPAnno web-server (3) which uses this alignment to classify and annotate newly sequenced actin-like proteins. strasbg.fr/ARPAnno/ARPMACS.htmlhttp://bips.u-strasbg.fr/ARPAnno/ ARP alignment representation References: 1-D. Szafron, P. Lu, R. Greiner, D.S. Wischart, B. Poulain, R. Eisner, Z. Lu, J. Anvik, C. Macdonell, A. Fyshe and D. Meeuwis (2004) Proteome Analyst: custom predictions in a web-based tool for high- throughput proteme annotations. Nucleic Acids Research vol 32, w365-w371 2-K. V. Mardia, J. T. Kent, J. M. Bibby (1980) Multivariate Analysis (Probability and Mathematical Statistics). Academic Press. 3-J. Muller, Y. Oma, L. Vallard, E.Friederich,O. Poch and B. Winsor (2005) Sequence and Comparative Genomic Analysis of Actin-related Proteins. Molecular Biology of the Cell vol 16, J.D. Thompson, J.C. Thierry and O.Poch (2003) Rascal: rapid scanning and correction of multiple sequence alignments. Bioinformaics vol 19, We have shown that it is possible to predict the subfamily of a sequence using a multiple alignment of subfamilies after the conversion of percent identities to coordinates. However the percent identity is a global parameter whereas local parameters (insertion/deletion, specific conserved residue...) are often discriminant between subfamilies. David Kieffer *§, Nicolas Wicker §, Olivier Poch § contact: * Genclis 15 rue du bois de la Champelle Vandoeuvre les Nancy § IGBMC, Laboratoire de Bioinformatique et Génomique Intégratives, 1 rue Laurent Fries Illkirch (France) As a consequence, the presented method should involve other descriptors of multiple alignments. In particular, the “blocks” of the Rascal program (4) could be introduced. These “blocks” are local conserved regions inside a multiple alignment. Another improvement could be a refinement of the optimization method through the introduction of simulated annealing, genetic algorithms, etc. Actin ARP1 ARP2 ARP3 ARP4 ARP5 ARP6 ARP7 ARP8 ARP9 ARP10 For each sequence i is mean percent of identity with each subfamily j is computed using formula: Conversion to distances between subfamilies Similarity matrix between the k subfamilies, convert to distance matrix Our naive Bayesian classifier has been tested on this ARP subfamily alignment. 1/3 of the sequences of each subfamily is randomly selected for the test set and 2/3 for the learning set. The results of this test are shown in the following histogram. More than 98% of all 273 tested sequences are classified correctly (last column). Human -actin reference sequence is in green. Amino acid Insertion in red and deletion in blue. Discriminating residues and “blocks” are in black dots and red boxes highlighted in yellow respectively. This representation shows the potential importance of “blocks” of local conservation to discriminate subfamilies.