MGM workshop. 19 Oct 2010 Functional annotation Datasources Konstantinos Mavrommatis

Slides:



Advertisements
Similar presentations
Beyond PubMed and BLAST: Exploring NCBI tools and databases Kate Bronstad David Flynn Alumni Medical Library.
Advertisements

Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
Genome databases and webtools for genome analysis Become familiar with microbial genome databases Use some of the tools useful for analyzing genome Visit.
CS 177 Hands-on lab with databases Quiz #1 Summary: Nucleotide and protein databases Sequence formats Lab exercises Quiz #1 Summary: Nucleotide and protein.
Basic Genomic Characteristic  AIM: to collect as much general information as possible about your gene: Nucleotide sequence Databases ○ NCBI GenBank ○
Pfam(Protein families )
Bioinformatics for biomedicine Summary and conclusions. Further analysis of a favorite gene Lecture 8, Per Kraulis
Systems Biology Existing and future genome sequencing projects and the follow-on structural and functional analysis of complete genomes will produce an.
Protein databases Morten Nielsen. Background- Nucleotide databases GenBank, National Center for Biotechnology Information.
Archives and Information Retrieval
Biological databases.
Sequence Analysis MUPGRET June workshops. Today What can you do with the sequence? What can you do with the ESTs? The case of SNP and Indel.
Computational Molecular Biology (Spring’03) Chitta Baral Professor of Computer Science & Engg.
Lecture 2.21 Retrieving Information: Using Entrez.
Protein Databases EBI – European Bioinformatics Institute
Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes Instituto Gulbenkian de Ciência, Oeiras PT.
The Protein Data Bank (PDB)
Protein Structure and Function Prediction. Predicting 3D Structure –Comparative modeling (homology) –Fold recognition (threading) Outstanding difficult.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
Chapter 2 Sequence databases A list of the databases’ uniform resource locators (URLs) discussed in this section is in Box 2.1.
Signaling Pathways and Summary June 30, 2005 Signaling lecture Course summary Tomorrow Next Week Friday, 7/8/05 Morning presentation of writing assignments.
UniProt - The Universal Protein Resource
Subsystem Approach to Genome Annotation National Microbial Pathogen Data Resource Claudia Reich NCSA, University of Illinois, Urbana.
Predicting Function (& location & post-tln modifications) from Protein Sequences June 15, 2015.
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
Ch10. Intermolecular Interactions and Biological Pathways
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
Bioinformatics for biomedicine
© Wiley Publishing All Rights Reserved. Protein and Specialized Sequence Databases.
Secondary Databases Ansuman sahoo Roll: Y Bioinformatics Class Presentation 30 Jan 2013.
NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015.
2 February, 2007 Life Science: Organisms. 2 February, 2007 Genomics “The genetic blueprints of all people generally have the same information, with approximately.
Biological Databases By : Lim Yun Ping E mail :
1 Orthology and paralogy A practical approach Searching the primaries Searching the secondaries Significance of database matches DB Web addresses Software.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Bioinformatics Overview, NCBI & GenBank JanPlan 2012.
Part I: Identifying sequences with … Speaker : S. Gaj Date
Organizing information in the post-genomic era The rise of bioinformatics.
Biological Databases Biology outside the lab. Why do we need Bioinfomatics? Over the past few decades, major advances in the field of molecular biology,
BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.
MGM workshop. 19 Oct 2010 Functional annotation Datasources Konstantinos Mavrommatis Head of Omics group
PIRSF Classification System PIRSF: Evolutionary relationships of proteins from super- to sub-families Homeomorphic Family: Homologous proteins sharing.
Alastair Kerr, Ph.D. WTCCB Bioinformatics Core An introduction to DNA and Protein Sequence Databases.
Protein and RNA Families
PROTEIN DATABASES. The ideal sequence database for computational analyses and data-mining: I t must be complete with minimal redundancy It must contain.
Biological databases Exercises. Discovery of distinct sequence databases using ensembl.
Other biological databases and ontologies. Biological systems Taxonomic data Literature Protein folding and 3D structure Small molecules Pathways and.
Protein Sequence Analysis - Overview - NIH Proteomics Workshop 2007 Raja Mazumder Scientific Coordinator, PIR Research Assistant Professor, Department.
Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Information Sources for Genomics Konstantinos Mavrommatis Genome.
Rice Proteins Data acquisition Curation Resources Development and integration of controlled vocabulary Gene Ontology Trait Ontology Plant Ontology
I. Prolinks: a database of protein functional linkage derived from coevolution II. STRING: known and predicted protein-protein associations, integrated.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Bioinformatics and Computational Biology
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
March 28, 2002 NIH Proteomics Workshop Bethesda, MD Lai-Su Yeh, Ph.D. Protein Scientist, National Biomedical Research Foundation Demo: Protein Information.
Copyright OpenHelix. No use or reproduction without express written consent1 1.
1 of 28 Evaluating Genes and Transcripts (“Genebuild”)
InterPro Sandra Orchard.
Protein databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen and from CSC bio-opas
Welcome to the Protein Database Tutorial. This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
Protein families, domains and motifs in functional prediction May 31, 2016.
Introduction to Genes and Genomes with Ensembl
Demo: Protein Information Resource
Sequence based searches:
Archives and Information Retrieval
Microbial Genome Annotation
Genome Annotation Continued
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

MGM workshop. 19 Oct 2010 Functional annotation Datasources Konstantinos Mavrommatis

MGM workshop. 19 Oct 2010 Let’s get started… Information from databases is used to predict the function of a protein (functional annotation).  Product name  Enzyme catalog number  Domain architecture  …

MGM workshop. 19 Oct 2010 But what is function? cobalamin biosynthetic enzyme, cobalt-precorrin-4 methyltransferase (CbiF)  molecular/enzymatic (methyltransferase)  Reaction (methylation)  Substrate (cobalt-precorrin-4)  Ligand (S-adenosyl-L-methionine)  metabolic (cobalamin biosynthesis)  physiological (maintenance of healthy nerve and red blood cells, through B12).

MGM workshop. 19 Oct 2010 Functional annotation Predict the biochemistry and physiology of an organism based on its genome sequence Explain known biochemical and physiological properties

MGM workshop. 19 Oct 2010 Homologs/Orthologs/Paralogs

Function prediction  Function transfer by homology  Homology  implies a common evolutionary origin.  not retention of similarity in any of their properties.  Homology ≠ similarity of function. Punta & Ofran. PLOS Comp Biol. 2008

MGM workshop. 19 Oct 2010 Trust transfer of annotation ? Punta & Ofran. PLOS Comp Biol. 2008

MGM workshop. 19 Oct 2010 Dos and Don’ts TypeDon’tDo HomologySame functionProbability for same function OrthologySame functionProbability for same function ParalogySame functionProbability for same function Sequence similaritySame functionProbability for same function High sequence similaritySame functionProbability for same function Same sequenceSame functionProbability for same function

MGM workshop. 19 Oct 2010 What if nothing is similar ?  Subcellular localization  Gene context  Special features  Prediction of binding residues (DISIS, bindN) Cytoplasm S ~ S Periplasm

MGM workshop. 19 Oct 2010 Genome annotation Model pathway Annotation should make sense Substrate A Substrate B Substrate C Substrate D Enzyme 2 Enzyme 1Enzyme 3 Enzyme 2 ? ? Enzyme 1Enzyme 3 ✓

MGM workshop. 19 Oct 2010 Annotation should make sense

MGM workshop. 19 Oct 2010 Databases  Databases used for the analysis of biological molecules.  Databases contain information organized in a way that allows users/researchers to retrieve and exploit it.  Why bother?  Store information.  Organize data.  Predict features (genes, functions...).  Understand relationships (metabolic reconstruction).

MGM workshop. 19 Oct 2010 Primary nucleotide databases EMBL/GenBank/DDBJ ( )  Archive containing all sequences from:  genome projects  sequencing centers  individual scientists  patent offices  The sequences are exchanged between the three centers on a daily basis.  Database is doubling every 10 months.  Sequences from >140,000 different species.  1400 new species added every month. YearBase pairsSequences ,575,745,17640,604, ,037,734,46252,016, ,019,290,70564,893, ,874,179,73080,388, ,116,431,94298,868,465

MGM workshop. 19 Oct 2010 Primary protein sequence databases  Contain coding sequences derived from the translation of nucleotide sequences  GenBank  Valid translations (CDS) from nt GenBank entries.  UniProtKB/TrEMBL (1996)  Automatic CDS translations from EMBL.  TrEMBL Release 40.3 (26-May-2009) contains 7,916,844 entries.

MGM workshop. 19 Oct 2010 Errors in databases There are many errors in the primary sequence databases:  In the sequences themselves:  Sequencing errors.  Cloning vectors sequences.  In the annotations:  Inaccuracies, omissions, and even mistakes.  Inconsistencies between some fields.

MGM workshop. 19 Oct 2010 Redundancy Redundancy is a major problem. Entries are partially or entirely duplicated:  e.g. 20% of vertebrate sequences in GenBank.  { {  { {  { { Partial and complete sequence duplications

MGM workshop. 19 Oct 2010 NCBI Derivative Sequence Data ATTGACTA TTGACA CGTGA ATTGACTA TATAGCCG ACGTGC TTGACA CGTGA ATTGACTA TATAGCCG GenBank TATAGCCG AT GA C ATT GA ATT C C GA ATT C C GA ATT C GA ATT C GA ATT C C GA ATT C C UniGene RefSeq Genome Assembly Labs Curators Algorithms TATAGCCG AGCTCCGATA CCGATGACAA

MGM workshop. 19 Oct 2010 RefSeq  Curated transcripts and proteins.  reviewed by NCBI staff.  Model transcripts and proteins.  generated by computer algorithms.  Assembled Genomic Regions (contigs).  Chromosome records.

MGM workshop. 19 Oct 2010 Secondary protein databases  Uniprot/SWISS-PROT (1986) (  a curated protein sequence database  high level of annotation (such as the description of the function of a protein, its domains structure, post-translational modifications, variants, etc.)  a minimal level of redundancy  high level of integration with other databases

MGM workshop. 19 Oct 2010 Classification databases  Groups (families/clusters) of proteins based on…  Overall sequence similarity.  Local sequence similarity.  Presence / absence of specific features (active site, signal peptides… ).  Structural similarity. ...  These groups contain proteins with similar properties.  Specific function, enzymatic activity.  General function.  Evolutionary relationship.  …

MGM workshop. 19 Oct 2010 Overall sequence similarity

MGM workshop. 19 Oct 2010 Clusters of orthologous groups (COGs)  COGs were delineated by comparing protein sequences encoded in 43 complete genomes representing 30 major phylogenetic lineages.  Each Cluster has representatives of at least 3 lineages  A function (specific or broad) has been assigned to each COG.

MGM workshop. 19 Oct 2010 How it works Reciprocal best hit Bidirectional best hit Blast best hit Unidirectional best hit COG1 COG2

MGM workshop. 19 Oct 2010 Profiles & Pfam  A method for classifying proteins into groups exploits region similarities, which contain valuable information (domains/profiles).  These domains/profiles can be used to detect distant relationships, where only few residues are conserved.

MGM workshop. 19 Oct 2010 Regions similarity

MGM workshop. 19 Oct HMMs of protein alignments (local) for domains, or global (cover whole protein) Pfam

MGM workshop. 19 Oct 2010 TIGRfam  Full length alignments.  Domain alignments.  Equivalogs: families of proteins with specific function.  Superfamilies: families of homologous genes.  HMMs

MGM workshop. 19 Oct 2010 KEGG orthology

MGM workshop. 19 Oct 2010 Composite pattern databases  To simplify sequence analysis, the family databases are being integrated to create a unified annotation resource – InterPro  Release 30.0 (Dec10) contains entries  Central annotation resource, with pointers to its satellite dbs

MGM workshop. 19 Oct 2010 * It is up to the user to decide if the annotation is correct *

MGM workshop. 19 Oct 2010 ENZYME

ENZYME

MGM workshop. 19 Oct 2010 KEGG  Contains information about biochemical pathways, and protein interactions.

MGM workshop. 19 Oct 2010 Functional annotation COGs gene KO terms Pfam TIGRfam Pfam TIGRfam IMG PSI BLAST 1e-2 BLASTp <1e-10, >45% id, >70% length Hmmsearch (BLAST preprocessing) BLASTp evalue<10, 20 best hits IMG term TIGRfam COG COG + pfam Pfam Product name (based on translation tables) NO YES hypothetical NO BLAST NO YES

MGM workshop. 19 Oct 2010 Sequencing projects & Metadata

MGM workshop. 19 Oct 2010 Literature search  PubMed

MGM workshop. 19 Oct 2010 Specialized databases  There is a large number of databases devoted to specific organisms.  For some model organisms there are often concurrent systems.  These databases are typically associated to sequencing or mapping projects.

MGM workshop. 19 Oct 2010 Other specialized databases  Signal transduction, regulation, protein-protein interactions  TRANSFAC (Transcription Factor database)  BRITE (Biomolecular Relations in Information Transmission and Expression database)  DIP (Database of Interacting Proteins)  BIND (Biomolecular Interaction Network database)  BioCarta  Biochemical pathways  KLOTHO (Biochemical Compounds Declarative database)  BRENDA (enzyme information system)  LIGAND (similar to Enzyme but with more information for substrates)  Gene order and co-occurrence  STRING 3D structures PDB (Protein Data Bank) MMDB (Molecular Modelling Data Base) NRL_3D (Non-Redundant Library of 3D Structures) SCOP (Structural Classification of Proteins) Polymorphism ALFRED (Allele Frequency Database) Molecular interactions DIP (Database of Interacting proteins) BIND (Biomolecular Interaction Network Database) Gene expression GXD (Mouse Gene Expression Database) The Stanford Microarray Database Mapping GDB (Genome Data Base) EMG (Encyclopedia of Mouse Genome) MGD (Mouse Genome Database) INE (Integrated Rice Genome Explorer) Protein quantification SWISS-2DPAGE PDD (Protein Disease Database) Sub2D (B. subtilis 2D Protein Index)

MGM workshop. 19 Oct 2010 List of databases

MGM workshop. 19 Oct 2010 Databanks interconnection Not all databases are updated regularly. Changes of annotation in one database are not reflected in others.

MGM workshop. 19 Oct 2010 Summary  Gene annotation should make sense in the context of the organism  We have main archives (Genbank), and currated databases (Refseq, SwissProt), and protein classification database (COG, Pfam), and many, many more…  They help predict the function, or the network of functions.  Systems that integrate the information from several databases, visualize and allow handling of data in an intuitive way are required QUESTIONS?