1 EMBL Outstation — The European Bioinformatics Institute Removing redundancy in SWISS-PROT and TrEMBL.

Slides:



Advertisements
Similar presentations
Genome Annotation: A Protein-centric Perspective.
Advertisements

Bioinformatics Ayesha M. Khan Spring 2013.
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
GENBANK, SWISSPROT AND OTHERS As Problem Sources for CSE 549 Andriy Tovkach Genetics.
Swiss-Prot Protein Database Daniel Amoruso December 2, 2004 BI 420.
Protein databases Morten Nielsen. Background- Nucleotide databases GenBank, National Center for Biotechnology Information.
Introduction to Bioinformatics Burkhard Morgenstern Institute of Microbiology and Genetics Department of Bioinformatics Goldschmidtstr. 1 Göttingen, March.
Archives and Information Retrieval
Protein Databases EBI – European Bioinformatics Institute
Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes Instituto Gulbenkian de Ciência, Oeiras PT.
Protein databases Henrik Nielsen. Background- Nucleotide databases GenBank, National Center for Biotechnology Information.
Proteins and Protein Function Charles Yan Spring 2006.
Class European Resources Protein Focused. Protein Databases EBI – European Bioinformatics Institute
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
Chapter 2 Sequence databases A list of the databases’ uniform resource locators (URLs) discussed in this section is in Box 2.1.
UniProt - The Universal Protein Resource
Bioinformatics Lecture 3 BCH 550 Arjumand Warsy. Retrieving Protein Sequences.
Claire O’Donovan EMBL-EBI. In UniProtKB, we aim to provide… o A high quality protein sequence database A non redundant protein database, with maximal.
An Introduction to Bioinformatics Molecular Biology Databases.
Joint EBI-Wellcome Trust Summer School June 2010.
The PIR-PSD current release 78.03, November 24, 2003, contains entries. 65 proteins The PIR was established in 1984 by the National Biomedical.
Development of Bioinformatics and its application on Biotechnology
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
Bioinformatics for biomedicine
Introduction to databases Tuomas Hätinen. Topics File Formats Databases -Primary structure: UniProt -Tertiary structure: PDB Database integration system.
Network Services for Biologists in the Genome Era The Work of the European Bioinformatics Institute.
Bsubt.embl complete entry in EMBL format (DNA and Features) bsubt.embl.Z bsubt.fasta complete DNA sequence in Fasta format bsubt.fasta.Z bsubt.con construct.
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Secondary Databases Ansuman sahoo Roll: Y Bioinformatics Class Presentation 30 Jan 2013.
NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015.
Biological Databases By : Lim Yun Ping E mail :
UniProt Non-redundant Reference Cluster (UniRef) Databases Swiss Institute of Bioinformatics (SIB) European Bioinformatics Institute (EMBL-EBI)
Fortaleza 31.VII.2006 UniProtKB: Questions and answers UniProtKB/Swiss-Prot: Questions, Answers and a few Tips.
Workshop OUTLINE Part 1: Introduction and motivation How does BLAST work? Part 2: BLAST programs Sequence databases Work Steps Extract and analyze results.
Bioinformatics Overview, NCBI & GenBank JanPlan 2012.
Part I: Identifying sequences with … Speaker : S. Gaj Date
EBI is an Outstation of the European Molecular Biology Laboratory. Annotation Procedures for Structural Data Deposited in the PDBe at EBI.
Biological Databases Biology outside the lab. Why do we need Bioinfomatics? Over the past few decades, major advances in the field of molecular biology,
1 EMBL Outstation — The European Bioinformatics Institute Added-Value Proteome Databases: SWISS-PROT, TrEMBL, InterPro.
1 EMBL Outstation — The European Bioinformatics Institute Automatic and Reliable Functional Annotation of Proteins.
Sequence Search and Analysis SPE 1653 (703)
1 EMBL Outstation — The European Bioinformatics Institute EDITtoTrEMBL Automated High-Quality Sequence Annotation Steffen Möller, Ulf Leser, Wolfgang Fleischmann,
PIRSF Classification System PIRSF: Evolutionary relationships of proteins from super- to sub-families Homeomorphic Family: Homologous proteins sharing.
Alastair Kerr, Ph.D. WTCCB Bioinformatics Core An introduction to DNA and Protein Sequence Databases.
Function preserves sequences
PROTEIN DATABASES. The ideal sequence database for computational analyses and data-mining: I t must be complete with minimal redundancy It must contain.
Biological databases Exercises. Discovery of distinct sequence databases using ensembl.
Building WormBase database(s). SAB 2008 Wellcome Trust Sanger Insitute Cold Spring Harbor Laboratory California Institute of Technology ● RNAi ● Microarray.
Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Information Sources for Genomics Konstantinos Mavrommatis Genome.
Copyright OpenHelix. No use or reproduction without express written consent1.
EMBL – EBI European Bioinformatics Institute UniProt - The Universal Protein Resource Claire O’Donovan.
Bioinformatics and Computational Biology
Computer Storage of Sequences
1 Discussion Practical 1. Features of major databases (PubMed and NCBI Protein Db) 2.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Central hub for biological data UniProtKB/Swiss-Prot is a central hub for biological data: over 120 databases are cross-referenced (EMBL/DDBJ/GenBank,
1 EMBL Outstation — The European Bioinformatics Institute Mus musculus - a model organism in SWISS-PROT.
Protein sequence databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen This also includes old material from my thesis
Copyright OpenHelix. No use or reproduction without express written consent1.
1 of 28 Evaluating Genes and Transcripts (“Genebuild”)
1 EMBL Outstation — The European Bioinformatics Institute Large-Scale Characterization of Protein Sequence Data.
BIOINFORMATICS. Bioinformatics is the application of statistics and computer science to the field of molecular biology. The term bioinformatics was coined.
Protein databases Henrik Nielsen
VectorBase genome annotation
UniProt: Universal Protein Resource
Introduction to Bioinformatics
Introduction to Databases
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

1 EMBL Outstation — The European Bioinformatics Institute Removing redundancy in SWISS-PROT and TrEMBL

2 EMBL Outstation — The European Bioinformatics Institute SWISS-PROT F is a curated protein sequence data bank established in 1986 by Amos Bairoch in Geneva and maintained collaboratively with EMBL since 1987 F contains currently protein sequence entries

3 EMBL Outstation — The European Bioinformatics Institute Essential criteria for a sequence data bank F it must be complete with minimal redundancy F it must contain as much up-to-date information as possible on each sequence F all the information items must be retrievable by computer programs in a consistent manner F it should be integrated (cross-referenced) with other sequence related data banks

4 EMBL Outstation — The European Bioinformatics Institute The Bottleneck: Annotation

5 EMBL Outstation — The European Bioinformatics Institute Annotation consists of the description of: F Function(s) of the protein F Post-translational modification(s) F Domains and sites F Secondary structure F Quaternary structure F Similarities to other proteins F Disease(s) associated with deficiencie(s) in the protein F Sequence conflicts, variants, etc.

6 EMBL Outstation — The European Bioinformatics Institute TrEMBL F is a Computer-annotated supplement to SWISS-PROT F consists of entries in SWISS-PROT format F translations of CDS in the Nucleotide Sequence Database not in SWISS-PROT F the translation tools used are based on the program trembl written by Thure Etzold at the EMBL in Heidelberg

7 EMBL Outstation — The European Bioinformatics Institute TrEMBLNEW F Weekly update of TrEMBL which contains protein coding sequences derived from EMBLNEW F TrEMBLNEW entries are moved into TrEMBL during the quarterly release building procedure

8 EMBL Outstation — The European Bioinformatics Institute The Production of TrEMBL F Translation and entry creation F Sorting the entries F Automated post-processing of the SP-TrEMBL entries

9 EMBL Outstation — The European Bioinformatics Institute Automated post-processing of TrEMBL entries F Redundancy removal: affects currently >10% of the entries F Improvements to annotation: affects currently >20% of the entries

10 EMBL Outstation — The European Bioinformatics Institute Removing Redundancy F Causes of redundancy and the detection of redundancy F Removing redundancy

11 EMBL Outstation — The European Bioinformatics Institute Causes of redundancy F Different literature and sequence reports for the same protein F Subfragments of longer sequences F Mutations, polymorphism, variations and conflicts of a sequence are often given as separate entries in EMBL

12 EMBL Outstation — The European Bioinformatics Institute Redundancy detection F The Cyclic Redundancy Check (CRC32) calculates a nearly unique and very compact checksum for each sequence F The Boyer-Moore sequence comparison algorithm for a fast string searching F An algorithm that finds strings with errors ( Landau- Vishkin)

13 EMBL Outstation — The European Bioinformatics Institute Removing Redundancy F Identical full length proteins are merged in one entry F Identical fragment proteins and subfragments of longer sequences from the same organism are merged

14 EMBL Outstation — The European Bioinformatics Institute Removing Redundancy F The ‘MERGE’ procedure - match CRC32  match TrEMBLNEW vs TrEMBLNEW (automatic merge)  match TrEMBLNEW vs TrEMBL (automatic merge)  match TrEMBLNEW vs SWISS-PROT (manual merge) - Subfragment assembly (LASSAP)  match TrEMBLNEW vs TrEMBLNEW (automatic merge and manual check)  match TrEMBLNEW vs TrEMBL (automatic merge and manual check)  match TrEMBLNEW vs SWISS-PROT (manual merge)

15 EMBL Outstation — The European Bioinformatics Institute PID Check EMBLNEW trembl SP + TREMBL PIDS (Work Release) Day 1 Day 2 Day n TREMBLNEW Week 1 Week 2 Week n TREMBLNEW Updates Replace PIDs in SP+TREMBL SP TREMBL Merge Between releases Building Release

16 EMBL Outstation — The European Bioinformatics Institute Results EMBL Nucleotide Sequence Database (rel 55) has 326,000 CDS SWISS-PROT (rel 36) has 74,019 entries TrEMBL (rel 7) has 193,860 entries F 110,000 CDS were already in 74,000 SWISS-PROT entries F 207,000 CDS were in 194,000 TrEMBL entries F 9,000 currently being processed due to redundancy procedures

17 EMBL Outstation — The European Bioinformatics Institute Results F Results of redundancy removal within TrEMBL 7 production were already in SWISS-PROT were merged due to CRC32 matches were removed by subfragment matches F 8,859 entries were removed

18 EMBL Outstation — The European Bioinformatics Institute Credits SWISS-PROT at EBI F Rolf Apweiler F Sergio Contrino F Wolfgang Fleischmann F Henning Hermjakob F Viv Junker F Fiona Lang F Claire O'Donovan F Michele Magrane F Maria Jesus Martin F Nicoletta Mitaritonna F Steffen Moeller F Youla Karavidopoulou F Gill Fraser F Evguenia Kriventseva Collaborators F Amos Bairoch F Eric Glemet F Jean-Jacques Codani