1 EMBL Outstation — The European Bioinformatics Institute Large-Scale Characterization of Protein Sequence Data.

Slides:



Advertisements
Similar presentations
Genome Annotation: A Protein-centric Perspective.
Advertisements

Integration of Protein Family, Function, Structure Rich Links to >90 Databases Value-Added Reports for UniProtKB Proteins iProClass Protein Knowledgebase.
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
Bioinformatics for biomedicine Summary and conclusions. Further analysis of a favorite gene Lecture 8, Per Kraulis
GENBANK, SWISSPROT AND OTHERS As Problem Sources for CSE 549 Andriy Tovkach Genetics.
Swiss-Prot Protein Database Daniel Amoruso December 2, 2004 BI 420.
Protein databases Morten Nielsen. Background- Nucleotide databases GenBank, National Center for Biotechnology Information.
Archives and Information Retrieval
Protein Databases EBI – European Bioinformatics Institute
Today’s menu: -UniProt - SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes Instituto Gulbenkian de Ciência, Oeiras PT.
Protein databases Henrik Nielsen. Background- Nucleotide databases GenBank, National Center for Biotechnology Information.
Proteins and Protein Function Charles Yan Spring 2006.
Today’s menu: -UniProt - SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
Class European Resources Protein Focused. Protein Databases EBI – European Bioinformatics Institute
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
Chapter 2 Sequence databases A list of the databases’ uniform resource locators (URLs) discussed in this section is in Box 2.1.
UniProt - The Universal Protein Resource
Bioinformatics Lecture 3 BCH 550 Arjumand Warsy. Retrieving Protein Sequences.
ExPASy - Expert Protein Analysis System The bioinformatics resource portal and other resources An Overview.
Joint EBI-Wellcome Trust Summer School June 2010.
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
Bioinformatics.
Development of Bioinformatics and its application on Biotechnology
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
Bioinformatics for biomedicine
Introduction to databases Tuomas Hätinen. Topics File Formats Databases -Primary structure: UniProt -Tertiary structure: PDB Database integration system.
Network Services for Biologists in the Genome Era The Work of the European Bioinformatics Institute.
Bsubt.embl complete entry in EMBL format (DNA and Features) bsubt.embl.Z bsubt.fasta complete DNA sequence in Fasta format bsubt.fasta.Z bsubt.con construct.
Secondary Databases Ansuman sahoo Roll: Y Bioinformatics Class Presentation 30 Jan 2013.
Biological Databases By : Lim Yun Ping E mail :
Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University.
EBI is an Outstation of the European Molecular Biology Laboratory. Annotation Procedures for Structural Data Deposited in the PDBe at EBI.
Biological Databases Biology outside the lab. Why do we need Bioinfomatics? Over the past few decades, major advances in the field of molecular biology,
1 EMBL Outstation — The European Bioinformatics Institute Added-Value Proteome Databases: SWISS-PROT, TrEMBL, InterPro.
BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.
Lecture 08 PROTEINSEQUENCEANALYSIS. PROTEINDATABASES PROTEINSEQUENCE MOTIF/DOMAIN FOLDINDING PROPERTIES TOOLS.
Protein Database David Shiuan Department of Life Science Institute of Biotechnology Interdisciplinary Program of Bioinformatics National Dong Hwa University.
1 EMBL Outstation — The European Bioinformatics Institute Automatic and Reliable Functional Annotation of Proteins.
1 EMBL Outstation — The European Bioinformatics Institute EDITtoTrEMBL Automated High-Quality Sequence Annotation Steffen Möller, Ulf Leser, Wolfgang Fleischmann,
Function preserves sequences
Protein and RNA Families
Biological databases an introduction By Dr. Erik Bongcam-Rudloff LCB-UU/SLU ILRI 2007 By Dr. Erik Bongcam-Rudloff LCB-UU/SLU ILRI 2007.
PROTEIN DATABASES. The ideal sequence database for computational analyses and data-mining: I t must be complete with minimal redundancy It must contain.
Mining Biological Data. Protein Enzymatic ProteinsTransport ProteinsRegulatory Proteins Storage ProteinsHormonal ProteinsReceptor Proteins.
Building WormBase database(s). SAB 2008 Wellcome Trust Sanger Insitute Cold Spring Harbor Laboratory California Institute of Technology ● RNAi ● Microarray.
Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Information Sources for Genomics Konstantinos Mavrommatis Genome.
1 EMBL Outstation — The European Bioinformatics Institute Removing redundancy in SWISS-PROT and TrEMBL.
Copyright OpenHelix. No use or reproduction without express written consent1.
EMBL – EBI European Bioinformatics Institute UniProt - The Universal Protein Resource Claire O’Donovan.
Bioinformatics and Computational Biology
Computer Storage of Sequences
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Central hub for biological data UniProtKB/Swiss-Prot is a central hub for biological data: over 120 databases are cross-referenced (EMBL/DDBJ/GenBank,
1 EMBL Outstation — The European Bioinformatics Institute Mus musculus - a model organism in SWISS-PROT.
Protein sequence databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen This also includes old material from my thesis
Protein domain/family db Secondary databases are the fruit of analyses of the sequences found in the primary sequence db Either manually curated (i.e.
Protein databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen and from CSC bio-opas
Welcome to the Protein Database Tutorial. This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
Protein databases Henrik Nielsen
Archives and Information Retrieval
생물정보학 Bioinformatics.
UniProt: Universal Protein Resource
Genome Annotation Continued
Predicting Active Site Residue Annotations in the Pfam Database
InterPro An Introduction
Introduction to Databases
Supporting High-Performance Data Processing on Flat-Files
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

1 EMBL Outstation — The European Bioinformatics Institute Large-Scale Characterization of Protein Sequence Data

2 EMBL Outstation — The European Bioinformatics Institute The Challenge F rapidly growing amounts of data lacking experimental determination of the biological function enhances the need for computational analyses of the data

3 EMBL Outstation — The European Bioinformatics Institute Databases are essential tools in Bioinformatics for computational analysis and data-mining (with SWISS-PROT being the gold-standard)

4 EMBL Outstation — The European Bioinformatics Institute SWISS-PROT F is a curated protein sequence data bank established in 1986 by Amos Bairoch in Geneva and maintained collaboratively with EMBL since 1987 F contains currently protein sequence entries

5 EMBL Outstation — The European Bioinformatics Institute Essential criteria for a sequence data bank ¶ it must be complete with minimal redundancy · it must contain as much up-to-date information as possible on each sequence ¸ all the information items must be retrievable by computer programs in a consistent manner ¹ it should be integrated (cross-referenced) with other sequence related data banks

6 EMBL Outstation — The European Bioinformatics Institute Integration with other databases F SWISS-PROT entries F abstracted from > references F linked by > direct pointers to F 28 related or specialized data collections

7 EMBL Outstation — The European Bioinformatics Institute Integration with other databases F EMBL Nucleotide Sequence Database F PDB F Genomic databases (FlyBase, SubtiList, MaizeDB, EcoGene, LISTA, SGD, StyGene) F 2D-Gel databases (ECO2DBASE, SWISS- 2DPAGE, Aarhus/Ghent, YEPD, Harefield) F Specialized collections (OMIM, PROSITE, ENZYME, GCRDB, Transfac, HSSP)

8 EMBL Outstation — The European Bioinformatics Institute Connections between databases

9 EMBL Outstation — The European Bioinformatics Institute SWISS-PROT Growth

10 EMBL Outstation — The European Bioinformatics Institute Nucleotide sequence database growth

11 EMBL Outstation — The European Bioinformatics Institute The Bottleneck: Annotation

12 EMBL Outstation — The European Bioinformatics Institute Annotation consists of the description of: F Function(s) of the protein F Post-translational modification(s) F Domains and sites F Secondary structure F Quaternary structure F Similarities to other proteins F Disease(s) associated with deficiencie(s) in the protein F Sequence conflicts, variants, etc.

13 EMBL Outstation — The European Bioinformatics Institute Annotation sources: F publications that report new sequence data F review articles to periodically update the annotation of families or groups of proteins F external experts

14 EMBL Outstation — The European Bioinformatics Institute TrEMBL F is a Computer-annotated supplement to SWISS-PROT F consists of entries in SWISS-PROT format F translations of CDS in the Nucleotide Sequence Database not in SWISS-PROT F the translation tools used are based on the program trembl written by Thure Etzold at the EMBL in Heidelberg

15 EMBL Outstation — The European Bioinformatics Institute August 1998: SWISS-PROT 36 + TrEMBL 7 F CDS in corresponding EMBL release F SWISS-PROT entries F CDS integrated in SWISS-PROT F the remaining CDS were merged whenever possible to reduce redundancy

16 EMBL Outstation — The European Bioinformatics Institute TrEMBL release 7 F TrEMBL entries F amino acids F linked by > direct pointers to F 14 related or specialized data collections

17 EMBL Outstation — The European Bioinformatics Institute The Production of TrEMBL ¶ translation and entry creation · sorting the entries ¸ post-processing the SP-TrEMBL entries

18 EMBL Outstation — The European Bioinformatics Institute Translation and entry creation ¶ translation of every CDS not yet cross-referenced to SWISS-PROT · parsing of information in EMBL entries into TrEMBL entries

19 EMBL Outstation — The European Bioinformatics Institute Sorting the entries F into SP-TrEMBL and REM-TrEMBL F SP-TrEMBL is split in taxonomic divisions

20 EMBL Outstation — The European Bioinformatics Institute Post-processing ¶ reducing redundancy · enhancing the information content

21 EMBL Outstation — The European Bioinformatics Institute Improving Automatic Annotation F will streamline flow into TrEMBL F will bring TrEMBL nearer to SWISS- PROT quality F will make the transition from TrEMBL to SWISS- PROT easier

22 EMBL Outstation — The European Bioinformatics Institute Demands on a system for automated data analysis and annotation F Correctness F Scalability F Updateable F Low level of redundant information F Completeness F Standardized vocabulary

23 EMBL Outstation — The European Bioinformatics Institute Components of a system for automated data analysis and annotation F sequence analysis tools (PROSITE, TM, Coiled Coils, Signal etc) F sequence similarity searching (FASTA, SW, BLAST) F database scanning/parsing (MGD, Flybase, ENZYME, etc) F information transfer decided by rule-based system

24 EMBL Outstation — The European Bioinformatics Institute Environment for Distributed Information Transfer to TrEMBL (EDITtoTrEMBL) F RuleBase F Analyzers F Dispatchers

25 EMBL Outstation — The European Bioinformatics Institute EDITtoTrEMBL

26 EMBL Outstation — The European Bioinformatics Institute EDITtoTrEMBL: RuleBase F SWISS-PROT as source of annotation: correctness and controlled vocabulary F Rules can be automatically and/or manually created F Rules can be updated

27 EMBL Outstation — The European Bioinformatics Institute EDITtoTrEMBL: Analyzers F Directly implement an algorithm or communicate with external programs F Query other databases F Use rules to add information to TrEMBL entries

28 EMBL Outstation — The European Bioinformatics Institute EDITtoTrEMBL: Dispatchers F Control of annotation flow F Error checking F Removal of redundant information

29 EMBL Outstation — The European Bioinformatics Institute Standardized transfer of annotation from characterized proteins in SWISS-PROT to TrEMBL entries F TrEMBL entry is reliably recognized by a given method as a member of a certain group of proteins F corresponding group of proteins in SWISS-PROT shares certain annotation F common annotation is transferred to the TrEMBL entry and flagged as annotated by similarity

30 EMBL Outstation — The European Bioinformatics Institute Automated post-processing of TrEMBL entries F redundancy removal: affects currently >10% of the entries F improvements of annotation: affects currently >20% of the entries

31 EMBL Outstation — The European Bioinformatics Institute Integrated resource of Protein domain and functional sites (InterPro) F Integration of different pattern recognition methods (PROSITE, PRINTS and PFAM) F Incorporation of new families and domains into InterPro F Enhancing the functional annotation of TREMBL entries F Enhancing genome annotation

32 EMBL Outstation — The European Bioinformatics Institute The InterPro project participants F Co-ordinated by EBI (R. Apweiler) F PROSITE (A. Bairoch, P. Bucher) F PRINTS (T. Attwood) F PFAM (R. Durbin, E. Birney, A. Bateman, E. Sonnhammer) F PRODOM (D. Kahn) F PRATT (I. Jonassen) F GENE-IT F LION bioscience AG

33 EMBL Outstation — The European Bioinformatics Institute SWISS-PROT + TrEMBL F complete and up-to-date protein sequence collection F minimal redundancy: SP_TR_NRDB F linked by > direct pointers to F 28 related or specialized data collections F deeper integration between the EMBL Nucleotide Sequence Database and SWISS- PROT + TrEMBL by using PID numbers

34 EMBL Outstation — The European Bioinformatics Institute Credits SWISS-PROT at EBI F Rolf Apweiler F Sergio Contrino F Christian Desaintes F Wolfgang Fleischmann F Henning Hermjakob F Viv Junker F Fiona Lang F Claire O'Donovan F Michele Magrane F Maria Jesus Martin F Nicoletta Mitaritonna F Steffen Moeller F Stephanie Kappus F Sheila Rose Collaborators F Amos Bairoch F Jean-Jacques Codani F Keith Tipton F Marvin Edelman F Compugen F Sue Povey and Julia White F MGD F Flybase F Neil Rawlings F Network of > 200 external experts