PROTEIN DATABASES. The ideal sequence database for computational analyses and data-mining: I t must be complete with minimal redundancy It must contain.

Slides:



Advertisements
Similar presentations
Genome Annotation: A Protein-centric Perspective.
Advertisements

Bioinformatics Ayesha M. Khan Spring 2013.
Databanks (A) NCBINCBI (National Center for Biotechnology Information) is a home for many public biological databases (see an older diagram below). All.
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
On line (DNA and amino acid) Sequence Information Lecture 7.
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
Bioinformatics for biomedicine Summary and conclusions. Further analysis of a favorite gene Lecture 8, Per Kraulis
Swiss-Prot Protein Database Daniel Amoruso December 2, 2004 BI 420.
Protein databases Morten Nielsen. Background- Nucleotide databases GenBank, National Center for Biotechnology Information.
Archives and Information Retrieval
Lecture 2.21 Retrieving Information: Using Entrez.
Protein Databases EBI – European Bioinformatics Institute
Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
The Protein Data Bank (PDB)
Protein databases Henrik Nielsen. Background- Nucleotide databases GenBank, National Center for Biotechnology Information.
Class European Resources Protein Focused. Protein Databases EBI – European Bioinformatics Institute
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
Chapter 2 Sequence databases A list of the databases’ uniform resource locators (URLs) discussed in this section is in Box 2.1.
UniProt - The Universal Protein Resource
Claire O’Donovan EMBL-EBI. In UniProtKB, we aim to provide… o A high quality protein sequence database A non redundant protein database, with maximal.
Predicting Function (& location & post-tln modifications) from Protein Sequences June 15, 2015.
1 iProLINK: An integrated protein resource for literature mining and literature-based curation 1. Bibliography mapping - UniProt mapped citations 2. Annotation.
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
Bioinformatics for biomedicine
Archives and Information Retrieval
Introduction to databases Tuomas Hätinen. Topics File Formats Databases -Primary structure: UniProt -Tertiary structure: PDB Database integration system.
© Wiley Publishing All Rights Reserved. Protein and Specialized Sequence Databases.
Secondary Databases Ansuman sahoo Roll: Y Bioinformatics Class Presentation 30 Jan 2013.
Biological databases Nicky Mulder:
Biological Databases By : Lim Yun Ping E mail :
1 Orthology and paralogy A practical approach Searching the primaries Searching the secondaries Significance of database matches DB Web addresses Software.
Fortaleza 31.VII.2006 UniProtKB: Questions and answers UniProtKB/Swiss-Prot: Questions, Answers and a few Tips.
Part I: Identifying sequences with … Speaker : S. Gaj Date
EBI is an Outstation of the European Molecular Biology Laboratory. Annotation Procedures for Structural Data Deposited in the PDBe at EBI.
Biological Databases Biology outside the lab. Why do we need Bioinfomatics? Over the past few decades, major advances in the field of molecular biology,
1 EMBL Outstation — The European Bioinformatics Institute Added-Value Proteome Databases: SWISS-PROT, TrEMBL, InterPro.
BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.
1 EMBL Outstation — The European Bioinformatics Institute Automatic and Reliable Functional Annotation of Proteins.
PIRSF Classification System PIRSF: Evolutionary relationships of proteins from super- to sub-families Homeomorphic Family: Homologous proteins sharing.
Function preserves sequences
Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Information Sources for Genomics Konstantinos Mavrommatis Genome.
Rice Proteins Data acquisition Curation Resources Development and integration of controlled vocabulary Gene Ontology Trait Ontology Plant Ontology
1 EMBL Outstation — The European Bioinformatics Institute Removing redundancy in SWISS-PROT and TrEMBL.
Protein Sequence Databases for Proteomics The good, the bad & the ugly US HUPO: Bioinformatics for Proteomics Nathan Edwards – March 12, 2005.
EMBL – EBI European Bioinformatics Institute UniProt - The Universal Protein Resource Claire O’Donovan.
MGM workshop. 19 Oct 2010 Functional annotation Datasources Konstantinos Mavrommatis
Bioinformatics and Computational Biology
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Computer Storage of Sequences
1 Discussion Practical 1. Features of major databases (PubMed and NCBI Protein Db) 2.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Biological Information and Biological Databases Meena K Sakharkar Bioinformatics Centre National University of Singapore.
Protein Sequence Databases for Proteomics The good, the bad & the ugly US HUPO: Bioinformatics for Proteomics Nathan Edwards – March 12, 2006.
Central hub for biological data UniProtKB/Swiss-Prot is a central hub for biological data: over 120 databases are cross-referenced (EMBL/DDBJ/GenBank,
1 EMBL Outstation — The European Bioinformatics Institute Mus musculus - a model organism in SWISS-PROT.
Protein sequence databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen This also includes old material from my thesis
1 of 28 Evaluating Genes and Transcripts (“Genebuild”)
Protein databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen and from CSC bio-opas
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
1 EMBL Outstation — The European Bioinformatics Institute Large-Scale Characterization of Protein Sequence Data.
Protein databases Henrik Nielsen
Bio/Chem-informatics
Archives and Information Retrieval
UniProt: Universal Protein Resource
There are four levels of structure in proteins
BLAST.
Introduction to Bioinformatics
Introduction to Databases
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

PROTEIN DATABASES

The ideal sequence database for computational analyses and data-mining: I t must be complete with minimal redundancy It must contain as much up-to-date information (annotation) as possible on each sequence All the information items must be retrievable by computer programs in a consistent manner It must be highly interoperable with other databases

PROTEIN DATABASES SWISS-PROT - Manually curated (EBI/SIB) TrEMBL - Translation of EMBL (EBI) PIR - annotated sequences (NCBI) GenPept -GenBank translations NRL_3D - Sequences from PDB OWL - Non-redundant sequences RefSeq - Non-redundant sequence set Kabat & IMGT - Immunological proteins

PIR (Protein Information Resource) Sources: GenBank/EMBL/DDBJ translations, literature, direct submissions -PIR-PSD (merging, annotation, classification) -PIR-Archive (original sequences) Total ~ non-redundant sequences

Annotation in PIR Annotation is from literature and available databases Uses controlled vocabulary and std nomenclature (Enzyme nomenclature) Includes status tags “validated, expt’l, similarity, predicted, absent” Classification into superfamilies and homology domain superfamilies Classification is used for applying common annotation to similar sequences and integrity checks

Example of a PIR entry (1) Link to list of entries for this species Acc no.s of sequences merged with this entry Link to other entries with same citation Links to EMBL/GenBank/DDBJ etc Link creates sequence reported for this reference

Example of a PIR entry (2) Link of entries classified into this superfamily or with this domain List of other PIR entries with this feature List of entries with these keywords Link to PDB entry for this sequence Alignments involving this protein

Example of a PIR entry (3) Link from top of entry page to Composition Table

Searching PIR for superfamily annotation Automated classification of full-length sequences >99% -families >70% -superfamilies -Use 50% identity for clustering of proteins into families -Also cluster into homology domain superfamilies

GenPept

NRL_3D Database html Protein database of sequences with 3D structure in PDB

NRL_3D Example entry (1)

NRL_3D Example entry (2)

OWL Non-redundant protein database derived from SWISS- PROT, PIR, GenBank (translations) and NRL_3D 279,796 entries, small because of strict redundancy criteria All identical and trivially-different sequences (i.e. those having a single amino acid change) are removed SWISS-PROT is highest priority, NRL_3D lowest

RefSeq Reference sequence standards for genomes, transcripts and proteins for human, mouse and rat Manually curated, non-redundant, status (genome annotation, predicted, provisional, reviewed) Includes data from NCBI Human Genome Annotation Project

SWISS-PROT A curated protein sequence data bank established in July 1986 by Amos Bairoch in Geneva and now maintained collaboratively with EMBL Contains manually annotated protein sequence entries (but >60% of all seq with some basic biochemical characterisation) Distinguishes between expt’l and comput’l derived annotation

SWISS-PROT STATISTICS SWISS-PROT entries amino acids abstracted from > references linked by > direct pointers to 35 related or specialized data collections

Example of a SWISS-PROT entry

The annotation is mainly found in: Comment (CC) lines Feature table (FT) Keyword (KW) lines Description (DE) lines

The topics of the CC lines are: ALTERNATIVE PRODUCTS CATALYTIC CAUTION COFACTOR DEVELOPMENTAL STAGE DISEASE DOMAIN ENZYME REGULATION FUNCTION INDUCTION MASS SPECTROMETRY PATHWAY PHARMACEUTICALS POLYMORPHISM PTM SIMILARITY SUBCELLULAR LOCATION SUBUNIT TISSUE SPECIFICITY

The FT keys are handling: Change indicators Amino-acid modifications Regions Secondary structure Other features

Change indicators are: CONFLICT - Different papers report differing sequences VARIANT - Authors report that sequence variants exist VARSPLIC - Description of sequence variants produced by alternative splicing MUTAGEN - Site which has been experimentally altered

Amino-acid modifications are: MOD_RES - Post-translational modification of a residue LIPID - Covalent binding of a lipidic moiety DISULFID - Disulfide bond THIOLEST - Thiolester bond THIOETH - Thioether bond CARBOHYD - Glycosylation site METAL - Binding site for a metal ion BINDING - Binding site for any chemical group (co- enzyme, prosthetic group, etc.)

Regions: SIGNAL TRANSIT PROPEP CHAIN PEPTIDE DOMAIN CA_BIND DNA_BIND NP_BIND TRANSMEM ZN_FING SIMILAR REPEAT

Other features are: ACT_SITE - Amino acid(s) involved in the activity of an enzyme SITE - Any other interesting site on the sequence INIT_MET - The sequence is known to start with an initiator methionine NON_TER - The residue at an extremity of the sequence is not the terminal residue NON_CONS - Non consecutive residues UNSURE - Uncertainties in the sequence

The KW lines : around 800 different keywords keyword dictionary available Controlled use of the keywords has cross- references DBXREFS – crossreferences to about 30 databases including pattern dbs, specialised genome dbs, other sequence dbs

Annotation sources: publications that report new sequence data review articles to periodically update the annotation of families or groups of proteins external experts

: SWISS-PROT ceased to be in the public domain

What has changed No changes for academic users Almost no restrictions on the redistribution of SWISS-PROT by academic servers or software companies Commercial users are required to pay yearly subscription fees. These fees will be used to complement the existing grants in order to provide stable long-term funding

SWISS-PROT Growth

DNA sequence database growth

The Bottleneck: Manual annotation

TrEMBL We cannot cope with the speed with which new data is coming out We do not want to dilute the quality of SWISS-PROT Solution: TrEMBL (TRanslation of EMBL): contains all translations of CDS in the Nucleotide Sequence Database not in SWISS-PROT TrEMBL is automatically generated and annotated using software tools

TrEMBL production EMBLNEW flatfile CDS scanning, translation and SWISS-PROT formatting protein_id in SP+TrEMBL TrEMBLnew REM-TrEMBL Smalls.dat Synth.dat Pseudo.dat Immuno.dat Patent.dat Truncated.dat SP-TrEMBL Redundancy checks Identical matches Sub-fragment matches Variants,conflicts... Automatic annotation (Prosite,PFAM, Rulebase, ENZYME, MGD, Flybase…) TrEMBL SWISS-PROT

SWISS-PROT + TrEMBL ftp.ebi.ac.uk/pub/databases/sp_tr_nrdb/ SWISS-PROT entries TrEMBL entries weekly production of a non-redundant and comprehensive protein sequence database consisting of SWISS-PROT, TrEMBL, and TrEMBLnew: