Presentation is loading. Please wait.

Presentation is loading. Please wait.

UniProt - The Universal Protein Resource

Similar presentations


Presentation on theme: "UniProt - The Universal Protein Resource"— Presentation transcript:

1 UniProt - The Universal Protein Resource
Claire O’Donovan

2 Pre-UniProt Swiss-Prot: created in 1986; since 1987, a collaboration of the SIB and the EMBL/EBI; TrEMBL: created at the EBI in 1996 as a computer-annotated protein sequence database supplementing Swiss-Prot. It was introduced to deal with the increased data flow from genome projects.

3 UniProt Consortium

4 UniProt Consortium activities

5 The three-layered approach
The UniProt Archive (UniParc) UniProtKB + all other protein sequences publicly available Completeness The UniProt Reference Clusters (UniRef) Non-redundant views of UniProtKB + selected UniParc sets Speed The UniProt Knowledgebase (UniProtKB) Central database of annotated protein sequences and functional information UniProtKB/Swiss-Prot + UniProtKB/TrEMBL

6 The three layer approach
Interrelationship between the UniProt Databases This is a graphical representation of the relationships between the UniProt databases as a summary of the previous slide: -The UniProt Archive provides comprehensive coverage of all protein sequences—including new and revised protein sequences—from many sources -The UniPRotKB is a subset of the protein sequences from UniPArc based on quality annotation and sequence stability -The UniProt Reference Clusters (UniRef), derived by automatic procedures from the UniProt Knowledgebase and the UniProt Archive, provide complete coverage of sequence space while hiding redundant sequences from view For each organism with a completely sequenced genome we construct a non-redundant proteome reference set. The availability of complete non-redundant proteome sets for all organisms whose genomes have been fully sequenced is a pre-requisite of many types of proteome-wide analysis. For example this is essential to high-throughput mass-spectrometric identification experiments, and in general, for any type of comparative genomic studies. As described in B , UniProt already constructs complete non-redundant proteome sets using a number of approaches. Each set and its analysis is made available shortly after the appearance of the newly completed genome sequence in the nucleotide sequence databases. A standard procedure is used to create, from the UniProt Knowledgebase, proteome sets for bacterial, archaeal and non-metazoan eukaryotic genomes. For most microbial and fungal genomes these sets are obtained via the list of predicted coding sequences submitted by the genome centers as bona-fide protein-coding “CDS

7 UniProt Archive UniParc is a non-redundant archive of protein sequences from the public databases It contains only protein sequences (no annotations) It provides cross-references to the source databases With these database goals, the summary of Database tasks are:

8 UniProt Archive: Principles
UniParc is non-redundant Each unique protein sequence is stored only once and is assigned a unique stable UniParc identifier (e.g UPI ) UniParc provides cross-references to the original source: active or retired UniParc provides sequence versions. UniParc stores each unique sequence only once and assigns a unique UniParc identifier. UniParc handles all sequences just as strings - all sequences 100% identical over the whole length of the sequence are merged regardless of whether they are from the same or from different species. Furthermore, UniParc cross-references the accession numbers of the source databases and provides sequence versions that get incremented each time the underlying sequence changes, thus, making it possible to observe sequence changes in all source databases. The use of the status flag “active” means that the sequence still exists in the source database, while the status flag “obsolete” indicates that the sequence does not exist anymore in the source database. UniParc records are without annotation since the annotation will be only true in the real context of the sequence: proteins with the same sequence may have different functions depending on species, tissue, developmental stage, etc. All this context dependent information (if known at all) cannot be present in UniParc, and it is the purpose of the UniProt Knowledgebase to provide this.

9 UniProt Reference Clusters Principles
It provides non-redundant reference data collections It allows faster and more informative sequence similarity searches It includes the UniProtKB and some data from UniParc It merges across different species The concept of non-redundancy for the UniProt Reference Clusters is similar to that discussed before for the UniProt Knowledgebase. However, there are some differences. UniRef merges sequences automatically across different species and includes some data from UniParc (like translations from the highly unstable gene predictions), while merging in the Knowledgebase is restricted to curator-assisted merging of reliable and stable sequence data derived from a certain gene from a certain species. UniRef100 is based on all UniProt Knowledgebase records, as well as on the UniParc records that represent Ensembl protein translations, RefSeq data, and some other smaller data sets. The production of UniRef100 starts with the clustering of all these records by sequence identity. Identical sequences and subfragments are presented as a single UniRef100 entry containing accession numbers of all the merged entries, the protein sequence, links to the corresponding UniProt Knowledgebase and archive records. UniRef90 and UniRef50 are built from UniRef100 (see section C.2.3) to provide non-redundant sequence collections for the scientific user community to perform faster homology searches. They yield a size reduction of approximately 40% and 65%, respectively

10 UniProt Reference Clusters Principles
UniRef100 It merges identical sequences and subfragments UniRef90 Size reduction of 40% UniRef50 Size reduction of 65% The concept of non-redundancy for the UniProt Reference Clusters is similar to that discussed before for the UniProt Knowledgebase. However, there are some differences. UniRef merges sequences automatically across different species and includes some data from UniParc (like translations from the highly unstable gene predictions), while merging in the Knowledgebase is restricted to curator-assisted merging of reliable and stable sequence data derived from a certain gene from a certain species. UniRef100 is based on all UniProt Knowledgebase records, as well as on the UniParc records that represent Ensembl protein translations, RefSeq data, and some other smaller data sets. The production of UniRef100 starts with the clustering of all these records by sequence identity. Identical sequences and subfragments are presented as a single UniRef100 entry containing accession numbers of all the merged entries, the protein sequence, links to the corresponding UniProt Knowledgebase and archive records. UniRef90 and UniRef50 are built from UniRef100 (see section C.2.3) to provide non-redundant sequence collections for the scientific user community to perform faster homology searches. They yield a size reduction of approximately 40% and 65%, respectively

11 UniProtKB/Swiss-Prot UniProtKB/TrEMBL
2007 UniProtKB/Swiss-Prot - Non-redundant - High level of integration - High level of manual curation - Contains 241,242 entries UniProtKB/TrEMBL - Translations of CDS in EMBL/GenBank/DDBJ - Automatic annotation - Contains 3,313,265 entries The consortium is involved in various activities including UniParc and UniRef but the major curation effort is directed towards the UniProt Knowledgebase. So what is the UniProt KnowledgeBase? It consists of two sections called Swiss-Prot and TrEMBL. The core distinquishing features of each of these sections are as follows: 2010 UniProtKB/Swiss-Prot 515,000 UniProtKB/TrEMBL -11,000,000

12 UniProtKB/TrEMBL Automatically generated in a biweekly cycle from the data present in EMBL/GenBank/DDBJ and some other sources such as TAIR/SGD Exclusions: pseudogenes, synthetic, immunoglobulins, patents, small sequences <8 /product, /gene, /locus_tag RefSeq and Ensembl Basically UniProtKB/TrEMBL represents the great majority of the todo list of the UniProt curators so it is worth taking a few minutes to review what this section is. The main source of this database is the CDS features in the nucleotide sequence databases, to which the sanger centre is of course a major contributer. We parse various data from the CDS feature into the appropriate linetypes in TrEMBL. The most obvious of these being the /product, /gene and /locus_tag qualifiers. Please contact us for more information about this if you would like to optimise the data transferral. There are some sequences which are not available in the nucleotide sequence databases for various reasons so we also try to capture this data from available sources such as TAIR/SGD. We are also working with refseq and ensembl to establish a pipeline to establish what data is missing/different between the resources in order to build consensus sets across the database. Exclusions: excluded at point of entry from nucleotide sequence databases and also deleted from db after discovery by curators. Welcome feedback on this.

13 UniProtKB/TrEMBL Proteome annotation
Cross-references to other databases Addition of relevant publications (eg PDB) Redundancy Automatic annotation Future plans for manual annotation eg human proteome project After the TrEMBL entries are created from the underlying nucleotide sequence database entries, various post-processing steps are carried out to enhance the entries prior to manual annotation. Annotation is standardised according to the manual rules in the proteome database. Cross-references are added to 72 other databases. Where the cross-references to other databases facilitate the addition of relevant citations, this is also done..eg PDB. Redundancy procedures merge records from the same species which are 100% identical. Automatic annotation procedures are applied based on the RuleBase and Spearmint systems. This is used very conservatively leading to the enhancement of 25% of the database. There are future plans for partial manual annotation in TrEMBL to facilitate enhancement of protein and gene names eg. Human proteome project

14 Literature Other databases Analysis tools External expertise
So after the initial work by the TrEMBL team, the UniProt curators then proceed with the manual annotation. This involves combining information from many sources such as of course literature but also other databases such as ensembl and havana and using analysis tools and availing of external expertise. The UniProt curators also actively search for proteins in sources other than the nucleotide sequence databases,..eg submissions and journal scan Analysis tools External expertise

15 Capturing the correct sequence
The first step is to capture the correct sequence. The nature of the nucleotide sequence databases is that they are archives and also that they store each sequence report in one entry. The 100% identical reports are already merged at the trembl level as mentioned previously but there will still be redundancy. Archive collections Each sequence report stored in its own entry Merging at 100% identity Still some redundancy

16 Sequence similarity searches
Identify potential merge candidates Identify similar already curated entries So the curators begin with a sequence similarity search to order to identify potential merge candidates and also to identify already curated similar entries in Swiss-Prot in order to speed the annotation process

17 Sequence comparison Sequence alignments
Identification of sequence differences Helps in identifying underlying causes Sequence alignments allow the identification of sequence differences and also helps is identifying the underlying causes.

18 Causes of sequence differences
Polymorphisms, disease variants Splice variants Sequencing errors Incorrect predictions The usual causes of sequence differences cover both naturally occuring events such as polymorphisms and splice variants and also events due to the sequencing process itself such as sequencing errors and incorrect predictions which of course have lessened in recent years. However due to the archive nature of the nucleotide sequence databases, they still exist in the databases.

19

20 Sequence analysis Range of sequence analysis tools used to predict important sequence features Use of most appropriate programs Development of new predictive methods

21 Evidence attribution System which allows linking of all information in an entry to its original source. Allows users: to trace origin of all data to differentiate easily between literature-derived and computational data to assess data reliability

22 EBI curation projects Submissions Journal scanning
Species-specific curation human, mouse, rat, C.elegans, Drosophila, Xenopus, zebrafish, S.cerevisiae, S.pombe Protein family curation kinases, keratins UniProtKB-MSD collaboration PTM standardisation

23 UniProt distribution Biweekly distribution
Website access FTP access DVD of UniProtKB

24 UniProt Web With these database goals, the summary of Database tasks are:


Download ppt "UniProt - The Universal Protein Resource"

Similar presentations


Ads by Google