EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall Protein Sequence Database:
22 Overview 1)The UniProt databases 2)UniProt/SwissProt annotation 3)UniProt/TrEMBL automatic annotation 4)Using the uniprot.org website 5)Computational access
1) The UniProt databases
44 Source of protein sequence data Nucleotide sequence database Protein sequence database Individual scientists Large-scale sequencing projects Patent Offices Nucleotide sequencing Submit Protein sequencing Derive protein sequence Protein sequencing is rare Most protein sequence derived from nucleotide data Protein sequencing is rare Most protein sequence derived from nucleotide data
55 Protein sequence is mainly derived data ACGCTCGTACGCATCGTCACTACTAGCTACGACGACGACACGCTACTACTCGACGATTCT DNA sequence translate Derived mRNA sequence AUGCGUAGUGAUGAAUGCUGCUGUGCGAUGAGCUGC Derived protein sequence MRSNECCCAMSC transcribe submit
66 Protein sequence is mainly derived data ACGCTCGTACGCATCGTCACTACTAGCTACGACGACGACACGCTACTACTCGACGATTCT DNA sequence translate Derived mRNA sequence AUGCGUAGUGAUGAAUGCUGCUGUGCGAUGAGCUGC Derived protein sequence MRSNECCCAMSC transcribe submit Predicted stop Predicted start may not have direct evidence Predicted splice sites
77 How to find the information you need? TATCTACAG TAGAGGCTATCAGCA CGCAGCACCAT GACGCGCATAACT GATCTACGA TAGCGAGCAGCAGCA CAGCATC GCAGCATCAG CTAAGCGACA ATAGACATCA AATCATCACGAT GAATCATCGTCTACG AGATCGC CTATCTGT High quality protein sequence Non-redundant data Splice isoforms, disease variants, PTMs Sequence archiving essential Protein identification Stable identifiers Consistent nomenclature Protein annotation Information protein function biological processes molecular interactions pathways
88 UniProt Since 2002 a merger and collaboration of three databases: Funded mainly by NIH (US) to be the highest quality, most thoroughly annotated protein sequence database Swiss-Prot & TrEMBLPIR-PSD
99 UniProt Consortium
10 Where does the data come from? Sequence sources UniParc ENA exchange data daily
11 Where does the data come from? more… Sequence sources ENA Model organisms PDB RefSeq Ensembl VEGA Patents UniParc UniMES UniProtKB/ TrEMBL Metagenomic & environmental Taxonomy known History of sequences High quality annotation UniProtKB/ SwissProt Remove redundancy Manual annotation
12 Where does the data come from? UniParc UniMES UniProtKB/ TrEMBL Metagenomic & environmental Taxonomy known UniProtKB/ SwissProt UniMES Clusters UniRef Clusters more… Sequence sources ENA Model organisms PDB RefSeq Ensembl VEGA Patents
13 4 components of UniProt UniParc UniMES Swiss-Prot: non-redundant, manual annotation TrEMBL: redundant, automatic annotation Combines sequences (speed searching) UniRef100, UniRef90, UniRef50 Complete history of sequences (no annotation) Cross-links to external sequence sources Sequences from metagenomic projects UniProtKB UniRef
14 Browsing a UniParc entry Sequence Navigate to individual entries Download data Deleted entries identified (greyed out) Accession List of databases containing sequence
15 Browsing a UniProtKB/SwissProt entry References Navigate to external data sources e.g. Ensembl Download data Names (synonyms) and taxonomy Ontologies Protein attributes Annotation Protein interactions Splice variants Sequence features General information Sequence
16 Browsing a UniRef90 entry Cluster name List of entries in cluster Taxonomy of each entry % identity of sequences in cluster Status (SwissProt and/or TrEMBL) Faster and more sensitive sequence search with no loss of information
17 Taxonomic distribution of species Bacteria (61%) Eukaryota (32%) Archaea (4%) Viruses (3%) All kingdoms: Within Eukaryota: Other mammals (27%) Homo (12%) Other (8%) Nematoda (2%) Insecta (5%) Fungi (18%) Viridiplantae (18%) Other Vertebrata (10%)
18 SwissProt – most represented species Mainly model organisms
19 Protein Existence tag Protein existence level: Evidence at protein level Evidence at transcript level Inferred from homology Predicted Uncertain (mainly TrEMBL) Total 13% 12% 70% 5% - !! Not sequence validation !!
20 Protein existence categories Protein existence level: Evidence at protein level Evidence at transcript level Inferred from homology Predicted Uncertain (mainly TrEMBL) !! Not sequence validation !! Human 59% 37.5% 1% 0.5% 2%
2) UniProtKB/SwissProt annotation
22 Annotation sources for UniProtKB UniProtKB * Manual curation * Literature-based annotation * Sequence analysis Transmembrane prediction InterPro classification Signal prediction Other predictions Protein classification * Automated annotation PRIDE GO InterPro IntAct IntEnz HAMAP RESID Functional info Protein identification data Protein families and domains Molecular interactions Enzymes Microbial protein families Post-translational modifications Some data sources for annotation Data sources
23 Features of UniProtKB Sequence Annotations NomenclatureReferences Ontologies Splice variants Sequence features
24 A wealth of external links 2D gel DBs 2DBase-Ecoli ANU-2DPAGE Aarhus/Ghent-2DPAGE (no server) COMPLUYEAST-2DPAGE Cornea-2DPAGE DOSAC-COBS-2DPAGE ECO2DBASE (no server) HSC-2DPAGE OGP PHCI-2DPAGE PMMA-2DPAGE Rat-heart-2DPAGE REPRODUCTION-2DPAGE Siena-2DPAGE SWISS-2DPAGE World-2DPAGE Family and domain DBs Gene3DPIRSF HAMAPPRINTS InterProProDom PANTHERPROSITE PfamTIGRFAMs SMART Organism-specific DBs DictyBase AGD EchoBASE CGD EcoGene CTD euHCVdb CYGD FlyBaseHGNC GeneCardsHPA GeneFarmMGI GrameneMIM H-InvDB RGD LegioListSGD LepromaTAIR ListiListZFIN MaizeGDBMypuList Orphanet PharmGKB PhotoListPseudoCAP SagaListSubtiList TubercuListWormBase WormPepXenbase GeneDB_Spombe ArachnoServer BuruList Protein family/group DBs CAZyMEROPS PeroxiBaseREBASE PptaseDBTCDB Genome annotation DBs EnsemblKEGG GeneIDNMPDR VectorBaseUCSC GenomeReviews TIGR Enzyme & pathway DBs BioCyc BRENDA Reactome Pathway_Interaction_DB Others BindingDB PMAP- CutDB DrugBank NextBio Sequence DBs EMBL IPI PIR RefSeq UniGene 3D structure DBs DisProt HSSP PDB PDBsum SMR PTM DBs GlycoSuiteDB PhosphoSite PhosSite Proteomic DBs PeptideAtlas PRIDE ProMEX Protein-protein interaction DBs DIP IntAct STRING Phylogenomic DBs HOGENOMOMA HOVERGENPhylomeDB InParanoidOrthoDB Polymorphism DBs dbSNP Gene expression DBs ArrayExpressBgee GermOnlineCleanEx Genevestigator Ontologies GO 125 links!
25 SwissProt manual annotation 1. Protein sequence 2. Biological information Extract literature information Orthologue data propagation Protein sequence analysis... Merge available CDS (coding sequence) Annotate sequence discrepancies Report sequencing errors...
26 Problem #1: sequence correction ~20% of Swiss-Prot entries required correction Typical problems: –Unsolved conflicts (sequencing errors) –Erroneous gene model predictions –Wrong initiation sites –Frameshifts...
27 Sequence quality from genome projects Drosophila: Well-curated 1.8% of gene models incorrect Arabidopsis: Annotated when sequenced, but no update 19.5% of gene models incorrect Tetraodon nigroviridis: Automatic run through (no manual intervention) >90% of gene models incorrect
28 Sequence curation Other examples of sequencing errors include: premature stop codons, read-throughs, erroneous initiator methionines Sequencing errors
29 Problem #2: proteome complexity 1 SwissProt entry = 1 gene (1 species) genome ~20,000 human protein-coding genes transcriptome ~100,000 human transcripts alternative splicing, alternative initiation, mRNA editing... proteome >1,000,000 human proteins Post-translational modification Annotation of sequence differences
30 Merging entries 1)Errors Erroneous gene model predictions; sequence errors 2)Natural variation Polymorphisms; Alternative start sites; Alternative splicing Multiple entries for the same protein exist in TrEMBL (redundancy) Apart from 100% identical sequences all merged sequences are analyzed by a curator so they can be annotated accordingly. Because of:
31 Example Multiple alignment of the end of the available GCR sequences: Annotation of the sequence differences (protein diversity):
32 Merging entries
33 Sequence curation Alternative Splicing
34 Sequence curation Alternative Splicing
35 Sequence curation Alternative Splicing
36 Sequence curation Alternative Splicing
37 Sequence curation Alternative Splicing
38 Sequence curation Identification of amino acid variants....and of PTMs....and also
39 Sequence curation Domain annotation Binding sites
40 SwissProt manual annotation 1. Protein sequence 2. Biological information Extract literature information Orthologue data propagation Protein sequence analysis... Merge available CDS (coding sequence) Annotate sequence discrepancies Report sequencing errors...
41 Sources of annotated information UniProtKB/SwissProt gathers information from multiple sources: Publications (literature/PubMed) Prediction proteins (Prosite, Anabelle) Contact with experts Other databases Nomenclature committees
42 Nomenclature Synonyms useful for literature searching
43 Nomenclature Provides synonyms and cleavage products of bifunctional proteins
44 Annotation comments Controlled vocabularies used whenever possible… >30 comment fields
45 Disease association Mendelian Inheritance in Man provides information on genetic disease associations Pharmacogenomics database
46 Sequence annotation (Features) …enable researchers to obtain a summary of what is known about a protein…
47 Sequence annotation (Features) Feature (e.g. domain) highlighted on sequence
48 Gene Ontology 2. Molecular Function An elemental activity or task or job Protein kinase activity Insulin binding Insulin receptor activity 1. Biological Process A commonly recognized series of events Cell division Mitosis Organelle fission 3. Cellular Component Where a gene product is located Mitochondrion Mitochondrial matrix Mitochondrial membrane
49 Gene Ontology Annotation for human Rhodopsin:
50 Imported annotation Binary interactions are taken from the database Interactors of human p53
51 Evidence for annotation UniProtKB/Swiss-Prot distinguishes between experimental and predicted data Type of evidenceEvidence tag 1 st : Experimental evidenceReference provided 2 nd : Light experimental evidenceProbable 3 rd : Inferred by similarity with homologous protein By similarity 4 th : Inferred by sequence predictionPotential
52 Evidence for annotation Proven Potential Proven By similarity
53 Sources references included
54 Versioning and archiving
55 Versioning and archiving Able to compare versions directly
56 Versioning and archiving
3) UniProtKB/TrEMBL automatic annotation
58 UniProtKB/TrEMBL !! Caution !! Quality of UniProtKB/TrEMBL entries depends upon quality of submissions in original EMBL/GenBank/DDBJ entry.
59 Annotated proteins guide TrEMBL entries 379 annotated UniProtKB/Swiss-Prot entries 9,186 un-annotated UniProtKB/TrEMBL entries Automatic annotation added using Swiss-Prot and InterPro (function prediction database) Don’t want un-annotated TrEMBL to be skeleton entries with no information Example for rhodopsin:
60 Automatic annotation UniProtKB uses 2 prediction programs: UniRule : maintains a set of manual annotation rules. InterProSwiss-Prot SAAS : generates a set of decision trees using data mining. (new set every UniProtKB release)
61 Automatic annotation - InterPro Swiss-Prot groups of related proteins (same family or share domains) TrEMBL uncharacterised sequence protein signatures InterPro automatic annotation pipeline CGCGCCTGTACGC TGAACGCTCGTGA CGTGTAGTGCGCG manually annotated sequence
62 Browsing a UniProtKB/TrEMBL entry Name (could be clone name) Automatic annotation. (derived from InterPro) Ontologies (both automatic and manual curation) Taxonomy
4) Using the website
64 Useful Features Integrated BLAST and Alignments Batch retrieval in a variety of formats Simple and modular advanced searching
65 uniprot.org: anatomy of an entry Entry Info Link to UniSave Link to UniRef Variety of formats Navigation bar Customize order
66 uniprot.org: anatomy of an entry Entry Info Link to UniSave Link to UniRef Variety of formats Navigation bar Customize order
67 Searching UniProt Search tools include: Text Search Blast sequence search Additional search engines through EBI (e.g. SSearch and FASTA)
68 Search Powerful text search tool with autocompletion and refinement options look for UniProt entries and documentation using biological information
69 Search Search sequence database, literature, taxonomy… More search options
70 Search Refine search
71 Search results
72 Search results Define type and order of search results
73 Search results Each result linked to the UniProt entry SwissProt TrEMBL SwissProt TrEMBL Select specific entries
74 Search results Can retrieve or BLAST sequence Keeps selected entries throughout session
75 Search results Can retrieve or align >2 sequences
76 BLAST A tool with standard options to search sequences in UniProt databases by sequence blast Search refinement (change parameters) Search refinement (change parameters)
77 BLAST Can query using protein or nucleotide sequences
78 BLAST P00750 Can query using identifier: UniProtKB accession (P00750) Specific version (P00750:2) Splice variant (P ) Name (A4_HUMAN) UniParc accession (UPI ) UniRef accession (UniRef100_P00750) Can query using identifier: UniProtKB accession (P00750) Specific version (P00750:2) Splice variant (P ) Name (A4_HUMAN) UniParc accession (UPI ) UniRef accession (UniRef100_P00750)
79 BLAST = best = should verify = biological significance less likely Threshold = expectation (E) value Provides cut-off between good and poor hits
80 BLAST Matrix = assigns probability score for each position Controls sensitivity of search
81 BLAST Stretches of cysteines or hydrophobic regions can cause spurious matches Replaces them with X’s Filtering = masks low complexity regions
82 BLAST Gapped = allows gaps in sequence Yes = to find more distant homologues No = to find closest matches (strict) Gapped = allows gaps in sequence Yes = to find more distant homologues No = to find closest matches (strict)
83 BLAST Hits = limits number of results
84 BLAST results Can filter or customize results
85 BLAST results Shows length of query sequence aligned Select match to see alignment
86 BLAST results – pairwise alignment Alignment of selected sequence
87 BLAST results – pairwise alignment Colour alignment by annotation or properties
88 BLAST results Further down the results page… details about matching protein sequences Further down the results page… details about matching protein sequences
89 BLAST results Can align checked sequences
90 BLAST results – multiple alignment Alignment of selected sequence Can add additional sequences to alignment
91 BLAST results – multiple alignment Colour alignment by annotation or properties
92 Align ClustalW multiple alignment tool with amino-acids highlighting options and feature annotation highlighting option
93 Retrieve - retrieve a list of entries in several standard formats. - then query retrieved sequences with UniProt search tool. UniProt-specific tool:
94 ID Mapping Allows mapping between different databases for a given protein
95 Other tools Sequence Similarity & Analysis
96 Other tools BLAST FASTA specialized searches
5) Computational access
98 Computational access to UniProt
99 Computational access to UniProt
100 Acknowledgements Rolf Apweiler Ioanis Xenarios Cathy H Wu +100 annotators