Download presentation
Presentation is loading. Please wait.
Published byHector Hodges Modified over 8 years ago
1
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall Protein Sequence Database:
2
22 Overview 1)The UniProt databases 2)UniProt/SwissProt annotation 3)UniProt/TrEMBL automatic annotation 4)Using the uniprot.org website 5)Computational access
3
1) The UniProt databases
4
44 Source of protein sequence data Nucleotide sequence database Protein sequence database Individual scientists Large-scale sequencing projects Patent Offices Nucleotide sequencing Submit Protein sequencing Derive protein sequence Protein sequencing is rare Most protein sequence derived from nucleotide data Protein sequencing is rare Most protein sequence derived from nucleotide data
5
55 Protein sequence is mainly derived data ACGCTCGTACGCATCGTCACTACTAGCTACGACGACGACACGCTACTACTCGACGATTCT DNA sequence translate Derived mRNA sequence AUGCGUAGUGAUGAAUGCUGCUGUGCGAUGAGCUGC Derived protein sequence MRSNECCCAMSC transcribe submit
6
66 Protein sequence is mainly derived data ACGCTCGTACGCATCGTCACTACTAGCTACGACGACGACACGCTACTACTCGACGATTCT DNA sequence translate Derived mRNA sequence AUGCGUAGUGAUGAAUGCUGCUGUGCGAUGAGCUGC Derived protein sequence MRSNECCCAMSC transcribe submit Predicted stop Predicted start may not have direct evidence Predicted splice sites
7
77 How to find the information you need? TATCTACAG TAGAGGCTATCAGCA CGCAGCACCAT GACGCGCATAACT GATCTACGA TAGCGAGCAGCAGCA CAGCATC GCAGCATCAG CTAAGCGACA ATAGACATCA AATCATCACGAT GAATCATCGTCTACG AGATCGC CTATCTGT High quality protein sequence Non-redundant data Splice isoforms, disease variants, PTMs Sequence archiving essential Protein identification Stable identifiers Consistent nomenclature Protein annotation Information protein function biological processes molecular interactions pathways
8
88 UniProt Since 2002 a merger and collaboration of three databases: Funded mainly by NIH (US) to be the highest quality, most thoroughly annotated protein sequence database Swiss-Prot & TrEMBLPIR-PSD http://www.uniprot.org/
9
99 UniProt Consortium
10
10 Where does the data come from? Sequence sources UniParc ENA exchange data daily
11
11 Where does the data come from? more… Sequence sources ENA Model organisms PDB RefSeq Ensembl VEGA Patents UniParc UniMES UniProtKB/ TrEMBL Metagenomic & environmental Taxonomy known History of sequences High quality annotation UniProtKB/ SwissProt Remove redundancy Manual annotation
12
12 Where does the data come from? UniParc UniMES UniProtKB/ TrEMBL Metagenomic & environmental Taxonomy known UniProtKB/ SwissProt UniMES Clusters UniRef Clusters more… Sequence sources ENA Model organisms PDB RefSeq Ensembl VEGA Patents
13
13 4 components of UniProt UniParc UniMES Swiss-Prot: non-redundant, manual annotation TrEMBL: redundant, automatic annotation Combines sequences (speed searching) UniRef100, UniRef90, UniRef50 Complete history of sequences (no annotation) Cross-links to external sequence sources Sequences from metagenomic projects UniProtKB UniRef
14
14 Browsing a UniParc entry Sequence Navigate to individual entries Download data Deleted entries identified (greyed out) Accession List of databases containing sequence
15
15 Browsing a UniProtKB/SwissProt entry References Navigate to external data sources e.g. Ensembl Download data Names (synonyms) and taxonomy Ontologies Protein attributes Annotation Protein interactions Splice variants Sequence features General information Sequence
16
16 Browsing a UniRef90 entry Cluster name List of entries in cluster Taxonomy of each entry % identity of sequences in cluster Status (SwissProt and/or TrEMBL) Faster and more sensitive sequence search with no loss of information
17
17 Taxonomic distribution of species Bacteria (61%) Eukaryota (32%) Archaea (4%) Viruses (3%) All kingdoms: Within Eukaryota: Other mammals (27%) Homo (12%) Other (8%) Nematoda (2%) Insecta (5%) Fungi (18%) Viridiplantae (18%) Other Vertebrata (10%)
18
18 SwissProt – most represented species Mainly model organisms
19
19 Protein Existence tag Protein existence level: Evidence at protein level Evidence at transcript level Inferred from homology Predicted Uncertain (mainly TrEMBL) Total 13% 12% 70% 5% - !! Not sequence validation !!
20
20 Protein existence categories Protein existence level: Evidence at protein level Evidence at transcript level Inferred from homology Predicted Uncertain (mainly TrEMBL) !! Not sequence validation !! Human 59% 37.5% 1% 0.5% 2%
21
2) UniProtKB/SwissProt annotation
22
22 Annotation sources for UniProtKB UniProtKB * Manual curation * Literature-based annotation * Sequence analysis Transmembrane prediction InterPro classification Signal prediction Other predictions Protein classification * Automated annotation PRIDE GO InterPro IntAct IntEnz HAMAP RESID Functional info Protein identification data Protein families and domains Molecular interactions Enzymes Microbial protein families Post-translational modifications Some data sources for annotation Data sources
23
23 Features of UniProtKB Sequence Annotations NomenclatureReferences Ontologies Splice variants Sequence features
24
24 A wealth of external links 2D gel DBs 2DBase-Ecoli ANU-2DPAGE Aarhus/Ghent-2DPAGE (no server) COMPLUYEAST-2DPAGE Cornea-2DPAGE DOSAC-COBS-2DPAGE ECO2DBASE (no server) HSC-2DPAGE OGP PHCI-2DPAGE PMMA-2DPAGE Rat-heart-2DPAGE REPRODUCTION-2DPAGE Siena-2DPAGE SWISS-2DPAGE World-2DPAGE Family and domain DBs Gene3DPIRSF HAMAPPRINTS InterProProDom PANTHERPROSITE PfamTIGRFAMs SMART Organism-specific DBs DictyBase AGD EchoBASE CGD EcoGene CTD euHCVdb CYGD FlyBaseHGNC GeneCardsHPA GeneFarmMGI GrameneMIM H-InvDB RGD LegioListSGD LepromaTAIR ListiListZFIN MaizeGDBMypuList Orphanet PharmGKB PhotoListPseudoCAP SagaListSubtiList TubercuListWormBase WormPepXenbase GeneDB_Spombe ArachnoServer BuruList Protein family/group DBs CAZyMEROPS PeroxiBaseREBASE PptaseDBTCDB Genome annotation DBs EnsemblKEGG GeneIDNMPDR VectorBaseUCSC GenomeReviews TIGR Enzyme & pathway DBs BioCyc BRENDA Reactome Pathway_Interaction_DB Others BindingDB PMAP- CutDB DrugBank NextBio Sequence DBs EMBL IPI PIR RefSeq UniGene 3D structure DBs DisProt HSSP PDB PDBsum SMR PTM DBs GlycoSuiteDB PhosphoSite PhosSite Proteomic DBs PeptideAtlas PRIDE ProMEX Protein-protein interaction DBs DIP IntAct STRING Phylogenomic DBs HOGENOMOMA HOVERGENPhylomeDB InParanoidOrthoDB Polymorphism DBs dbSNP Gene expression DBs ArrayExpressBgee GermOnlineCleanEx Genevestigator Ontologies GO 125 links!
25
25 SwissProt manual annotation 1. Protein sequence 2. Biological information Extract literature information Orthologue data propagation Protein sequence analysis... Merge available CDS (coding sequence) Annotate sequence discrepancies Report sequencing errors...
26
26 Problem #1: sequence correction ~20% of Swiss-Prot entries required correction Typical problems: –Unsolved conflicts (sequencing errors) –Erroneous gene model predictions –Wrong initiation sites –Frameshifts...
27
27 Sequence quality from genome projects Drosophila: Well-curated 1.8% of gene models incorrect Arabidopsis: Annotated when sequenced, but no update 19.5% of gene models incorrect Tetraodon nigroviridis: Automatic run through (no manual intervention) >90% of gene models incorrect
28
28 Sequence curation Other examples of sequencing errors include: premature stop codons, read-throughs, erroneous initiator methionines Sequencing errors
29
29 Problem #2: proteome complexity 1 SwissProt entry = 1 gene (1 species) genome ~20,000 human protein-coding genes transcriptome ~100,000 human transcripts alternative splicing, alternative initiation, mRNA editing... proteome >1,000,000 human proteins Post-translational modification Annotation of sequence differences
30
30 Merging entries 1)Errors Erroneous gene model predictions; sequence errors 2)Natural variation Polymorphisms; Alternative start sites; Alternative splicing Multiple entries for the same protein exist in TrEMBL (redundancy) Apart from 100% identical sequences all merged sequences are analyzed by a curator so they can be annotated accordingly. Because of:
31
31 Example Multiple alignment of the end of the available GCR sequences: Annotation of the sequence differences (protein diversity):
32
32 Merging entries
33
33 Sequence curation Alternative Splicing
34
34 Sequence curation Alternative Splicing
35
35 Sequence curation Alternative Splicing
36
36 Sequence curation Alternative Splicing
37
37 Sequence curation Alternative Splicing
38
38 Sequence curation Identification of amino acid variants....and of PTMs....and also
39
39 Sequence curation Domain annotation Binding sites
40
40 SwissProt manual annotation 1. Protein sequence 2. Biological information Extract literature information Orthologue data propagation Protein sequence analysis... Merge available CDS (coding sequence) Annotate sequence discrepancies Report sequencing errors...
41
41 Sources of annotated information UniProtKB/SwissProt gathers information from multiple sources: Publications (literature/PubMed) Prediction proteins (Prosite, Anabelle) Contact with experts Other databases Nomenclature committees
42
42 Nomenclature Synonyms useful for literature searching
43
43 Nomenclature Provides synonyms and cleavage products of bifunctional proteins
44
44 Annotation comments Controlled vocabularies used whenever possible… >30 comment fields
45
45 Disease association Mendelian Inheritance in Man provides information on genetic disease associations Pharmacogenomics database
46
46 Sequence annotation (Features) …enable researchers to obtain a summary of what is known about a protein…
47
47 Sequence annotation (Features) Feature (e.g. domain) highlighted on sequence
48
48 Gene Ontology 2. Molecular Function An elemental activity or task or job Protein kinase activity Insulin binding Insulin receptor activity 1. Biological Process A commonly recognized series of events Cell division Mitosis Organelle fission 3. Cellular Component Where a gene product is located Mitochondrion Mitochondrial matrix Mitochondrial membrane
49
49 Gene Ontology Annotation for human Rhodopsin:
50
50 Imported annotation Binary interactions are taken from the database Interactors of human p53
51
51 Evidence for annotation UniProtKB/Swiss-Prot distinguishes between experimental and predicted data Type of evidenceEvidence tag 1 st : Experimental evidenceReference provided 2 nd : Light experimental evidenceProbable 3 rd : Inferred by similarity with homologous protein By similarity 4 th : Inferred by sequence predictionPotential
52
52 Evidence for annotation Proven Potential Proven By similarity
53
53 Sources references included
54
54 Versioning and archiving
55
55 Versioning and archiving Able to compare versions directly
56
56 Versioning and archiving
57
3) UniProtKB/TrEMBL automatic annotation
58
58 UniProtKB/TrEMBL !! Caution !! Quality of UniProtKB/TrEMBL entries depends upon quality of submissions in original EMBL/GenBank/DDBJ entry.
59
59 Annotated proteins guide TrEMBL entries 379 annotated UniProtKB/Swiss-Prot entries 9,186 un-annotated UniProtKB/TrEMBL entries Automatic annotation added using Swiss-Prot and InterPro (function prediction database) Don’t want un-annotated TrEMBL to be skeleton entries with no information Example for rhodopsin:
60
60 Automatic annotation UniProtKB uses 2 prediction programs: UniRule : maintains a set of manual annotation rules. InterProSwiss-Prot SAAS : generates a set of decision trees using data mining. (new set every UniProtKB release)
61
61 Automatic annotation - InterPro Swiss-Prot groups of related proteins (same family or share domains) TrEMBL uncharacterised sequence protein signatures InterPro automatic annotation pipeline CGCGCCTGTACGC TGAACGCTCGTGA CGTGTAGTGCGCG manually annotated sequence
62
62 Browsing a UniProtKB/TrEMBL entry Name (could be clone name) Automatic annotation. (derived from InterPro) Ontologies (both automatic and manual curation) Taxonomy
63
4) Using the www.uniprot.org website
64
64 www.uniprot.org Useful Features Integrated BLAST and Alignments Batch retrieval in a variety of formats Simple and modular advanced searching
65
65 uniprot.org: anatomy of an entry Entry Info Link to UniSave Link to UniRef Variety of formats Navigation bar Customize order
66
66 uniprot.org: anatomy of an entry Entry Info Link to UniSave Link to UniRef Variety of formats Navigation bar Customize order
67
67 Searching UniProt Search tools include: Text Search Blast sequence search Additional search engines through EBI (e.g. SSearch and FASTA) http://www.uniprot.org/
68
68 Search Powerful text search tool with autocompletion and refinement options look for UniProt entries and documentation using biological information
69
69 Search Search sequence database, literature, taxonomy… More search options
70
70 Search Refine search
71
71 Search results
72
72 Search results Define type and order of search results
73
73 Search results Each result linked to the UniProt entry SwissProt TrEMBL SwissProt TrEMBL Select specific entries
74
74 Search results Can retrieve or BLAST sequence Keeps selected entries throughout session
75
75 Search results Can retrieve or align >2 sequences
76
76 BLAST A tool with standard options to search sequences in UniProt databases by sequence blast Search refinement (change parameters) Search refinement (change parameters)
77
77 BLAST Can query using protein or nucleotide sequences
78
78 BLAST P00750 Can query using identifier: UniProtKB accession (P00750) Specific version (P00750:2) Splice variant (P00750-2) Name (A4_HUMAN) UniParc accession (UPI0000000001) UniRef accession (UniRef100_P00750) Can query using identifier: UniProtKB accession (P00750) Specific version (P00750:2) Splice variant (P00750-2) Name (A4_HUMAN) UniParc accession (UPI0000000001) UniRef accession (UniRef100_P00750)
79
79 BLAST = best = should verify = biological significance less likely Threshold = expectation (E) value Provides cut-off between good and poor hits
80
80 BLAST Matrix = assigns probability score for each position Controls sensitivity of search
81
81 BLAST Stretches of cysteines or hydrophobic regions can cause spurious matches Replaces them with X’s Filtering = masks low complexity regions
82
82 BLAST Gapped = allows gaps in sequence Yes = to find more distant homologues No = to find closest matches (strict) Gapped = allows gaps in sequence Yes = to find more distant homologues No = to find closest matches (strict)
83
83 BLAST Hits = limits number of results
84
84 BLAST results Can filter or customize results
85
85 BLAST results Shows length of query sequence aligned Select match to see alignment
86
86 BLAST results – pairwise alignment Alignment of selected sequence
87
87 BLAST results – pairwise alignment Colour alignment by annotation or properties
88
88 BLAST results...... Further down the results page… details about matching protein sequences Further down the results page… details about matching protein sequences
89
89 BLAST results...... Can align checked sequences
90
90 BLAST results – multiple alignment Alignment of selected sequence Can add additional sequences to alignment
91
91 BLAST results – multiple alignment Colour alignment by annotation or properties
92
92 Align ClustalW multiple alignment tool with amino-acids highlighting options and feature annotation highlighting option
93
93 Retrieve - retrieve a list of entries in several standard formats. - then query retrieved sequences with UniProt search tool. UniProt-specific tool:
94
94 ID Mapping Allows mapping between different databases for a given protein
95
95 Other tools http://www.ebi.ac.uk/ Sequence Similarity & Analysis
96
96 Other tools BLAST FASTA specialized searches http://www.ebi.ac.uk/Tools/sss/
97
5) Computational access
98
98 Computational access to UniProt http://www.uniprot.org/
99
99 Computational access to UniProt http://www.ebi.ac.uk/uniprot/
100
100 Acknowledgements Rolf Apweiler Ioanis Xenarios Cathy H Wu +100 annotators
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.