Presentation is loading. Please wait.

Presentation is loading. Please wait.

EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall Protein Sequence Database:

Similar presentations


Presentation on theme: "EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall Protein Sequence Database:"— Presentation transcript:

1 EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall Protein Sequence Database:

2 22 Overview 1)The UniProt databases 2)UniProt/SwissProt annotation 3)UniProt/TrEMBL automatic annotation 4)Using the uniprot.org website 5)Computational access

3 1) The UniProt databases

4 44 Source of protein sequence data Nucleotide sequence database Protein sequence database Individual scientists Large-scale sequencing projects Patent Offices Nucleotide sequencing Submit Protein sequencing Derive protein sequence Protein sequencing is rare Most protein sequence derived from nucleotide data Protein sequencing is rare Most protein sequence derived from nucleotide data

5 55 Protein sequence is mainly derived data ACGCTCGTACGCATCGTCACTACTAGCTACGACGACGACACGCTACTACTCGACGATTCT DNA sequence translate Derived mRNA sequence AUGCGUAGUGAUGAAUGCUGCUGUGCGAUGAGCUGC Derived protein sequence MRSNECCCAMSC transcribe submit

6 66 Protein sequence is mainly derived data ACGCTCGTACGCATCGTCACTACTAGCTACGACGACGACACGCTACTACTCGACGATTCT DNA sequence translate Derived mRNA sequence AUGCGUAGUGAUGAAUGCUGCUGUGCGAUGAGCUGC Derived protein sequence MRSNECCCAMSC transcribe submit Predicted stop Predicted start may not have direct evidence Predicted splice sites

7 77 How to find the information you need? TATCTACAG TAGAGGCTATCAGCA CGCAGCACCAT GACGCGCATAACT GATCTACGA TAGCGAGCAGCAGCA CAGCATC GCAGCATCAG CTAAGCGACA ATAGACATCA AATCATCACGAT GAATCATCGTCTACG AGATCGC CTATCTGT High quality protein sequence Non-redundant data Splice isoforms, disease variants, PTMs Sequence archiving essential Protein identification Stable identifiers Consistent nomenclature Protein annotation Information protein function biological processes molecular interactions pathways

8 88 UniProt Since 2002 a merger and collaboration of three databases: Funded mainly by NIH (US) to be the highest quality, most thoroughly annotated protein sequence database Swiss-Prot & TrEMBLPIR-PSD http://www.uniprot.org/

9 99 UniProt Consortium

10 10 Where does the data come from? Sequence sources UniParc ENA exchange data daily

11 11 Where does the data come from? more… Sequence sources ENA Model organisms PDB RefSeq Ensembl VEGA Patents UniParc UniMES UniProtKB/ TrEMBL Metagenomic & environmental Taxonomy known History of sequences High quality annotation UniProtKB/ SwissProt Remove redundancy Manual annotation

12 12 Where does the data come from? UniParc UniMES UniProtKB/ TrEMBL Metagenomic & environmental Taxonomy known UniProtKB/ SwissProt UniMES Clusters UniRef Clusters more… Sequence sources ENA Model organisms PDB RefSeq Ensembl VEGA Patents

13 13 4 components of UniProt UniParc UniMES  Swiss-Prot: non-redundant, manual annotation  TrEMBL: redundant, automatic annotation  Combines sequences (speed searching)  UniRef100, UniRef90, UniRef50  Complete history of sequences (no annotation)  Cross-links to external sequence sources  Sequences from metagenomic projects UniProtKB UniRef

14 14 Browsing a UniParc entry Sequence Navigate to individual entries Download data Deleted entries identified (greyed out) Accession List of databases containing sequence

15 15 Browsing a UniProtKB/SwissProt entry References Navigate to external data sources e.g. Ensembl Download data Names (synonyms) and taxonomy Ontologies Protein attributes Annotation Protein interactions Splice variants Sequence features General information Sequence

16 16 Browsing a UniRef90 entry Cluster name List of entries in cluster Taxonomy of each entry % identity of sequences in cluster Status (SwissProt and/or TrEMBL) Faster and more sensitive sequence search with no loss of information

17 17 Taxonomic distribution of species Bacteria (61%) Eukaryota (32%) Archaea (4%) Viruses (3%) All kingdoms: Within Eukaryota: Other mammals (27%) Homo (12%) Other (8%) Nematoda (2%) Insecta (5%) Fungi (18%) Viridiplantae (18%) Other Vertebrata (10%)

18 18 SwissProt – most represented species Mainly model organisms

19 19 Protein Existence tag Protein existence level: Evidence at protein level Evidence at transcript level Inferred from homology Predicted Uncertain (mainly TrEMBL) Total 13% 12% 70% 5% - !! Not sequence validation !!

20 20 Protein existence categories Protein existence level: Evidence at protein level Evidence at transcript level Inferred from homology Predicted Uncertain (mainly TrEMBL) !! Not sequence validation !! Human 59% 37.5% 1% 0.5% 2%

21 2) UniProtKB/SwissProt annotation

22 22 Annotation sources for UniProtKB UniProtKB * Manual curation * Literature-based annotation * Sequence analysis Transmembrane prediction InterPro classification Signal prediction Other predictions Protein classification * Automated annotation PRIDE GO InterPro IntAct IntEnz HAMAP RESID Functional info Protein identification data Protein families and domains Molecular interactions Enzymes Microbial protein families Post-translational modifications Some data sources for annotation Data sources

23 23 Features of UniProtKB Sequence Annotations NomenclatureReferences Ontologies Splice variants Sequence features

24 24 A wealth of external links 2D gel DBs 2DBase-Ecoli ANU-2DPAGE Aarhus/Ghent-2DPAGE (no server) COMPLUYEAST-2DPAGE Cornea-2DPAGE DOSAC-COBS-2DPAGE ECO2DBASE (no server) HSC-2DPAGE OGP PHCI-2DPAGE PMMA-2DPAGE Rat-heart-2DPAGE REPRODUCTION-2DPAGE Siena-2DPAGE SWISS-2DPAGE World-2DPAGE Family and domain DBs Gene3DPIRSF HAMAPPRINTS InterProProDom PANTHERPROSITE PfamTIGRFAMs SMART Organism-specific DBs DictyBase AGD EchoBASE CGD EcoGene CTD euHCVdb CYGD FlyBaseHGNC GeneCardsHPA GeneFarmMGI GrameneMIM H-InvDB RGD LegioListSGD LepromaTAIR ListiListZFIN MaizeGDBMypuList Orphanet PharmGKB PhotoListPseudoCAP SagaListSubtiList TubercuListWormBase WormPepXenbase GeneDB_Spombe ArachnoServer BuruList Protein family/group DBs CAZyMEROPS PeroxiBaseREBASE PptaseDBTCDB Genome annotation DBs EnsemblKEGG GeneIDNMPDR VectorBaseUCSC GenomeReviews TIGR Enzyme & pathway DBs BioCyc BRENDA Reactome Pathway_Interaction_DB Others BindingDB PMAP- CutDB DrugBank NextBio Sequence DBs EMBL IPI PIR RefSeq UniGene 3D structure DBs DisProt HSSP PDB PDBsum SMR PTM DBs GlycoSuiteDB PhosphoSite PhosSite Proteomic DBs PeptideAtlas PRIDE ProMEX Protein-protein interaction DBs DIP IntAct STRING Phylogenomic DBs HOGENOMOMA HOVERGENPhylomeDB InParanoidOrthoDB Polymorphism DBs dbSNP Gene expression DBs ArrayExpressBgee GermOnlineCleanEx Genevestigator Ontologies GO 125 links!

25 25 SwissProt manual annotation 1. Protein sequence 2. Biological information Extract literature information Orthologue data propagation Protein sequence analysis... Merge available CDS (coding sequence) Annotate sequence discrepancies Report sequencing errors...

26 26 Problem #1: sequence correction ~20% of Swiss-Prot entries required correction Typical problems: –Unsolved conflicts (sequencing errors) –Erroneous gene model predictions –Wrong initiation sites –Frameshifts...

27 27 Sequence quality from genome projects Drosophila: Well-curated 1.8% of gene models incorrect Arabidopsis: Annotated when sequenced, but no update 19.5% of gene models incorrect Tetraodon nigroviridis: Automatic run through (no manual intervention) >90% of gene models incorrect

28 28 Sequence curation Other examples of sequencing errors include: premature stop codons, read-throughs, erroneous initiator methionines Sequencing errors

29 29 Problem #2: proteome complexity 1 SwissProt entry = 1 gene (1 species) genome ~20,000 human protein-coding genes transcriptome ~100,000 human transcripts alternative splicing, alternative initiation, mRNA editing... proteome >1,000,000 human proteins Post-translational modification Annotation of sequence differences

30 30 Merging entries 1)Errors Erroneous gene model predictions; sequence errors 2)Natural variation Polymorphisms; Alternative start sites; Alternative splicing  Multiple entries for the same protein exist in TrEMBL (redundancy) Apart from 100% identical sequences all merged sequences are analyzed by a curator so they can be annotated accordingly. Because of:

31 31 Example Multiple alignment of the end of the available GCR sequences: Annotation of the sequence differences (protein diversity):

32 32 Merging entries

33 33 Sequence curation Alternative Splicing

34 34 Sequence curation Alternative Splicing

35 35 Sequence curation Alternative Splicing

36 36 Sequence curation Alternative Splicing

37 37 Sequence curation Alternative Splicing

38 38 Sequence curation Identification of amino acid variants....and of PTMs....and also

39 39 Sequence curation Domain annotation Binding sites

40 40 SwissProt manual annotation 1. Protein sequence 2. Biological information Extract literature information Orthologue data propagation Protein sequence analysis... Merge available CDS (coding sequence) Annotate sequence discrepancies Report sequencing errors...

41 41 Sources of annotated information UniProtKB/SwissProt gathers information from multiple sources: Publications (literature/PubMed) Prediction proteins (Prosite, Anabelle) Contact with experts Other databases Nomenclature committees

42 42 Nomenclature Synonyms useful for literature searching

43 43 Nomenclature Provides synonyms and cleavage products of bifunctional proteins

44 44 Annotation comments Controlled vocabularies used whenever possible… >30 comment fields

45 45 Disease association Mendelian Inheritance in Man provides information on genetic disease associations Pharmacogenomics database

46 46 Sequence annotation (Features) …enable researchers to obtain a summary of what is known about a protein…

47 47 Sequence annotation (Features) Feature (e.g. domain) highlighted on sequence

48 48 Gene Ontology 2. Molecular Function An elemental activity or task or job Protein kinase activity Insulin binding Insulin receptor activity 1. Biological Process A commonly recognized series of events Cell division Mitosis Organelle fission 3. Cellular Component Where a gene product is located Mitochondrion Mitochondrial matrix Mitochondrial membrane

49 49 Gene Ontology Annotation for human Rhodopsin:

50 50 Imported annotation Binary interactions are taken from the database Interactors of human p53

51 51 Evidence for annotation UniProtKB/Swiss-Prot distinguishes between experimental and predicted data Type of evidenceEvidence tag 1 st : Experimental evidenceReference provided 2 nd : Light experimental evidenceProbable 3 rd : Inferred by similarity with homologous protein By similarity 4 th : Inferred by sequence predictionPotential

52 52 Evidence for annotation Proven Potential Proven By similarity

53 53 Sources references included

54 54 Versioning and archiving

55 55 Versioning and archiving Able to compare versions directly

56 56 Versioning and archiving

57 3) UniProtKB/TrEMBL automatic annotation

58 58 UniProtKB/TrEMBL !! Caution !! Quality of UniProtKB/TrEMBL entries depends upon quality of submissions in original EMBL/GenBank/DDBJ entry.

59 59 Annotated proteins guide TrEMBL entries 379 annotated UniProtKB/Swiss-Prot entries 9,186 un-annotated UniProtKB/TrEMBL entries Automatic annotation added using Swiss-Prot and InterPro (function prediction database) Don’t want un-annotated TrEMBL to be skeleton entries with no information Example for rhodopsin:

60 60 Automatic annotation UniProtKB uses 2 prediction programs: UniRule : maintains a set of manual annotation rules. InterProSwiss-Prot SAAS : generates a set of decision trees using data mining. (new set every UniProtKB release)

61 61 Automatic annotation - InterPro Swiss-Prot groups of related proteins (same family or share domains) TrEMBL uncharacterised sequence protein signatures InterPro automatic annotation pipeline CGCGCCTGTACGC TGAACGCTCGTGA CGTGTAGTGCGCG manually annotated sequence

62 62 Browsing a UniProtKB/TrEMBL entry Name (could be clone name) Automatic annotation. (derived from InterPro) Ontologies (both automatic and manual curation) Taxonomy

63 4) Using the www.uniprot.org website

64 64 www.uniprot.org Useful Features Integrated BLAST and Alignments Batch retrieval in a variety of formats Simple and modular advanced searching

65 65 uniprot.org: anatomy of an entry Entry Info Link to UniSave Link to UniRef Variety of formats Navigation bar Customize order

66 66 uniprot.org: anatomy of an entry Entry Info Link to UniSave Link to UniRef Variety of formats Navigation bar Customize order

67 67 Searching UniProt Search tools include: Text Search Blast sequence search Additional search engines through EBI (e.g. SSearch and FASTA) http://www.uniprot.org/

68 68 Search Powerful text search tool with autocompletion and refinement options look for UniProt entries and documentation using biological information

69 69 Search Search sequence database, literature, taxonomy… More search options

70 70 Search Refine search

71 71 Search results

72 72 Search results Define type and order of search results

73 73 Search results Each result linked to the UniProt entry SwissProt TrEMBL SwissProt TrEMBL Select specific entries

74 74 Search results Can retrieve or BLAST sequence Keeps selected entries throughout session

75 75 Search results Can retrieve or align >2 sequences

76 76 BLAST A tool with standard options to search sequences in UniProt databases by sequence blast Search refinement (change parameters) Search refinement (change parameters)

77 77 BLAST Can query using protein or nucleotide sequences

78 78 BLAST P00750 Can query using identifier: UniProtKB accession (P00750) Specific version (P00750:2) Splice variant (P00750-2) Name (A4_HUMAN) UniParc accession (UPI0000000001) UniRef accession (UniRef100_P00750) Can query using identifier: UniProtKB accession (P00750) Specific version (P00750:2) Splice variant (P00750-2) Name (A4_HUMAN) UniParc accession (UPI0000000001) UniRef accession (UniRef100_P00750)

79 79 BLAST = best = should verify = biological significance less likely Threshold = expectation (E) value Provides cut-off between good and poor hits

80 80 BLAST Matrix = assigns probability score for each position Controls sensitivity of search

81 81 BLAST Stretches of cysteines or hydrophobic regions can cause spurious matches Replaces them with X’s Filtering = masks low complexity regions

82 82 BLAST Gapped = allows gaps in sequence Yes = to find more distant homologues No = to find closest matches (strict) Gapped = allows gaps in sequence Yes = to find more distant homologues No = to find closest matches (strict)

83 83 BLAST Hits = limits number of results

84 84 BLAST results Can filter or customize results

85 85 BLAST results Shows length of query sequence aligned Select match to see alignment

86 86 BLAST results – pairwise alignment Alignment of selected sequence

87 87 BLAST results – pairwise alignment Colour alignment by annotation or properties

88 88 BLAST results...... Further down the results page… details about matching protein sequences Further down the results page… details about matching protein sequences

89 89 BLAST results...... Can align checked sequences

90 90 BLAST results – multiple alignment Alignment of selected sequence Can add additional sequences to alignment

91 91 BLAST results – multiple alignment Colour alignment by annotation or properties

92 92 Align ClustalW multiple alignment tool with amino-acids highlighting options and feature annotation highlighting option

93 93 Retrieve - retrieve a list of entries in several standard formats. - then query retrieved sequences with UniProt search tool. UniProt-specific tool:

94 94 ID Mapping Allows mapping between different databases for a given protein

95 95 Other tools http://www.ebi.ac.uk/ Sequence Similarity & Analysis

96 96 Other tools BLAST FASTA specialized searches http://www.ebi.ac.uk/Tools/sss/

97 5) Computational access

98 98 Computational access to UniProt http://www.uniprot.org/

99 99 Computational access to UniProt http://www.ebi.ac.uk/uniprot/

100 100 Acknowledgements Rolf Apweiler Ioanis Xenarios Cathy H Wu +100 annotators


Download ppt "EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall Protein Sequence Database:"

Similar presentations


Ads by Google