EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall Protein Sequence Database:

Slides:



Advertisements
Similar presentations
Genome Annotation: A Protein-centric Perspective.
Advertisements

Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
Integration of Protein Family, Function, Structure Rich Links to >90 Databases Value-Added Reports for UniProtKB Proteins iProClass Protein Knowledgebase.
1 Welcome to the Protein Database Tutorial This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
EBI is an Outstation of the European Molecular Biology Laboratory. Alex Mitchell InterPro team Using InterPro for functional analysis.
Protein databases Morten Nielsen. Background- Nucleotide databases GenBank, National Center for Biotechnology Information.
Archives and Information Retrieval
Protein Databases EBI – European Bioinformatics Institute
Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes Instituto Gulbenkian de Ciência, Oeiras PT.
Protein databases Henrik Nielsen. Background- Nucleotide databases GenBank, National Center for Biotechnology Information.
Class European Resources Protein Focused. Protein Databases EBI – European Bioinformatics Institute
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
UniProt - The Universal Protein Resource
Claire O’Donovan EMBL-EBI. In UniProtKB, we aim to provide… o A high quality protein sequence database A non redundant protein database, with maximal.
Predicting Function (& location & post-tln modifications) from Protein Sequences June 15, 2015.
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
Viewing & Getting GO COST Functional Modeling Workshop April, Helsinki.
Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy 1 Practical exercises Answers…
Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
Bioinformatics for biomedicine
Introduction to databases Tuomas Hätinen. Topics File Formats Databases -Primary structure: UniProt -Tertiary structure: PDB Database integration system.
© Wiley Publishing All Rights Reserved. Protein and Specialized Sequence Databases.
CACAO Training Fall Community Assessment of Community Annotation with Ontologies (CACAO)
NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015.
UniProt Non-redundant Reference Cluster (UniRef) Databases Swiss Institute of Bioinformatics (SIB) European Bioinformatics Institute (EMBL-EBI)
Fortaleza 31.VII.2006 UniProtKB: Questions and answers UniProtKB/Swiss-Prot: Questions, Answers and a few Tips.
Corrections. - The cacao genome is currently being sequenced - Human Chromosome 1 sequence Search ‘Genome’
DONNA MAGLOTT, PH.D. PRO AND MEDICAL GENETICS RESOURCES AT NCBI.
Copyright OpenHelix. No use or reproduction without express written consent1.
Part I: Identifying sequences with … Speaker : S. Gaj Date
Biological Databases Biology outside the lab. Why do we need Bioinfomatics? Over the past few decades, major advances in the field of molecular biology,
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
PIRSF Classification System PIRSF: Evolutionary relationships of proteins from super- to sub-families Homeomorphic Family: Homologous proteins sharing.
Alastair Kerr, Ph.D. WTCCB Bioinformatics Core An introduction to DNA and Protein Sequence Databases.
Protein and RNA Families
PROTEIN DATABASES. The ideal sequence database for computational analyses and data-mining: I t must be complete with minimal redundancy It must contain.
Other biological databases and ontologies. Biological systems Taxonomic data Literature Protein folding and 3D structure Small molecules Pathways and.
Building WormBase database(s). SAB 2008 Wellcome Trust Sanger Insitute Cold Spring Harbor Laboratory California Institute of Technology ● RNAi ● Microarray.
Protein Sequence Analysis - Overview - NIH Proteomics Workshop 2007 Raja Mazumder Scientific Coordinator, PIR Research Assistant Professor, Department.
Copyright OpenHelix. No use or reproduction without express written consent1.
Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Information Sources for Genomics Konstantinos Mavrommatis Genome.
1 EMBL Outstation — The European Bioinformatics Institute Removing redundancy in SWISS-PROT and TrEMBL.
Copyright OpenHelix. No use or reproduction without express written consent1.
EMBL – EBI European Bioinformatics Institute UniProt - The Universal Protein Resource Claire O’Donovan.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
EBI is an Outstation of the European Molecular Biology Laboratory. EBI patent related services Jennifer McDowall Senior Scientist, EMBL-EBI 3 rd Annual.
Learning and exploring Life science through the EBI reosurces and tools BIOQUEST workshop_2011 Vicky Schneider, EMBL-EBI Training Programme Project leader.
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Central hub for biological data UniProtKB/Swiss-Prot is a central hub for biological data: over 120 databases are cross-referenced (EMBL/DDBJ/GenBank,
Protein sequence databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen This also includes old material from my thesis
Copyright OpenHelix. No use or reproduction without express written consent1.
1 of 28 Evaluating Genes and Transcripts (“Genebuild”)
InterPro Sandra Orchard.
Protein databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen and from CSC bio-opas
What is BLAST? Basic BLAST search What is BLAST?
Welcome to the Protein Database Tutorial. This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
1 EMBL Outstation — The European Bioinformatics Institute Large-Scale Characterization of Protein Sequence Data.
Gene Annotation & Gene Ontology May 24, Gene lists from RNAseq analysis What do you do with a list of 100s of genes that contain only the following.
Protein families, domains and motifs in functional prediction May 31, 2016.
Protein databases Henrik Nielsen
Demo: Protein Information Resource
UniProt: Universal Protein Resource
Genome Annotation Continued
Welcome to the Protein Database Tutorial
Presentation transcript:

EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall Protein Sequence Database:

22 Overview 1)The UniProt databases 2)UniProt/SwissProt annotation 3)UniProt/TrEMBL automatic annotation 4)Using the uniprot.org website 5)Computational access

1) The UniProt databases

44 Source of protein sequence data Nucleotide sequence database Protein sequence database Individual scientists Large-scale sequencing projects Patent Offices Nucleotide sequencing Submit Protein sequencing Derive protein sequence Protein sequencing is rare Most protein sequence derived from nucleotide data Protein sequencing is rare Most protein sequence derived from nucleotide data

55 Protein sequence is mainly derived data ACGCTCGTACGCATCGTCACTACTAGCTACGACGACGACACGCTACTACTCGACGATTCT DNA sequence translate Derived mRNA sequence AUGCGUAGUGAUGAAUGCUGCUGUGCGAUGAGCUGC Derived protein sequence MRSNECCCAMSC transcribe submit

66 Protein sequence is mainly derived data ACGCTCGTACGCATCGTCACTACTAGCTACGACGACGACACGCTACTACTCGACGATTCT DNA sequence translate Derived mRNA sequence AUGCGUAGUGAUGAAUGCUGCUGUGCGAUGAGCUGC Derived protein sequence MRSNECCCAMSC transcribe submit Predicted stop Predicted start may not have direct evidence Predicted splice sites

77 How to find the information you need? TATCTACAG TAGAGGCTATCAGCA CGCAGCACCAT GACGCGCATAACT GATCTACGA TAGCGAGCAGCAGCA CAGCATC GCAGCATCAG CTAAGCGACA ATAGACATCA AATCATCACGAT GAATCATCGTCTACG AGATCGC CTATCTGT High quality protein sequence Non-redundant data Splice isoforms, disease variants, PTMs Sequence archiving essential Protein identification Stable identifiers Consistent nomenclature Protein annotation Information protein function biological processes molecular interactions pathways

88 UniProt Since 2002 a merger and collaboration of three databases: Funded mainly by NIH (US) to be the highest quality, most thoroughly annotated protein sequence database Swiss-Prot & TrEMBLPIR-PSD

99 UniProt Consortium

10 Where does the data come from? Sequence sources UniParc ENA exchange data daily

11 Where does the data come from? more… Sequence sources ENA Model organisms PDB RefSeq Ensembl VEGA Patents UniParc UniMES UniProtKB/ TrEMBL Metagenomic & environmental Taxonomy known History of sequences High quality annotation UniProtKB/ SwissProt Remove redundancy Manual annotation

12 Where does the data come from? UniParc UniMES UniProtKB/ TrEMBL Metagenomic & environmental Taxonomy known UniProtKB/ SwissProt UniMES Clusters UniRef Clusters more… Sequence sources ENA Model organisms PDB RefSeq Ensembl VEGA Patents

13 4 components of UniProt UniParc UniMES  Swiss-Prot: non-redundant, manual annotation  TrEMBL: redundant, automatic annotation  Combines sequences (speed searching)  UniRef100, UniRef90, UniRef50  Complete history of sequences (no annotation)  Cross-links to external sequence sources  Sequences from metagenomic projects UniProtKB UniRef

14 Browsing a UniParc entry Sequence Navigate to individual entries Download data Deleted entries identified (greyed out) Accession List of databases containing sequence

15 Browsing a UniProtKB/SwissProt entry References Navigate to external data sources e.g. Ensembl Download data Names (synonyms) and taxonomy Ontologies Protein attributes Annotation Protein interactions Splice variants Sequence features General information Sequence

16 Browsing a UniRef90 entry Cluster name List of entries in cluster Taxonomy of each entry % identity of sequences in cluster Status (SwissProt and/or TrEMBL) Faster and more sensitive sequence search with no loss of information

17 Taxonomic distribution of species Bacteria (61%) Eukaryota (32%) Archaea (4%) Viruses (3%) All kingdoms: Within Eukaryota: Other mammals (27%) Homo (12%) Other (8%) Nematoda (2%) Insecta (5%) Fungi (18%) Viridiplantae (18%) Other Vertebrata (10%)

18 SwissProt – most represented species Mainly model organisms

19 Protein Existence tag Protein existence level: Evidence at protein level Evidence at transcript level Inferred from homology Predicted Uncertain (mainly TrEMBL) Total 13% 12% 70% 5% - !! Not sequence validation !!

20 Protein existence categories Protein existence level: Evidence at protein level Evidence at transcript level Inferred from homology Predicted Uncertain (mainly TrEMBL) !! Not sequence validation !! Human 59% 37.5% 1% 0.5% 2%

2) UniProtKB/SwissProt annotation

22 Annotation sources for UniProtKB UniProtKB * Manual curation * Literature-based annotation * Sequence analysis Transmembrane prediction InterPro classification Signal prediction Other predictions Protein classification * Automated annotation PRIDE GO InterPro IntAct IntEnz HAMAP RESID Functional info Protein identification data Protein families and domains Molecular interactions Enzymes Microbial protein families Post-translational modifications Some data sources for annotation Data sources

23 Features of UniProtKB Sequence Annotations NomenclatureReferences Ontologies Splice variants Sequence features

24 A wealth of external links 2D gel DBs 2DBase-Ecoli ANU-2DPAGE Aarhus/Ghent-2DPAGE (no server) COMPLUYEAST-2DPAGE Cornea-2DPAGE DOSAC-COBS-2DPAGE ECO2DBASE (no server) HSC-2DPAGE OGP PHCI-2DPAGE PMMA-2DPAGE Rat-heart-2DPAGE REPRODUCTION-2DPAGE Siena-2DPAGE SWISS-2DPAGE World-2DPAGE Family and domain DBs Gene3DPIRSF HAMAPPRINTS InterProProDom PANTHERPROSITE PfamTIGRFAMs SMART Organism-specific DBs DictyBase AGD EchoBASE CGD EcoGene CTD euHCVdb CYGD FlyBaseHGNC GeneCardsHPA GeneFarmMGI GrameneMIM H-InvDB RGD LegioListSGD LepromaTAIR ListiListZFIN MaizeGDBMypuList Orphanet PharmGKB PhotoListPseudoCAP SagaListSubtiList TubercuListWormBase WormPepXenbase GeneDB_Spombe ArachnoServer BuruList Protein family/group DBs CAZyMEROPS PeroxiBaseREBASE PptaseDBTCDB Genome annotation DBs EnsemblKEGG GeneIDNMPDR VectorBaseUCSC GenomeReviews TIGR Enzyme & pathway DBs BioCyc BRENDA Reactome Pathway_Interaction_DB Others BindingDB PMAP- CutDB DrugBank NextBio Sequence DBs EMBL IPI PIR RefSeq UniGene 3D structure DBs DisProt HSSP PDB PDBsum SMR PTM DBs GlycoSuiteDB PhosphoSite PhosSite Proteomic DBs PeptideAtlas PRIDE ProMEX Protein-protein interaction DBs DIP IntAct STRING Phylogenomic DBs HOGENOMOMA HOVERGENPhylomeDB InParanoidOrthoDB Polymorphism DBs dbSNP Gene expression DBs ArrayExpressBgee GermOnlineCleanEx Genevestigator Ontologies GO 125 links!

25 SwissProt manual annotation 1. Protein sequence 2. Biological information Extract literature information Orthologue data propagation Protein sequence analysis... Merge available CDS (coding sequence) Annotate sequence discrepancies Report sequencing errors...

26 Problem #1: sequence correction ~20% of Swiss-Prot entries required correction Typical problems: –Unsolved conflicts (sequencing errors) –Erroneous gene model predictions –Wrong initiation sites –Frameshifts...

27 Sequence quality from genome projects Drosophila: Well-curated 1.8% of gene models incorrect Arabidopsis: Annotated when sequenced, but no update 19.5% of gene models incorrect Tetraodon nigroviridis: Automatic run through (no manual intervention) >90% of gene models incorrect

28 Sequence curation Other examples of sequencing errors include: premature stop codons, read-throughs, erroneous initiator methionines Sequencing errors

29 Problem #2: proteome complexity 1 SwissProt entry = 1 gene (1 species) genome ~20,000 human protein-coding genes transcriptome ~100,000 human transcripts alternative splicing, alternative initiation, mRNA editing... proteome >1,000,000 human proteins Post-translational modification Annotation of sequence differences

30 Merging entries 1)Errors Erroneous gene model predictions; sequence errors 2)Natural variation Polymorphisms; Alternative start sites; Alternative splicing  Multiple entries for the same protein exist in TrEMBL (redundancy) Apart from 100% identical sequences all merged sequences are analyzed by a curator so they can be annotated accordingly. Because of:

31 Example Multiple alignment of the end of the available GCR sequences: Annotation of the sequence differences (protein diversity):

32 Merging entries

33 Sequence curation Alternative Splicing

34 Sequence curation Alternative Splicing

35 Sequence curation Alternative Splicing

36 Sequence curation Alternative Splicing

37 Sequence curation Alternative Splicing

38 Sequence curation Identification of amino acid variants....and of PTMs....and also

39 Sequence curation Domain annotation Binding sites

40 SwissProt manual annotation 1. Protein sequence 2. Biological information Extract literature information Orthologue data propagation Protein sequence analysis... Merge available CDS (coding sequence) Annotate sequence discrepancies Report sequencing errors...

41 Sources of annotated information UniProtKB/SwissProt gathers information from multiple sources: Publications (literature/PubMed) Prediction proteins (Prosite, Anabelle) Contact with experts Other databases Nomenclature committees

42 Nomenclature Synonyms useful for literature searching

43 Nomenclature Provides synonyms and cleavage products of bifunctional proteins

44 Annotation comments Controlled vocabularies used whenever possible… >30 comment fields

45 Disease association Mendelian Inheritance in Man provides information on genetic disease associations Pharmacogenomics database

46 Sequence annotation (Features) …enable researchers to obtain a summary of what is known about a protein…

47 Sequence annotation (Features) Feature (e.g. domain) highlighted on sequence

48 Gene Ontology 2. Molecular Function An elemental activity or task or job Protein kinase activity Insulin binding Insulin receptor activity 1. Biological Process A commonly recognized series of events Cell division Mitosis Organelle fission 3. Cellular Component Where a gene product is located Mitochondrion Mitochondrial matrix Mitochondrial membrane

49 Gene Ontology Annotation for human Rhodopsin:

50 Imported annotation Binary interactions are taken from the database Interactors of human p53

51 Evidence for annotation UniProtKB/Swiss-Prot distinguishes between experimental and predicted data Type of evidenceEvidence tag 1 st : Experimental evidenceReference provided 2 nd : Light experimental evidenceProbable 3 rd : Inferred by similarity with homologous protein By similarity 4 th : Inferred by sequence predictionPotential

52 Evidence for annotation Proven Potential Proven By similarity

53 Sources references included

54 Versioning and archiving

55 Versioning and archiving Able to compare versions directly

56 Versioning and archiving

3) UniProtKB/TrEMBL automatic annotation

58 UniProtKB/TrEMBL !! Caution !! Quality of UniProtKB/TrEMBL entries depends upon quality of submissions in original EMBL/GenBank/DDBJ entry.

59 Annotated proteins guide TrEMBL entries 379 annotated UniProtKB/Swiss-Prot entries 9,186 un-annotated UniProtKB/TrEMBL entries Automatic annotation added using Swiss-Prot and InterPro (function prediction database) Don’t want un-annotated TrEMBL to be skeleton entries with no information Example for rhodopsin:

60 Automatic annotation UniProtKB uses 2 prediction programs: UniRule : maintains a set of manual annotation rules. InterProSwiss-Prot SAAS : generates a set of decision trees using data mining. (new set every UniProtKB release)

61 Automatic annotation - InterPro Swiss-Prot groups of related proteins (same family or share domains) TrEMBL uncharacterised sequence protein signatures InterPro automatic annotation pipeline CGCGCCTGTACGC TGAACGCTCGTGA CGTGTAGTGCGCG manually annotated sequence

62 Browsing a UniProtKB/TrEMBL entry Name (could be clone name) Automatic annotation. (derived from InterPro) Ontologies (both automatic and manual curation) Taxonomy

4) Using the website

64 Useful Features Integrated BLAST and Alignments Batch retrieval in a variety of formats Simple and modular advanced searching

65 uniprot.org: anatomy of an entry Entry Info Link to UniSave Link to UniRef Variety of formats Navigation bar Customize order

66 uniprot.org: anatomy of an entry Entry Info Link to UniSave Link to UniRef Variety of formats Navigation bar Customize order

67 Searching UniProt Search tools include: Text Search Blast sequence search Additional search engines through EBI (e.g. SSearch and FASTA)

68 Search Powerful text search tool with autocompletion and refinement options look for UniProt entries and documentation using biological information

69 Search Search sequence database, literature, taxonomy… More search options

70 Search Refine search

71 Search results

72 Search results Define type and order of search results

73 Search results Each result linked to the UniProt entry SwissProt TrEMBL SwissProt TrEMBL Select specific entries

74 Search results Can retrieve or BLAST sequence Keeps selected entries throughout session

75 Search results Can retrieve or align >2 sequences

76 BLAST A tool with standard options to search sequences in UniProt databases by sequence blast Search refinement (change parameters) Search refinement (change parameters)

77 BLAST Can query using protein or nucleotide sequences

78 BLAST P00750 Can query using identifier: UniProtKB accession (P00750) Specific version (P00750:2) Splice variant (P ) Name (A4_HUMAN) UniParc accession (UPI ) UniRef accession (UniRef100_P00750) Can query using identifier: UniProtKB accession (P00750) Specific version (P00750:2) Splice variant (P ) Name (A4_HUMAN) UniParc accession (UPI ) UniRef accession (UniRef100_P00750)

79 BLAST = best = should verify = biological significance less likely Threshold = expectation (E) value Provides cut-off between good and poor hits

80 BLAST Matrix = assigns probability score for each position Controls sensitivity of search

81 BLAST Stretches of cysteines or hydrophobic regions can cause spurious matches Replaces them with X’s Filtering = masks low complexity regions

82 BLAST Gapped = allows gaps in sequence Yes = to find more distant homologues No = to find closest matches (strict) Gapped = allows gaps in sequence Yes = to find more distant homologues No = to find closest matches (strict)

83 BLAST Hits = limits number of results

84 BLAST results Can filter or customize results

85 BLAST results Shows length of query sequence aligned Select match to see alignment

86 BLAST results – pairwise alignment Alignment of selected sequence

87 BLAST results – pairwise alignment Colour alignment by annotation or properties

88 BLAST results Further down the results page… details about matching protein sequences Further down the results page… details about matching protein sequences

89 BLAST results Can align checked sequences

90 BLAST results – multiple alignment Alignment of selected sequence Can add additional sequences to alignment

91 BLAST results – multiple alignment Colour alignment by annotation or properties

92 Align ClustalW multiple alignment tool with amino-acids highlighting options and feature annotation highlighting option

93 Retrieve - retrieve a list of entries in several standard formats. - then query retrieved sequences with UniProt search tool. UniProt-specific tool:

94 ID Mapping Allows mapping between different databases for a given protein

95 Other tools Sequence Similarity & Analysis

96 Other tools BLAST FASTA specialized searches

5) Computational access

98 Computational access to UniProt

99 Computational access to UniProt

100 Acknowledgements Rolf Apweiler Ioanis Xenarios Cathy H Wu +100 annotators