MCB September, 2010 Protein Sequence Databases Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence.

MCB September, 2010 Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases: use and pitfalls http://education.expasy.org/cours/CJSFEAP2010/

MCB September, 2010 Protein Sequence Databases Mr. Proteomics Mr. Protein sequence databases

MCB September, 2010 Protein Sequence Databases protein identification by database matching mass spectrometry analysis 624.3 769.8 893.4 1056.1 1326.7 1501.9 1759.8 2100.6 2200 624.3 769.8 893.4 994.5 1056.1 1326.7 1501.9 1759.8 1923.4 2100.6 600 2200 T Y GGAAR GPGFK PSTTGVE M FR EHI C LLGR G ANR samples with peptides Large protein lists is not the end point in Proteomics -> importance of protein sequence annotation

MCB September, 2010 Protein Sequence Databases New challenge  Flood of data -> need to be stored, curated and made available for analysis and knowledge discovery

MCB September, 2010 Protein Sequence Databases Many protein sequence databases… Which does contain the highest quality data ? Which is comprehensive ? Which is up-to-date ? Which is redundant ? Which is indexed (allows complex queries) ? Which Web server does respond most quickly ? Which does contain complete proteomes ? …….??????

MCB September, 2010 Protein Sequence Databases A HUPO test sample study reveals common problems in mass spectrometry–based proteomics PubMed 19448641 (2009) A single mass spectrometry experiment can identified up to about 4000 proteins (15’000 peptides) Protein databases vary greatly in terms of their curation, completeness and comprehensiveness (search with different protein databases = could get different results). Only 7 labs (on 27) were able to identify the 20 human proteins present in a sample, also due to the fact that the search engines used cannot distinguish among different identifiers for the same protein…

MCB September, 2010 Protein Sequence Databases Awareness of the content and usage of knowledge resources is a pre-requisite to do any type of « serious » research in the field of molecular life sciences (AMB, 2007)

Menu Introduction Nucleic acid sequence databases ENA, GenBank, DDBJ Protein sequence databases UniProt databases (UniProtKB) NCBI protein databases

MCB September, 2010 Protein Sequence Databases Protein sequence origin More than 99 % of the protein sequences are derived from the translation of nucleotide sequences (genomes and/or cDNAs) -> Important to know where the protein sequence comes from… (sequencing & gene prediction quality) !

MCB September, 2010 Protein Sequence Databases … ~ 2500 genomes sequenced (single organism, varying sizes, including virus) … ~ 5’000 ongoing genome sequencing projects … cDNAs sequencing projects (ESTs or cDNAs) … metagenome sequencing projects = environmental samples: multiple ‘unknown’ organisms,

Metagenomics: Metagenomics: study of genetic material recovered directly from environmental samples Global Ocean Sampling (C. Venter) 1ml sea water: 1 mo bacteria and 10 mo virus Whale fall (AAFZ00000000.1) Soil, sand beach, New-York air, … Human fluids, mouse gut (millions of bacteria within human body) Water treatment industry… Lists of projects: http://www.ncbi.nlm.nih.gov/genomes/lenvs.cgi Venter’s Sorcerer II

MCB September, 2010 Protein Sequence Databases … ~ 2500 genomes sequenced (single organism, varying sizes) … ~ 5’000 ongoing genome sequencing projects … cDNAs sequencing projects (ESTs or cDNAs) … metagenome sequencing projects … personal human genomes new generation sequencers : Illumina: 25 billions of bp /day;

MCB September, 2010 Protein Sequence Databases http://www.youtube.com/watch?v=mVZI7NBgcWM 2’000’000 $ (2007) 70’000’000 $ (diploid, 2007) 3’000’000’000 $ (public consortium, 2000) 300’000’000 $ (Celera, 2000) 2010

MCB September, 2010 Protein Sequence Databases How many proteins-coding genes at the end?

MCB September, 2010 Protein Sequence Databases 190‘500'025'042 1st estimate: ~30 million species (1.8 million named) 2 nd estimate: 20 million bacteria/archea x 4'000 genes 1 million protists x 6'000 genes 5 million insects x 14'000 genes 2 million fungi x 6'000 genes 0.5 million plants x 20'000 genes 0.5 million molluscs, worms, arachnids, etc. x 20'000 genes 0.1 million vertebrates x 25'000 genes The calculation: 2x10 7 x4000+1x10 6 x6000+5x10 6 x14000+2x10 6 x6000+5x10 5 x20000+5x10 5 x20000+1x10 5 x25000 +20000 (Craig Venter)+ 42(Douglas Adam) + …

About 190 milliards of proteins (?) About 12.0 millions of ‘known’ protein sequences in 2010 (from ~290’000 species) More than 99 % of the protein sequences are derived from the translation of nucleotide sequences Less than 1 % direct protein sequencing (Edman, MS/MS…) -> It is important that users know where the protein sequence comes from… (sequencing & gene prediction quality) !

Menu Introduction Nucleic acid sequencedatabases ENA/GenBank, DDBJ Protein sequence databases UniProt databases (UniProtKB) NCBI protein databases

MCB September, 2010 Protein Sequence Databases ENA (EMBL-Bank) GenBank DDBJ

MCB September, 2010 Protein Sequence Databases http://www.insdc.org/ ENA/GenBank/DDBJ

cDNAs, ESTs, genes, genomes, … ENA, GenBank, DDBJ Data not submitted to public databases, delayed or cancelled… The hectic life of a sequence … archive of primary sequence data and corresponding annotation submitted by the laboratories that did the sequencing.

MCB September, 2010 Protein Sequence Databases Journals do not (SHOULD NOT) accept a paper dealing with a nucleic acid sequence if the ENA/GenBank/DDBJ AC number is not available… ‘journal publishers generally require deposition prior to publication so that an accession number can be included in the paper.’ …not the case yet for protein sequences !!!

taxonomy Cross-references references accession number

CDS annotation (Prediction or experimentally determined) sequence CDS CoDing Sequence (proposed by submitters) annotation provided by the laboratories that did the sequencing

CONTIG --------------------------------------------------------------------------------------CGANGGCCTATCAACA ATG AAAGGTCGAAACCTG Genomic AGCTACAAACAGATCCTTGATAATTGTCGTTGATTTTACTTTATCCTAAATTTATCTCAAAAATGTTGAAATTCAGATTCGTCAAGCGAGGGCCTATCAACA ATG -AAGGTCGAAACCTG *** ************ ** * ************** CONTIG CGTTTACTCCGGATACAAGATCCACCCAGGACACGGNAAAGAGACTTGTCCGTACTGACGGAAAG------------------------------------------------------- Genomic CGTTTACTCCGGATACAAGATCCACCCAGGACACGG-AAAGAGACTTGTCCGTACTGACGGAAAGGTGAGTTCAGTTTCTCTTTGAAAGGCGTTAGCATGCTGTTAGAGCTCGTAAGGTA ************************************ **************************** CONTIG ------------------------------------------------------------------------------------------------------------------------ Genomic TATTGTAATTTTACGAGTGTTGAAGTATTGCAAAAGTAAAGCATAATCACCTTATGTATGTGTTGGTGCTATATCTTCTAGTTTTTAGAAGTTATACCATCGTTAAGCATGCCACGTGTT CONTIG ----------------------------------------------GTCCAAATCTTCCTCAGTGGAAAGGCACTCAAGGGAGCCAAGCTTCGCCGTAACCCACGTGACATCAGATGGAC Genomic GAGTGCGACAAACTACCGTTTCATGATTTATTTATTCAAATTTCAGGTCCAAATCTTCCTCAGTGGAAAGGCACTCAAGGGAGCCAAGCTTCGCCGTAACCCACGTGACATCAGATGGAC ************************************************************************** CONTIG TGTCCTCTACAGAATCAAGAACAAGAAG---------------------------------------------GGAACCCACGGACAAGAGCAAGTCACCAGAAAGAAGACCAAGAAGTC Genomic TGTCCTCTACAGAATCAAGAACAAGAAGGTACTTGAGATCCTTAAACGCAGTTGAAAATTGGTAATTTTACAGGGAACCCACGGACAAGAGCAAGTCACCAGAAAGAAGACCAAGAAGTC **************************** *********************************************** CONTIG CGTCCAGGTTGTTAACCGCGCCGTCGCTGGACTTTCCCTTGATGCTATCCTTGCCAAGAGAAACCAGACCGAAGACTTCCGTCGCCAACAGCGTGAACAAGCCGCTAAGATCGCCAAGGA Genomic CGTCCAGGTTGTTAACCGCGCCGTCGCTGGACTTTCCCTTGATGCTATCCTTGCCAAGAGAAACCAGACCGAAGACTTCCGTCGCCAACAGCGTGAACAAGCCGCTAAGATCGCCAAGGA ************************************************************************************************************************ CONTIG TGCCAACAAGGCTGTCCGTGCCGCCAAGGCTGCTNCCAACAAG----------------------------------------------------------------------------- Genomic TGCCAACAAGGCTGTCCGTGCCGCCAAGGCTGCTGCCAACAAGGTAAACTTTCTACAATATTTATTATAAACTTTAGCATGCTGTTAGAGCTTGTAAGGTATATGTGATTTTACGAGTGT ********************************** ******** CONTIG -------------------------------------------------------------------------------------------------------------------GNAAA Genomic GTTATTTGAAGCTGTAATATCAATAAGCATGTCTCGTGTGAAGTCCGACAATTTACCATATGCATGAAATTTAAAAACAAGTTAATTTTGTCAATTCTTTATCATTGGTTTTCAGGAAAA * *** CONTIG GAAGGCCTCTCAGCCAAAGACCCAGCAAAAGACCGCCAAGAATNTNAAGACTGCTGCTCCNCGTGTCGGNGGAAANCGA TAA ACGTTCTCGGNCCCGTTATTGTAATAAATTTTGTTGAC Genomic GAAGGCCTCTCAGCCAAAGACCCAGCAAAAGACCGCCAAGAATGTGAAGACTGCTGCTCCACGTGTCGGAGGAAAGCGA TAA ACGTTCTCGGTCCCGTTATTGTAATAAATTTTGTTGAC ******************************************* * ************** ******** ***** **** * *********** *************************** CONTIG C----------------------------------------------------------------------------------------------------------------------- Genomic CGTTAAAGTTTTAATGCAAGACATCCAACAAGAAAAGTATTCTCAAATTATTATTTTAACAGAACTATCCGAATCTGTTCATTTGAGTTTGTTTAGAATGAGGACTCTTCGAATAGCCCA * CoDing Sequence Alignment between a mRNA and a genomic sequence exon intron

CDS translation provided by ENA CDS provided by the submitters The first Met !

MCB September, 2010 Protein Sequence Databases Very rarely done… Pitfall no 1 – gene prediction ‘quality’ ! ? !

Complete genome (submitted) but only ~ 2,000 CDS/proteins available ! Pitfall no 2 – CDS annotation and submission ! ? !

MCB September, 2010 Protein Sequence Databases http://www.ebi.ac.uk/swissprot/sptr_stats/index.html …annotated CDS in UniProtKB (no gene prediction) (~290’000 species)

MCB September, 2010 Protein Sequence Databases From nucleic acid to amino acid sequences databases….

The hectic life of a protein sequence … cDNAs, ESTs, genomes, … ENA, GenBank, DDBJ Data not submitted to public databases, delayed or cancelled… …if the submitters provide an annotated Coding Sequence (CDS) (1/10 ENA entries) Protein sequence databases Nucleic acid databases Gene prediction RefSeq, Ensembl no CDS

Why doing things in a simple way, when you can do it in a very complex one ?

The hectic life of a sequence … TrEMBL Genpept CoDing Sequences provided by submitters cDNAs, ESTs, genomes, … ENA, GenBank, DDBJ Data not submitted to public databases, delayed or cancelled… Swiss-Prot RefSeq PRF Scientific publications derived sequences Ensembl CCDS UniParc UniProtKB PDB(PIR) + all ‘species’ specific databases (EcoGene, TAIR, …) (IPI) UniMES CoDing Sequences provided by submitters and gene prediction TPA

Major protein sequence database ‘sources’ UniProtKB: Swiss-Prot + TrEMBL NCBI-nr: Swiss-Prot + GenPept + PIR + PDB + PRF + RefSeq + TPA PIRPDBPRF UniProtKB/Swiss-Prot: manually annotated protein sequences (12’200 species) UniProtKB/TrEMBL: submitted CDS (ENA) + automated annotation; non redundant with Swiss-Prot (290’000 species) GenPept: submitted CDS (GenBank); redundant with Swiss-Prot (230’000 species ?) PIR: Protein Information Ressource; archive since 2003; integrated into UniProtKB PDB: Protein Databank: 3D data and associated sequences (PIR-NRL3D) PRF: Protein Research Foundation : journal scan of ‘published’ peptide sequences RefSeq: Reference Sequence for DNA, RNA, protein + gene prediction + some manual annotation (11’000 species) TPA: Third Party Annotation Sequence Database (update of entries derived from GenBank primary data) Integrated resources ‘cross-references’ Resources kept separated

Ensembl: UniProtKB + RefSeq + gene prediction (40 species) Vega: Ensembl + gene prediction + manual annotation (5 species) IPI: UniProtKB + RefSeq + Ensembl + TAIR (arabidopsis db) + H-InvDB (human cDNAs manual annotation) + VEGA (Vertebrate Genome Annotation) (7 species) Closure in 2010 !!! CCDS: consensus between EBI, NCBI, Sanger, USCS, (3 species) Others: OWL: Swiss-Prot + PIR + PDB + GenPept (obsolete) MSDB (Mascot): Swiss-Prot + PIR + PDB + TrEMBL + GenBank… dbESTs: translated ESTs (in the 6 frames; no annotated CDSs, no gene prediction) Major protein sequence database ‘composite’

MCB September, 2010 Protein Sequence Databases Phenyx: UniProtKB, IPI, NCBInr Mascot: NCBInr, Swiss-Prot, dbEST Protein prospector: NCBInr, Swiss-Prot, dbEST, GenPept Aldente: UniProtKB. ProFound: NCBInr, Swiss-Prot, dbEST OMSSA: NCBInr and RefSeq. Different protein databases available for different online proteomic tools ! ? !

MCB September, 2010 Protein Sequence Databases Databases used Study done on 28 proteomic papers (from 2010): The majority of labs ~61% use IPI ~18% use Swiss-Prot (mainly human, some bacteria) ~20% use other sources such as NCBI, SGD or in house developed databases Personal communication: Silvia Jimenez, nov 2010

Menu Introduction Nucleic acid sequence databases ENA-Bank/GenBank, DDBJ Protein sequence databases UniProt databases (UniProtKB) NCBI protein databases

MCB September, 2010 Protein Sequence Databases UniProt SIB + EBI + PIR

UniProtKB UniProtKB: protein sequence knowledgebase, 2 sections UniProtKB/Swiss-Prot and UniProtKB/TrEMBL (query, Blast, download) (~12 mo entries) UniParc UniParc: protein sequence archive (ENA equivalent at the protein level). Each entry contains a protein sequence with cross- links to other databases where you find the sequence (active or not). Not annotated (query, Blast, download) (~25mo entries) UniRef UniRef: 3 clusters of protein sequences with 100, 90 and 50 % similarity; useful to speed up sequence similarity search (BLAST) (query, Blast, download) (UniRef100 10 mo entries; UniRef90 7 mo entries; UniRef50 3.3 mo entries) UniMES UniMES: protein sequences derived from metagenomic projects (mostly Global Ocean Sampling (GOS)) (download) (8 mo entries, included in UniParc)

MCB September, 2010 Protein Sequence Databases UniProt databases

MCB September, 2010 Protein Sequence Databases UniProtKB an encyclopedia on proteins composed of 2 sections UniProtKB/TrEMBL and UniProtKB/Swiss-Prot released every 4 weeks

MCB September, 2010 Protein Sequence Databases UniProtKB from ENA to TrEMBL UniProtKB protein sequence data are mainly derived from ENA (CDS) but also from Ensembl and other sequence resources such as RefSeq or model organism databases (MODs). Data from the PIR database have been integrated in UniProt since 2003.

TrEMBL ENA Automated extraction of protein sequence (translated CDS), gene name and references.+ Automated annotation

MCB September, 2010 Protein Sequence Databases The quality of UniProtKB/TrEMBL data, including the protein sequence, is directly dependent on the information provided by the submitter of the original nucleotide entry. Automated annotation Redundancy check (100% merge) Family attribution (InterPro) Many other cross-references Rule-based automated annotation ! ? !

MCB September, 2010 Protein Sequence Databases UniProtKB from TrEMBL to Swiss-Prot Once manually annotated and integrated into Swiss- Prot, the entry is deleted from TrEMBL -> minimal redundancy

TrEMBL ENA Automated extraction of protein sequence (translated CDS), gene name and references.+ Automated annotation Manual annotation of the sequence and associated biological information Swiss-Prot

MCB September, 2010 Protein Sequence Databases Sequence Sequence features Ontologies References Nomenclature Splice variants Annotations

MCB September, 2010 Protein Sequence Databases UniProtKB: from TrEMBL to Swiss-Prot Manual annotation 1. Protein sequence (merge available CDS, annotate sequence discrepancies, report sequencing mistakes…) 2. Biological information (sequence analysis, extract literature information, ortholog data propagation, …)

MCB September, 2010 Protein Sequence Databases UniProtKB: from TrEMBL to Swiss-Prot Manual annotation 1. Protein sequence (merge available CDS, annotate sequence discrepancies, report sequencing mistakes…) 2. Biological information (extract literature information, ortholog data propagation, protein sequence analysis…)

MCB September, 2010 Protein Sequence Databases The displayed sequence is the most prevalent protein sequence and/or the protein sequence which is also found in orthologous species. The displayed sequence is generally derived from the translation of the genomic sequence (when available). Sequence differences are documented. 1 entry 1 gene (1 species) 1 displayed sequence (annotation of alternative sequences, when available) UniProtKB/Swiss-Prot Protein sequence annotation

MCB September, 2010 Protein Sequence Databases What is the current status? At least 20% of Swiss-Prot entries required a minimal amount of curation effort so as to obtain the “correct” sequence. Typical problems –unsolved conflicts; –uncorrected initiation sites; –frameshifts; –other ‘problems’

MCB September, 2010 Protein Sequence Databases … once upon a time, it was a gene on chromosome 11…

MCB September, 2010 Protein Sequence Databases ! ? ! … once upon a time, it was a gene on chromosome 11… All these sequences are available in protein sequence databases (i.e.GenPept) !!!

Quality of protein information from genome projects Lets look at proteins originating from genome projects: –Drosophila: the paradigm of a curated genome should look like (thanks to FlyBase) : only 1.8% of the gene models conflict with Swiss-Prot sequences; –Arabidopsis: a typical example of a genome where a lot of annotation was done when it was sequenced, but no update since then (at least in the public view): 20% of the gene models are erroneous; –Tetraodon nigroviridis: the typical example of a quick and dirty automatic run through a genome with no manual intervention: >90% of the gene models produce incorrect proteins. –Bacteria and Archaea have almost no splicing, so predictions are “easier”, however errors are still made… Start codons, missed small proteins (<100aa)…

UniProtKB/Swiss-Prot Protein sequence annotation

MCB September, 2010 Protein Sequence Databases Example of problem (derived from gene prediction pipeline) Ensembl completes the human ‘proteome’ by predicting/annotating missing genes according to orthologs sequences.. ID URAD_HUMAN Unreviewed; 171 AA. AC A6NGE7; DT 24-JUL-2007, integrated into UniProtKB/TrEMBL. DT 24-JUL-2007, sequence version 1. DT 02-OCT-2007, entry version 3. DE 2-oxo-4-hydroxy-4-carboxy-5-ureidoimidazoline decarboxylase homolog DE (OHCU decarboxylase homolog) (Parahox neighbour). GN Name=PRHOXNB; … DR EMBL; AL591024; -; NOT_ANNOTATED_CDS; Genomic_DNA. DR Ensembl; ENSG00000183463; Homo sapiens. DR HGNC; HGNC:17785; PRHOXNB. PE 4: Predicted; In primates the genes coding for the enzymes for the degradation of uric acid were inactivated and converted to pseudogenes.

Producing a clean set of sequences is not a trivial task; It is not getting easier as more and more types of sequence data are submitted;

MCB September, 2010 Protein Sequence Databases The ‘Protein existence’ tag indicates what is the evidence for the existence of a given protein (but not associated with sequence quality); Different qualifiers: 1. Evidence at protein level (~18%) (MS, western blot (tissue specificity), immuno (subcellular location),…) 2. Evidence at transcript level (~19%) 3. Inferred from homology (~58 %) 4. Predicted (~5%) 5. Uncertain (mainly in TrEMBL) ‘Protein existence’ tag

MCB September, 2010 Protein Sequence Databases

MCB September, 2010 Protein Sequence Databases In order to avoid ‘pseudogenes’ and most of the unprobable protein sequences, you can filter your query and avoid sequences with ‘protein existence tag’ = ‘Uncertain’

MCB September, 2010 Protein Sequence Databases The ‘alternative’ sequence(s)

MCB September, 2010 Protein Sequence Databases (Jensen O.N., Curr. Opin. Chem. Biol., 2004, 8, 33-41, PMID: 15036154). Proteome complexity Example with human Not predictable at the genome level ! -> important post- genomic data ! ~20’000

MCB September, 2010 Protein Sequence Databases Multiple alignment of the end of the available GCR sequences Annotation of the sequence differences (protein diversity) 1 entry 1 gene (1 species) …and natural variant

P04150 www.uniprot.org

MCB September, 2010 Protein Sequence Databases Available in separated files! Important remark > 30’000 additional sequences (total)

MCB September, 2010 Protein Sequence Databases The ‘alternative’ sequence(s) not ‘directly available’ for a lot of tools, including protein identification tools, Blast, depending on the server !….

MCB September, 2010 Protein Sequence Databases UniProtKB: from TrEMBL to Swiss-Prot Manual annotation 1. Protein sequence (merge available CDS, annotate sequence discrepancies, report sequencing mistakes…) 2. Biological information (extract literature information, ortholog data propagation, protein sequence analysis…)

MCB September, 2010 Protein Sequence Databases UniProtKB/Swiss-Prot gathers data form multiple sources: - publications (literature/Pubmed) - prediction programs (Prosite, Anabelle) - contacts with experts - other databases - nomenclature committees Maximum usage of controlled vocabulary Keywords, Tissues, Post-translational modifications, Strains, Species, Subcellular location, Extracellular domains, Journals, … Gene Ontology… Extract literature information and protein sequence analysis maximum usage of controlled vocabulary

MCB September, 2010 Protein Sequence Databases Protein nomenclature

…enable researchers to obtain a summary of what is known about a protein… General annotation (Comments) www.uniprot.org

Sequence annotation (Features) …enable researchers to obtain a summary of what is known about a protein… www.uniprot.org

MCB September, 2010 Protein Sequence Databases Ontologies Swiss-Prot keywords Gene Ontology (GO terms)

MCB September, 2010 Protein Sequence Databases Human protein manual annotation: some statistics (Aug 2010)

MCB September, 2010 Protein Sequence Databases Non-experimental qualifiers UniProtKB/Swiss-Prot considers both experimental and predicted data and makes a clear distinction between both. Level. Type of evidenceQualifier 1st. Strong experimental evidence 2nd. Light experimental evidenceProbable 3rd. Inferred by similarity with homologous protein (data of 1st or 2 nd level) By similarity 4th. Inferred by sequence predictionPotential

Phenyx: UniProtKB**, IPI, NCBInr Mascot: NCBInr, Swiss-Prot*, dbEST, OWL, MSDB Protein prospector: NCBInr, Swiss-Prot, dbEST, GenPept, Ludwignr, OWL Aldente: UniProtKB** ProFound: NCBInr, Swiss-Prot, dbEST OMSSA: NCBInr and RefSeq ** the tool takes into account AP, PTM, …but not yet variants /conflict annotations * the tool takes into account AP annotations Do proteomic analysis tools make use of sequence annotation ?

MCB September, 2010 Protein Sequence Databases Identification of biologically active proteins: use Swiss-Prot annotations with Phenyx Sequence processing annotations –Removal of signal peptides –Removal of transit peptides –Extraction of active chains Post-translational modifications Sequence variants –Splicing variants –Sequence mutations

MCB September, 2010 Protein Sequence Databases Access to UniProtKB www.uniprot.org

MCB September, 2010 Protein Sequence Databases www.uniprot.org

MCB September, 2010 Protein Sequence Databases Search A very powerful text search tool with autocompletion and refinement options allowing to look for UniProt entries and documentation by biological information

MCB September, 2010 Protein Sequence Databases The search interface guides users with helpful suggestions and hints

MCB September, 2010 Protein Sequence Databases Result pages: Highly customizable

The URL (results) can be bookmarked and manually modified.

MCB September, 2010 Protein Sequence Databases Blast A tool associated with the standard options to search sequences in UniProt databases

Blast results: customize display

MCB September, 2010 Protein Sequence Databases Align A ClustalW multiple alignment tool with amino-acids highlighting options and feature annotation highlighting option

MCB September, 2010 Protein Sequence Databases ClustalW multiple alignment of insulin sequences

MCB September, 2010 Protein Sequence Databases Retrieve A UniProt specific tool allowing to retrieve a list of entries in several standard formats. You can then query your ‘personal database’ with the UniProt search tool.

MCB September, 2010 Protein Sequence Databases AC Large protein lists is not the end point in Proteomics -> importance of protein sequence annotation

MCB September, 2010 Protein Sequence Databases Retrieve tool (UniProt)

MCB September, 2010 Protein Sequence Databases Play with the customize display tool…

MCB September, 2010 Protein Sequence Databases ~40 % of human proteins have no known function (experimental data)…but many more are associated with GO terms…(computer-assigned).

MCB September, 2010 Protein Sequence Databases ID Mapping Gives the possibility to get a mapping between different databases for a given protein

MCB September, 2010 Protein Sequence Databases These identifiers are all pointing to TP53 (p53): Question: same protein sequence ? P04637, NP_000537, ENSG00000141510, CCDS11118, GC17M007512, UPI000002ED67, IPI00025087, etc. - Specific and unique for each database… - Essential for retrieving your data and citation - Beware their ‘stability ’…(linked to a protein sequence or to a gene) - ID mapping tools ! ? !

MCB September, 2010 Protein Sequence Databases Complete proteomes http://www.uniprot.org/taxonomy/?query=complete:yes KW: Complete proteome

MCB September, 2010 Protein Sequence Databases Download

MCB September, 2010 Protein Sequence Databases Downloading UniProt http://www.uniprot.org/downloads http://www.uniprot.org/downloads

MCB September, 2010 Protein Sequence Databases UniProtKB Statistics

520’000 + 11’60’000  12’000’000 Swiss-Prot & TrEMBL introduce a new arithmetical concept ! Redundancy in TrEMBL & Redundancy between TrEMBL and Swiss-Prot 12’000 species290’000 species Swiss-ProtTrEMBL

MCB September, 2010 Protein Sequence Databases UniProtKB/Swiss-Prot (Manual annotation) UniProtKB/TrEMBL (automatic annotation) 12’000 species mainly model organisms

Not yet available

~ 200 new entries / day new release every 4 weeks -Annotation is useful, good annotation is better, update is essential ! - Some entries have gone through more than 120 versions since their integration in UniProtKB/Swiss-Prot

UniProtKB entry history Always cite the primary accession number (AC) !

Menu Introduction Nucleic acid sequencedatabases ENA/GenBank, DDBJ Protein sequence databases UniProt databases (UniProtKB) NCBI protein databases

MCB September, 2010 Protein Sequence Databases NCBI protein databases (Entrez protein, NCBI nr) http://www.ncbi.nlm.nih.gov/protein

Major protein sequence database ‘sources’ UniProtKB: Swiss-Prot + TrEMBL NCBI-nr: Swiss-Prot + GenPept + PIR + PDB + PRF + RefSeq + TPA PIRPDBPRF UniProtKB/Swiss-Prot: manually annotated protein sequences (12’200 species) UniProtKB/TrEMBL: submitted CDS (ENA) + automated annotation; non redundant with Swiss-Prot (290’000 species) GenPept: submitted CDS (GenBank); redundant with Swiss-Prot (230’000 species ?) PIR: Protein Information Ressource; archive since 2003; integrated into UniProtKB PDB: Protein Databank: 3D data and associated sequences PRF: Protein Research Foundation : journal scan of ‘published’ peptide sequences RefSeq: Reference Sequence for DNA, RNA, protein + gene prediction + some manual annotation (11’000 species) Integrated resources ‘cross-references’ Resources kept separated

MCB September, 2010 Protein Sequence Databases NCBI-nr: Swiss-Prot + GenPept + PIR + PDB + PRF + RefSeq + TPA

MCB September, 2010 Protein Sequence Databases RefSeq Produced by NCBI and NLM Information: http://www.ncbi.nlm.nih.gov/books/bookres.fcgi/handbook/ch18.pdf http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=handbook&part=ch1#GenB ank_ASM http://www.ncbi.nlm.nih.gov/RefSeq/ NCBI-nr: Swiss-Prot + GenPept + PIR + PDB + PRF + RefSeq + TPA

Reference Sequence (RefSeq) provides one example of each natural biological molecule (protein- mRNA- genomic DNA) for major organisms (11’000 species) -> several entries for the same gene (if alternative splicing) gene prediction manual annotation (‘reviewed’ tagged entries) Annotation are mainly found in Entrez Gene (« interdependent curated resources ») Can be queried via Entrez protein system Nice accession numbers: NP_, NM_, etc… RefSeq

MCB September, 2010 Protein Sequence Databases Query RefSeq

MCB September, 2010 Protein Sequence Databases KW AC Taxonomy References

GenBank source and status Annotation and ontologies

MCB September, 2010 Protein Sequence Databases UniProtKB vs RefSeq

RefSeq chooses one or several protein reference sequences for a given gene: they do not annotate the sequence differences. - If there is an alternative splicing event, there will be several distinct entries for a given gene Example: GCR_HUMAN GCR_HUMAN UniProtKB/Swiss-Prot 1 UniProtKB entry7 RefSeq entries cross-linked with

MCB September, 2010 Protein Sequence Databases GI number ‘GenInfo identifier’ number - In addition to an AC number specific from the original database, each protein sequence in the NCBInr database (included Swiss-Prot entry) has a GI number.

MCB September, 2010 Protein Sequence Databases AC

MCB September, 2010 Protein Sequence Databases GI number: ‘GenInfo identifier’ number - If the sequence changes in any way, a new GI number will be assigned: - A Sequence Revision History tool is available to track the various GI numbers, version numbers, and update dates for sequences that appeared in a specific GenBank record: http://www.ncbi.nlm.nih.gov/entrez/sutils/girevhist.cgi

MCB September, 2010 Protein Sequence Databases ID/AC mapping

MCB September, 2010 Protein Sequence Databases http://www.ebi.ac.uk/Tools/picr/

MCB September, 2010 Protein Sequence Databases Protein databases for proteomic analysis…

MCB September, 2010 Protein Sequence Databases IPI http://www.ebi.ac.uk/IPI/IPIhelp.html IPI Closure in 2010 !

MCB September, 2010 Protein Sequence Databases Automatic approach that builds clusters through combining knowledge already present in the primary data source (UniProtKB, RefSeq, Ensembl) and sequence similarity. IPI=UniProtKB + RefSeq + Ensembl (+ H-InvDB, TAIR and VEGA) human, mouse, rat, zebrafish, Arabidopsis, chicken, and cow proteomes

MCB September, 2010 Protein Sequence Databases No annotation

MCB September, 2010 Protein Sequence Databases Mascot http://www.matrixscience.com/search_intro.html

MCB September, 2010 Protein Sequence Databases Update November, the 11th….

ENA/GenBank/DDBJRefSeq www.ncbi.nlm.nih.gov/RefSeq/ UniProt www.uniprot.org Protein and nucleotide dataGenomic, RNA and protein dataProtein data only Biological data added by the submitters (gene name, tissue…) Biological data annotated by curators, also found in the corresponding Entrez Gene entry Biological data annotated by curators (Swiss-Prot), within the entry Not curated Partially manually curated (‘reviewed’ entries) Manually curated in Swiss-Prot, not in TrEMBL Author submissionNCBI creates from existing dataUniProt creates from existing data Only author can revise (except TPA) NCBI revises as new data emerge UniProt revises as new data emerge Multiple records for same loci common Single records for each molecule of major organisms Single records for each protein of major organisms (in Swiss- Prot, TrEMBL is redundant) Records can contradict each other Identification and annotation of discrepancy No limit to species included Limited to model organismsPriority (but not limited) to model organisms Data exchanged among INSDC members NCBI database; collaboration with UniProt UniProt database; collaboration with NCBI (RefSeq, CCDS)

MCB September, 2010 Protein Sequence Databases All documents are online http://education.expasy.org/cours/CJSFEAP2010/

MCB September, 2010 Protein Sequence Databases Additional material

MCB September, 2010 Protein Sequence Databases PIR

MCB September, 2010 Protein Sequence Databases PIR: the Protein Identification Resource PIR-PSD is no more updated, but exists as an archive

MCB September, 2010 Protein Sequence Databases PDB

MCB September, 2010 Protein Sequence Databases PDB PDB (Protein Data Bank), 3D structure Contains the spatial coordinates of macromolecule atoms whose 3D structure has been obtained by X-ray or NMR studies Contains also the corresponding protein sequences *The PIR-NRL3D database makes the sequence information in PDB available for similarity searches and other tools Includes protein sequences which are mutated, chimearic etc… (created specifically to study the effect of a mutation on the 3D structure)

MCB September, 2010 Protein Sequence Databases PDB: Protein Data Bank www.rcsb.org/pdb/ Managed by Research Collaboratory for Structural Bioinformatics (RCSB) (USA). Associated with specialized programs allow the visualization of the corresponding 3D structure (e.g., SwissPDB-viewer, Chime, Rasmol)). Currently there are ~67’000 structural data for about 15’000 different proteins, but far less protein family (highly redundant) !

MCB September, 2010 Protein Sequence Databases PDB: example

MCB September, 2010 Protein Sequence Databases Coordinates of each atom Sequence

MCB September, 2010 Protein Sequence Databases Visualisation with Jmol

MCB September, 2010 Protein Sequence Databases PRF Protein Research Foundation

MCB September, 2010 Protein Sequence Databases http://www.genome.jp/dbget-bin/www_bfind?prf Looks for the peptide sequence described in publication (and which are not submitted in databases !!!)

MCB September, 2010 Protein Sequence Databases Ensembl http://www.ensembl.org/ Review http://nar.oxfordjournals.org/cgi/content/full/35/suppl_1/D610 Annotation pipeline http://www.genome.org/cgi/content/full/14/5/942 http://www.ensembl.org/ http://nar.oxfordjournals.org/cgi/content/full/35/suppl_1/D610 http://www.genome.org/cgi/content/full/14/5/942

MCB September, 2010 Protein Sequence Databases - Ensembl: align the genomic sequences with all the sequences found in ENA, UniProtKB/Swiss-Prot, RefSeq and UniProtKB/TrEMBL (-> known genes) - Also do gene prediction (-> novel genes) Ensembl= UniProtKB + RefSeq + gene prediction - DNA, RNA and protein sequences available for several species. - Ensembl concentrates on vertebrate genomes, but other groups have adapted the system for use with plant, fungal and metazoa genomes.

MCB September, 2010 Protein Sequence Databases Example of problem Ensembl completes the human ‘proteome’ by annotating missing genes according to orthologs sequences.. ID URAD_HUMAN Unreviewed; 171 AA. AC A6NGE7; DT 24-JUL-2007, integrated into UniProtKB/TrEMBL. DT 24-JUL-2007, sequence version 1. DT 02-OCT-2007, entry version 3. DE 2-oxo-4-hydroxy-4-carboxy-5-ureidoimidazoline decarboxylase homolog DE (OHCU decarboxylase homolog) (Parahox neighbour). GN Name=PRHOXNB; … DR EMBL; AL591024; -; NOT_ANNOTATED_CDS; Genomic_DNA. DR Ensembl; ENSG00000183463; Homo sapiens. DR HGNC; HGNC:17785; PRHOXNB. PE 4: Predicted; In primates the genes coding for the enzymes catalyzing the degradation of uric acid were inactivated and converted to pseudogenes.

MCB September, 2010 Protein Sequence Databases CCDS

MCB September, 2010 Protein Sequence Databases http://www.ncbi.nlm.nih.gov/CCDS/

MCB September, 2010 Protein Sequence Databases CCDS (human) Combining different approaches – ab initio, by similarity - and taking advantage of the expertise acquired by different institutes, including manual annotation… Consensus between 4 institutions…

MCB September, 2010 Protein Sequence Databases UniParc

MCB September, 2010 Protein Sequence Databases UniParc - non-redundant protein sequence archive, containing both active and inactive sequences (including sequences which are not in UniProtKB i.e. immunoglobulins….) - the equivalent of ENA/GenBank/DDBJ at the protein level - species-merged: merge sequences between species when 100% identical over the whole length. - no annotation (only taxonomy) - can be searched only with database names, taxonomy, checksum (CRC64) and accession numbers (ACs) or UniProtKB, UniRef and UniParc IDs. - Beware: contains wrong prediction, pseudogenes etc…

MCB September, 2010 Protein Sequence Databases Query UniParc

MCB September, 2010 Protein Sequence Databases UniRef

MCB September, 2010 Protein Sequence Databases ‘UniRef is useful for comprehensive BLAST similarity searches by providing sets of representative sequences’

MCB September, 2010 Protein Sequence Databases «Collapsing BLAST results» Three collections of sequence clusters from UniProtKB and selected UniParc entries: One UniRef100 entry -> all identical sequences (identical sequences and sub-fragments are grouped in a single record) -> reduction of 12 % One UniRef90 entry -> sequences that have at least 90 % or more identity -> reduction of 40 % One UniRef50 entry -> sequences that are at least 50 % identical -> reduction of 65 % Based on sequence identity -> Independent of the species !

MCB September, 2010 Protein Sequence Databases Independent of species and sequence length UniRef 90

MCB September, 2010 Protein Sequence Databases UniMes

MCB September, 2010 Protein Sequence Databases The UniProt Metagenomic and Environmental Sequences (UniMES) database is a repository specifically developed for metagenomic and environmental protein data (only GOS data for the moment). Download only (but included in UniParc -> Blast). - UniMES Fasta sequences - UniMES matches to InterPro methods ftp.uniprot.org/pub/databases/uniprot

MCB September, 2010 Protein Sequence Databases UniMES: sequences in fasta format

Phenyx: UniProtKB***, IPI, NCBInr Mascot: NCBInr, Swiss-Prot*, dbEST, OWL, MSDB Protein prospector: NCBInr, Swiss-Prot, dbEST, GenPept, Ludwignr, OWL. Aldente: UniProtKB***. ProFound: NCBInr, Swiss-Prot, dbEST OMSSA: NCBInr and RefSeq. Translation of ESTs sequences in the 6 frames (EST are not associated with annotated CDSs !) *** the tool takes into account AP, PTM, variants /conflict annotations * the tool takes into account AP annotations (but not for the online public version)

MCB September, 2010 Protein Sequence Databases Use of UniProtKB/Swiss-Prot annotation by Phenyx

MCB September, 2010 Protein Sequence Databases Improve Identification of biologically active proteins: Use Swiss-Prot annotations with Phenyx Sequence processing annotations –Removal of signal peptides –Removal of transit peptides –Extraction of active chains Post-translational modifications Sequence variants –Splicing variants –Sequence mutations

MCB September, 2010 Protein Sequence Databases Phenyx and alternative sequences http://www.genebio.com/products/phenyx/features.html#section4

MCB September, 2010 Protein Sequence Databases Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence.

Similar presentations

Presentation on theme: "MCB September, 2010 Protein Sequence Databases Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

MCB September, 2010 Protein Sequence Databases Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence.

Similar presentations

Presentation on theme: "MCB September, 2010 Protein Sequence Databases Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence."— Presentation transcript:

Similar presentations

About project

Feedback