Protein Sequence Databases Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Protein Sequence Databases Marie-Claude.Blatter@isb-sib.ch Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases http://education.expasy.org/cours/Murcia2011/ Murcia, February, 2011

Menu Introduction Nucleic acid sequence databases ENA, GenBank, DDBJ Protein sequence databases UniProt databases (UniProtKB) NCBI protein databases Other databases (Ensembl, IPI, CCDS, …) Murcia, February, 2011Protein Sequence Databases

Menu Introduction Nucleic acid sequence databases ENA, GenBank, DDBJ Protein sequence databases UniProt databases (UniProtKB) NCBI protein databases Murcia, February, 2011Protein Sequence Databases

Indispensible for bioinformatic studies 1.Databases (free access on the web) 2.Software tools 3.Servers Murcia, February, 2011Protein Sequence Databases

A collection of related data, which are –structured –searchable –updated periodically –cross-referenced Includes also associated tools necessary for access/query, download, etc. What is a database ? Murcia, February, 2011Protein Sequence Databases

Why biological databases ? Exponential growth in biological data. Data (genomic sequences, protein sequences, 3D structures, 2D gel electrophoresis, MS analysis, microarrays, publications….) are no longer published in a conventional manner, but directly submitted to databases. Essential tools for biological research. Murcia, February, 2011Protein Sequence Databases

The NAR Online Molecular Biology Database collection in 2011 A total of 1’330 databases http://nar.oxfordjournals.org/content/38/suppl_1 Murcia, February, 2011Protein Sequence Databases

Categories of databases for Life Sciences Sequences (DNA, protein) Genomics 3D structure Mutation/polymorphism Protein domain/family Metabolism/Pathways Bibliography ‘Others’ (Protein protein interaction, Microarrays…) Murcia, February, 2011Protein Sequence Databases

Categories of databases for Life Sciences Sequences (DNA, protein) –DNA/RNA: EMBL/GenBank/DDBJ, –Protein: UniProtKB, NCBInr Genomics - OMIM, Flybase 3D structure –PDB Mutation/polymorphism –dbSNP Protein domain/family –InterPro Metabolism/Pathways –KEGG Bibliography –PubMed –‘Others’ (Protein protein interaction, Microarrays…) Murcia, February, 2011Protein Sequence Databases

Murcia, February, 2011Protein Sequence Databases

DNA sequences Human Genome Gene Annotation Protein Sequences Macromolecular Structure Data Microarray Expression Data Murcia, February, 2011Protein Sequence Databases

Proliferation of databases Which does contain the highest quality data ? Which is comprehensive ? Which is up-to-date ? Which is redundant ? Which is indexed (allows complex queries) ? Which Web server does respond most quickly ? …….??????

Protein Sequence Databases Awareness of the content and usage of knowledge resources is a pre-requisite to do any type of « serious » research in the field of molecular life sciences (AMB, 2007) Murcia, February, 2011

Where can we find… A video -> Youtube Info on S. Hawking-> Wikipedia A book -> Amazon A friend -> Facebook – Usually only one server DNA sequence -> EMBL Protein sequence -> UniProtKB, RefSeq… – Several different servers give access to the ‘same’ database

Servers ‘Any computer (…) serving out applications or services can technically be called a server. ‘ (Wikipedia) Murcia, February, 2011Protein Sequence Databases

EBI: http://www.ebi.ac.uk/ Murcia, February, 2011Protein Sequence Databases

NCBI: http://www.ncbi.nlm.nih.gov/ Murcia, February, 2011Protein Sequence Databases

ExPASy: http://expasy.org Murcia, February, 2011Protein Sequence Databases

www.uniprot.org Murcia, February, 2011Protein Sequence Databases

How to find a database ? Beware not all servers give access to the latest version of the database. Important to know the ‘home server’ for a given database. – ExPASy life sciences directory: -> ‘home’ server links (www.expasy.org/alinks.html) – Google (http://www.google.com) (not always linked to the ‘home’ server) Murcia, February, 2011Protein Sequence Databases

http://www.expasy.org/ Murcia, February, 2011Protein Sequence Databases

http://www.expasy.org/links.html Murcia, February, 2011Protein Sequence Databases

The same data on different servers…. UniProtNCBI Murcia, February, 2011Protein Sequence Databases

http://srs.dna.affrc.go.jp/srs8/srs?-id+1QexuT1Yn4Di0xF+[uniprot_swissprot-AccNumber:P16855]+-e Murcia, February, 2011Protein Sequence Databases

Proteins…proteins Murcia, February, 2011Protein Sequence Databases

Protein sequences are the fundamental determinants of biological structure and function. http://www.ncbi.nlm.nih.gov/protein Murcia, February, 2011

Protein Sequence Databases Protein sequence databases are essential for… - Identification of proteins by proteomics --> completeness, sequence quality ‘producing large protein lists is not the end point in Proteomics’ -> extract knowledge - Similarity searches, BLAST (functional prediction) --> sequence quality (no redundance) - Training datasets (prediction tools, PTM etc.) --> sequence and annotation quality - Creation of DNA chips for mRNA expression studies --> completeness (complete proteome), sequence quality

TrEMBL Genpept Swiss-Prot RefSeq PRF Ensembl CCDS UniParc UniProtKB PDB(PIR) (IPI) UniMES TPA NCBInr ? Murcia, February, 2011Protein Sequence Databases

These identifiers are all pointing to a same sequence of TP53 (p53) ! P04637, NP_000537, ENSG00000141510, CCDS11118, UPI000002ED67, IPI00025087, HIT000320921, XP_001172091, DD954676, JT0436, etc. Murcia, February, 2011

Protein Sequence Databases A HUPO test sample study reveals common problems in mass spectrometry–based proteomics PubMed 19448641 (2009)19448641 A single mass spectrometry experiment can identified up to about 4000 proteins (15’000 peptides) Protein databases vary greatly in terms of their curation, completeness and comprehensiveness (search with different protein databases = could get different results). Only 7 labs (on 27) were able to identify the 20 human proteins present in a sample, mainly due to the fact that the search engines used cannot distinguish among different identifiers for the same protein… Murcia, February, 2011

Protein sequence origin… Murcia, February, 2011Protein Sequence Databases

Protein sequence origin More than 99 % of the protein sequences are derived from the translation of nucleotide sequences (genomes and/or cDNAs) -> Important to know where the protein sequence comes from… (sequencing & gene prediction quality) ! Murcia, February, 2011

Flood of data example with the genome sequences…

New challenge  Flood of data -> need to be stored, curated and made available for analysis and knowledge discovery Murcia, February, 2011Protein Sequence Databases

… ~ 2500 genomes sequenced (single organism, varying sizes, including virus) … ~ 5’000 ongoing genome sequencing projects Murcia, February, 2011

Protein Sequence Databases http://www.ncbi.nlm.nih.gov/genomes/static/gpstat.html http://www.ncbi.nlm.nih.gov/genomes/GenomesHome.cgi?taxid=10239&hopt=stat ~ 50-100 genomes/month + ~2’500 viral genomes => Total ~ 5’000 genomes

Protein Sequence Databases … ~ 2500 genomes sequenced (single organism, varying sizes, including virus) … ~ 5’000 ongoing genome sequencing projects … cDNAs sequencing projects (ESTs or cDNAs) … metagenome sequencing projects = environmental samples: multiple ‘unknown’ organisms, Murcia, February, 2011

Metagenomics Metagenomics study of genetic material recovered directly from environmental samples Global Ocean Sampling (C. Venter) 1ml sea water: 1 mo bacteria and 10 mo virus Whale fall (AAFZ00000000.1) Soil, sand beach, New-York air, … Human fluids, mouse gut (millions of bacteria within human body) Water treatment industry… Lists of projects: http://www.ncbi.nlm.nih.gov/genomes/lenvs.cgi Venter’s Sorcerer II Murcia, February, 2011Protein Sequence Databases

Murcia, February, 2011Protein Sequence Databases … ~ 2500 genomes sequenced (single organism, varying sizes) … ~ 5’000 ongoing genome sequencing projects … cDNAs sequencing projects (ESTs or cDNAs) … metagenome sequencing projects … personal human genomes new generation sequencers : Illumina: 25 billions of bp /day;

Protein Sequence Databases http://www.youtube.com/watch?v=mVZI7NBgcWM …2700 genomes in 2010, 30’000 genomes in 2011 ? 2’000’000 $ (2007) 70’000’000 $ (diploid, 2007) 3’000’000’000 $ (public consortium, 2000) 300’000’000 $ (Celera, 2000) 2010 Murcia, February, 2011

Protein Sequence Databases But…we known now that his apoE allele is the one associated with increased risk for Alzheimer and that he has the ‘blue eye’ allele… Murcia, February, 2011

Protein Sequence Databases apoE gene (Ensembl genome browser) Murcia, February, 2011

New projects 1000 genomes (first publication, October 2010) Multiple personal genomes (sexual cells, lymphoid cells, cancer cells…) International cancer genome consortium (www.icgc.org).www.icgc.org They look at the most common cancers and for each they sequence the genome of 500 patients with cancer and 500 healthy individuals…. Murcia, February, 2011Protein Sequence Databases

How many proteins-coding genes at the end? Murcia, February, 2011

Protein Sequence Databases Peabody museum exhibition on the Tree of Life http://www.peabody.yale.edu/exhibits/treeoflife/http://www.peabody.yale.edu/exhibits/treeoflife/ Murcia, February, 2011

Protein Sequence Databases 190‘500'025'042 1st estimate: ~30 million species (1.8 million named) 2 nd estimate: 20 million bacteria/archea x 4'000 genes 1 million protists x 6'000 genes 5 million insects x 14'000 genes 2 million fungi x 6'000 genes 0.5 million plants x 20'000 genes 0.5 million molluscs, worms, arachnids, etc. x 20'000 genes 0.1 million vertebrates x 25'000 genes The calculation: 2x10 7 x4000+1x10 6 x6000+5x10 6 x14000+2x10 6 x6000+5x10 5 x20000+5x10 5 x20000+1x10 5 x25000 +20000 (Craig Venter)+ 42(Douglas Adam) + … Murcia, February, 2011

About 190 milliards of proteins (?) About 13.0 millions of ‘known’ protein sequences in 2011 (from ~300’000 species) More than 99 % of the protein sequences are derived from the translation of nucleotide sequences Less than 1 % direct protein sequencing (Edman, MS/MS…) -> It is important that users know where the protein sequence comes from… (sequencing & gene prediction quality) !

cDNAs, ESTs, genes, genomes, … Nucleic acid sequence databases The ideal life of a sequence … Murcia, February, 2011Protein Sequence Databases Protein sequence databases

Menu Introduction Nucleic acid sequence databases ENA/GenBank, DDBJ Protein sequence databases UniProt databases (UniProtKB) NCBI protein databases Murcia, February, 2011Protein Sequence Databases

ENA (EMBL-Bank) GenBank DDBJ DNA Data Bank of Japan archive of primary sequence data and corresponding annotation submitted by the laboratories that did the sequencing. Murcia, February, 2011 European Nucleotide Archive

Protein Sequence Databases http://www.insdc.org/ ENA/GenBank/DDBJ Murcia, February, 2011

cDNAs, ESTs, genes, genomes, … ENA, GenBank, DDBJ Data not submitted to public databases, delayed or cancelled… The hectic life of a sequence … Murcia, February, 2011Protein Sequence Databases

Journals do not (SHOULD NOT) accept a paper dealing with a nucleic acid sequence if the ENA/GenBank/DDBJ AC number is not available… ‘journal publishers generally require deposition prior to publication so that an accession number can be included in the paper.’ http://www.ncbi.nlm.nih.gov/books/bv.fcgi?highlight=refseq&rid=handbook.section.GenBank_ASM#GenBank_ASM.RefSeq …not the case for protein sequences !!! no more the case for a lot of genomes !!! Murcia, February, 2011

Protein Sequence Databases Serve as archives : ‘nothing goes out’ Contain all public sequences derived from: – Genome projects (> 80 % of entries) – Sequencing centers (cDNAs, ESTs…) – Individual scientists ( 15 % of entries) – Patent offices (i.e. European Patent Office, EPO) Currently: ~200x10 6 sequences, ~300 x10 9 bp; Sequences from > 300’000 different species; ENA/GenBank/DDBJ Murcia, February, 2011

Protein Sequence Databases Archival databases: -Can be very redundant for some loci -Sequence records are owned by the original submitter and can not be alterered by a third party (except TPA)

Protein Sequence Databases Organisms with the highest redundancy … Murcia, February, 2011

taxonomy Cross-references references accession number Murcia, February, 2011Protein Sequence Databases

CDS annotation (Prediction or experimentally determined) sequence CDS CoDing Sequence (proposed by submitters) Murcia, February, 2011Protein Sequence Databases

The hectic life of a sequence … cDNAs, ESTs, genes, genomes, … ENA, GenBank, DDBJ Data not submitted to public databases, delayed or cancelled… with or without annotated CDS provided by authors CDS CoDing Sequence portion of DNA/RNA translated into protein (from Met to STOP) Experimentally proved or derived from gene prediction !!! not so well documented !!! Murcia, February, 2011

CONTIG --------------------------------------------------------------------------------------CGANGGCCTATCAACA ATG AAAGGTCGAAACCTG Genomic AGCTACAAACAGATCCTTGATAATTGTCGTTGATTTTACTTTATCCTAAATTTATCTCAAAAATGTTGAAATTCAGATTCGTCAAGCGAGGGCCTATCAACA ATG -AAGGTCGAAACCTG *** ************ ** * ************** CONTIG CGTTTACTCCGGATACAAGATCCACCCAGGACACGGNAAAGAGACTTGTCCGTACTGACGGAAAG------------------------------------------------------- Genomic CGTTTACTCCGGATACAAGATCCACCCAGGACACGG-AAAGAGACTTGTCCGTACTGACGGAAAGGTGAGTTCAGTTTCTCTTTGAAAGGCGTTAGCATGCTGTTAGAGCTCGTAAGGTA ************************************ **************************** CONTIG ------------------------------------------------------------------------------------------------------------------------ Genomic TATTGTAATTTTACGAGTGTTGAAGTATTGCAAAAGTAAAGCATAATCACCTTATGTATGTGTTGGTGCTATATCTTCTAGTTTTTAGAAGTTATACCATCGTTAAGCATGCCACGTGTT CONTIG ----------------------------------------------GTCCAAATCTTCCTCAGTGGAAAGGCACTCAAGGGAGCCAAGCTTCGCCGTAACCCACGTGACATCAGATGGAC Genomic GAGTGCGACAAACTACCGTTTCATGATTTATTTATTCAAATTTCAGGTCCAAATCTTCCTCAGTGGAAAGGCACTCAAGGGAGCCAAGCTTCGCCGTAACCCACGTGACATCAGATGGAC ************************************************************************** CONTIG TGTCCTCTACAGAATCAAGAACAAGAAG---------------------------------------------GGAACCCACGGACAAGAGCAAGTCACCAGAAAGAAGACCAAGAAGTC Genomic TGTCCTCTACAGAATCAAGAACAAGAAGGTACTTGAGATCCTTAAACGCAGTTGAAAATTGGTAATTTTACAGGGAACCCACGGACAAGAGCAAGTCACCAGAAAGAAGACCAAGAAGTC **************************** *********************************************** CONTIG CGTCCAGGTTGTTAACCGCGCCGTCGCTGGACTTTCCCTTGATGCTATCCTTGCCAAGAGAAACCAGACCGAAGACTTCCGTCGCCAACAGCGTGAACAAGCCGCTAAGATCGCCAAGGA Genomic CGTCCAGGTTGTTAACCGCGCCGTCGCTGGACTTTCCCTTGATGCTATCCTTGCCAAGAGAAACCAGACCGAAGACTTCCGTCGCCAACAGCGTGAACAAGCCGCTAAGATCGCCAAGGA ************************************************************************************************************************ CONTIG TGCCAACAAGGCTGTCCGTGCCGCCAAGGCTGCTNCCAACAAG----------------------------------------------------------------------------- Genomic TGCCAACAAGGCTGTCCGTGCCGCCAAGGCTGCTGCCAACAAGGTAAACTTTCTACAATATTTATTATAAACTTTAGCATGCTGTTAGAGCTTGTAAGGTATATGTGATTTTACGAGTGT ********************************** ******** CONTIG -------------------------------------------------------------------------------------------------------------------GNAAA Genomic GTTATTTGAAGCTGTAATATCAATAAGCATGTCTCGTGTGAAGTCCGACAATTTACCATATGCATGAAATTTAAAAACAAGTTAATTTTGTCAATTCTTTATCATTGGTTTTCAGGAAAA * *** CONTIG GAAGGCCTCTCAGCCAAAGACCCAGCAAAAGACCGCCAAGAATNTNAAGACTGCTGCTCCNCGTGTCGGNGGAAANCGA TAA ACGTTCTCGGNCCCGTTATTGTAATAAATTTTGTTGAC Genomic GAAGGCCTCTCAGCCAAAGACCCAGCAAAAGACCGCCAAGAATGTGAAGACTGCTGCTCCACGTGTCGGAGGAAAGCGA TAA ACGTTCTCGGTCCCGTTATTGTAATAAATTTTGTTGAC ******************************************* * ************** ******** ***** **** * *********** *************************** CONTIG C----------------------------------------------------------------------------------------------------------------------- Genomic CGTTAAAGTTTTAATGCAAGACATCCAACAAGAAAAGTATTCTCAAATTATTATTTTAACAGAACTATCCGAATCTGTTCATTTGAGTTTGTTTAGAATGAGGACTCTTCGAATAGCCCA * CoDing Sequence Alignment between a mRNA and a genomic sequence exon intron Murcia, February, 2011Protein Sequence Databases

CDS translation provided by ENA CDS provided by the submitters The first Met ! Murcia, February, 2011Protein Sequence Databases

A eukaryotic gene (UCSC) 3’ untranslated region Final exon Initial exon Introns Internal exons This particular gene lies on the reverse strand ! 5’ 3’ Met STOP Murcia, February, 2011Protein Sequence Databases

UCSC: human EPO 5’ 3’ mRNAs and their corresponding CDS annotation (from EMBL/GenBank/DDBJ) contig Murcia, February, 2011Protein Sequence Databases

Complete genome (submitted) but only ~ 2,000 CDS/proteins available ! Murcia, February, 2011Protein Sequence Databases

http://www.ebi.ac.uk/swissprot/sptr_stats/index.html …annotated CDS in UniProtKB Murcia, February, 2011

Protein Sequence Databases Variable level of sequence quality - Sequencing quality - Gene prediction quality Authors can specify the nature of the CDS by using the qualifier: "/evidence=experimental" or "/evidence=not_experimental". Very rarely done… ENA/GenBank/DDBJ Murcia, February, 2011

Protein Sequence Databases Very rarely done… Murcia, February, 2011

Variable level of sequence quality DNA vs RNA Murcia, February, 2011Protein Sequence Databases

RNA EST: Expressed Sequence Tags produced by one-shot sequencing of a cloned cDNA (no CDS, but proteomic tools give access to‘translated ESTs’) HTC : High Throughput cDNAs (CDS annotation) DNA GSS: Genome Sequence Survey: similar to the EST division, with the exception that most of the sequences are genomic in origin (no annotation, no CDS, with some exceptions (Drosophila)) HTG: High-Throughput Genomic Sequences: single-pass, unfinished genomic sequences (no annotation, no CDS with some exceptions (Leishmania)) WGS: Whole Genome Shotgun: contigs of a sequencing project. WGS data can contain annotation and should be updated as sequencing progresses. (CDS annotation) Murcia, February, 2011Protein Sequence Databases

Complete proteomes Complete genomes ? Murcia, February, 2011

Complete genomes ?? UCSC Murcia, February, 2011Protein Sequence Databases

27478 contigs Genome reference consortium http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/data.shtml N50 is a weighted median statistic such that 50% of the entire assembly is contained in contigs equal to or larger than this value Murcia, February, 2011Protein Sequence Databases

Genome sequencing and assembly some caveats to deal with… ~ 350 gaps in 2010 (human genome) In the next future, we will have to deal with ‘incomplete genome’ sequences (never finished, metagenome…)… Prediction of ‘partial’ genes/exons is complex ! Updates of genome sequences: not always ‘stable’ data… We are all different: -> ‘pan genome’ ? Murcia, February, 2011Protein Sequence Databases

Murcia, February, 2011

Protein Sequence Databases From nucleic acid to amino acid sequences databases…. Murcia, February, 2011

The hectic life of a protein sequence … cDNAs, ESTs, genomes, … ENA, GenBank, DDBJ Data not submitted to public databases, delayed or cancelled… …if the submitters provide an annotated Coding Sequence (CDS) (1/10 ENA entries) Protein sequence databases Nucleic acid databases Gene prediction RefSeq, Ensembl no CDS

The hectic life of a protein sequence … cDNAs, ESTs, genomes, … ENA, GenBank, DDBJ Data not submitted to public databases, delayed or cancelled… …if the submitters provide an annotated Coding Sequence (CDS) (1/10 ENA entries) Protein sequence databases Nucleic acid databases Gene prediction RefSeq, Ensembl no CDS RefSeq, Ensembl and other* * 1000 genomes: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/2010_11 /

Why doing things in a simple way, when you can do it in a very complex one ? Murcia, February, 2011Protein Sequence Databases

The hectic life of a sequence … TrEMBL Genpept CoDing Sequences provided by submitters cDNAs, ESTs, genomes, … ENA, GenBank, DDBJ Data not submitted to public databases, delayed or cancelled… Swiss-Prot RefSeq PRF Scientific publications derived sequences Ensembl CCDS UniParc UniProtKB PDB(PIR) + all ‘species’ specific databases (EcoGene, TAIR, …) (IPI) UniMES CoDing Sequences provided by submitters and gene prediction TPA

Major ‘general’ protein sequence database ‘sources’ UniProtKB: Swiss-Prot + TrEMBL NCBI-nr: Swiss-Prot + GenPept + PIR + PDB + PRF + RefSeq + TPA PIRPDBPRF UniProtKB/Swiss-Prot: manually annotated protein sequences (12’000 species) UniProtKB/TrEMBL: submitted CDS (ENA) + automated annotation; non redundant with Swiss-Prot (300’000 species) GenPept: submitted CDS (GenBank); redundant with Swiss-Prot (300’000 species ?) PIR: Protein Information Ressource; archive since 2003; integrated into UniProtKB PDB: Protein Databank: 3D data and associated sequences PRF: Protein Research Foundation journal scan of ‘published’ peptide sequences RefSeq: Reference Sequence for DNA, RNA, protein + gene prediction + some manual annotation (11’000 species) TPA: Third part annotation Integrated resources ‘cross-references’ Resources kept separated TPA

Protein Sequence Databases Swiss-Prot TrEMBL Look for toll-like receptor 4 (homo sapiens) www.uniprot.org Murcia, February, 2011

GenPept Swiss-Prot RefSeq GenPept Look for toll-like receptor 4 (homo sapiens) http://www.ncbi.nlm.nih.gov/

Protein Sequence Databases

Menu Introduction Nucleic acid sequence databases ENA-Bank/GenBank, DDBJ Protein sequence databases UniProt databases (UniProtKB) NCBI protein databases Murcia, February, 2011Protein Sequence Databases

UniProt What is UniProt ?. UniProtKB sequence curation. UniProtKB biological data curation. Statistics. Access to UniProtKB Murcia, February, 2011 UniProt consortium: EBI + SIB + PIR

www.uniprot.org Murcia, February, 2011Protein Sequence Databases

UniProt databases Murcia, February, 2011

UniProtKB UniProtKB: protein sequence knowledgebase, 2 sections UniProtKB/Swiss-Prot and UniProtKB/TrEMBL (query, Blast, download) (~13 mo entries) UniParc UniParc: protein sequence archive (ENA equivalent at the protein level). Each entry contains a protein sequence with cross- links to other databases where you find the sequence (active or not). Not annotated (query, Blast, download) (~25mo entries) UniRef UniRef: 3 clusters of protein sequences with 100, 90 and 50 % similarity; useful to speed up sequence similarity search (BLAST) (query, Blast, download) (UniRef100 10 mo entries; UniRef90 7 mo entries; UniRef50 3.3 mo entries) UniMES UniMES: protein sequences derived from metagenomic projects (mostly Global Ocean Sampling (GOS)) (download) (8 mo entries, included in UniParc) Murcia, February, 2011Protein Sequence Databases

Murcia, February, 2011Protein Sequence Databases UniProtKB/Swiss-Prot and UniProtKB/TrEMBL give access to all the protein sequences which are available to the public. However, UniProtKB excludes the following protein sequences: - Most non-germline immunoglobulins and T-cell receptors - Synthetic sequences - Most patent application sequences - Small fragments encoded from nucleotide sequence (<8 amino acids) - Pseudogenes* - Fusion/truncated proteins - Not real proteins * many putative pseudogene sequences may be expected to remain in UniProtKB for some time as it can be difficult to prove the non-existence of a protein

Protein Sequence Databases UniProtKB an encyclopedia on proteins composed of 2 sections UniProtKB/TrEMBL and UniProtKB/Swiss-Prot released every 4 weeks Murcia, February, 2011

Protein Sequence Databases UniProtKB from ENA to TrEMBL UniProtKB protein sequence data are mainly derived from ENA (CDS) but also from Ensembl and other sequence resources such as RefSeq or model organism databases (MODs). Data from the PIR database have been integrated in UniProt since 2003.

TrEMBL ENA Automated extraction of protein sequence (translated CDS), gene name and references.+ Automated annotation

Protein Sequence Databases The quality of UniProtKB/TrEMBL data, including the protein sequence, is directly dependent on the information provided by the submitter of the original nucleotide entry. Automated annotation Redundancy check (100% merge (same lenght, not fragment)) Family attribution (InterPro) Many other cross-references Rule-based automated annotation (~38% of TrEMBL entries) Automated annotation systems: -UniRule (RuleBase, HAMAP; manually reviewed) -SAAS (automated generated rules, i.e. via InterPro) Murcia, February, 2011

One protein sequence One species Automated annotation Keywords and Gene Ontology Automated annotation Function, Subcellular location, Catalytic activity, Sequence similarities… Automated annotation transmembrane domains, signal peptide… Cross-references to over 125 databases References Protein and gene names Taxonomic information UniProtKB/TrEMBL www.uniprot.org Murcia, February, 2011Protein Sequence Databases

Protein Sequence DatabasesMurcia, February, 2011

Protein Sequence Databases UniProtKB from TrEMBL to Swiss-Prot Once manually annotated and integrated into Swiss- Prot, the entry is deleted from TrEMBL -> minimal redundancy Murcia, February, 2011

TrEMBL ENA Automated extraction of protein sequence (translated CDS), gene name and references.+ Automated annotation Manual annotation of the sequence and associated biological information Swiss-Prot Murcia, February, 2011Protein Sequence Databases

UniProtKB: from TrEMBL to Swiss-Prot Manual annotation 1. Protein sequence (merge available CDS, annotate sequence discrepancies, report sequencing mistakes…) 2. Biological information (sequence analysis, extract literature information, ortholog data propagation, …) Murcia, February, 2011

MSKEKFERTKPHVNVGTIGHVDHGKTTLTAAITTVLAKTYGGAAR AFDQIDNAPEEKARGITINTSHVEYDTPTRHYAHVDCPGHADYVK NMITGAAQMDGAILVVAATDGPMPQTREHILLGRQVGVPYIIVFL NKCDMVDDEELLELVEMEVRELLSQYDFPGDDTPIVRGSALKALE GDAEWEAKILELAGFLDSYIPEPERAIDKPFLLPIEDVFSISGRG TVVTGRVERGIIKVGEEVEIVGIKETQKSTCTGVEMFRKLLDEGR AGENVGVLLRGIKREEIERGQVLAKPGTIKPHTKFESEVYILSKD EGGRHTPFFKGYRPQFYFRTTDVTGTIELPEGVEMVMPGDNIKMV VTLIHPIAMDDGLRFAIREGGRTVGAGVVAKVLG One protein sequence One gene One species Manual annotation Keywords and Gene Ontology Manual annotation Function, Subcellular location, Catalytic activity, Disease, Tissue specificty, Pathway… Manual annotation Post-translational modifications, variants, transmembrane domains, signal peptide… Cross-references to over 125 databases References Protein and gene names Taxonomic information Alternative products: protein sequences produced by alternative splicing, alternative promoter usage, alternative initiation… UniProtKB/Swiss-Prot www.uniprot.org Murcia, February, 2011Protein Sequence Databases

In a UniProtKB/Swiss-Prot entry, you can expect to find: A (often corrected) protein sequence and the description of various isoforms/variants. All the names of a given protein (and of its gene); A summary of what is known about the protein: function, PTM, tissue expression, disease, 3D data etc.…; A description of important sequence features: domains, PTMs, variations, etc.; A selection of references; Selected keywords and ontologies; Numerous cross-references (central hub); Murcia, February, 2011

Protein Sequence Databases UniProtKB 1- Sequence curation Murcia, February, 2011

Protein Sequence Databases UniProtKB: from TrEMBL to Swiss-Prot Manual annotation 1. Protein sequence (merge available CDS, annotate sequence discrepancies, report sequencing mistakes…) 2. Biological information (extract literature information, ortholog data propagation, protein sequence analysis…) Murcia, February, 2011

Protein Sequence Databases The displayed protein sequence …canonical, representative, consensus… Murcia, February, 2011

Protein Sequence Databases The displayed sequence is the most prevalent protein sequence and/or the protein sequence which is also found in orthologous species. The displayed sequence is generally derived from the translation of the genomic sequence (when available). Sequence differences are documented. 1 entry 1 gene (1 species) 1 displayed sequence (annotation of alternative sequences, when available) UniProtKB/Swiss-Prot protein sequence annotation ‘Merging policy’: a gene-centric view of protein space Murcia, February, 2011

Protein Sequence Databases What is the current status? At least 20% of Swiss-Prot entries required a minimal amount of curation effort so as to obtain the “correct” sequence. Typical problems –unsolved conflicts; –uncorrected initiation sites; –frameshifts; –other ‘problems’ Murcia, February, 2011

Protein Sequence Databases Murcia, February, 2011

Protein Sequence Databases … once a gene on chromosome 11… Murcia, February, 2011

Quality of protein information from genome projects Lets look at proteins originating from genome projects: –Drosophila: the paradigm of a curated genome should look like (thanks to FlyBase) : only 1.8% of the gene models conflict with Swiss-Prot sequences; –Arabidopsis: a typical example of a genome where a lot of annotation was done when it was sequenced, but no update since then (at least in the public view): 20% of the gene models are erroneous; –Tetraodon nigroviridis: the typical example of a quick and dirty automatic run through a genome with no manual intervention: >90% of the gene models produce incorrect proteins. –Bacteria and Archaea have almost no splicing, so predictions are “easier”, however errors are still made… Start codons, missed small proteins (<100aa)… Murcia, February, 2011Protein Sequence Databases

UniProtKB/Swiss-Prot Protein sequence annotation Murcia, February, 2011Protein Sequence Databases

Example of problem (derived from gene prediction pipeline) Ensembl completes the human ‘proteome’ by predicting/annotating missing genes according to orthologous sequences.. ID URAD_HUMAN Unreviewed; 171 AA. AC A6NGE7; DT 24-JUL-2007, integrated into UniProtKB/TrEMBL. DT 24-JUL-2007, sequence version 1. DT 02-OCT-2007, entry version 3. DE 2-oxo-4-hydroxy-4-carboxy-5-ureidoimidazoline decarboxylase homolog DE (OHCU decarboxylase homolog) (Parahox neighbour). GN Name=PRHOXNB; … DR EMBL; AL591024; -; NOT_ANNOTATED_CDS; Genomic_DNA. DR Ensembl; ENSG00000183463; Homo sapiens. DR HGNC; HGNC:17785; PRHOXNB. PE 4: Predicted; In primates the genes coding for the enzymes for the degradation of uric acid were inactivated and converted to pseudogenes. Murcia, February, 2011

Producing a clean set of sequences is not a trivial task; It is not getting easier as more and more types of sequence data are submitted; It is important to pursue our efforts to make sure we provide our users with the most correct set of sequences for a given organism.

Protein Sequence Databases The ‘Protein existence’ tag indicates what is the evidence for the existence of a given protein; Different qualifiers: 1. Evidence at protein level (~18%) (MS, western blot (tissue specificity), immuno (subcellular location),…) 2. Evidence at transcript level (~19%) 3. Inferred from homology (~58 %) 4. Predicted (~5%) 5. Uncertain (mainly in TrEMBL) ‘Protein existence’ tag Murcia, February, 2011

Protein Sequence Databases In order to avoid ‘pseudogenes’ and most of the unprobable protein sequences, you can filter your query and avoid sequences with ‘protein existence tag’ = ‘Uncertain’ Murcia, February, 2011

The ‘alternative’ sequence(s) Murcia, February, 2011

Protein Sequence Databases How many proteins at the end? Example with human Murcia, February, 2011

Protein Sequence Databases (Jensen O.N., Curr. Opin. Chem. Biol., 2004, 8, 33-41, PMID: 15036154). Proteome complexity Example with human Not predictable at the genome level ! -> important post- genomic data ! ~20’000 Murcia, February, 2011

Protein Sequence Databases UniProtKB/Swiss-Prot 1 entry 1 gene (1 species) Annotation of the sequence differences (including conflicts, polymorphisms, splice variants etc..) -> annotation of protein diversity Murcia, February, 2011

Protein Sequence Databases Multiple alignment of the end of the available GCR sequences Annotation of the sequence differences (protein diversity) 1 entry 1 gene (1 species) …and natural variant Murcia, February, 2011

P04150 www.uniprot.org Murcia, February, 2011Protein Sequence Databases

Murcia, February, 2011Protein Sequence Databases UniProtKB (and RefSeq) do under-represent alternatively spliced products Transcript variant are only made when there is information available on the full-lenght nature of the product; if multiple, alternate exons are found through the lenght of the gene, no assumption is made about the combination of the alternate exons that exists in vivo. http://www.ncbi.nlm.nih.gov/books/NBK50679/#RefSeqFAQ.what_does_a_reviewed_status_me

Protein Sequence Databases Available in separated files! Important remark > 30’000 additional sequences (total) Murcia, February, 2011

Protein Sequence Databases The ‘alternative’ sequence(s) not ‘directly available’ for a lot of tools, including protein identification tools, Blast, depending on the server !…. Murcia, February, 2011

Blast P04150 against Swiss-Prot / homo sapiens @ UniProt Isoform sequences Murcia, February, 2011Protein Sequence Databases

Blast P04150 against Swiss-Prot / homo sapiens @ NCBI The isoform sequences are not present in the NCBI protein database ! The.x number (P06401.4) correspond to the version number of the sequence…not to an alternatively spliced sequence ! Murcia, February, 2011

Protein Sequence Databases UniProtKB 2- Biological data curation Murcia, February, 2011

Protein Sequence Databases UniProtKB: from TrEMBL to Swiss-Prot Manual annotation 1. Protein sequence (merge available CDS, annotate sequence discrepancies, report sequencing mistakes…) 2. Biological information (extract literature information, ortholog data propagation, protein sequence analysis…) Murcia, February, 2011

Summary of the current knowledge on a given protein. Maximum usage of controlled vocabulary Keywords, Tissues, Post-translational modifications, Strains, Species, Subcellular location, Extracellular domains, Journals… Provides a reliable set of annotated protein entries for: Reference data for systems designed to automatically transfer annotation to similar, not yet (or never) characterized sequences Training of data mining tools, prediction programs UniProtKB/Swiss-Prot General annotation Murcia, February, 2011Protein Sequence Databases

UniProtKB/Swiss-Prot gathers data form multiple sources: - publications (literature/Pubmed) - prediction programs (Prosite, Anabelle) - contacts with experts - other databases - nomenclature committees An evidence attribution system allows to easily trace the source of each annotation Extract literature information and protein sequence analysis maximum usage of controlled vocabulary Murcia, February, 2011

Protein Sequence Databases Protein nomenclature Murcia, February, 2011

…enable researchers to obtain a summary of what is known about a protein… General annotation (Comments) www.uniprot.org Murcia, February, 2011Protein Sequence Databases

Human protein manual annotation: some statistics (Aug 2010) Murcia, February, 2011

Sequence annotation (Features) …enable researchers to obtain a summary of what is known about a protein… www.uniprot.org Murcia, February, 2011Protein Sequence Databases

(Jensen O.N., Curr. Opin. Chem. Biol., 2004, 8, 33-41, PMID: 15036154). Proteome complexity Example with human Not predictable at the genome level ! -> important post- genomic data ! ~20’000 Murcia, February, 2011

Protein Sequence Databases Human protein manual annotation: some statistics (PTM) Murcia, February, 2011

Protein Sequence Databases Non-experimental qualifiers UniProtKB/Swiss-Prot considers both experimental and predicted data and makes a clear distinction between both. Level. Type of evidenceQualifier 1st. Strong experimental evidenceRef.X 2nd. Light experimental evidenceProbable 3rd. Inferred by similarity with homologous protein (data of 1st or 2 nd level) By similarity 4th. Inferred by sequence predictionPotential Murcia, February, 2011

Find all the protein localized in the cytoplasm (experimentally proven) which are phosphorylated on a serine (experimentally proven) Murcia, February, 2011Protein Sequence Databases

UniProtKB Additional information can be found in the cross-references (to more than 140 databases) Murcia, February, 2011

DNA sequences Gene annotation Gene expression data Protein sequences Macromolecular structure data Protein centric view of database network Murcia, February, 2011Protein Sequence Databases

2D gel 2DBase-Ecoli ANU-2DPAGE Aarhus/Ghent-2DPAGE (no server) COMPLUYEAST-2DPAGE Cornea-2DPAGE DOSAC-COBS-2DPAGE ECO2DBASE (no server) OGP PHCI-2DPAGE PMMA-2DPAGE Rat-heart-2DPAGE REPRODUCTION-2DPAGE Siena-2DPAGE SWISS-2DPAGE UCD-2DPAGE World-2DPAGE Family and domain Gene3D HAMAP InterPro PANTHER Pfam PIRSF PRINTS ProDom PROSITE SMART SUPFAM TIGRFAMs Organism-specific AGD ArachnoServer CGD ConoServer CTD CYGD dictyBase EchoBASE EcoGene euHCVdb EuPathDB FlyBase GeneCards GeneDB_Spombe GeneFarm GenoList Gramene H-InvDB HGNC HPA LegioList Leproma MaizeGDB MGI MIM neXtProt Orphanet PharmGKB PseudoCAP RGD SGD TAIR TubercuList WormBase Xenbase ZFIN Protein family/group Allergome CAZy MEROPS PeroxiBase PptaseDB REBASE TCDB Genome annotation Ensembl EnsemblBacteria EnsemblFungi EnsemblMetazoa EnsemblPlants EnsemblProtists GeneID GenomeReviews KEGG NMPDR TIGR UCSC VectorBase Enzyme and pathway BioCyc BRENDA Pathway_Interaction_DB Reactome Other BindingDB DrugBank NextBio PMAP-CutDB Sequence EMBL IPI PIR RefSeq UniGene 3D structure DisProt HSSP PDB PDBsum ProteinModelPortal SMR PTM GlycoSuiteDB PhosphoSite PhosSite UniProtKB/Swiss-Prot: 129 explicit links and 14 implicit links! Proteomic PeptideAtlas PRIDE ProMEX PPI DIP IntAct MINT STRING Phylogenomic dbs eggNOG GeneTree HOGENOM HOVERGEN InParanoid OMA OrthoDB PhylomeDB ProtClustDB Polymorphism dbSNP Gene expression ArrayExpress Bgee CleanEx Genevestigator GermOnline Ontologies GO

Protein Sequence Databases UniProtKB Access to UniProtKB Murcia, February, 2011

Protein Sequence Databases The UniProt web site : www.uniprot.org Murcia, February, 2011

Protein Sequence Databases The UniProt web site - www.uniprot.org Powerful search engine, google-like and easy-to-use, but also supports very directed field searches (similar to SRS) Scoring mechanism presenting relevant matches first Entry views, search result views and downloads are customizable The URL of a result page reflects the query; all pages and queries are bookmarkable, supporting programmatic access Tools: Blast, Alignment, IDmapping, Batch retrieval (Retrieve) Murcia, February, 2011

Protein Sequence Databases Search A very powerful text search tool with autocompletion and refinement options allowing to look for UniProt entries and documentation by biological information Murcia, February, 2011

Protein Sequence Databases UniProt query tool (www.uniprot.org) A mixture of Google and SRS Find all human proteins with experimental evidence for their location in the nucleus Murcia, February, 2011

Protein Sequence Databases The search interface guides users with helpful suggestions and hints Murcia, February, 2011

Protein Sequence Databases Result pages: Highly customizable Murcia, February, 2011

Protein Sequence Databases Custom downloads…. Accession Genes Domains Protein Existence P02768 ALB (GIG20) (GIG42) (PRO0903) (PRO1708) (PRO2044) (PRO2619) (PRO2675) (UNQ696/PRO1341) Albumin domains (3) Evidence at protein level P02769 ALB Albumin domains (3) Evidence at protein level P02770 Alb Albumin domains (3) Evidence at protein level P07724 Alb (Alb-1) (Alb1) Albumin domains (3) Evidence at protein level P08759 alb-A Albumin domains (3) Evidence at transcript level P14872 alb-B Albumin domains (3) Evidence at transcript level P43652 AFM (ALB2) (ALBA) Albumin domains (3) Evidence at protein level P08835 ALB Albumin domains (3) Evidence at protein level P49822 ALB Albumin domains (3) Evidence at protein level P19121 ALB Albumin domains (3) Evidence at protein level Open with Excel etc. Murcia, February, 2011

The URL (results) can be bookmarked and manually modified. Murcia, February, 2011Protein Sequence Databases

Blast A tool associated with the standard options to search sequences in UniProt databases Murcia, February, 2011

Blast results: customize display Murcia, February, 2011Protein Sequence Databases

Blast: use of UniProt annotation amino-acids highlighting options and feature annotation highlighting option in the local alignment Murcia, February, 2011Protein Sequence Databases

Align A ClustalW multiple alignment tool with amino-acids highlighting options and feature annotation highlighting option Murcia, February, 2011

Protein Sequence Databases ClustalW multiple alignment of insulin sequences amino-acids highlighting options and feature annotation highlighting option in the local alignment Murcia, February, 2011

Protein Sequence Databases Retrieve A UniProt specific tool allowing to retrieve a list of entries in several standard formats. You can then query your ‘personal database’ with the UniProt search tool. Murcia, February, 2011

Protein Sequence Databases Your dataset: results of a Scan Prosite Murcia, February, 2011

Protein Sequence Databases ID Mapping Gives the possibility to get a mapping between different databases for a given protein Murcia, February, 2011

Protein Sequence Databases These identifiers are all pointing to TP53 (p53) ! P04637, NP_000537, ENSG00000141510, CCDS11118, UPI000002ED67, IPI00025087, etc. Murcia, February, 2011

Protein Sequence Databases Download Murcia, February, 2011

Protein Sequence Databases Downloading UniProt http://www.uniprot.org/downloads http://www.uniprot.org/downloads Murcia, February, 2011

Protein Sequence Databases Complete proteome ‘gene’ centred or all known proteins ? Murcia, February, 2011

Protein Sequence Databases http://www.uniprot.org/faq/38 Murcia, February, 2011

Protein Sequence Databases Remark: Some peptides are not associated with the keyword ‘Complete proteome’ because they do not match with the human genome Murcia, February, 2011

Protein Sequence Databases Murcia, February, 2011 UniProt proteome sets, if downloaded in UniProt flat file or XML format, contain one sequence per UniProt record ! ‘gene’ centred all protein sequences in UniProtKB/Swiss-Prot… Are missing: other alternatively spliced protein sequences in UniProtKB/TrEMBL

Protein Sequence Databases Human protein manual annotation: some statistics (Aug 2010) Murcia, February, 2011

Protein Sequence Databases UniProtKB Statistics Murcia, February, 2011

520’000 + 13’000’000  13’000’000 Swiss-Prot & TrEMBL introduce a new arithmetical concept ! Redundancy in TrEMBL & Redundancy between TrEMBL and Swiss-Prot 12’000 species130’000 species Swiss-ProtTrEMBL Murcia, February, 2011Protein Sequence Databases

12’000 species mainly model organisms Murcia, February, 2011

Not yet available Murcia, February, 2011Protein Sequence Databases

~ 200 new entries / day new release every 4 weeks -Annotation is useful, good annotation is better, update is essential ! - Some entries have gone through more than 120 versions since their integration in UniProtKB/Swiss-Prot Murcia, February, 2011Protein Sequence Databases

UniProtKB entry history Always cite the primary accession number (AC) !

Protein Sequence Databases UniParc Murcia, February, 2011

Protein Sequence Databases UniParc - non-redundant protein sequence archive, containing both active and inactive sequences (including sequences which are not in UniProtKB i.e. immunoglobulins….) - the equivalent of ENA/GenBank/DDBJ at the protein level - species-merged: merge sequences between species when 100% identical over the whole length. - no annotation (only taxonomy) - can be searched only with database names, taxonomy, checksum (CRC64) and accession numbers (ACs) or UniProtKB, UniRef and UniParc IDs. - Beware: contains wrong prediction, pseudogenes etc… Murcia, February, 2011

Protein Sequence Databases Query UniParc

Protein Sequence Databases UniRef Murcia, February, 2011

Protein Sequence Databases ‘UniRef is useful for comprehensive BLAST similarity searches by providing sets of representative sequences’ Murcia, February, 2011

Protein Sequence Databases «Collapsing BLAST results» Three collections of sequence clusters from UniProtKB and selected UniParc entries: One UniRef100 entry -> all identical sequences (identical sequences and sub-fragments are grouped in a single record) -> reduction of 12 % One UniRef90 entry -> sequences that have at least 90 % or more identity -> reduction of 40 % One UniRef50 entry -> sequences that are at least 50 % identical -> reduction of 65 % Based on sequence identity -> Independent of the species ! Murcia, February, 2011

Protein Sequence Databases Independent of species and sequence length UniRef 90 Murcia, February, 2011

Protein Sequence Databases UniMes Murcia, February, 2011

Protein Sequence Databases The UniProt Metagenomic and Environmental Sequences (UniMES) database is a repository specifically developed for metagenomic and environmental protein data (only GOS data for the moment). Download only (but included in UniParc -> Blast). - UniMES Fasta sequences - UniMES matches to InterPro methods ftp.uniprot.org/pub/databases/uniprot Murcia, February, 2011

Protein Sequence Databases UniMES: sequences in fasta format Murcia, February, 2011

Menu Introduction Nucleic acid sequencedatabases ENA/GenBank, DDBJ Protein sequence databases UniProt databases (UniProtKB) NCBI protein databases Murcia, February, 2011Protein Sequence Databases

NCBI protein databases (Entrez protein, NCBI nr) http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Protein Murcia, February, 2011

Major ‘general’ protein sequence database ‘sources’ UniProtKB: Swiss-Prot + TrEMBL NCBI-nr: Swiss-Prot + GenPept + PIR + PDB + PRF + RefSeq + TPA PIRPDBPRF UniProtKB/Swiss-Prot: manually annotated protein sequences (12’000 species) UniProtKB/TrEMBL: submitted CDS (ENA) + automated annotation; non redundant with Swiss-Prot (300’000 species) GenPept: submitted CDS (GenBank); redundant with Swiss-Prot (300’000 species ?) PIR: Protein Information Ressource; archive since 2003; integrated into UniProtKB PDB: Protein Databank: 3D data and associated sequences PRF: Protein Research Foundation journal scan of ‘published’ peptide sequences RefSeq: Reference Sequence for DNA, RNA, protein + gene prediction + some manual annotation (11’000 species) TPA: Third part annotation Integrated resources ‘cross-references’ Resources kept separated TPA Murcia, February, 2011Protein Sequence Databases

Query at Entrez protein http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Protein Murcia, February, 2011Protein Sequence Databases

Typical result of a query at « Entrez protein » RefSeq Swiss-Prot Genpept Murcia, February, 2011Protein Sequence Databases

A Swiss-Prot entry with the NCBI look Murcia, February, 2011Protein Sequence Databases

GI number ‘GenInfo identifier’ number - In addition to an AC number specific from the original database, each protein sequence in the NCBInr database (included Swiss-Prot entry) has a GI number. Murcia, February, 2011

Protein Sequence Databases AC Murcia, February, 2011

Protein Sequence Databases GI number: ‘GenInfo identifier’ number - If the sequence changes in any way, a new GI number will be assigned: GI identifiers provide a mechanism for identifying the exact sequence that was used or retrieved in a given search. - A separate GI number is assigned to each protein translation (alternative products) - A Sequence Revision History tool is available to track the various GI numbers, version numbers, and update dates for sequences that appeared in a specific GenBank record: http://www.ncbi.nlm.nih.gov/entrez/sutils/girevhist.cgi Murcia, February, 2011

Protein Sequence Databases ID/AC mapping Murcia, February, 2011

Protein Sequence Databases http://www.ebi.ac.uk/Tools/picr/ Murcia, February, 2011

Protein Sequence Databases GenPept Translation from annotated CDS in GenBank Contains all translated CDS annotated in GenBank/ENA/DDBJ sequences - equivalent to UniProtKB/TrEMBL, except that it is redundant with other databases (Swiss-Prot, RefSeq, PIR….) Murcia, February, 2011

GenPept: ‘translations from all annotated coding regions (CDS) in GenBank’ Murcia, February, 2011Protein Sequence Databases

RefSeq Produced by NCBI and NLM http://www.ncbi.nlm.nih.gov/books/bookres.fcgi/handbook/ch18.pdf FAQ: http://www.ncbi.nlm.nih.gov/books/NBK50679/ http://www.ncbi.nlm.nih.gov/RefSeq/ Murcia, February, 2011

The Reference Sequence (RefSeq) collection aims to provide a comprehensive, integrated, non-redundant set of sequences, including genomic DNA, transcript (RNA), and protein products, for major research organisms. Protein – mRNA – genomic sequence Also chromosomes, organelle genomes, plasmids, intermediate assembled genomic contigs, ncRNAs. - tighly linked to Entrez Gene (« interdependent curated resources »)

Protein Sequence Databases Murcia, February, 2011 Example: NP_000790

Protein Sequence Databases KW AC Taxonomy References Murcia, February, 2011

GenBank source and status Annotation and ontologies Murcia, February, 2011Protein Sequence Databases

Murcia, February, 2011 Curated records

Protein Sequence Databases UniProtKB vs RefSeq Murcia, February, 2011

UniProtKB/Swiss-Prot merges all CDS available for a given gene and describes the sequence differences UniProtKB/Swiss-Prot P04150 (GCR_HUMAN): Murcia, February, 2011Protein Sequence Databases

RefSeq chooses one or several protein reference sequences for a given gene: they do not annotate the sequence differences. - If there is an alternative splicing event, there will be several distinct entries for a given gene Example: GCR_HUMAN GCR_HUMAN UniProtKB/Swiss-Prot 1 UniProtKB entry7 RefSeq entries cross-linked with Murcia, February, 2011Protein Sequence Databases

Murcia, February, 2011Protein Sequence Databases Protein feature annotation found in RefSeq - Conserved domains - Signal and mature petides - Propagation of a subset of features from Swiss-Prot.

Murcia, February, 2011Protein Sequence Databases PTM annotation Swiss-Prot vs RefSeq GCR_human

Murcia, February, 2011Protein Sequence Databases RefSeq statistics The numbers are not comparable: entries ‘sequence’ (RefSeq) vs entries ‘gene’ (UniProtKB/Swiss-Prot)

Protein Sequence Databases Summary UniProtKB vs NCBI protein Murcia, February, 2011

ENA/GenBank/DDBJRefSeq www.ncbi.nlm.nih.gov/RefSeq/ UniProt www.uniprot.org Protein and nucleotide dataGenomic, RNA and protein dataProtein data only Biological data added by the submitters (gene name, tissue…) Biological data annotated by curators, also found in the corresponding Entrez Gene entry Biological data annotated by curators (Swiss-Prot), within the entry Not curated Partially manually curated (‘reviewed’ entries) Manually curated in Swiss-Prot, not in TrEMBL Author submissionNCBI creates from existing data + gene prediction UniProt creates from existing data Only author can revise (except TPA) NCBI revises as new data emerge UniProt revises as new data emerge Multiple records for same loci common Single records for each molecule of major organisms Single records for each protein from one gene of major organisms (in Swiss-Prot, TrEMBL is redundant) Records can contradict each other Identification and annotation of discrepancy No limit to species included Limited to model organismsPriority (but not limited) to model organisms Data exchanged among INSDC members NCBI database; collaboration with UniProt UniProt database; collaboration with NCBI (RefSeq, CCDS) Murcia, February, 2011Protein Sequence Databases

PIR Murcia, February, 2011

Protein Sequence Databases PIR: the Protein Identification Resource PIR-PSD is no more updated, but exists as an archive Murcia, February, 2011

Protein Sequence Databases PDB Murcia, February, 2011

Protein Sequence Databases PDB PDB (Protein Data Bank), 3D structure Contains the spatial coordinates of macromolecule atoms whose 3D structure has been obtained by X-ray or NMR studies Contains also the corresponding protein sequences *The PIR-NRL3D database makes the sequence information in PDB available for similarity searches and other tools Includes protein sequences which are mutated, chimearic etc… (created specifically to study the effect of a mutation on the 3D structure)

Protein Sequence Databases PDB: Protein Data Bank www.rcsb.org/pdb/ Managed by Research Collaboratory for Structural Bioinformatics (RCSB) (USA). Associated with specialized programs allow the visualization of the corresponding 3D structure (e.g., SwissPDB-viewer, Chime, Rasmol)). Currently there are ~68’000 structural data for about 15’000 different proteins, but far less protein family (highly redundant) ! Murcia, February, 2011

Protein Sequence Databases PDB: example Murcia, February, 2011

Protein Sequence Databases Coordinates of each atom Sequence Murcia, February, 2011

Protein Sequence Databases Visualisation with Jmol Murcia, February, 2011

Protein Sequence Databases PRF Protein Research Foundation Murcia, February, 2011

Protein Sequence Databases http://www.genome.jp/dbget-bin/www_bfind?prf Looks for the peptide sequence described in publication (and which are not submitted in databases !!!) Murcia, February, 2011

Protein Sequence Databases Other protein databases Murcia, February, 2011

Protein Sequence Databases Ensembl http://www.ensembl.org/ Review http://nar.oxfordjournals.org/cgi/content/full/35/suppl_1/D610 Annotation pipeline http://www.genome.org/cgi/content/full/14/5/942 http://www.ensembl.org/ http://nar.oxfordjournals.org/cgi/content/full/35/suppl_1/D610 http://www.genome.org/cgi/content/full/14/5/942 Murcia, February, 2011

Protein Sequence Databases - Ensembl: align the genomic sequences with all the sequences found in ENA, UniProtKB/Swiss-Prot, RefSeq and UniProtKB/TrEMBL (-> known genes) - Also do gene prediction (-> novel genes) Ensembl= UniProtKB + RefSeq + gene prediction - DNA, RNA and protein sequences available for several species. - Ensembl concentrates on vertebrate genomes, but other groups have adapted the system for use with plant, fungal and metazoa genomes. Murcia, February, 2011

Protein Sequence Databases Example of problem (derived from gene prediction pipeline) Ensembl completes the human ‘proteome’ by predicting/annotating missing genes according to orthologous sequences.. ID URAD_HUMAN Unreviewed; 171 AA. AC A6NGE7; DT 24-JUL-2007, integrated into UniProtKB/TrEMBL. DT 24-JUL-2007, sequence version 1. DT 02-OCT-2007, entry version 3. DE 2-oxo-4-hydroxy-4-carboxy-5-ureidoimidazoline decarboxylase homolog DE (OHCU decarboxylase homolog) (Parahox neighbour). GN Name=PRHOXNB; … DR EMBL; AL591024; -; NOT_ANNOTATED_CDS; Genomic_DNA. DR Ensembl; ENSG00000183463; Homo sapiens. DR HGNC; HGNC:17785; PRHOXNB. PE 4: Predicted; In primates the genes coding for the enzymes for the degradation of uric acid were inactivated and converted to pseudogenes. Murcia, February, 2011

Protein Sequence Databases IPI http://www.ebi.ac.uk/IPI/IPIhelp.html IPI: Closure ! http://www.ebi.ac.uk/IPI/IPIhelp.html Murcia, February, 2011

Protein Sequence Databases Automatic approach that builds clusters through combining knowledge already present in the primary data source (UniProtKB, RefSeq, Ensembl) and sequence similarity. IPI=UniProtKB + RefSeq + Ensembl (+ H-InvDB, TAIR +VEGA). !!! Complete proteome sets include all alternative splicing sequences…. Available for human, mouse, rat, Zebrafish, Arabidopsis, Chicken, and Cow Murcia, February, 2011

Protein Sequence Databases CCDS Murcia, February, 2011

Protein Sequence Databases http://www.ncbi.nlm.nih.gov/CCDS/

Protein Sequence Databases CCDS (human, mouse) Combining different approaches – ab initio, by similarity - and taking advantage of the expertise acquired by different institutes, including manual annotation… Consensus between 4 institutions… Murcia, February, 2011

Protein Sequence Databases Gene Ontology (GO) Murcia, February, 2011

Standards : Why is it so important ? ‘The ever-increasing number of sequencing projects necessitates a standardized system (…) to ensure that the flood of information produced can be effectively utilized.‘ (PMID 19577473 ) Standardization of biological data/information (data sharing and computational analysis). Aim: extract and compare annotation between different resources or species (semantic similarity).

Secreted or not secreted ? Pubmed19299134

The Gene Ontology is a controlled vocabulary, a set of standard terms—words and phrases—used for indexing and retrieving information. In addition to defining terms, GO also defines the relationships between the terms, making it a structured vocabulary. Contains ~30’000 terms. Gene Ontology (GO)

Gene Ontology (GO) terms  biological process broad biological phenomena e.g. mitosis, growth, digestion  molecular function molecular role e.g. catalytic activity, binding  cellular component Subcellular location e.g nucleus, ribosome, origin recognition complex Murcia, February, 2011Protein Sequence Databases

GO terms associated with human Erythropoietin

http://www.geneontology.org

Caveats Annotation is the process of assigning/mapping GO terms to gene products… Electronic vs Manual annotation… Murcia, February, 2011Protein Sequence Databases

Example with EPO Murcia, February, 2011Protein Sequence Databases

Histone H4 Murcia, February, 2011Protein Sequence Databases !!! Large scale derived data (‘proteome’)

GO terms: Essential link between biological knowledge and high throuput genomic and proteomic datasets… PMID: 15514041 ‘summary of the gene ontology classifications for all mapped ESTs…’ Murcia, February, 2011Protein Sequence Databases

~40 % of human proteins have no known function (experimental data)…but many more are associated with GO terms…(computer-assigned). Murcia, February, 2011Protein Sequence Databases

All documents (including practicals) are online http://education.expasy.org/cours/Murcia2011/ Murcia, February, 2011

Protein Sequence Databases Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Similar presentations

Presentation on theme: "Protein Sequence Databases Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Protein Sequence Databases Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases

Similar presentations

Presentation on theme: "Protein Sequence Databases Swiss-Prot group, Geneva SIB Swiss Institute of Bioinformatics Protein sequence databases"— Presentation transcript:

Similar presentations

About project

Feedback