Presentation is loading. Please wait.

Presentation is loading. Please wait.

Protein Sequence Databases Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center.

Similar presentations


Presentation on theme: "Protein Sequence Databases Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center."— Presentation transcript:

1 Protein Sequence Databases Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center

2 2 Protein Sequence Databases Link between mass spectra and proteins A protein’s amino-acid sequence provides a basis for interpreting Enzymatic digestion Separation protocols Fragmentation Peptide ion masses We must interpret database information as carefully as mass spectra.

3 3 More than sequence… Protein sequence databases provide much more than sequence: Names Descriptions Facts Predictions Links to other information sources Protein databases provide a link to the current state of our understanding about a protein.

4 4 Much more than sequence Names Accession, Name, Description Biological Source Organism, Source, Taxonomy Literature Function Biological process, molecular function, cellular component Known and predicted Features Polymorphism, Isoforms, PTMs, Domains Derived Data Molecular weight, pI

5 5 Database types Curated Swiss-Prot UniProt RefSeq NP Translated TrEMBL RefSeq XP, ZP Omnibus NCBI’s nr MSDB IPI Other PDB HPRD EST Genomic

6 6 SwissProt From ExPASy Expert Protein Analysis System Swiss Institute of Bioinformatics ~ 515,000 protein sequence “entries” ~ 12,000 species represented ~ 20,000 Human proteins Highly curated Minimal redundancy Part of UniProt Consortium

7 7 TrEMBL Translated EMBL nucleotide sequences European Molecular Biology Laboratory European Bioinformatics Institute (EBI) Computer annotated Only sequences absent from SwissProt ~ 10.5 M protein sequence “entries” ~ 230,000 species ~ 75,000 Human proteins Part of UniProt Consortium

8 8 UniProt Universal Protein Resource Combination of sequences from Swiss-Prot TrEMBL Mixture of highly curated (Swiss-Prot) and computer annotation (TrEMBL) “Similar sequence” clusters are available 50%, 90%, 100% sequence similarity

9 9 RefSeq Reference Sequence From NCBI (National Center for Biotechnology Information), NLM, NIH Integrated genomic, transcript, and protein sequences. Varying levels of curation Reviewed, Validated, …, Predicted, … ~ 9.7 M protein sequence “entries” ~ 209,000 reviewed, ~ 90,000 validated ~ 39,000 Human proteins

10 10 RefSeq Particular focus on major research organisms Tightly integrated with genome projects. Curated entries: NP accessions Predicted entries: XP accessions Others: YP, ZP, AP

11 11 IPI International Protein Index From EBI For a specific species, combines UniProt, RefSeq, Ensembl Species specific databases HInv-DB, VEGA, TAIR ~ 87,000 (from ~ 307,000 ) human protein sequence entries Human, mouse, rat, zebra fish, arabidopsis, chicken, cow

12 12 MSDB From the Imperial College (London) Combines PIR, TrEMBL, GenBank, SwissProt Distributed with Mascot …so well integrated with Mascot ~ 3.2M protein sequence entries “Similar sequences” suppressed 100% sequence similarity Not updated since September 2006 (obsolete)

13 13 NCBI’s nr “non-redundant” Contains GenBank CDS translations RefSeq Proteins Protein Data Bank (PDB) SwissProt, TrEMBL, PIR Others “Similar sequences” suppressed 100% sequence similarity ~ 10.5 M protein sequence “entries”

14 14 Others HPRD Manually curated integration of literature PDB Focus on protein structure dbEST Part of GenBank - EST sequences Genome Sequences

15 15 Human Sequences Number of Human genes is believed to be between 20,000 and 25,000 SwissProt~ 20,000 RefSeq~ 39,000 TrEMBL~ 75,000 IPI-HUMAN~ 87,000 MSDB~130,000 nr~230,000

16 16 DNA to Protein Sequence Derived from http://online.itp.ucsb.edu/online/infobio01/burge

17 17 Genome Browsers Link genomic, transcript, and protein sequence in a graphical manner Genes, ESTs, SNPs, cross-species, etc. UC Santa Cruz http://genome.ucsc.edu Ensembl http://www.ensembl.org NCBI Map View http://www.ncbi.nlm.nih.gov/mapview

18 18 UCSC Genome Browser Shows many sources of protein sequence evidence in a unified display

19 19 PeptideMapper Web Service I’m Feeling Lucky

20 20 PeptideMapper Web Service I’m Feeling Lucky

21 21 Unannotated Splice Isoform

22 22 Accessions Permanent labels Short, machine readable Enable precise communication Typos render them unusable! Each database uses a different format Swiss-Prot: P17947 Ensembl: ENSG00000066336 PIR: S60367; S60367 GO: GO:0003700;

23 23 Names / IDs Compact mnemonic labels Not guaranteed permanent Require careful curation Conceptual objects ALBU_HUMAN Serum Albumin RT30_HUMAN Mitochondrial 28S ribosomal protein S30 CP3A7_HUMAN Cytochrome P450 3A7

24 24 Description / Name Free text description Human readable Space limited Hard for computers to interpret! No standard nomenclature or format Often abused…. COX7R_HUMAN Cytochrome c oxidase subunit VIIa- related protein, mitochondrial [Precursor]

25 25 FASTA Format

26 26 FASTA Format > Accession number No uniform format Multiple accessions separated by | One line of description Usually pretty cryptic Organism of sequence? No uniform format Official latin name not necessarily used Amino-acid sequence in single-letter code Usually spread over multiple lines.

27 27 Organism / Species / Taxonomy The protein’s organism… …or the source of the biological sample The most reliable sequence annotation available Useful only to the extent that it is correct NCBI’s taxonomy is widely used Provides a standard of sorts; Heirachical Other databases don’t necessarily keep up Organism specific sequence databases starting to become available.

28 28 Organism / Species / Taxonomy Buffalo rat Gunn rats Norway rat Rattus PC12 clone IS Rattus norvegicus Rattus norvegicus8 Rattus norwegicus Rattus rattiscus Rattus sp. Rattus sp. strain Wistar Sprague-Dawley rat Wistar rats brown rat laboratory rat rat rats zitter rats

29 29 Controlled Vocabulary Middle ground between computers and people Provides precision for concepts Searching, sorting, browsing Concept relationships Vocabulary / Ontology must be established Human curation Link between concept and object: Manually curated Automatic / Predicted

30 30 Controlled Vocabulary

31 31 Controlled Vocabulary

32 32 Controlled Vocabulary

33 33 Controlled Vocabulary

34 34 Controlled Vocabulary

35 35 Controlled Vocabulary

36 36 Controlled Vocabulary

37 37 Controlled Vocabulary

38 38 Controlled Vocabulary

39 39 Controlled Vocabulary

40 40 Controlled Vocabulary

41 41 Controlled Vocabulary

42 42 Controlled Vocabulary

43 43 Controlled Vocabulary

44 44 Ontology Structure NCBI Taxonomy Tree Gene Ontology (GO) Molecular function Biological process Cellular component Directed, Acyclic Graph (DAG) Unstructured labels Overlapping?

45 45 Ontology Structure

46 46 Protein Families Similar sequence implies similar function Similar structure implies similar function Common domains imply similar function Bootstrap up from small sets of proteins with well understood characteristics Usually a hybrid manual / automatic approach

47 47 Protein Families

48 48 Protein Families

49 49 Protein Families PROSITE, PFam, InterPro, PRINTS Swiss-Prot keywords Differences: Motif style, ontology structure, degree of manual curation Similarities: Primarily sequence based, cross species

50 50 Gene Ontology Hierarchical Molecular function Biological process Cellular component Describes the vocabulary only! Protein families provide GO association Not necessarily any appropriate GO category. Not necessarily in all three hierarchies. Sometimes general categories are used because none of the specific categories are correct.

51 51 Protein Family / Gene Ontology

52 52 Sequence Variants Protein sequence can vary due to Polymorphism Alternative splicing Post-translational modification Sequence databases typically do not capture all versions of a protein’s sequence

53 53 Sequence Variants Swiss-Prot; a curated protein sequence database which strives to provide a high level of annotation (such as the description of the function of a protein, its domains structure, post- translational modifications, variants, etc.), a minimal level of redundancy and high level of integration with other databases - Swiss-Prot web site front page

54 54 Sequence Variants b) Minimal redundancy Many sequence databases contain, for a given protein sequence, separate entries which correspond to different literature reports. In Swiss- Prot we try as much as possible to merge all these data so as to minimize the redundancy of the database. If conflicts exist between various sequencing reports, they are indicated in the feature table of the corresponding entry. - Swiss-Prot User Manual, Section 1.1

55 55 Sequence Variants IPI provides a top level guide to the main databases that describe the proteomes of higher eukaryotic organisms. IPI: 1. effectively maintains a database of cross references between the primary data sources 2. provides minimally redundant yet maximally complete sets of proteins for featured species (one sequence per transcript) 3. maintains stable identifiers (with incremental versioning) to allow the tracking of sequences in IPI between IPI releases. - IPI web site front page

56 56 Swiss-Prot Variant Annotations

57 57 Swiss-Prot Variant Annotations

58 58 Swiss-Prot Variant Annotations

59 59 Peptides to Proteins Nesvizhskii et al., Anal. Chem. 2003

60 60 Peptides to Proteins

61 61 Peptides to Proteins A peptide sequence may occur in many different protein sequences Variants, paralogues, protein families Separation, digestion and ionization is not well understood Proteins in sequence database are extremely non-random, and very dependent

62 62 Omnibus Database Redundancy Elimination Source databases often contain the same sequences with different descriptions Omnibus databases keep one copy of the sequence, and An arbitrary description, or All descriptions, or Particular description, based on source preference Good definitions can be lost, including taxonomy

63 63 Description Elimination gi|12053249|emb|CAB66806.1| hypothetical protein [Homo sapiens] gi|46255828|gb|AAH68998.1| COMMD4 protein [Homo sapiens] gi|42632621|gb|AAS22242.1| COMMD4 [Homo sapiens] gi|21361661|ref|NP_060298.2| COMM domain containing 4 [Homo sapiens] gi|51316094|sp|Q9H0A8| COM4_HUMAN COMM domain containing protein 4 gi|49065330|emb|CAG38483.1| COMMD4 [Homo sapiens]

64 64 Description Elimination gi|2947219|gb|AAC39645.1| UDP-galactose 4' epimerase [Homo sapiens] gi|1119217|gb|AAB86498.1| UDP-galactose-4-epimerase [Homo sapiens] gi|14277913|pdb|1HZJ|B Chain B, Human Udp-Galactose 4-Epimerase: Accommodation Of Udp-N- Acetylglucosamine Within The Active Site gi|14277912|pdb|1HZJ|A Chain A, Human Udp-Galactose 4-Epimerase: Accommodation Of Udp-N- Acetylglucosamine Within The Active Site gi|2494659|sp|Q14376| GALE_HUMAN UDP-glucose 4-epimerase (Galactowaldenase) (UDP-galactose 4-epimerase) gi|1585500|prf||2201313A UDP galactose 4'-epimerase

65 65 Description Elimination gi|4261710|gb|AAD14010.1| chlordecone reductase [Homo sapiens] gi|2117443|pir||A57407 chlordecone reductase (EC 1.1.1.225) / 3alpha- hydroxysteroid dehydrogenase (EC 1.1.1.-) I [validated] – human gi|1839264|gb|AAB47003.1| HAKRa product/3 alpha-hydroxysteroid dehydrogenase homolog [human, liver, Peptide, 323 aa] gi|1705823|sp|P17516|AKC4_HUMAN Aldo-keto reductase family 1 member C4 (Chlordecone reductase) (CDR) (3- alpha-hydroxysteroid dehydrogenase) (3-alpha-HSD) (Dihydrodiol dehydrogenase 4) (DD4) (HAKRA) gi|7328948|dbj|BAA92885.1| dihydrodiol dehydrogenase 4 [Homo sapiens] gi|7328971|dbj|BAA92893.1| dihydrodiol dehydrogenase 4 [Homo sapiens]

66 66 Summary Protein sequence databases should be interpreted with as much care as mass spectra Protein sequences come from genes Use controlled vocabularies Understand the structure of ontologies Take advantage of computational predictions Look for sequence variants Peptides to proteins not as simple as it seems Be careful with omnibus databases


Download ppt "Protein Sequence Databases Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center."

Similar presentations


Ads by Google