Protein Sequence Databases for Proteomics The good, the bad & the ugly US HUPO: Bioinformatics for Proteomics Nathan Edwards – March 12, 2006.

Slides:

Advertisements

Similar presentations

1 Welcome to the Protein Database Tutorial This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.

Advertisements

Gene Ontology John Pinney

The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.

Swiss-Prot Protein Database Daniel Amoruso December 2, 2004 BI 420.

Archives and Information Retrieval

Biological databases.

InterPro/prosite UCSC Genome Browser Exercise 3. Turning information into knowledge  The outcome of a sequencing project is masses of raw data  The.

Copyright OpenHelix. No use or reproduction without express written consent1 Organization of genomic data… Genome backbone: base position number sequence.

Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center

Today’s menu: -UniProt - SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.

Today’s menu: -UniProt - SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.

EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:

Protein and Function Databases

UniProt - The Universal Protein Resource

Bioinformatics Lecture 3 BCH 550 Arjumand Warsy. Retrieving Protein Sequences.

Claire O’Donovan EMBL-EBI. In UniProtKB, we aim to provide… o A high quality protein sequence database A non redundant protein database, with maximal.

Doug Brutlag Professor Emeritus Biochemistry & Medicine (by courtesy) Genome Databases Computational Molecular Biology Biochem 218 – BioMedical Informatics.

Automatic methods for functional annotation of sequences Petri Törönen.

Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.

Bioinformatics for biomedicine

© Wiley Publishing All Rights Reserved. Protein and Specialized Sequence Databases.

NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015.

Protein Sequence Databases, Peptides to Proteins, and Statistical Significance Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown.

GENOME-CENTRIC DATABASES Daniel Svozil. NCBI Gene Search for DUT gene in human.

Biological Databases By : Lim Yun Ping E mail :

Copyright OpenHelix. No use or reproduction without express written consent 2 Overview of Genome Browsers Materials prepared by Warren C. Lathe, Ph.D.

UCSC Genome Browser 1. The Progress 2 Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools.

Corrections. - The cacao genome is currently being sequenced - Human Chromosome 1 sequence Search ‘Genome’

Generating Peptide Candidates from Protein Sequence Databases for Protein Identification via Mass Spectrometry Nathan Edwards Informatics Research.

DONNA MAGLOTT, PH.D. PRO AND MEDICAL GENETICS RESOURCES AT NCBI.

The Gene Ontology project Jane Lomax. Ontology (for our purposes) “an explicit specification of some topic” – Stanford Knowledge Systems Lab Includes:

What is an Ontology? An ontology is a specification of a conceptualization that is designed for reuse across multiple applications and implementations.

Browsing the Genome Using Genome Browsers to Visualize and Mine Data.

BIOINFORMATIK I UEBUNG 2 mRNA processing.

Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.

PIRSF Classification System PIRSF: Evolutionary relationships of proteins from super- to sub-families Homeomorphic Family: Homologous proteins sharing.

Alastair Kerr, Ph.D. WTCCB Bioinformatics Core An introduction to DNA and Protein Sequence Databases.

Protein and RNA Families

PROTEIN DATABASES. The ideal sequence database for computational analyses and data-mining: I t must be complete with minimal redundancy It must contain.

Copyright OpenHelix. No use or reproduction without express written consent1.

Biological databases Exercises. Discovery of distinct sequence databases using ensembl.

Other biological databases and ontologies. Biological systems Taxonomic data Literature Protein folding and 3D structure Small molecules Pathways and.

Protein Sequence Databases for Proteomics The good, the bad & the ugly US HUPO: Bioinformatics for Proteomics Nathan Edwards – March 12, 2005.

Bioinformatics and Computational Biology

Protein Sequence Databases Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center.

Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,

Faster, more sensitive peptide identification from tandem mass spectra by sequence database compression Nathan J. Edwards Center for Bioinformatics & Computational.

Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.

EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.

March 28, 2002 NIH Proteomics Workshop Bethesda, MD Lai-Su Yeh, Ph.D. Protein Scientist, National Biomedical Research Foundation Demo: Protein Information.

Aggressive Enumeration of Peptide Sequences for MS/MS Peptide Identification Nathan Edwards Center for Bioinformatics and Computational Biology.

Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.

Central hub for biological data UniProtKB/Swiss-Prot is a central hub for biological data: over 120 databases are cross-referenced (EMBL/DDBJ/GenBank,

Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,

Tools in Bioinformatics Genome Browsers. Retrieving genomic information Previous lesson(s): annotation-based perspective of search/data Today: genomic-based.

1 of 28 Evaluating Genes and Transcripts (“Genebuild”)

InterPro Sandra Orchard.

Protein databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen and from CSC bio-opas

Genomes at NCBI. Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools lists 57 databases.

COURSE OF BIOINFORMATICS Exam_30/01/2014 A.

Welcome to the Protein Database Tutorial. This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.

EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.

Introduction to Genes and Genomes with Ensembl

Protein databases Henrik Nielsen

Demo: Protein Information Resource

UniProt: Universal Protein Resource

Genome Annotation Continued

Welcome to the Protein Database Tutorial

Ensembl Genome Repository.

SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.

Presentation transcript:

Protein Sequence Databases for Proteomics The good, the bad & the ugly US HUPO: Bioinformatics for Proteomics Nathan Edwards – March 12, 2006

2 US HUPO: Bioinformatics for Proteomics Protein Sequence Databases Link between mass spectra and proteins A protein’s amino-acid sequence provides a basis for interpreting Enzymatic digestion Separation protocols Fragmentation We must interpret database information as carefully as mass spectra.

3 US HUPO: Bioinformatics for Proteomics More than sequence… Protein sequence databases provide much more than sequence: Names Descriptions Facts Predictions Links to other information sources Protein databases provide a link to the current state of our understanding about a protein.

4 US HUPO: Bioinformatics for Proteomics Much more than sequence Names Accession, Name, Description Biological Source Organism, Source, Taxonomy Literature Function Biological process, molecular function, cellular component Known and predicted Features Polymorphism, Isoforms, PTMs, Domains

5 US HUPO: Bioinformatics for Proteomics Database types Curated Swiss-Prot PIR RefSeq NP Translated TrEMBL RefSeq XP, ZP Omnibus NCBI’s nr MSDB IPI Other PDB HPRD EST Genomic

6 US HUPO: Bioinformatics for Proteomics Human Sequences Number of Human Genes is believed to be between 20,000 and 25,000 PIR~ 10,500 SwissProt~ 12,000 RefSeq~ 28,000 IPI-HUMAN~ 48,000 TrEMBL~ 52,000 MSDB~ 105,000

7 US HUPO: Bioinformatics for Proteomics Accessions Permanent labels Short, machine readable Enable precise communication Typos render them unusable! Each database uses a different format Swiss-Prot: P17947 Ensembl: ENSG PIR: S60367; S60367 GO: GO: ;

8 US HUPO: Bioinformatics for Proteomics Names / IDs Compact mnemonic labels Not guaranteed permanent Require careful curation Conceptual objects Swiss-Prot names changed last year! ALBU_HUMAN Serum Albumin RT30_HUMAN Mitochondrial 28S ribosomal protein S30 CP3A7_HUMAN Cytochrome P450 3A7

9 US HUPO: Bioinformatics for Proteomics Description / Name Free text description Human readable Space limited Hard for computers to interpret! No standard nomenclature or format Often abused…. COX7R_HUMAN Cytochrome c oxidase subunit VIIa- related protein, mitochondrial [Precursor]

10 US HUPO: Bioinformatics for Proteomics FASTA Format

11 US HUPO: Bioinformatics for Proteomics FASTA Format > Accession number No uniform format Multiple accessions separated by | One line of description Usually pretty cryptic Organism of sequence? No uniform format Official latin name not necessarily used Amino-acid sequence in single-letter code Usually spread over multiple lines.

12 US HUPO: Bioinformatics for Proteomics Organism / Species / Taxonomy The protein’s organism… …or the source of the biological sample The most reliable sequence annotation available Useful only to the extent that it is correct NCBI’s taxonomy is widely used Provides a standard of sorts; Heirachical Other databases don’t necessarily keep up Organism specific sequence databases are also available.

13 US HUPO: Bioinformatics for Proteomics Organism / Species / Taxonomy Buffalo rat Gunn rats Norway rat Rattus PC12 clone IS Rattus norvegicus Rattus norvegicus8 Rattus norwegicus Rattus rattiscus Rattus sp. Rattus sp. strain Wistar Sprague-Dawley rat Wistar rats brown rat laboratory rat rat rats zitter rats

14 US HUPO: Bioinformatics for Proteomics Controlled Vocabulary Middle ground between computers and people Provides precision for concepts Searching, sorting, browsing Concept relationships Vocabulary / Ontology must be established Human curation Link between concept and object: Manually curated Automatic / Predicted

15 US HUPO: Bioinformatics for Proteomics Controlled Vocabulary

16 US HUPO: Bioinformatics for Proteomics Controlled Vocabulary

17 US HUPO: Bioinformatics for Proteomics Controlled Vocabulary

18 US HUPO: Bioinformatics for Proteomics Controlled Vocabulary

19 US HUPO: Bioinformatics for Proteomics Controlled Vocabulary

20 US HUPO: Bioinformatics for Proteomics Controlled Vocabulary

21 US HUPO: Bioinformatics for Proteomics Ontology Structure NCBI Taxonomy Tree Gene Ontology (GO) Molecular function Biological process Cellular component Directed, Acyclic Graph (DAG) Unstructured labels InterPro, Pfam, Swiss-Prot keywords Overlapping?

22 US HUPO: Bioinformatics for Proteomics Ontology Structure

23 US HUPO: Bioinformatics for Proteomics Protein Families Similar sequence implies similar function Similar structure implies similar function Common domains imply similar function Bootstrap up from small sets of proteins with well understood characteristics Usually a hybrid manual / automatic approach

24 US HUPO: Bioinformatics for Proteomics Protein Families

25 US HUPO: Bioinformatics for Proteomics Protein Families

26 US HUPO: Bioinformatics for Proteomics Protein Families PROSITE, PFam, InterPro, PRINTS Swiss-Prot keywords Differences: Motif style, ontology structure, degree of manual curation Similarities: Primarily sequence based, cross species

27 US HUPO: Bioinformatics for Proteomics Gene Ontology Hierarchical Molecular function Biological process Cellular component Describes the vocabulary only! Protein families provide GO association Not necessarily any appropriate GO category. Not necessarily in all three hierarchies. Sometimes general categories are used because none of the specific categories are correct.

28 US HUPO: Bioinformatics for Proteomics Protein Family / Gene Ontology

29 US HUPO: Bioinformatics for Proteomics Sequence Variants Protein sequence can vary due to Polymorphism Alternative splicing Post-translational modification Sequence databases typically do not capture all versions of a protein’s sequence

30 US HUPO: Bioinformatics for Proteomics Sequence Variants Swiss-Prot; a curated protein sequence database which strives to provide a high level of annotation (such as the description of the function of a protein, its domains structure, post- translational modifications, variants, etc.), a minimal level of redundancy and high level of integration with other databases - Swiss-Prot web site front page

31 US HUPO: Bioinformatics for Proteomics Sequence Variants b) Minimal redundancy Many sequence databases contain, for a given protein sequence, separate entries which correspond to different literature reports. In Swiss- Prot we try as much as possible to merge all these data so as to minimize the redundancy of the database. If conflicts exist between various sequencing reports, they are indicated in the feature table of the corresponding entry. - Swiss-Prot User Manual, Section 1.1

32 US HUPO: Bioinformatics for Proteomics Sequence Variants IPI provides a top level guide to the main databases that describe the proteomes of higher eukaryotic organisms. IPI: 1. effectively maintains a database of cross references between the primary data sources 2. provides minimally redundant yet maximally complete sets of proteins for featured species (one sequence per transcript) 3. maintains stable identifiers (with incremental versioning) to allow the tracking of sequences in IPI between IPI releases. - IPI web site front page

33 US HUPO: Bioinformatics for Proteomics Sequence Variants Swiss-Prot variants, isoforms and conflicts are retained as features Script varsplic.pl can enumerate all sequence variants Command-line options for full enumeration -which full -varsplic -variant -conflict

34 US HUPO: Bioinformatics for Proteomics Swiss-Prot Variant Annotations

35 US HUPO: Bioinformatics for Proteomics Swiss-Prot Variant Annotations

36 US HUPO: Bioinformatics for Proteomics Swiss-Prot Variant Annotations Feature viewer Variants

37 US HUPO: Bioinformatics for Proteomics Swiss-Prot VarSplic Output P MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDTQFVRF P MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDTQFVRF P MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDTQFVRF P MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDTQFVRF P MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDTQFVRF P MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGKPRFIAVGYVDDTQFVRF P MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGKPRFIAVGYVDDTQFVRF P MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDTQFVRF P MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDTQFVRF P MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDTQFVRF P MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDTQFVRF P MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDTQFVRF ******************************************:*****************

38 US HUPO: Bioinformatics for Proteomics Swiss-Prot VarSplic Output P SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSYTQAASSDSAQ P SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSYTQAASSDSAQ P SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSYTQAASSDSAQ P SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSYTQAASSDSAQ P SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSYTQAASSDSAQ P SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSYTQAASSDSAQ P SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSYTQAASSDSAQ P SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSYTQAASSDSAQ P SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSYTQAASSDSAQ P SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSYTQAASSDSAQ P SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSYSQAASSDSAQ P SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSYSQAASSDSAQ ************************************* *******:*********

39 US HUPO: Bioinformatics for Proteomics Omnibus Database Redundancy Elimination Source databases often contain the same sequences with different descriptions Omnibus databases keep one copy of the sequence, and An arbitrary description, or All descriptions, or Particular description, based on source preference Good definitions can be lost, including taxonomy

40 US HUPO: Bioinformatics for Proteomics Omnibus Database Redundancy Elimination NCBI’s nr: Keeps all descriptions, separated by ^A MSDB: Pecking order: PIR1-4, TrEMBL, GenBank, Swiss-Prot, NRL3D IPI: All accessions, one description

41 US HUPO: Bioinformatics for Proteomics Description Elimination gi| |emb|CAB | hypothetical protein [Homo sapiens] gi| |gb|AAH | COMMD4 protein [Homo sapiens] gi| |gb|AAS | COMMD4 [Homo sapiens] gi| |ref|NP_ | COMM domain containing 4 [Homo sapiens] gi| |sp|Q9H0A8| COM4_HUMAN COMM domain containing protein 4 gi| |emb|CAG | COMMD4 [Homo sapiens]

42 US HUPO: Bioinformatics for Proteomics Description Elimination gi| |gb|AAC | UDP-galactose 4' epimerase [Homo sapiens] gi| |gb|AAB | UDP-galactose-4-epimerase [Homo sapiens] gi| |pdb|1HZJ|B Chain B, Human Udp-Galactose 4-Epimerase: Accommodation Of Udp-N- Acetylglucosamine Within The Active Site gi| |pdb|1HZJ|A Chain A, Human Udp-Galactose 4-Epimerase: Accommodation Of Udp-N- Acetylglucosamine Within The Active Site gi| |sp|Q14376| GALE_HUMAN UDP-glucose 4-epimerase (Galactowaldenase) (UDP-galactose 4-epimerase) gi| |prf|| A UDP galactose 4'-epimerase

43 US HUPO: Bioinformatics for Proteomics Description Elimination gi| |gb|AAD | chlordecone reductase [Homo sapiens] gi| |pir||A57407 chlordecone reductase (EC ) / 3alpha- hydroxysteroid dehydrogenase (EC ) I [validated] – human gi| |gb|AAB | HAKRa product/3 alpha-hydroxysteroid dehydrogenase homolog [human, liver, Peptide, 323 aa] gi| |sp|P17516|AKC4_HUMAN Aldo-keto reductase family 1 member C4 (Chlordecone reductase) (CDR) (3- alpha-hydroxysteroid dehydrogenase) (3-alpha-HSD) (Dihydrodiol dehydrogenase 4) (DD4) (HAKRA) gi| |dbj|BAA | dihydrodiol dehydrogenase 4 [Homo sapiens] gi| |dbj|BAA | dihydrodiol dehydrogenase 4 [Homo sapiens]

44 US HUPO: Bioinformatics for Proteomics DNA to Protein Sequence Derived from

45 US HUPO: Bioinformatics for Proteomics Translated sequences Gene models describe introns and exons Start site? Splice sites? Alternative splicing? ESTs provide limited evidence of transcription only There is a lot we don’t know about what protein sequences result from a gene Recent revision of number of human genes suggest a bigger role for alternative splicing.

46 US HUPO: Bioinformatics for Proteomics Genome Browsers Link genomic, transcript, and protein sequence in a graphical manner Genes, ESTs, SNPs, cross-species, etc. UC Santa Cruz Ensembl NCBI Map View

47 US HUPO: Bioinformatics for Proteomics UCSC Genome Browser Shows many sources of protein sequence evidence in a unified display Can use EST accession as a location!

48 US HUPO: Bioinformatics for Proteomics Summary Protein sequence databases should be interpreted with as much care as mass spectra Use controlled vocabularies Understand the structure of ontologies Take advantage of computational predictions Look for sequence variants Be careful with omnibus databases