Presentation is loading. Please wait.

Presentation is loading. Please wait.

Bioinformatics. Strategies for proteomics: which database? Dr Richard J Edwards 27 August 2009; CALMARO workshop.

Similar presentations

Presentation on theme: "Bioinformatics. Strategies for proteomics: which database? Dr Richard J Edwards 27 August 2009; CALMARO workshop."— Presentation transcript:

1 Bioinformatics. Strategies for proteomics: which database? Dr Richard J Edwards 27 August 2009; CALMARO workshop.

2 Bioinformatic Strategies for proteomics: which database?  Sequence databases  The importance of sequence databases in proteomics  Database options  Proteomics search strategies  Completeness vs. Redundancy  Local vs. online databases  Species-specific vs. multi-species  EST libraries  Which database?  Decoy databases  Open Discussion

3 Sequence databases.

4 The importance of the database Database Protein List

5  Idealised assumptions for matching spectra  Protein sample is pure  All peaks correspond to protein fragments  Protein fragmentation is complete and perfect  Can predict peaks from sequence  Every peak is present & precise  Peaks can be matched exactly  Exact protein will be present in database  Every peak should find in silico match  Reality  Impurities  Incomplete digestion  Measurement error  Post-translational modifications The importance of the database NOISE Incomplete

6 Quality vs. Quantity  Trade-off in any analysis Quantity (Completeness) Quality (Accuracy) Ideal Trade-off Assumptions Database Reality

7  Idealised database for matching spectra  Every protein in sample is present in database  100% coverage  No unnecessary duplication of data  Non-redundant  Sequence present in database matches sequence in sample  Every peak should find exact in silico match  Reality  Incomplete proteome annotation  Proteins missing  Duplicate entries (redundancy)  Incomplete protein annotation  Truncated sequences  Sequencing errors  Missing sequence variants The importance of the database NOISE Incomplete

8 The importance of the database Database Protein List NOISE Incomplete NOISE Incomplete

9 The importance of the database Database NOISE Incomplete NOISE Incomplete NOISE Incomplete Protein List

10 Database options.

11 Common proteomics databases DatabaseDescription; source databasesOrgan- isms Update frequency UniProt/ Swiss-Prot [EBI] Expertly curated; high level of annotation; minimum level of redundancy; high level of integration with other databases. ManyRelease every 4 months; updates every 2 weeks UniProt/ TrEMBL [EBI] Computer-annotated supplement to Uni-Prot/Swiss-Prot. Contains translated coding sequences from GenBankTM nucleotide database, protein sequences extracted from the literature or submitted to Uni-Prot/Swiss-Prot but not yet manually curated. ManyRelease every 4 months; updates every 2 weeks RefSeq [NCBI] Ongoing curation by NCBI staff; non-redundant; explicitly linked nucleotide and protein sequences; stable reference; high level of integration with other databases. ManyRelease every 3 months Ensembl [EBI] Created using automated genome annotation pipeline; eukaryotic genomes only; explicitly linked nucleotide and protein sequences; stable reference; high level of integration with other databases. Peptides identified by MS/MS can be mapped to the genome via Ensembl Protein database and visualized using Ensembl Genome Browser. SeveralEvery 1–2 months IPI [EBI] Good balance between degree of redundancy and completeness; references to the primary data sources; attempts to maintain stable identifiers (with incremental versioning), but still in flux. Assembled from Uni-Prot (Swiss-Prot + TrEMBL), RefSeq, Emsembl, H-Invitational database. A fewMonthly Entrez Protein (NCBInr) [NCBI] More complete with regard to sequence polymorphisms and splice forms; annotations extracted from curated databases; high degree of sequence redundancy makes interpretation difficult. Assembled from GenBankTM and RefSeq coding sequence translations, Protein Information Resource (PIR), Protein Data Bank (PDB), Uni-Prot/Swiss-Prot, Protein Research Foundation (PRF). ManyFrequent updates

12 Which database?  What resources are available?  What is most important for your analysis? Quantity (Completeness) Quality (Accuracy) Database Trade-off Protein List NOISE Incomplete Protein List

13 Generic Protein Databases  UniProtKB / NCBI  Advantages  Ease of access  Completeness  Updated  Disadvantages:  Redundancy  High noise levels  Inappropriate species etc.  In silico annotation  Not made with proteomics in mind  May not have relevant variants Patent Data UniParc WormBaseFlyBase Sub/ Peptide Data PDBVEGAEnsemblRefSeq INSDC (incl. WGS, Env.) UniProtKBUniMes UniRef 100 UniRef 90 UniRef 50 UniSave Database sources Proteome Sets IPI UniProt data sources and data flow

14 Genome databases  EnsEMBL/FlyBase/Wormbase etc.  Advantages:  Organism-specific information  Potentially high (~100%) coverage  Potentially very low redundancy/noise  Disadvantages:  Very dependent on annotation level/quality  Poor annotation = low completeness  May need other databases to interpret results  Best database to use if well-annotated genome available

15 EST libraries  NCBI dbEST / Organism-specific EST Projects  Generate your own!  Advantages:  Reasonable coverage of high expression proteins  Matches proteomics bias  Species-specific = more accurate matches  Enables identification of species-specific proteins  Not so reliant on annotation  Sequence variants  Transcripts without known homology/function  Disadvantages:  Often very poor annotation = Extra work  High levels of redundancy = Extra work  Sequence fragments = missed indentifications  Search as DNA in six reading frames or annotate proteins first

16 Proteomics search strategies.

17 Proteomics search strategies  No universal “best” strategy  Trade-offs: best depends on focus  Common trade-offs:  Completeness vs. Redundancy  Local vs. Online  Translated vs. Untranslated ESTs  Species-specific vs. Generic database  Why are you doing the experiment?  What do you want to identify?

18 Completeness vs. Redundancy  Redundancy  Multiple identifications of essentially the same protein  Sequence variants within a species  Same protein (family) in different species  More redundancy = extra work  Which hits are unique? (Different peptides)

19 Redundancy: identification issues Copyright ©2005 American Society for Biochemistry and Molecular Biology Nesvizhskii, A. I. (2005) Mol. Cell. Proteomics 4: 1419-1440 Sequences of identified peptides often do not allow discrimination between different protein isoforms

20 Completeness vs. Redundancy  Redundancy  Multiple identifications of essentially the same protein  Sequence variants within a species  Same protein (family) in different species  More redundancy = extra work  Which hits are unique? (Different peptides)  Poor annotation  Splice variant vs. Protein family?  More redundancy = lower sensitivity  Larger databases = more random hits = stricter score thresholds

21 Completeness vs. Redundancy Copyright ©2005 American Society for Biochemistry and Molecular Biology Nesvizhskii, A. I. (2005) Mol. Cell. Proteomics 4: 1419-1440 Protein sequence databases differ in terms of their completeness and the degree of sequence redundancy Quantity (Completeness) Quality (Non Redundancy) Trade-off Database

22 NCBI Redundancy Copyright ©2005 American Society for Biochemistry and Molecular Biology Nesvizhskii, A. I. (2005) Mol. Cell. Proteomics 4: 1419-1440

23 Completeness vs. Redundancy Large, Redundant DatabaseSmall, Non-Redundant Database Every major protein and specific variant is important Happy to group similar hits to a single protein/family (e.g. HSP90) Willing and able to perform extensive post-identification analysis Want to minimise need for additional cleanup/data analysis  Try to maximise both quality and quantity  Quality genome, IPI or custom-built search database Quantity (Completeness) Quality (Non- Redundancy) dbEST Genome X IPI

24 Copyright ©2005 American Society for Biochemistry and Molecular Biology Nesvizhskii, A. I. (2005) Mol. Cell. Proteomics 4: 1419-1440 An example of a protein family: alpha tubulins Inconclusive identification

25 Local vs. Online  Standard databases plug in to search engines  Can also search “Local” databases, stored on own machine Standard Online DatabaseLocal Database Regularly updated: latest sequence data & annotation More control of content – customise species & sequences Easy to describe & referenceStable database for multiple searches Don’t have to worry about sequence formats/naming etc. Ease comparisons & redundancy removal across multiple experiments Eases generation of decoy database

26 EST libraries  Can search untranslated in six reading frames (RFs)  Or Assemble, annotate & search proteins UntranslatedAssembled & Annotated Quick and easy preparationTime-consuming and difficult (unless already done!) Suffers from short fragmentsLonger sequences = more chance of multiple peptides Potential to detect SNPs/isoformsAssembly may incorrectly assimilate/remove variants Low coverage: more robust to sequencing error High coverage: more robust to sequencing error Large quantity of random translations (UTRs + 5 incorrect RFs) Smaller, higher quality dataset = less False Positives Detect novel proteins (no homology to known proteins) ORFs without homology to known proteins may be removed Need to annotate hits

27 Search species  Best to use species-specific data where possible  More chance of identical peptide sequences  More chance of family member/isoform discrimination  Can identify taxa-specific proteins  Can search in other species and infer  Only works for conserved proteins  Increases noise and False Positive rate  Compromise  Subset of well-annotated & closely-related species  Maximise completeness, minimise noise

28 Poor genome vs. Wrong species  Annotation generally based on homology to known proteins  Proteins similar enough to be found by proteomics will be easy to find & annotate in genome  What is the genome coverage?  High coverage = little to be gained by additional search  Low coverage = may be many conserved proteins missing  Multi-species search may find extra proteins  Compromise:  Search available species data (genome/EST)  Second search against selected taxa (UniProtKB)  Bacteria: can search genome in 6RF! (No introns)

29 Which database? How to choose.

30 How to choose?  What do you want to do?  Priority  Reproducibility/Comparability/Hypothesis testing  Fewer, high quality identifications  Smaller, more focused database  Hypothesis generation  More identifications, more potential false positives  Larger/multiple search databases  How much post-identification analysis?  Not much  Higher quality, lower numbers  Detailed manual analysis  More identifications, more potential false positives  Always sensible to look for probable identifications  Are sequences missing from your search database?

31 Experimental design Copyright ©2005 American Society for Biochemistry and Molecular Biology Nesvizhskii, A. I. (2005) Mol. Cell. Proteomics 4: 1419-1440  Protein separation  Better discrimination of variants  Cope with more redundancy  Shotgun (no separation)  Less discrimination

32 Decoy databases.

33 False positives: NOISE Database NOISE Incomplete NOISE Incomplete NOISE Incomplete Protein List

34 False positives: NOISE  How much noise?  How many False Positives due to noise? NOISE Incomplete Database Decoy Database Protein List NOISE RANDOM Protein List

35 Conclusions.

36 Summary  Selection of database is very important for quality of results  Primarily a trade-off between completeness & noise  Choice of database depends on:  Availability of data  Experimental design  Aims/objectives of study  Priorities of analysis  Well annotated genomes (proteomes!) best  Poorly annotated genomes & ESTs can be supplemented with searching related taxa  Local databases give more control/repeatability  Decoy databases can help estimate false positive rates

37 Open Discussion

Download ppt "Bioinformatics. Strategies for proteomics: which database? Dr Richard J Edwards 27 August 2009; CALMARO workshop."

Similar presentations

Ads by Google