Download presentation
Presentation is loading. Please wait.
1
Bioinformatics. Strategies for proteomics: which database? Dr Richard J Edwards 27 August 2009; CALMARO workshop. R.Edwards@Southampton.ac.uk
2
Bioinformatic Strategies for proteomics: which database? Sequence databases The importance of sequence databases in proteomics Database options Proteomics search strategies Completeness vs. Redundancy Local vs. online databases Species-specific vs. multi-species EST libraries Which database? Decoy databases Open Discussion
3
Sequence databases.
4
The importance of the database Database Protein List
5
Idealised assumptions for matching spectra Protein sample is pure All peaks correspond to protein fragments Protein fragmentation is complete and perfect Can predict peaks from sequence Every peak is present & precise Peaks can be matched exactly Exact protein will be present in database Every peak should find in silico match Reality Impurities Incomplete digestion Measurement error Post-translational modifications The importance of the database NOISE Incomplete
6
Quality vs. Quantity Trade-off in any analysis Quantity (Completeness) Quality (Accuracy) Ideal Trade-off Assumptions Database Reality
7
Idealised database for matching spectra Every protein in sample is present in database 100% coverage No unnecessary duplication of data Non-redundant Sequence present in database matches sequence in sample Every peak should find exact in silico match Reality Incomplete proteome annotation Proteins missing Duplicate entries (redundancy) Incomplete protein annotation Truncated sequences Sequencing errors Missing sequence variants The importance of the database NOISE Incomplete
8
The importance of the database Database Protein List NOISE Incomplete NOISE Incomplete
9
The importance of the database Database NOISE Incomplete NOISE Incomplete NOISE Incomplete Protein List
10
Database options.
11
Common proteomics databases DatabaseDescription; source databasesOrgan- isms Update frequency UniProt/ Swiss-Prot [EBI] Expertly curated; high level of annotation; minimum level of redundancy; high level of integration with other databases. ManyRelease every 4 months; updates every 2 weeks UniProt/ TrEMBL [EBI] Computer-annotated supplement to Uni-Prot/Swiss-Prot. Contains translated coding sequences from GenBankTM nucleotide database, protein sequences extracted from the literature or submitted to Uni-Prot/Swiss-Prot but not yet manually curated. ManyRelease every 4 months; updates every 2 weeks RefSeq [NCBI] Ongoing curation by NCBI staff; non-redundant; explicitly linked nucleotide and protein sequences; stable reference; high level of integration with other databases. ManyRelease every 3 months Ensembl [EBI] Created using automated genome annotation pipeline; eukaryotic genomes only; explicitly linked nucleotide and protein sequences; stable reference; high level of integration with other databases. Peptides identified by MS/MS can be mapped to the genome via Ensembl Protein database and visualized using Ensembl Genome Browser. SeveralEvery 1–2 months IPI [EBI] Good balance between degree of redundancy and completeness; references to the primary data sources; attempts to maintain stable identifiers (with incremental versioning), but still in flux. Assembled from Uni-Prot (Swiss-Prot + TrEMBL), RefSeq, Emsembl, H-Invitational database. A fewMonthly Entrez Protein (NCBInr) [NCBI] More complete with regard to sequence polymorphisms and splice forms; annotations extracted from curated databases; high degree of sequence redundancy makes interpretation difficult. Assembled from GenBankTM and RefSeq coding sequence translations, Protein Information Resource (PIR), Protein Data Bank (PDB), Uni-Prot/Swiss-Prot, Protein Research Foundation (PRF). ManyFrequent updates
12
Which database? What resources are available? What is most important for your analysis? Quantity (Completeness) Quality (Accuracy) Database Trade-off Protein List NOISE Incomplete Protein List
13
Generic Protein Databases UniProtKB / NCBI Advantages Ease of access Completeness Updated Disadvantages: Redundancy High noise levels Inappropriate species etc. In silico annotation Not made with proteomics in mind May not have relevant variants Patent Data UniParc WormBaseFlyBase Sub/ Peptide Data PDBVEGAEnsemblRefSeq INSDC (incl. WGS, Env.) UniProtKBUniMes UniRef 100 UniRef 90 UniRef 50 UniSave Database sources Proteome Sets IPI UniProt data sources and data flow
14
Genome databases EnsEMBL/FlyBase/Wormbase etc. Advantages: Organism-specific information Potentially high (~100%) coverage Potentially very low redundancy/noise Disadvantages: Very dependent on annotation level/quality Poor annotation = low completeness May need other databases to interpret results Best database to use if well-annotated genome available
15
EST libraries NCBI dbEST / Organism-specific EST Projects Generate your own! Advantages: Reasonable coverage of high expression proteins Matches proteomics bias Species-specific = more accurate matches Enables identification of species-specific proteins Not so reliant on annotation Sequence variants Transcripts without known homology/function Disadvantages: Often very poor annotation = Extra work High levels of redundancy = Extra work Sequence fragments = missed indentifications Search as DNA in six reading frames or annotate proteins first
16
Proteomics search strategies.
17
Proteomics search strategies No universal “best” strategy Trade-offs: best depends on focus Common trade-offs: Completeness vs. Redundancy Local vs. Online Translated vs. Untranslated ESTs Species-specific vs. Generic database Why are you doing the experiment? What do you want to identify?
18
Completeness vs. Redundancy Redundancy Multiple identifications of essentially the same protein Sequence variants within a species Same protein (family) in different species More redundancy = extra work Which hits are unique? (Different peptides)
19
Redundancy: identification issues Copyright ©2005 American Society for Biochemistry and Molecular Biology Nesvizhskii, A. I. (2005) Mol. Cell. Proteomics 4: 1419-1440 Sequences of identified peptides often do not allow discrimination between different protein isoforms
20
Completeness vs. Redundancy Redundancy Multiple identifications of essentially the same protein Sequence variants within a species Same protein (family) in different species More redundancy = extra work Which hits are unique? (Different peptides) Poor annotation Splice variant vs. Protein family? More redundancy = lower sensitivity Larger databases = more random hits = stricter score thresholds
21
Completeness vs. Redundancy Copyright ©2005 American Society for Biochemistry and Molecular Biology Nesvizhskii, A. I. (2005) Mol. Cell. Proteomics 4: 1419-1440 Protein sequence databases differ in terms of their completeness and the degree of sequence redundancy Quantity (Completeness) Quality (Non Redundancy) Trade-off Database
22
NCBI Redundancy Copyright ©2005 American Society for Biochemistry and Molecular Biology Nesvizhskii, A. I. (2005) Mol. Cell. Proteomics 4: 1419-1440
23
Completeness vs. Redundancy Large, Redundant DatabaseSmall, Non-Redundant Database Every major protein and specific variant is important Happy to group similar hits to a single protein/family (e.g. HSP90) Willing and able to perform extensive post-identification analysis Want to minimise need for additional cleanup/data analysis Try to maximise both quality and quantity Quality genome, IPI or custom-built search database Quantity (Completeness) Quality (Non- Redundancy) dbEST Genome X IPI
24
Copyright ©2005 American Society for Biochemistry and Molecular Biology Nesvizhskii, A. I. (2005) Mol. Cell. Proteomics 4: 1419-1440 An example of a protein family: alpha tubulins Inconclusive identification
25
Local vs. Online Standard databases plug in to search engines Can also search “Local” databases, stored on own machine Standard Online DatabaseLocal Database Regularly updated: latest sequence data & annotation More control of content – customise species & sequences Easy to describe & referenceStable database for multiple searches Don’t have to worry about sequence formats/naming etc. Ease comparisons & redundancy removal across multiple experiments Eases generation of decoy database
26
EST libraries Can search untranslated in six reading frames (RFs) Or Assemble, annotate & search proteins UntranslatedAssembled & Annotated Quick and easy preparationTime-consuming and difficult (unless already done!) Suffers from short fragmentsLonger sequences = more chance of multiple peptides Potential to detect SNPs/isoformsAssembly may incorrectly assimilate/remove variants Low coverage: more robust to sequencing error High coverage: more robust to sequencing error Large quantity of random translations (UTRs + 5 incorrect RFs) Smaller, higher quality dataset = less False Positives Detect novel proteins (no homology to known proteins) ORFs without homology to known proteins may be removed Need to annotate hits
27
Search species Best to use species-specific data where possible More chance of identical peptide sequences More chance of family member/isoform discrimination Can identify taxa-specific proteins Can search in other species and infer Only works for conserved proteins Increases noise and False Positive rate Compromise Subset of well-annotated & closely-related species Maximise completeness, minimise noise
28
Poor genome vs. Wrong species Annotation generally based on homology to known proteins Proteins similar enough to be found by proteomics will be easy to find & annotate in genome What is the genome coverage? High coverage = little to be gained by additional search Low coverage = may be many conserved proteins missing Multi-species search may find extra proteins Compromise: Search available species data (genome/EST) Second search against selected taxa (UniProtKB) Bacteria: can search genome in 6RF! (No introns)
29
Which database? How to choose.
30
How to choose? What do you want to do? Priority Reproducibility/Comparability/Hypothesis testing Fewer, high quality identifications Smaller, more focused database Hypothesis generation More identifications, more potential false positives Larger/multiple search databases How much post-identification analysis? Not much Higher quality, lower numbers Detailed manual analysis More identifications, more potential false positives Always sensible to look for probable identifications Are sequences missing from your search database?
31
Experimental design Copyright ©2005 American Society for Biochemistry and Molecular Biology Nesvizhskii, A. I. (2005) Mol. Cell. Proteomics 4: 1419-1440 Protein separation Better discrimination of variants Cope with more redundancy Shotgun (no separation) Less discrimination
32
Decoy databases.
33
False positives: NOISE Database NOISE Incomplete NOISE Incomplete NOISE Incomplete Protein List
34
False positives: NOISE How much noise? How many False Positives due to noise? NOISE Incomplete Database Decoy Database Protein List NOISE RANDOM Protein List
35
Conclusions.
36
Summary Selection of database is very important for quality of results Primarily a trade-off between completeness & noise Choice of database depends on: Availability of data Experimental design Aims/objectives of study Priorities of analysis Well annotated genomes (proteomes!) best Poorly annotated genomes & ESTs can be supplemented with searching related taxa Local databases give more control/repeatability Decoy databases can help estimate false positive rates
37
Open Discussion R.Edwards@Southampton.ac.uk
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.