Download presentation
Presentation is loading. Please wait.
1
Bioinformatics. Analysis of proteomic data. Dr Richard J Edwards 28 August 2009; CALMARO workshop. ©Gary Larson (In not much detail)
2
Bioinformatic analysis of proteomic data Improving sequence identifications Dealing with redundancy Annotating protein hits Adding value to protein lists Accession number mapping & data integration Gene Ontology analysis Protein interaction networks Example: identifying E. huxleyi proteins with multi-species and EST sequence databases Open Discussion
3
Improving identifications: dealing with redundancy.
4
Identifying redundancy Copyright ©2005 American Society for Biochemistry and Molecular Biology Nesvizhskii, A. I. (2005) Mol. Cell. Proteomics 4: 1419-1440 Choice of database affects redundancy identification SwissProt/IPI indicate splice variants EnsEMBL peptides map back onto non-redundant gene IDs Poor annotation hard to differentiate variant/error/family
5
Copyright ©2005 American Society for Biochemistry and Molecular Biology Nesvizhskii, A. I. (2005) Mol. Cell. Proteomics 4: 1419-1440 Example: alpha tubulin protein family Identifying redundancy Sometimes, identification cannot be conclusive
6
Copyright ©2005 American Society for Biochemistry and Molecular Biology Nesvizhskii, A. I. (2005) Mol. Cell. Proteomics 4: 1419-1440 Basic peptide grouping scenarios Identifying redundancy Sometimes, identification cannot be conclusive Different scenarios can present different problems How important is it to study? Might need to identify protein(s) through further experiments ? ?? ? ? ? ?
7
Copyright ©2005 American Society for Biochemistry and Molecular Biology Nesvizhskii, A. I. (2005) Mol. Cell. Proteomics 4: 1419-1440 A simplified example of a protein summary list Identifying redundancy Final protein list: Conclusive IDs Protein groups Inconclusive IDs Are inconclusive/ group hits redundant? Same protein from different species Splice variants Does it matter? Inflated numbers Biased analyses Comparisons between experiments Unique to protein Unique to group No unique
8
Homology groupings Can use BLAST to identify groups of related proteins Help identify possible redundancies Need to look at peptides Particularly useful for “off-species” identifications Tendency for many hits to same protein in different species Clustering proteins by %identity http://www.southampton.ac.uk/~re1u06/software/gablam/
9
Improving identifications: annotating protein hits.
10
Protein annotation Database Protein List NOISE Poorly (un)annotated proteins Real proteins or database noise? Reliable annotation?
11
Most of our protein data comes from DNA sequences PDB: 53,660 structures = 3D SwissProt: 392,667 = Curated TrEMBL: >6 million & UniParc: >16 million = Most inferred from DNA Most annotation inferred through sequence analysis Protein data from translated DNA Lots of errors! Sequence errors Annotation errors AnnotationTranslation Where does the data come from?
12
Protein annotation Use standard sequence analysis tools Manual guidance/care = better than automated databases! Homology searching BLAST vs. UniProtKB Protein domain searches, e.g. PFam Conservation analysis Multiple sequence alignment with homologues Are functionally important sites conserved? Phylogenetic analysis Evolutionary relationships can help distinguish function Assignment to protein subfamily etc. Useful where BLAST hits have competing annotation http://www.southampton.ac.uk/~re1u06/software/haqesac/
13
Beyond proteomics: adding value to protein lists.
14
What Bioinformatics cannot (usually) do Magic Replace hypothesis driven research Directed analysis is always better than “fishing” (e.g. GO) Provide a definitive answer Ranking/prioritising better
15
Follow-up analyses Many possibilities What was the aim of the study? What resources are available for your organism? Imitation is the sincerest form of flattery Find a good study and copy the best bits Easier to describe Easier to justify to reviewers Hypothesis-driven analysis is best Many tools facilitate hypothesis generation (data exploration) Be aware of risk of testing a hypothesis on data used to generate it Be aware of multiple testing issues
16
Follow-up analyses EBI and NCBI both provide many useful tools EBI run many good courses at Hinxton http://www.ebi.ac.uk/Tools/
17
Seek collaborations Time / Energy Reward Bioinformatics Find a tame bioinformatician to help if needed Good collaboration = Trade Papers / Grants / improving the bioinformatics E.g. adding your organism/database to an online resource ©Gary Larson
18
Accession number mapping Other databases may contain better/specific annotation UniProtKB, OMIM etc. Results from searches against older databases may need updating EBI tool: PICR [Protein Identifier Cross-Reference Service] BioMart: Query & Xref tool for many databases www.biomart.org www.biomart.org http://www.ebi.ac.uk/Tools/picr/
19
BioMart
20
Gene Ontology analysis Gene Ontology [GO] = gene annotation project Controlled vocabulary allows standardisation & comparisons http://www.geneontology.org/
21
Gene Ontology analysis Many Gene Ontology exploration tools AmiGO, GOA, FatiGO, DAVID etc. Depend on source databases May need to map IDs using PICR first GO enrichment Assess frequency of GO terms in your list against expectation Often a big multiple testing issue Be aware of biases – how is expectation derived E.g. Abundant, conserved proteins more likely to be annotated & more likely to be identified in a proteomics experiment Best if hypothesis-driven or used for data confirmation E.g. Enrichment of certain subcellular fraction
22
Protein interaction networks Can be useful for identifying protein complexes in data E.g. STRING [http://string-db.org/]
23
Example: identifying E. huxleyi proteins with multi-species and EST sequence databases
24
Combined search strategy Genome unavailable (for download & searching) dbEST Thalassiosira pseudonana Taxa-limited Database 90,000 E hux ESTs Protein List :Rhodophyta: :Stramenopiles: :Haptophyceae: :Alveolata: :Cryptophyta:
25
EST dataset BLAST database MS/MS data MASCOT hits Translated to 6RFs RFs and MASCOT peptides filtered FIESTA consensus & annotation Final protein identifications BUDAPEST CORE 1 2 3 4 5 Poor quality RFs removed OPTIONAL (MANUAL or AUTOMATED) 90,000 E hux ESTs 173 ESTs 728 189 RFs 113 615 Taxa-limited Database 117 Cons 321 34 Cons 34 83 Cons 287 173 EST hits (728 peptides) 83 Consensus sequences 40 Clusters by homology (variants/isoforms) 287 Peptides 239 Unique to one consensus 48 Shared within one cluster http://www.southampton.ac.uk/~re1u06/software/budapest/
26
Annotating EST Consensus Sequences Homology searching & phylogenetics Sequence Database Consensus UniProt Taxa-limited Database Alignment
27
Protein family identification
28
Redundancy/ Variants
29
Combined search strategy Genome unavailable (for download & searching) dbEST Thalassiosira pseudonana Taxa-limited Database 90,000 E hux ESTs 173 Hits 83 Consensus 40+ Proteins 96 Hits 26+ Proteins :Rhodophyta: :Stramenopiles: :Haptophyceae: :Alveolata: :Cryptophyta: 64+ Proteins (12 Common)
30
Conclusions.
31
Summary Extra analysis of raw protein lists adds value False positives vs. Real proteins Annotation of uncharacterised hits Numerous tools for mining protein lists Data exploration and/or hypothesis testing Community/Organism dependent Worth contacting bioinformaticians for further development Development of customised bioinformatics solutions can greatly increase power of study Increased availability of high throughput technologies Poor annotation & high error rates Increased need for bioinformatics post-processing to improve quality
32
Open Discussion R.Edwards@Southampton.ac.uk
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.