Download presentation
Presentation is loading. Please wait.
Published byBrad Coatsworth Modified over 9 years ago
1
Proteogenomics: Refining and Improving Genome Annotation Samuel H Payne J Craig Venter Institute
2
State of Genome Annotation Most prokaryotic genomes are auto-annotated. Sequence and function are inferred with comparative genomics; validation is sparse. Difficulties with novel or HGT genes Mature protein features localization PTM, cleavage Salzberg 2007
3
Diversity or Confusion
4
Proteomics Input: protein sample Output: list of peptides
5
Proteogenomics Definition: using proteomics data to do genome annotation Goals: Find all coding regions of the genome, annotated and unannotated Submit improved annotation to NCBI Identify “mature protein” features
6
Proteogenomics Protocol Data sources Yersinia pestis - Pieper et al., 2008, 2009 Bacillus anthracis – PRC/NIAID
7
Correcting Errors Unannotated genes Both known and totally novel
8
Correcting Errors Unannotated genes Both known and totally novel
9
Correcting Errors Start site assignment
10
Exceptions to Rules Multi-ORF genes: self splicing, frame shift
11
Exceptions to Rules Non-canonical start codons infC – ATT (Sacerdot 1982, Payne 2010) in enterobacteria; ATA in Shewanella (Gupta 2007) Deinococcus (Baudet 2009) suggests new non- standard starts
12
Overlaps/Wrong Frames
13
Pseudo?genes Expression of ABC transporter n- terminus. Missing critical motif elements. 5 peptides (with splicing) map to a transposable element gene. Sequence alignment to an Arabidopsis Ulp1 Castellana 2008
14
Signal Peptide N-terminal motif, target protein for export 1983 Perlman & Halvorson Early basic residue, hydrophobic patch, AxB motif – A = [I,V,L,A,G,S], B = [A,G,S]
15
Profile of an Exported Protein Early basic residue, hydrophobic patch, motif
16
Future Rinse and repeat 30 proteomes in 3 years Stable, robust pipeline for general use Hosted at TeraGrid NovelNew Start Y. pestis45 B. anthracis46 D. radiodurans225117 D. vulgaris5589 L. interrogans2023
17
When Gene Predictors Fail Are GC extremes difficult? 50% (Y. pestis) – 4 missed 30’s (B. anthracis, L.interrogans) 4, 20 60’s (D. vulgaris, D. radiodurans) 55, 225
18
Are They Strange? Relative GC – does it fail on genes with different GC from others?
19
Are They All Short?
20
We See What We Know Proximity to Model Organism Yersinia/Bacillus errors: 4/4 ‘Remote species’ errors: 20, 55, >200
21
We See What We Know Hypothetical vs. Named Compare novel genes to observed proteome Hypergeometric where Null probability is from the observed proteome HypotheticalNamedp-value B. anthracis310.018 L. interrogans1280.018 D. radiodurans31810 -10 D. vulgaris391610 -14
22
Expressed Protein Resource Protein Sequences >30 M sequences nr, uniprot JCVI metagenomics JGI genomes 40,000 clusters Cross referenced with proteomics, for validated proteins
23
Acknowledgements Eli Venter Shih-Ting Huang, Rembert Pieper Granger Sutton Dick Smith, PNNL NSF
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.