Download presentation
Presentation is loading. Please wait.
1
Improving the Sensitivity of Peptide Identification for Genome Annotation
Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center
2
Why Tandem Mass Spectrometry?
MS/MS spectra provide evidence for the amino-acid sequence of functional proteins. Key concepts: Spectrum acquisition is unbiased Direct observation of amino-acid sequence Sensitive to small sequence variations
3
Mass Spectrometry for Proteomics
Measure mass of many (bio)molecules simultaneously High bandwidth Mass is an intrinsic property of all (bio)molecules No prior knowledge required
4
Mass Spectrometer Ionizer Sample Mass Analyzer Detector MALDI
+ _ Mass Analyzer Detector MALDI Electro-Spray Ionization (ESI) Time-Of-Flight (TOF) Quadrapole Ion-Trap Electron Multiplier (EM)
5
Mass Spectrum
6
Mass is fundamental
7
Mass Spectrometry for Proteomics
Measure mass of many molecules simultaneously ...but not too many, abundance bias Mass is an intrinsic property of all (bio)molecules ...but need a reference to compare to
8
Mass Spectrometry for Proteomics
Mass spectrometry has been around since the turn of the century... ...why is MS based Proteomics so new? Ionization methods MALDI, Electrospray Protein chemistry & automation Chromatography, Gels, Computers Protein sequence databases A reference for comparison
9
Sample Preparation for MS/MS
Enzymatic Digest and Fractionation
10
Single Stage MS MS
11
Tandem Mass Spectrometry (MS/MS)
Precursor selection
12
Tandem Mass Spectrometry (MS/MS)
Precursor selection + collision induced dissociation (CID) MS/MS
13
Peptide Fragmentation
Peptide: S-G-F-L-E-E-D-E-L-K y1 y2 y3 y4 y5 y6 y7 y8 y9 ion 1020 907 778 663 534 405 292 145 88 MW 762 SGFL EEDELK b4 389 SGFLEED ELK b7 633 SGFLE EDELK b5 1080 S GFLEEDELK b1 1022 SG FLEEDELK b2 875 SGF LEEDELK b3 504 SGFLEE DELK b6 260 SGFLEEDE LK b8 147 SGFLEEDEL K b9
14
Unannotated Splice Isoform
Human Jurkat leukemia cell-line Lipid-raft extraction protocol, targeting T cells von Haller, et al. MCP 2003. LIME1 gene: LCK interacting transmembrane adaptor 1 LCK gene: Leukocyte-specific protein tyrosine kinase Proto-oncogene Chromosomal aberration involving LCK in leukemias. Multiple significant peptide identifications
15
Unannotated Splice Isoform
16
Unannotated Splice Isoform
17
Translation start-site correction
Halobacterium sp. NRC-1 Extreme halophilic Archaeon, insoluble membrane and soluble cytoplasmic proteins Goo, et al. MCP 2003. GdhA1 gene: Glutamate dehydrogenase A1 Multiple significant peptide identifications Observed start is consistent with Glimmer 3.0 prediction(s)
18
Halobacterium sp. NRC-1 ORF: GdhA1
K-score E-value vs 10% FDR Many peptides inconsistent with annotated translation start site of NP_279651
19
Translation start-site correction
20
Phyloproteomics Tandem mass-spectra of proteins (top-down)
High-accuracy instrument (Orbitrap, UMD Core) Proteins from unsequenced bacteria matching identical proteins in related organisms Demonstration using Y.rohdei.
21
Protein Fragmentation Spectrum
Match to Y. pestis 50S RP L32 AVQQNKPTRSKRGMRRSHDA LTTATLSVDKTSGETHLRHH ITADGFYRGRKVIG
22
Phyloproteomics
23
phylogeny.fr – "One-Click"
Phyloproteomics Protein Sequence 16S-rRNA Sequence phylogeny.fr – "One-Click"
24
Shared "Biomarker" Proteins
25
Phyloproteomics Recent extension to highly homologous proteins in related organisms Merely require N- and/or C-terminus in common Broadens applicability considerably Phyloproteomic trees for E.herbicola and Enterocloacae, neither sequenced. New paradigm for phylogenetic analysis?
26
Lost peptide identifications
Missing from the sequence database Search engine strengths, weaknesses, quirks Poor score or statistical significance Thorough search takes too long
27
Searching under the street-light…
Tandem mass spectrometry doesn’t discriminate against novel peptides but protein sequence databases do! Searching traditional protein sequence databases biases the results in favor of well-understood and/or computationally predicted proteins and protein isoforms!
28
Peptide Sequence Databases
All amino-acid 30-mers, no redundancy From ESTs, Proteins, mRNAs 30-40 fold size, search time reduction Formatted as a FASTA sequence database One entry per gene/cluster. Organism Size (AA) Size (Entries) Human 248Mb 74,976 Mouse 171Mb 55,887 Rat 76Mb 42,372 Zebra-fish 94Mb 40,490
29
We can observe evidence for…
Known coding SNPs Unannotated coding mutations Alternate splicing isoforms Alternate/Incorrect translation start-sites Microexons Alternate/Incorrect translation frames …though it must be treated thoughtfully.
30
PeptideMapper Web Service
I’m Feeling Lucky
31
PeptideMapper Web Service
I’m Feeling Lucky
32
PeptideMapper Web Service
I’m Feeling Lucky
33
PeptideMapper Web Service
Suffix-tree index on peptide sequence database Fast peptide to gene/cluster mapping “Compression” makes this feasible Peptide alignment with cluster evidence Amino-acid or nucleotide; exact & near-exact Genomic-loci mapping via UCSC “known-gene” transcripts, and Predetermined, embedded genomic coordinates
34
Comparison of search engine results
No single score is comprehensive Search engines disagree Many spectra lack confident peptide assignment 38% 14% 28% 3% 2% 1% X! Tandem SEQUEST Mascot Here is way, no single one gives the best results Q: after improvement, what is the percentage of identified spectra, how is the improvement? 25 – 30% Searle et al. JPR 7(1), 2008
35
Combining search engine results – harder than it looks!
Consensus boosts confidence, but... How to assess statistical significance? Gain specificity, but lose sensitivity! Incorrect identifications are correlated too! How to handle weak identifications? Consensus vs disagreement vs abstention Threshold at some significance? We apply unsupervised machine-learning.... Lots of related work unified in a single framework.
36
Supervised Learning
37
Unsupervised Learning
38
Peptide Atlas A8_IP LTQ Dataset
39
Running many search engines
Search engine configuration can be difficult: Correct spectral format Search parameter files and command-line Pre-processed sequence databases. Tracking spectrum identifiers Extracting peptide identifications, especially modifications and protein identifiers
40
Peptide Identification Meta-Search
Simple unified search interface for: Mascot, X!Tandem, K-Score, OMSSA, MyriMatch, S-Score, InsPecT, KM-Score Automatic decoy searches Automatic spectrum file "chunking" Automatic scheduling Serial, Multi-Processor, Cluster, Grid
41
PepArML Meta-Search Engine
X!Tandem, KScore, OMSSA, MyriMatch, Mascot (1 core). NSF TeraGrid 1000+ CPUs Heterogeneous compute resources X!Tandem, KScore, OMSSA, MyriMatch. Secure communication Edwards Lab Scheduler & 48+ CPUs Scales easily to simultaneous searches Single, simple search request UMIACS 250+ CPUs
42
PepArML Meta-Search Engine
X!Tandem, KScore, OMSSA, MyriMatch, Mascot (1 core). NSF TeraGrid 1000+ CPUs Heterogeneous compute resources X!Tandem, KScore, OMSSA, MyriMatch. Secure communication Edwards Lab Scheduler & 80+ CPUs Scales easily to simultaneous searches Single, simple search request
43
PepArML Meta-Search Engine
Heterogeneous compute resources NSF TeraGrid 1000+ CPUs Edwards Lab Scheduler & 48+ CPUs Secure communication Simple search request UMIACS 250+ CPUs
44
PepArML Meta-Search Engine
Heterogeneous compute resources NSF TeraGrid 1000+ CPUs Edwards Lab Scheduler & 48+ CPUs Secure communication Simple search request UMIACS 250+ CPUs
45
Peptide Identification Grid-Enabled Meta-Search
Access to high-performance computing resources for the proteomics community NSF TeraGrid Community Portal University/Institute HPC clusters Individual lab compute resources Contribute cycles to the community and get access to others’ cycles in return. Centralized scheduler Compute capacity can still be exclusive, or prioritized. Compute client plays well with HPC grid schedulers.
46
Conclusions Improve the scope and sensitivity of peptide identification for genome annotation, using Exhaustive peptide sequence databases Machine-learning for combining Meta-search tools to maximize consensus Grid-computing for thorough search
47
Acknowledgements Dr. Catherine Fenselau & students Dr. Yan Wang
University of Maryland Biochemistry Dr. Yan Wang University of Maryland Proteomics Core Dr. Art Delcher University of Maryland CBCB Dr. Chau-Wen Tseng & Dr. Xue Wu University of Maryland Computer Science Funding: NIH/NCI
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.