Presentation is loading. Please wait.

Presentation is loading. Please wait.

Improving the Sensitivity of Peptide Identification Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical.

Similar presentations


Presentation on theme: "Improving the Sensitivity of Peptide Identification Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical."— Presentation transcript:

1 Improving the Sensitivity of Peptide Identification Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center Xue Wu, Chau-Wen Tseng Department of Computer Science University of Maryland, College Park

2 2 Lost peptide identifications Missing from the sequence database Search engine strengths, weaknesses, quirks Poor score or statistical significance Thorough search takes too long

3 3 Lost peptide identifications Missing from the sequence database Build exhaustive peptide sequence databases Search engine strengths, weaknesses, quirks Use multiple search engines and combine results Poor score or statistical significance Use spectral-matching to identify weak spectra Use search-engine consensus to boost confidence Use machine-learning to distinguish true from false Thorough search takes too long Harness the power of heterogeneous computational grids

4 4 Peptide Sequence Databases All peptides at most 30 amino-acids long from: IPI and all IPI constituent protein sequences IPI, HInvDB, VEGA, UniProt, EMBL, RefSeq, GenBank SwissProt variants, conflicts, splices, and signal peptide truncations. Genbank and RefSeq mRNA sequence 3 frame translation GenBank EST and HTC sequences 6 frame translation and found in at least 2 sequences Grouped by UniGene cluster and compressed.

5 5 Formatted as a FASTA sequence database Easy integration with search engines. One entry per gene/cluster. Automated rebuild every few months. Peptide Sequence Databases OrganismSize (AA)Size (Entries) Human209Mb75,043 Mouse151Mb55,929 Rat 67Mb43,211 Zebra-fish 90Mb47,922

6 6 Spectral Matching with HMMs

7 7 I0I0 b1b1 I1I1 I2I2 I3I3 I4I4 I5I5 I6I6 y1y1 b2b2 y2y2 b3b3 y3y3 11%17% 6%94%8%0%11%86%17%0%6%92%19%

8 8 Hidden Markov Model Ion Delete Insert (m/z,int) pair emitted by ion & insert states

9 9 Boosting Identification Sensitivity TestTrainOther (High confidence ids only) OtherModelNone (Low confidence ids)

10 10 Spectral Matching of Peptide Variants DFLAGGVAAAISK DFLAGGIAAAISK

11 11 Spectral Matching Extrapolation

12 12 Comparison of search engine results No single score is comprehensive Search engines disagree Many spectra lack confident peptide assignment Searle et al. JPR 7(1), 2008 38% 14% 28% 14% 3% 2% 1% X! Tandem SEQUEST Mascot

13 13 Combining search engine results – harder than it looks! Consensus boosts confidence, but... How to assess statistical significance? Gain specificity, but lose sensitivity! Incorrect identifications are correlated too! How to handle weak identifications? Consensus vs disagreement vs abstention Threshold at some significance? We apply unsupervised machine-learning.... Lots of related work unified in a single framework.

14 14 Supervised Learning

15 15 Unsupervised Learning

16 16 PepArML Combining Results Q-TOF LTQ MALDI

17 17 Unsupervised Learning H C-TMO U-TMO U*-TMO False Positive RateIteration

18 18 Searching for Consensus Search engine quirks can destroy consensus Initial methionine loss as tryptic peptide Charge state enumeration or guessing X!Tandem's refinement mode Pyro-Gln, Pyro-Glu modifications Difficulty tracking spectrum identifiers Precursor mass tolerance (Da vs ppm) Decoy searches must be identical!

19 19 Configuring for Consensus Search engine configuration can be difficult: Correct spectral format Search parameter files and command-line Pre-processed sequence databases. Tracking spectrum identifiers Extracting peptide identifications, especially modifications and protein identifiers

20 20 Peptide Identification Meta-Search Parameters Instrument Precursor Tolerance Fragment Tolerance Max. Charge Sequence Database Target/Decoy Modification Fixed/Variable Amino-Acids Position Delta Proteolytic Agent Motif Peptide Candidates Termini Specificity Precursor Tolerance Missed cleavages Charge State Handling # 13 C Peaks Search Engines Mascot, X!Tandem OMSSA, MyriMatch

21 21 Peptide Identification Meta-Search Simple unified search interface for: Mascot, X!Tandem OMSSA, Myrimatch Automatic decoy searches Automatic spectrum file "chunking" Automatic scheduling Serial, Multi-Processor, Cluster, Grid

22 22 Peptide Identification Meta-Search NSF TeraGrid 1000+ CPUs UMIACS 250+ CPUs Edwards Lab Scheduler & 48+ CPUs Secure communication Heterogeneous compute resources Simple search request

23 23 Conclusions Improve sensitivity of peptide identification Exhaustive peptide sequence databases Machine-learning for matching and combining Meta-search tools maximize consensus Grid-computing to achieve thorough search

24 24 Acknowledgements Catherine Fenselau University of Maryland Biochemistry Funding: NIH/NCI, USDA/ARS


Download ppt "Improving the Sensitivity of Peptide Identification Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical."

Similar presentations


Ads by Google