Aggressive Enumeration of Peptide Sequences for MS/MS Peptide Identification Nathan Edwards Center for Bioinformatics and Computational Biology
2 Mass Spectrometer Ionizer Sample + _ Mass Analyzer Detector MALDI Electro-Spray Ionization (ESI) Time-Of-Flight (TOF) Quadrapole Ion-Trap Electron Multiplier (EM)
3 Mass Spectrometer (MALDI-TOF) Source Length = s Field-free drift zone Length = D E d = 0 Microchannel plate detector Backing plate (grounded) Extraction grid (source voltage -V s ) UV (337 nm) Detector grid -V s Pulse voltage Analyte/ matrix
4 Mass is fundamental
5 Sample Preparation for MS/MS Enzymatic Digest and Fractionation
6 Single Stage MS MS
7 Tandem Mass Spectrometry (MS/MS) Precursor selection
8 Tandem Mass Spectrometry (MS/MS) Precursor selection + collision induced dissociation (CID) MS/MS
9 i+1 Peptide Fragmentation -HN-CH-CO-NH-CH-CO-NH- RiRi CH-R’ bibi y n-i y n-i-1 b i+1 R” i+1
10 Peptide Fragmentation m/z % Intensity KLEDEELFG S
11 Peptide Fragmentation K 1166 L 1020 E 907 D 778 E 663 E 534 L 405 F 292 G 145 S 88b ions m/z % Intensity y ions y6y6 y7y7 y2y2 y3y3 y4y4 y5y5 y8y8 y9y9 b3b3 b5b5 b6b6 b7b7 b8b8 b9b9 b4b4
12 Fail when peptides are missing from sequence database Protein sequence databases serve many masters Full length protein sequences not needed for MS/MS Explicit variant enumeration is needed for MS/MS Much peptide sequence information is lost, inaccessible, or not integrated Protein isoforms, sequence variants, SNPs, alternate splice forms, ESTs Some peptides are more interesting than others Protein identification is only part of the story MS/MS Search Engines
13 Human Sequences Number of Human Genes is believed to be between 20,000 and 25,000 PIR~ 10,500 SwissProt~ 12,000 RefSeq~ 28,000 IPI-HUMAN~ 48,000 TrEMBL~ 52,000 MSDB~ 105,000
14 DNA to Protein Sequence Derived from
15 UCSC Genome Brower
16 Genomic Peptide Sequences Many putative peptide sequences never become “protein” sequences Genomic DNA, Refseq mRNA, ESTs SNP/Polymorphism databases Variant records in SwissProt Genomic annotation seeks “full length” genes and proteins
17 Genomic Peptide Sequences Genomic DNA Exons & introns, 6 frames, large (3Gb → 6Gb) Refseq mRNA No introns, 3 frames, small (36Mb → 36Mb) Most protein sequences already represented in sequence databases ESTs No introns, 6 frames, large (3Gb → 6Gb) Used by gene & alternative splicing pipelines Highly redundant, nucleotide error rate ~ 1%
18 “Novel” Peptide
19 Novel peptide
20 EST Peptides 6 frame translation Ambiguous base enumeration (up to a point) Break at non-amino-acids stop codons + X Discard AA sequence < 50 AA long Result: ~ 3 Gb Not as simple as it sounds!
21 EST Peptides Lots of ambiguous bases >gi|272208|gb|M | TGCACAACCAAGTTTTGTGACTACGGGAAGGCT CCCGGGGCAGAGGAGTACGCTCAACAAGATGTG TTAAAGAAATCTTACTCCAAGGCCTTCACGCTG ACCATCTCTGCCCTCTTTGTGACACCCAAGACG ACTGGGGCCCNGGTGGAGTTAAGCGAGCAGCAA CTNCAGTTGTNGCCGAGTGATGTGGACAAGCTG TCACCCACTGACA
22 Codon Table
23 EST Peptides Frame 1 translation CTTKFCDYGKA PGAEEYAQQDV LKKSYSKAFTL TISALFVTPKT TGA[QPRL]VELSEQQ LQL[S*LW]PSDVDKL SPTD[IKMNSRT]
24 Correcting EST Sequence Align ESTs to genome Use aligned genomic sequence Must get splice sites right! 6 frame translation Break at non-amino-acids stop codons + X Discard AA sequence < 50 AA long Result: ~ 1 Gb
25 Genomic Coding Sequence Use Genscan to predict exons Use very low probability threshold Alternative exons option No need for translation (35Mb)
26 Exon “Pair” Enumeration * Gene model Exon 4 w/ SNP 30 AA * Exon Pairs & Paths 3-Frame Translation C 3 Compression Peptide Sequence *
27 Peptide Candidates Parent ion Typically < 3000 Da Tryptic Peptides Cut at K or R Search engines Don’t handle > 4+ well Long peptides don’t fragment well # of distinct 30-mers upper bounds total peptide content
28 Sequence Database Compression Construct sequence database that is Complete All 30-mers are present Correct No other 30-mers are present Compact No 30-mer is present more than once
29 SBH-graph ACDEFGI, ACDEFACG, DEFGEFGI
30 Compressed SBH-graph ACDEFGI, ACDEFACG, DEFGEFGI
31 Sequence Databases & CSBH-graphs Sequences correspond to paths ACDEFGI, ACDEFACG, DEFGEFGI
32 Sequence Databases & CSBH-graphs All k-mers represented by an edge have the same count
33 Sequence Databases & CSBH-graphs Complete All edges are on some path Correct Output path sequence only Compact No edge is used more than once C 3 Path Set uses all edges exactly once.
34 Sequence Databases & CSBH-graphs Use each edge exactly once ACDEFGEFGI, DEFACG
35 Sequence Databases & CSBH-graphs All k-mers that occur at least twice ACDEFGI
36 Relative Search Time IPI-H SP SP-VS UP UP-VS
37 More Sensitive Peptide ID Significances, p-values, Expect values Normalize for number of trials Blast: Size of sequence database Mascot etc.: Number of peptides scored against each spectrum Redundant peptide sequences increase the number of trials, artificially. Trials are not independent! Less redundancy results in a better significance estimate
38 More Sensitive Peptide ID
39 Human Peptide Sequences EST enumeration 30-mers must occur at least twice EST corrections Genscan exons Uncompressed size: ~ 4.5Gb Compressed size: ~ 263Mb
40 Infrastructure X!Tandem open source search engine Configured to search aggressive peptide enumeration (human) Web interface for browsing results Integrated with condor Results stored in MySQL database Over 3 million publicly available MS/MS spectra from human samples
41 “Novel” Peptide
42 “Novel” Peptide
43 “Novel” Peptide
44 Ongoing work Integrate SNPs and exon pairs Get (lots) more spectra! Solve the reverse mapping problem Where did this peptide come from? What protein does this peptide represent?
45 Thanks Informatics ABI & Celera Ross Lippert, Clark Mobarry, Bjarni Halldorsson University of Maryland, CP V.S. Subrahmanian, Fritz McCall, Doan Pham Fenselau UM, CP University of Maryland, CP Chau-Wen Tseng, Xue Wu