Presentation is loading. Please wait.

Presentation is loading. Please wait.

Aggressive Enumeration of Peptide Sequences for MS/MS Peptide Identification Nathan Edwards Center for Bioinformatics and Computational Biology.

Similar presentations


Presentation on theme: "Aggressive Enumeration of Peptide Sequences for MS/MS Peptide Identification Nathan Edwards Center for Bioinformatics and Computational Biology."— Presentation transcript:

1 Aggressive Enumeration of Peptide Sequences for MS/MS Peptide Identification Nathan Edwards Center for Bioinformatics and Computational Biology

2 2 Mass Spectrometer Ionizer Sample + _ Mass Analyzer Detector MALDI Electro-Spray Ionization (ESI) Time-Of-Flight (TOF) Quadrapole Ion-Trap Electron Multiplier (EM)

3 3 Mass Spectrometer (MALDI-TOF) Source Length = s Field-free drift zone Length = D E d = 0 Microchannel plate detector Backing plate (grounded) Extraction grid (source voltage -V s ) UV (337 nm) Detector grid -V s Pulse voltage Analyte/ matrix

4 4 Mass is fundamental

5 5 Sample Preparation for MS/MS Enzymatic Digest and Fractionation

6 6 Single Stage MS MS

7 7 Tandem Mass Spectrometry (MS/MS) Precursor selection

8 8 Tandem Mass Spectrometry (MS/MS) Precursor selection + collision induced dissociation (CID) MS/MS

9 9 i+1 Peptide Fragmentation -HN-CH-CO-NH-CH-CO-NH- RiRi CH-R’ bibi y n-i y n-i-1 b i+1 R” i+1

10 10 Peptide Fragmentation 100 0 2505007501000 m/z % Intensity KLEDEELFG S

11 11 Peptide Fragmentation K 1166 L 1020 E 907 D 778 E 663 E 534 L 405 F 292 G 145 S 88b ions 100 0 2505007501000 m/z % Intensity 147260389504633762875102210801166y ions y6y6 y7y7 y2y2 y3y3 y4y4 y5y5 y8y8 y9y9 b3b3 b5b5 b6b6 b7b7 b8b8 b9b9 b4b4

12 12 Fail when peptides are missing from sequence database Protein sequence databases serve many masters Full length protein sequences not needed for MS/MS Explicit variant enumeration is needed for MS/MS Much peptide sequence information is lost, inaccessible, or not integrated Protein isoforms, sequence variants, SNPs, alternate splice forms, ESTs Some peptides are more interesting than others Protein identification is only part of the story MS/MS Search Engines

13 13 Human Sequences Number of Human Genes is believed to be between 20,000 and 25,000 PIR~ 10,500 SwissProt~ 12,000 RefSeq~ 28,000 IPI-HUMAN~ 48,000 TrEMBL~ 52,000 MSDB~ 105,000

14 14 DNA to Protein Sequence Derived from http://online.itp.ucsb.edu/online/infobio01/burge

15 15 UCSC Genome Brower

16 16 Genomic Peptide Sequences Many putative peptide sequences never become “protein” sequences Genomic DNA, Refseq mRNA, ESTs SNP/Polymorphism databases Variant records in SwissProt Genomic annotation seeks “full length” genes and proteins

17 17 Genomic Peptide Sequences Genomic DNA Exons & introns, 6 frames, large (3Gb → 6Gb) Refseq mRNA No introns, 3 frames, small (36Mb → 36Mb) Most protein sequences already represented in sequence databases ESTs No introns, 6 frames, large (3Gb → 6Gb) Used by gene & alternative splicing pipelines Highly redundant, nucleotide error rate ~ 1%

18 18 “Novel” Peptide

19 19 Novel peptide

20 20 EST Peptides 6 frame translation Ambiguous base enumeration (up to a point) Break at non-amino-acids stop codons + X Discard AA sequence < 50 AA long Result: ~ 3 Gb Not as simple as it sounds!

21 21 EST Peptides Lots of ambiguous bases >gi|272208|gb|M61958.1| TGCACAACCAAGTTTTGTGACTACGGGAAGGCT CCCGGGGCAGAGGAGTACGCTCAACAAGATGTG TTAAAGAAATCTTACTCCAAGGCCTTCACGCTG ACCATCTCTGCCCTCTTTGTGACACCCAAGACG ACTGGGGCCCNGGTGGAGTTAAGCGAGCAGCAA CTNCAGTTGTNGCCGAGTGATGTGGACAAGCTG TCACCCACTGACA

22 22 Codon Table

23 23 EST Peptides Frame 1 translation CTTKFCDYGKA PGAEEYAQQDV LKKSYSKAFTL TISALFVTPKT TGA[QPRL]VELSEQQ LQL[S*LW]PSDVDKL SPTD[IKMNSRT]

24 24 Correcting EST Sequence Align ESTs to genome Use aligned genomic sequence Must get splice sites right! 6 frame translation Break at non-amino-acids stop codons + X Discard AA sequence < 50 AA long Result: ~ 1 Gb

25 25 Genomic Coding Sequence Use Genscan to predict exons Use very low probability threshold Alternative exons option No need for translation (35Mb)

26 26 Exon “Pair” Enumeration 1 2345 4* Gene model Exon 4 w/ SNP 30 AA 1 3 1 5 3 4* 5 12 3 Exon Pairs & Paths 3-Frame Translation C 3 Compression Peptide Sequence 1 4 3 4 4 5 12 4 12 4* 12 5 1

27 27 Peptide Candidates Parent ion Typically < 3000 Da Tryptic Peptides Cut at K or R Search engines Don’t handle > 4+ well Long peptides don’t fragment well # of distinct 30-mers upper bounds total peptide content

28 28 Sequence Database Compression Construct sequence database that is Complete All 30-mers are present Correct No other 30-mers are present Compact No 30-mer is present more than once

29 29 SBH-graph ACDEFGI, ACDEFACG, DEFGEFGI

30 30 Compressed SBH-graph ACDEFGI, ACDEFACG, DEFGEFGI

31 31 Sequence Databases & CSBH-graphs Sequences correspond to paths ACDEFGI, ACDEFACG, DEFGEFGI

32 32 Sequence Databases & CSBH-graphs All k-mers represented by an edge have the same count 2 2 1 2 1

33 33 Sequence Databases & CSBH-graphs Complete All edges are on some path Correct Output path sequence only Compact No edge is used more than once C 3 Path Set uses all edges exactly once.

34 34 Sequence Databases & CSBH-graphs Use each edge exactly once ACDEFGEFGI, DEFACG

35 35 Sequence Databases & CSBH-graphs All k-mers that occur at least twice ACDEFGI 2 2 1 2 1

36 36 Relative Search Time IPI-H SP SP-VS UP UP-VS

37 37 More Sensitive Peptide ID Significances, p-values, Expect values Normalize for number of trials Blast: Size of sequence database Mascot etc.: Number of peptides scored against each spectrum Redundant peptide sequences increase the number of trials, artificially. Trials are not independent! Less redundancy results in a better significance estimate

38 38 More Sensitive Peptide ID

39 39 Human Peptide Sequences EST enumeration 30-mers must occur at least twice EST corrections Genscan exons Uncompressed size: ~ 4.5Gb Compressed size: ~ 263Mb

40 40 Infrastructure X!Tandem open source search engine Configured to search aggressive peptide enumeration (human) Web interface for browsing results Integrated with condor Results stored in MySQL database Over 3 million publicly available MS/MS spectra from human samples

41 41 “Novel” Peptide

42 42 “Novel” Peptide

43 43 “Novel” Peptide

44 44 Ongoing work Integrate SNPs and exon pairs Get (lots) more spectra! Solve the reverse mapping problem Where did this peptide come from? What protein does this peptide represent?

45 45 Thanks Informatics Research @ ABI & Celera Ross Lippert, Clark Mobarry, Bjarni Halldorsson UMIACS @ University of Maryland, CP V.S. Subrahmanian, Fritz McCall, Doan Pham Fenselau Lab @ UM, CP CS @ University of Maryland, CP Chau-Wen Tseng, Xue Wu


Download ppt "Aggressive Enumeration of Peptide Sequences for MS/MS Peptide Identification Nathan Edwards Center for Bioinformatics and Computational Biology."

Similar presentations


Ads by Google