Aggressive Enumeration of Peptide Sequences for MS/MS Peptide Identification Nathan Edwards Center for Bioinformatics and Computational Biology.

Aggressive Enumeration of Peptide Sequences for MS/MS Peptide Identification Nathan Edwards Center for Bioinformatics and Computational Biology

2 Mass Spectrometer Ionizer Sample + _ Mass Analyzer Detector MALDI Electro-Spray Ionization (ESI) Time-Of-Flight (TOF) Quadrapole Ion-Trap Electron Multiplier (EM)

3 Mass Spectrometer (MALDI-TOF) Source Length = s Field-free drift zone Length = D E d = 0 Microchannel plate detector Backing plate (grounded) Extraction grid (source voltage -V s ) UV (337 nm) Detector grid -V s Pulse voltage Analyte/ matrix

4 Mass is fundamental

5 Sample Preparation for MS/MS Enzymatic Digest and Fractionation

6 Single Stage MS MS

7 Tandem Mass Spectrometry (MS/MS) Precursor selection

8 Tandem Mass Spectrometry (MS/MS) Precursor selection + collision induced dissociation (CID) MS/MS

9 i+1 Peptide Fragmentation -HN-CH-CO-NH-CH-CO-NH- RiRi CH-R’ bibi y n-i y n-i-1 b i+1 R” i+1

10 Peptide Fragmentation 100 0 2505007501000 m/z % Intensity KLEDEELFG S

11 Peptide Fragmentation K 1166 L 1020 E 907 D 778 E 663 E 534 L 405 F 292 G 145 S 88b ions 100 0 2505007501000 m/z % Intensity 147260389504633762875102210801166y ions y6y6 y7y7 y2y2 y3y3 y4y4 y5y5 y8y8 y9y9 b3b3 b5b5 b6b6 b7b7 b8b8 b9b9 b4b4

12 Fail when peptides are missing from sequence database Protein sequence databases serve many masters Full length protein sequences not needed for MS/MS Explicit variant enumeration is needed for MS/MS Much peptide sequence information is lost, inaccessible, or not integrated Protein isoforms, sequence variants, SNPs, alternate splice forms, ESTs Some peptides are more interesting than others Protein identification is only part of the story MS/MS Search Engines

13 Human Sequences Number of Human Genes is believed to be between 20,000 and 25,000 PIR~ 10,500 SwissProt~ 12,000 RefSeq~ 28,000 IPI-HUMAN~ 48,000 TrEMBL~ 52,000 MSDB~ 105,000

14 DNA to Protein Sequence Derived from http://online.itp.ucsb.edu/online/infobio01/burge

15 UCSC Genome Brower

16 Genomic Peptide Sequences Many putative peptide sequences never become “protein” sequences Genomic DNA, Refseq mRNA, ESTs SNP/Polymorphism databases Variant records in SwissProt Genomic annotation seeks “full length” genes and proteins

17 Genomic Peptide Sequences Genomic DNA Exons & introns, 6 frames, large (3Gb → 6Gb) Refseq mRNA No introns, 3 frames, small (36Mb → 36Mb) Most protein sequences already represented in sequence databases ESTs No introns, 6 frames, large (3Gb → 6Gb) Used by gene & alternative splicing pipelines Highly redundant, nucleotide error rate ~ 1%

18 “Novel” Peptide

19 Novel peptide

20 EST Peptides 6 frame translation Ambiguous base enumeration (up to a point) Break at non-amino-acids stop codons + X Discard AA sequence < 50 AA long Result: ~ 3 Gb Not as simple as it sounds!

21 EST Peptides Lots of ambiguous bases >gi|272208|gb|M61958.1| TGCACAACCAAGTTTTGTGACTACGGGAAGGCT CCCGGGGCAGAGGAGTACGCTCAACAAGATGTG TTAAAGAAATCTTACTCCAAGGCCTTCACGCTG ACCATCTCTGCCCTCTTTGTGACACCCAAGACG ACTGGGGCCCNGGTGGAGTTAAGCGAGCAGCAA CTNCAGTTGTNGCCGAGTGATGTGGACAAGCTG TCACCCACTGACA

22 Codon Table

23 EST Peptides Frame 1 translation CTTKFCDYGKA PGAEEYAQQDV LKKSYSKAFTL TISALFVTPKT TGA[QPRL]VELSEQQ LQL[S*LW]PSDVDKL SPTD[IKMNSRT]

24 Correcting EST Sequence Align ESTs to genome Use aligned genomic sequence Must get splice sites right! 6 frame translation Break at non-amino-acids stop codons + X Discard AA sequence < 50 AA long Result: ~ 1 Gb

25 Genomic Coding Sequence Use Genscan to predict exons Use very low probability threshold Alternative exons option No need for translation (35Mb)

26 Exon “Pair” Enumeration 1 2345 4* Gene model Exon 4 w/ SNP 30 AA 1 3 1 5 3 4* 5 12 3 Exon Pairs & Paths 3-Frame Translation C 3 Compression Peptide Sequence 1 4 3 4 4 5 12 4 12 4* 12 5 1

27 Peptide Candidates Parent ion Typically < 3000 Da Tryptic Peptides Cut at K or R Search engines Don’t handle > 4+ well Long peptides don’t fragment well # of distinct 30-mers upper bounds total peptide content

28 Sequence Database Compression Construct sequence database that is Complete All 30-mers are present Correct No other 30-mers are present Compact No 30-mer is present more than once

29 SBH-graph ACDEFGI, ACDEFACG, DEFGEFGI

30 Compressed SBH-graph ACDEFGI, ACDEFACG, DEFGEFGI

31 Sequence Databases & CSBH-graphs Sequences correspond to paths ACDEFGI, ACDEFACG, DEFGEFGI

32 Sequence Databases & CSBH-graphs All k-mers represented by an edge have the same count 2 2 1 2 1

33 Sequence Databases & CSBH-graphs Complete All edges are on some path Correct Output path sequence only Compact No edge is used more than once C 3 Path Set uses all edges exactly once.

34 Sequence Databases & CSBH-graphs Use each edge exactly once ACDEFGEFGI, DEFACG

35 Sequence Databases & CSBH-graphs All k-mers that occur at least twice ACDEFGI 2 2 1 2 1

36 Relative Search Time IPI-H SP SP-VS UP UP-VS

37 More Sensitive Peptide ID Significances, p-values, Expect values Normalize for number of trials Blast: Size of sequence database Mascot etc.: Number of peptides scored against each spectrum Redundant peptide sequences increase the number of trials, artificially. Trials are not independent! Less redundancy results in a better significance estimate

38 More Sensitive Peptide ID

39 Human Peptide Sequences EST enumeration 30-mers must occur at least twice EST corrections Genscan exons Uncompressed size: ~ 4.5Gb Compressed size: ~ 263Mb

40 Infrastructure X!Tandem open source search engine Configured to search aggressive peptide enumeration (human) Web interface for browsing results Integrated with condor Results stored in MySQL database Over 3 million publicly available MS/MS spectra from human samples

44 Ongoing work Integrate SNPs and exon pairs Get (lots) more spectra! Solve the reverse mapping problem Where did this peptide come from? What protein does this peptide represent?

45 Thanks Informatics Research @ ABI & Celera Ross Lippert, Clark Mobarry, Bjarni Halldorsson UMIACS @ University of Maryland, CP V.S. Subrahmanian, Fritz McCall, Doan Pham Fenselau Lab @ UM, CP CS @ University of Maryland, CP Chau-Wen Tseng, Xue Wu

Aggressive Enumeration of Peptide Sequences for MS/MS Peptide Identification Nathan Edwards Center for Bioinformatics and Computational Biology.

Similar presentations

Presentation on theme: "Aggressive Enumeration of Peptide Sequences for MS/MS Peptide Identification Nathan Edwards Center for Bioinformatics and Computational Biology."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Aggressive Enumeration of Peptide Sequences for MS/MS Peptide Identification Nathan Edwards Center for Bioinformatics and Computational Biology.

Similar presentations

Presentation on theme: "Aggressive Enumeration of Peptide Sequences for MS/MS Peptide Identification Nathan Edwards Center for Bioinformatics and Computational Biology."— Presentation transcript:

Similar presentations

About project

Feedback