Aggressive Enumeration of Peptide Sequences for MS/MS Peptide Identification Nathan Edwards Center for Bioinformatics and Computational Biology.

Slides:



Advertisements
Similar presentations
Genomes and Proteomes genome: complete set of genetic information in organism gene sequence contains recipe for making proteins (genotype) proteome: complete.
Advertisements

How to identify peptides October 2013 Gustavo de Souza IMM, OUS.
Improving the Sensitivity of Peptide Identification for Genome Annotation Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown.
PepArML: A model-free, result-combining peptide identification arbiter via machine learning Xue Wu, Chau-Wen Tseng, Nathan Edwards University of Maryland,
Protein Sequencing and Identification by Mass Spectrometry.
Fa 05CSE182 CSE182-L7 Protein sequencing and Mass Spectrometry.
Proteomics: A Challenge for Technology and Information Science CBCB Seminar, November 21, 2005 Tim Griffin Dept. Biochemistry, Molecular Biology and Biophysics.
ProReP - Protein Results Parser v3.0©
Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine
Proteomics Informatics – Protein identification II: search engines and protein sequence databases (Week 5)
Proteomics Informatics Workshop Part I: Protein Identification
Previous Lecture: Regression and Correlation
Mass Spectrometry. What are mass spectrometers? They are analytical tools used to measure the molecular weight of a sample. Accuracy – 0.01 % of the total.
Scaffold Download free viewer:
Proteomics Informatics (BMSC-GA 4437) Course Director David Fenyö Contact information
My contact details and information about submitting samples for MS
Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.
Novel Peptide Identification using ESTs and Sequence Database Compression Nathan Edwards Center for Bioinformatics and Computational Biology University.
Proteomics Informatics (BMSC-GA 4437) Course Director David Fenyö Contact information
Evaluated Reference MS/MS Spectra Libraries Current and Future NIST Programs.
Protein sequencing and Mass Spectrometry. Sample Preparation Enzymatic Digestion (Trypsin) + Fractionation.
Tryptic digestion Proteomics Workflow for Gel-based and LC-coupled Mass Spectrometry Protein or peptide pre-fractionation is a prerequisite for the reduction.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Improving Genome Annotation using Proteomics Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland, College Park.
Improving the Reliability of Peptide Identification by Tandem Mass Spectrometry Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology.
Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.
Nathan Edwards Center for Bioinformatics and Computational Biology
Optimal k-mer superstrings for protein identification and DNA assay design. Nathan Edwards Center for Bioinformatics and Computational Biology University.
Direct Experimental Observation of Functional Protein Isoforms by Tandem Mass Spectrometry Nathan Edwards Center for Bioinformatics and Computational Biology.
Generating Peptide Candidates from Protein Sequence Databases for Protein Identification via Mass Spectrometry Nathan Edwards Informatics Research.
Common parameters At the beginning one need to set up the parameters.
Novel Empirical FDR Estimation in PepArML David Retz and Nathan Edwards Georgetown University Medical Center.
Laxman Yetukuri T : Modeling of Proteomics Data
Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center.
PeptideProphet Explained Brian C. Searle Proteome Software Inc SW Bertha Blvd, Portland OR (503) An explanation.
Improving the Sensitivity of Peptide Identification for Genome Annotation Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown.
Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.
Sackler Medical School
Proteomics Technology and Protein Identification
Poster produced by Faculty & Curriculum Support (FACS), Georgetown University Medical Center Application of meta-search, grid-computing, and machine-learning.
Biological databases Exercises. Discovery of distinct sequence databases using ensembl.
In-Gel Digestion Why In-Gel Digest?
PEAKS: De Novo Sequencing using Tandem Mass Spectrometry Bin Ma Dept. of Computer Science University of Western Ontario.
CSE182 CSE182-L11 Protein sequencing and Mass Spectrometry.
Alternative Splicing (a review by Liliana Florea, 2005) CS 498 SS Saurabh Sinha 11/30/06.
Peptide Identification via Tandem Mass Spectrometry Sorin Istrail.
Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,
Improving the Sensitivity of Peptide Identification Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical.
Faster, more sensitive peptide identification from tandem mass spectra by sequence database compression Nathan J. Edwards Center for Bioinformatics & Computational.
Lecture-9 MS Techniques and Protein Identification Huseyin Tombuloglu, Phd GBE423 Genomics & Proteomics.
Poster produced by Faculty & Curriculum Support (FACS), Georgetown University Medical Center  Peptide sequence databases, meta-search engine, machine-learning.
Improving the Sensitivity of Peptide Identification by Meta-Search, Grid-Computing, and Machine-Learning Nathan Edwards Georgetown University Medical Center.
Fgenes++ pipelines for automatic annotation of eukaryotic genomes Victor Solovyev, Peter Kosarev, Royal Holloway College, University of London Softberry.
Improving the Sensitivity of Peptide Identification for Genome Annotation Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown.
Protein Identification Using Tandem Mass Spectrometry Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland, College.
Poster produced by Faculty & Curriculum Support (FACS), Georgetown University Medical Center Application of meta-search, grid-computing, and machine-learning.
Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,
UCSC Genome Browser Zeevik Melamed & Dror Hollander Gil Ast Lab Sackler Medical School.
Proteomics & Mass Spectrometry
Application of meta-search, grid-computing, and machine-learning can significantly improve the sensitivity of peptide identification. The PepArML meta-search.
Proteomics Informatics (BMSC-GA 4437) Course Directors David Fenyö Kelly Ruggles Beatrix Ueberheide Contact information
2014 생화학 실험 (1) 6주차 실험조교 : 류 지 연 Yonsei Proteome Research Center 산학협동관 421호
Yonsei Proteome Research Center Peptide Mass Finger-Printing Part II. MALDI-TOF 2013 생화학 실험 (1) 6 주차 자료 임종선 조교 내선 6625.
Mass Spectrometry 101 (continued) Hackert - CH 370 / 387D
The Syllabus. The Syllabus Safety First !!! Students will not be allowed into the lab without proper attire. Proper attire is designed for your protection.
Proteomics Informatics David Fenyő
Protein Identification Using Mass Spectrometry
Mass Spectrometry THE MAIN USE OF MS IN ORG CHEM IS:
Proteomics Informatics David Fenyő
Presentation transcript:

Aggressive Enumeration of Peptide Sequences for MS/MS Peptide Identification Nathan Edwards Center for Bioinformatics and Computational Biology

2 Mass Spectrometer Ionizer Sample + _ Mass Analyzer Detector MALDI Electro-Spray Ionization (ESI) Time-Of-Flight (TOF) Quadrapole Ion-Trap Electron Multiplier (EM)

3 Mass Spectrometer (MALDI-TOF) Source Length = s Field-free drift zone Length = D E d = 0 Microchannel plate detector Backing plate (grounded) Extraction grid (source voltage -V s ) UV (337 nm) Detector grid -V s Pulse voltage Analyte/ matrix

4 Mass is fundamental

5 Sample Preparation for MS/MS Enzymatic Digest and Fractionation

6 Single Stage MS MS

7 Tandem Mass Spectrometry (MS/MS) Precursor selection

8 Tandem Mass Spectrometry (MS/MS) Precursor selection + collision induced dissociation (CID) MS/MS

9 i+1 Peptide Fragmentation -HN-CH-CO-NH-CH-CO-NH- RiRi CH-R’ bibi y n-i y n-i-1 b i+1 R” i+1

10 Peptide Fragmentation m/z % Intensity KLEDEELFG S

11 Peptide Fragmentation K 1166 L 1020 E 907 D 778 E 663 E 534 L 405 F 292 G 145 S 88b ions m/z % Intensity y ions y6y6 y7y7 y2y2 y3y3 y4y4 y5y5 y8y8 y9y9 b3b3 b5b5 b6b6 b7b7 b8b8 b9b9 b4b4

12 Fail when peptides are missing from sequence database Protein sequence databases serve many masters Full length protein sequences not needed for MS/MS Explicit variant enumeration is needed for MS/MS Much peptide sequence information is lost, inaccessible, or not integrated Protein isoforms, sequence variants, SNPs, alternate splice forms, ESTs Some peptides are more interesting than others Protein identification is only part of the story MS/MS Search Engines

13 Human Sequences Number of Human Genes is believed to be between 20,000 and 25,000 PIR~ 10,500 SwissProt~ 12,000 RefSeq~ 28,000 IPI-HUMAN~ 48,000 TrEMBL~ 52,000 MSDB~ 105,000

14 DNA to Protein Sequence Derived from

15 UCSC Genome Brower

16 Genomic Peptide Sequences Many putative peptide sequences never become “protein” sequences Genomic DNA, Refseq mRNA, ESTs SNP/Polymorphism databases Variant records in SwissProt Genomic annotation seeks “full length” genes and proteins

17 Genomic Peptide Sequences Genomic DNA Exons & introns, 6 frames, large (3Gb → 6Gb) Refseq mRNA No introns, 3 frames, small (36Mb → 36Mb) Most protein sequences already represented in sequence databases ESTs No introns, 6 frames, large (3Gb → 6Gb) Used by gene & alternative splicing pipelines Highly redundant, nucleotide error rate ~ 1%

18 “Novel” Peptide

19 Novel peptide

20 EST Peptides 6 frame translation Ambiguous base enumeration (up to a point) Break at non-amino-acids stop codons + X Discard AA sequence < 50 AA long Result: ~ 3 Gb Not as simple as it sounds!

21 EST Peptides Lots of ambiguous bases >gi|272208|gb|M | TGCACAACCAAGTTTTGTGACTACGGGAAGGCT CCCGGGGCAGAGGAGTACGCTCAACAAGATGTG TTAAAGAAATCTTACTCCAAGGCCTTCACGCTG ACCATCTCTGCCCTCTTTGTGACACCCAAGACG ACTGGGGCCCNGGTGGAGTTAAGCGAGCAGCAA CTNCAGTTGTNGCCGAGTGATGTGGACAAGCTG TCACCCACTGACA

22 Codon Table

23 EST Peptides Frame 1 translation CTTKFCDYGKA PGAEEYAQQDV LKKSYSKAFTL TISALFVTPKT TGA[QPRL]VELSEQQ LQL[S*LW]PSDVDKL SPTD[IKMNSRT]

24 Correcting EST Sequence Align ESTs to genome Use aligned genomic sequence Must get splice sites right! 6 frame translation Break at non-amino-acids stop codons + X Discard AA sequence < 50 AA long Result: ~ 1 Gb

25 Genomic Coding Sequence Use Genscan to predict exons Use very low probability threshold Alternative exons option No need for translation (35Mb)

26 Exon “Pair” Enumeration * Gene model Exon 4 w/ SNP 30 AA * Exon Pairs & Paths 3-Frame Translation C 3 Compression Peptide Sequence *

27 Peptide Candidates Parent ion Typically < 3000 Da Tryptic Peptides Cut at K or R Search engines Don’t handle > 4+ well Long peptides don’t fragment well # of distinct 30-mers upper bounds total peptide content

28 Sequence Database Compression Construct sequence database that is Complete All 30-mers are present Correct No other 30-mers are present Compact No 30-mer is present more than once

29 SBH-graph ACDEFGI, ACDEFACG, DEFGEFGI

30 Compressed SBH-graph ACDEFGI, ACDEFACG, DEFGEFGI

31 Sequence Databases & CSBH-graphs Sequences correspond to paths ACDEFGI, ACDEFACG, DEFGEFGI

32 Sequence Databases & CSBH-graphs All k-mers represented by an edge have the same count

33 Sequence Databases & CSBH-graphs Complete All edges are on some path Correct Output path sequence only Compact No edge is used more than once C 3 Path Set uses all edges exactly once.

34 Sequence Databases & CSBH-graphs Use each edge exactly once ACDEFGEFGI, DEFACG

35 Sequence Databases & CSBH-graphs All k-mers that occur at least twice ACDEFGI

36 Relative Search Time IPI-H SP SP-VS UP UP-VS

37 More Sensitive Peptide ID Significances, p-values, Expect values Normalize for number of trials Blast: Size of sequence database Mascot etc.: Number of peptides scored against each spectrum Redundant peptide sequences increase the number of trials, artificially. Trials are not independent! Less redundancy results in a better significance estimate

38 More Sensitive Peptide ID

39 Human Peptide Sequences EST enumeration 30-mers must occur at least twice EST corrections Genscan exons Uncompressed size: ~ 4.5Gb Compressed size: ~ 263Mb

40 Infrastructure X!Tandem open source search engine Configured to search aggressive peptide enumeration (human) Web interface for browsing results Integrated with condor Results stored in MySQL database Over 3 million publicly available MS/MS spectra from human samples

41 “Novel” Peptide

42 “Novel” Peptide

43 “Novel” Peptide

44 Ongoing work Integrate SNPs and exon pairs Get (lots) more spectra! Solve the reverse mapping problem Where did this peptide come from? What protein does this peptide represent?

45 Thanks Informatics ABI & Celera Ross Lippert, Clark Mobarry, Bjarni Halldorsson University of Maryland, CP V.S. Subrahmanian, Fritz McCall, Doan Pham Fenselau UM, CP University of Maryland, CP Chau-Wen Tseng, Xue Wu