Generating Peptide Candidates from Protein Sequence Databases for Protein Identification via Mass Spectrometry Nathan Edwards Informatics Research.

Slides:



Advertisements
Similar presentations
CS 336 March 19, 2012 Tandy Warnow.
Advertisements

Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
1 EE5900 Advanced Embedded System For Smart Infrastructure Static Scheduling.
1 Maximum flow sender receiver Capacity constraint Lecture 6: Jan 25.
15-853Page : Algorithms in the Real World Suffix Trees.
296.3: Algorithms in the Real World
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
How to identify peptides October 2013 Gustavo de Souza IMM, OUS.
Approximation Algorithms: Combinatorial Approaches Lecture 13: March 2.
. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.
Peptide Identification by Tandem Mass Spectrometry Behshad Behzadi April 2005.
Data Processing Algorithms for Analysis of High Resolution MSMS Spectra of Peptides with Complex Patterns of Posttranslational Modifications Shenheng Guan.
Implicit Hitting Set Problems Richard M. Karp Harvard University August 29, 2011.
Indexed Search Tree (Trie) Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
Sequence Alignment vs. Database Task: Given a query sequence and millions of database records, find the optimal alignment between the query and a record.
This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.
The restriction mapping problem revisited Gopal Pandurangan and H. Ramesh Journal of Computer and System Sciences 526~544(2002)
Phylogenetic Networks of SNPs with Constrained Recombination D. Gusfield, S. Eddhu, C. Langley.
ProReP - Protein Results Parser v3.0©
Introduction to computational genomics – hands on course Gene expression (Gasch et al) Unit 1: Mapper Unit 2: Aggregator and peak finder Solexa MNase Reads.
Sequence alignment, E-value & Extreme value distribution
Proteomics Informatics – Protein identification II: search engines and protein sequence databases (Week 5)
Previous Lecture: Regression and Correlation
Scaffold Download free viewer:
My contact details and information about submitting samples for MS
Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.
Novel Peptide Identification using ESTs and Sequence Database Compression Nathan Edwards Center for Bioinformatics and Computational Biology University.
Analysis of tandem mass spectra - II Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology.
Protein sequencing and Mass Spectrometry. Sample Preparation Enzymatic Digestion (Trypsin) + Fractionation.
Management Science 461 Lecture 9 – Arc Routing November 18, 2008.
Theory of Computing Lecture 10 MAS 714 Hartmut Klauck.
Mining Frequent Itemsets with Constraints Takeaki Uno Takeaki Uno National Institute of Informatics, JAPAN Nov/2005 FJWCP.
PROTEIN STRUCTURE NAME: ANUSHA. INTRODUCTION Frederick Sanger was awarded his first Nobel Prize for determining the amino acid sequence of insulin, the.
Nathan Edwards Center for Bioinformatics and Computational Biology
Optimal k-mer superstrings for protein identification and DNA assay design. Nathan Edwards Center for Bioinformatics and Computational Biology University.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Common parameters At the beginning one need to set up the parameters.
Graph Theory And Bioinformatics Jason Wengert. Outline Introduction to Graphs Eulerian Paths & Hamiltonian Cycles Interval Graph & Shape of Genes Sequencing.
Analysis of Complex Proteomic Datasets Using Scaffold Free Scaffold Viewer can be downloaded at:
Laxman Yetukuri T : Modeling of Proteomics Data
Graph Indexing: A Frequent Structure- based Approach Alicia Cosenza November 26 th, 2007.
INF380 - Proteomics-101 INF380 – Proteomics Chapter 10 – Spectral Comparison Spectral comparison means that an experimental spectrum is compared to theoretical.
Combinatorial Optimization Problems in Computational Biology Ion Mandoiu CSE Department.
Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.
Poster produced by Faculty & Curriculum Support (FACS), Georgetown University Medical Center Application of meta-search, grid-computing, and machine-learning.
Union-find Algorithm Presented by Michael Cassarino.
CSE 589 Part VI. Reading Skiena, Sections 5.5 and 6.8 CLR, chapter 37.
Protein Identification via Database searching Attila Kertész-Farkas Protein Structure and Bioinformatics Group, ICGEB, Trieste.
NP-COMPLETE PROBLEMS. Admin  Two more assignments…  No office hours on tomorrow.
Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.
Peptide Identification via Tandem Mass Spectrometry Sorin Istrail.
Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,
Improving the Sensitivity of Peptide Identification Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical.
Faster, more sensitive peptide identification from tandem mass spectra by sequence database compression Nathan J. Edwards Center for Bioinformatics & Computational.
Implicit Hitting Set Problems Richard M. Karp Erick Moreno Centeno DIMACS 20 th Anniversary.
Flipping letters to minimize the support of a string Giuseppe Lancia, Franca Rinaldi, Romeo Rizzi University of Udine.
Aggressive Enumeration of Peptide Sequences for MS/MS Peptide Identification Nathan Edwards Center for Bioinformatics and Computational Biology.
Output Sensitive Algorithm for Finding Similar Objects Jul/2/2007 Combinatorial Algorithms Day Takeaki Uno Takeaki Uno National Institute of Informatics,
Tag-based Blind Identification of PTMs with Point Process Model 1 Chunmei Liu, 2 Bo Yan, 1 Yinglei Song, 2 Ying Xu, 1 Liming Cai 1 Dept. of Computer Science.
Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,
ISA Kim Hye mi. Introduction Input Spectrum data (Protein database) Peptide assignment Peptide validation manual validation PeptideProphet.
1 Chapter 6 Reformulation-Linearization Technique and Applications.
CSCI2950-C Genomes, Networks, and Cancer
Chapter 5. Optimal Matchings
Randomized Algorithms CS648
Fast Sequence Alignments
Proteomics Informatics David Fenyő
5.4 T-joins and Postman Problems
Proteomics Informatics David Fenyő
Presentation transcript:

Generating Peptide Candidates from Protein Sequence Databases for Protein Identification via Mass Spectrometry Nathan Edwards Informatics Research

Protein Identification -Turns mass spectrometry into proteomics -Sequence is link to identity, annotation, literature, genomics -Proteomics workflows interrogate more than mass -Quality of AA sequence databases sequence & annotation varies wildly -Protein identification is not BLAST!

LC-MS/MS for Protein Id Tandem MS/MS LC-MS/MS: 1 MS spectrum followed by 2-5 Tandem MS/MS spectra every 5-10 sec.

LC-MS/MS for Protein Id -1 experiment produces 1000’s of MS/MS spectra -Suitable for complex mixtures -100’s-1000’s of proteins identified from a single experiment -High-throughput protein identification!

Sequence Database Search Engines Input:Set of MS/MS spectra and associated parent ion masses Output:Peptide sequence for each spectrum 1.Generate peptide candidates from a protein or genomic sequence database 2.Score and rank the peptide candidates

Sequence Database Search Engines Input:Set of MS/MS spectra and associated parent ion masses Output:Peptide sequence for each spectrum 1.Generate peptide candidates from a protein or genomic sequence database 2.Score and rank the peptide candidates

Peptide Candidate Generation Input:Sequence  (length n), from alphabet A (Additive) mass  (a) for a 2 A Query masses M 1,…,M k Output:All (distinct) pairs of query masses i and subsequences  with

Peptide Candidate Generation and Peptide Id -Sequence databases contain many individual proteins -Must avoid redundant scoring -Protein context is important

Simple Linear Scan MKWVTFISLLFLFSSAYSRGV… Query Mass = … Output: WVTFISLLFLFSSAYSR

Sequential Linear Scan -O(nk) time -Simple to implement -Easy to track protein context -Poor data locality -Redundant candidates -String scanning problem

Simultaneous Linear Scan MKWVTFISLLFLFSSAYSRGV… Max Query Mass = … … Lookup each candidate mass in turn.

Simultaneous Linear Scan -O(k log k + n L log k) time -Simple to implement -Easy to track protein context -Better data locality -Redundant candidates -Now a query mass lookup problem!

Overlap Plot from a LC/MS/MS Experiment

Redundant Candidate Elimination -Must avoid repeat scoring of the same peptide candidate -Want to avoid generating redundant candidates -Non-redundant sequence databases contain lots of substring redundancy!

Substring Density (  )

Redundant Candidate Elimination -Suffix trees represent all distinct substrings of a string. S SS SLFS S S S L F FLFSS SL F LFS

Redundant Candidate Elimination -Suffix trees represent all distinct substrings of a string. S SS SLFS S S S L F FLFSS SL F LFS

Suffix-Tree Traversal -O(k log k + n L  log k) time -Redundancy eliminated -Tricky to implement well -Memory overhead ¼ 5n -Protein context more involved -Data locality hard to quantify -Must preprocess sequence db -Still a query mass lookup problem!

Fast Query Mass Lookup -With (small) integer weights, O(M max + k + n L  O) time is possible -Use a query mass lookup table! -Can we achieve this for real weights and non-uniform tolerances? YES!

Fast Query Mass Lookup  mass

Fast Query Mass Lookup  Candidate mass L R O mass

Fast Query Mass Lookup  L R O mass Candidate mass

Fast Query Mass Lookup  L R O mass Candidate mass

Fast Query Mass Lookup  L R O mass Candidate mass

Fast Query Mass Lookup  Candidate mass L R O mass

Fast Query Mass Lookup  Candidate mass L R O mass

Fast Query Mass Lookup  Candidate mass L R O mass

Fast Query Mass Lookup  L R O mass Candidate mass

Fast Query Mass Lookup -Must have  · I min -Table size is O(M max /  + k I max /  -Practical for typical parameters -Running time: Table construction + O(n L  O) is dominated by size of output

Observations -Peptide candidate generation is a key subproblem. -Must eliminate substring redundancy. -As k increases, peptide candidate generation becomes an interval lookup problem. -Run time dominated by output size.

Sequence Database Search Engines -What if peptide isn’t in database? -Need richer set of peptide candidates -Protein isoforms, sequence variants, SNPs, alternate splice forms -Some have phenotypic or clinical annotations

Swiss-Prot

Swiss-Prot Variant Annotations

Swiss-Prot Sequence

Swiss-Prot -VarSplic enumerates all variants, conflicts, isoforms -Swiss-Prot sequence size: -56 Mb -VarSplic sequence size: -90 Mb -How many more peptide candidates?

Swiss-Prot Variant Annotations Feature viewer Variants

Swiss-Prot VarSplic Output P MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDTQFVRF P MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDTQFVRF P MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDTQFVRF P MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDTQFVRF P MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDTQFVRF P MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGKPRFIAVGYVDDTQFVRF P MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGKPRFIAVGYVDDTQFVRF P MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDTQFVRF P MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDTQFVRF P MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDTQFVRF P MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDTQFVRF P MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDTQFVRF ******************************************:*****************

Swiss-Prot VarSplic Output P SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSYTQAASSDSAQ P SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSYTQAASSDSAQ P SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSYTQAASSDSAQ P SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSYTQAASSDSAQ P SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSYTQAASSDSAQ P SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSYTQAASSDSAQ P SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSYTQAASSDSAQ P SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSYTQAASSDSAQ P SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSYTQAASSDSAQ P SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSYTQAASSDSAQ P SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSYSQAASSDSAQ P SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSYSQAASSDSAQ ************************************* *******:*********

Peptide Candidates -Parent ion -Typically < 3000 Da -Tryptic Peptides -Cut at K or R -Search engines -Don’t handle > 4+ well -Long peptides don’t fragment well -# of distinct 30-mers upper bounds total peptide content

Peptide Candidates -At most 2% additional peptides in ~ 1.6 times as much sequence Sequence Database Swiss-ProtVarSplic Size56 Mb90 Mb 30-mers (N 30 )44 Mb45 Mb Overhead27%97%

Sequence Database Compression Construct sequence database that is -Complete -All 30-mers are present -Correct -No other 30-mers are present -Compact -No 30-mer is present more than once

Sequence Database Compression Sequence DatabaseSwiss-ProtVarSplic Original Size55 Mb90 Mb Distinct 30-mers44 Mb45 Mb Overhead27%97% C 3 Size53 Mb54 Mb C 3 Overhead19%20% C 3 Compression93%61% Compression LB79%51%

SBH-graph ACDEFGI, ACDEFACG, DEFGEFGI

Compressed SBH-graph ACDEFGI, ACDEFACG, DEFGEFGI

Sequence Databases & CSBH-graphs -Sequences correspond to paths ACDEFGI, ACDEFACG, DEFGEFGI

Sequence Databases & CSBH-graphs -Complete -All edges are on some path -Correct -Output path sequence only -Compact -No edge is used more than once -C 3 Path Set uses all edges exactly once.

Size of C 3 Path Set for k-mers -Each path costs (k-1)-mer + path sequence + EOS -Sequence database with p paths N k + p k -Minimize sequence database size by minimizing number of paths -subject to C 3 constraints

Best case senario… …if CSBH-graph admits an Eulerian path. Sequence database size (k-1) + N k + 1 How many paths are required if the CSBH-graph is not Eulerian?

Non-Eulerian Components -Net degree -b(v) = # in edges - # out edges -Total degree surplus -B + =  b(v)>0 b(v) -For each path -Start node’s net degree +1 -End node’s net degree -1 -Otherwise, net degree: no change -To reduce all nodes to net degree 0, must have at least B + paths.

Components w/ B + (C) == 0 -Balanced component must have Eulerian tour, so require exactly one path. -m balanced components

# Paths Lower Bound The C 3 path set must contain at least B + + m paths. This lower bound is achievable! Just add (B + - 1) “restart” edges to non-Eulerian components

Achieving Path Lower Bound

AA Sequence Databases

Minimum Size C 3 Sequence Database

Implementation -Suitable for use by Mascot, SEQUEST, … -FASTA format -All connection to protein context is lost -Must do exact string search to find peptides in original database

Extensions -Drop compactness constraint! -Reuse edges rather than starting a new path -Similar to the “Chinese Postman Problem” -Solvable to optimality using a network flow formulation.

Other Ideas -We can drop correctness too! -Equivalent to shortest substring on the set of 30-mers -30-mer subsets …containing two tryptic sites? …containing Cysteine? -Smaller suffix-tree oracles for short queries