Tag-based Blind Identification of PTMs with Point Process Model 1 Chunmei Liu, 2 Bo Yan, 1 Yinglei Song, 2 Ying Xu, 1 Liming Cai 1 Dept. of Computer Science.

Slides:



Advertisements
Similar presentations
Minimum Vertex Cover in Rectangle Graphs
Advertisements

How to identify peptides October 2013 Gustavo de Souza IMM, OUS.
PepArML: A model-free, result-combining peptide identification arbiter via machine learning Xue Wu, Chau-Wen Tseng, Nathan Edwards University of Maryland,
De Novo Sequencing v.s. Database Search Bin Ma School of Computer Science University of Waterloo Ontario, Canada.
Bin Ma, CTO Bioinformatics Solutions Inc. June 5, 2011.
CSE182 CSE182-L12 Mass Spectrometry Peptide identification.
Protein Sequencing and Identification by Mass Spectrometry.
Aki Hecht Seminar in Databases (236826) January 2009
Fa 05CSE182 CSE182-L7 Protein sequencing and Mass Spectrometry.
Heuristic alignment algorithms and cost matrices
Peptide Identification by Tandem Mass Spectrometry Behshad Behzadi April 2005.
Data Processing Algorithms for Analysis of High Resolution MSMS Spectra of Peptides with Complex Patterns of Posttranslational Modifications Shenheng Guan.
PEAKS: De Novo Sequencing using MS/MS spectra Bin Ma, U. Western Ontario, Canada Kaizhong Zhang,U. Western Ontario, Canada Chengzhi Liang, Bioinformatics.
Mass Spectrometry Peptide identification
Computing fragmentation trees from tandem mass spectrometry data Florian Rasche1, Aleš Svatoš2, Ravi Kumar Maddula2, Christoph Böttcher3 & Sebastian Böcker1*
The restriction mapping problem revisited Gopal Pandurangan and H. Ramesh Journal of Computer and System Sciences 526~544(2002)
Fa 05CSE182 CSE182-L8 Mass Spectrometry. Fa 05CSE182 Bio. quiz What is a gene? What is a transcript? What is translation? What are microarrays? What is.
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
Mass spectrometry in proteomics Modified from: I519 Introduction to Bioinformatics, Fall, 2012.
Proteomics Informatics – Protein identification II: search engines and protein sequence databases (Week 5)
Previous Lecture: Regression and Correlation
Mass Spectrometry. What are mass spectrometers? They are analytical tools used to measure the molecular weight of a sample. Accuracy – 0.01 % of the total.
Scaffold Download free viewer:
My contact details and information about submitting samples for MS
Goals in Proteomics 1.Identify and quantify proteins in complex mixtures/complexes 2.Identify global protein-protein interactions 3.Define protein localizations.
1 Mass Spectrometry-based Proteomics Xuehua Shen (Adapted from slides with textbook)
1 Mass Spectrometry-based Proteomics Xuehua Shen (Adapted from slides with textbook)
Analysis of tandem mass spectra - II Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology.
Proteomics Informatics (BMSC-GA 4437) Course Director David Fenyö Contact information
Protein sequencing and Mass Spectrometry. Sample Preparation Enzymatic Digestion (Trypsin) + Fractionation.
Tryptic digestion Proteomics Workflow for Gel-based and LC-coupled Mass Spectrometry Protein or peptide pre-fractionation is a prerequisite for the reduction.
The dynamic nature of the proteome
PROTEIN STRUCTURE NAME: ANUSHA. INTRODUCTION Frederick Sanger was awarded his first Nobel Prize for determining the amino acid sequence of insulin, the.
Variables: – T(p) - set of candidate transcripts on which pe read p can be mapped within 1 std. dev. – y(t) -1 if a candidate transcript t is selected,
INF380 - Proteomics-91 INF380 – Proteomics Chapter 9 – Identification and characterization by MS/MS The MS/MS identification problem can be formulated.
Generating Peptide Candidates from Protein Sequence Databases for Protein Identification via Mass Spectrometry Nathan Edwards Informatics Research.
Common parameters At the beginning one need to set up the parameters.
A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo.
Protein Folding Programs By Asım OKUR CSE 549 November 14, 2002.
Analysis of Complex Proteomic Datasets Using Scaffold Free Scaffold Viewer can be downloaded at:
A Comprehensive Comparison of the de novo Sequencing Accuracies of PEAKS, BioAnalyst and PLGS Bin Ma 1 ; Amanda Doherty-Kirby 1 ; Aaron Booy 2 ; Bob Olafson.
Laxman Yetukuri T : Modeling of Proteomics Data
INF380 - Proteomics-101 INF380 – Proteomics Chapter 10 – Spectral Comparison Spectral comparison means that an experimental spectrum is compared to theoretical.
Protein Identification via Database searching Attila Kertész-Farkas Protein Structure and Bioinformatics Group, ICGEB, Trieste.
PEAKS: De Novo Sequencing using Tandem Mass Spectrometry Bin Ma Dept. of Computer Science University of Western Ontario.
CSE182 CSE182-L12 Mass Spectrometry Peptide identification.
CSE182 CSE182-L11 Protein sequencing and Mass Spectrometry.
Peptide Identification via Tandem Mass Spectrometry Sorin Istrail.
Multiple flavors of mass analyzers Single MS (peptide fingerprinting): Identifies m/z of peptide only Peptide id’d by comparison to database, of predicted.
EBI is an Outstation of the European Molecular Biology Laboratory. In silico analysis of accurate proteomics, complemented by selective isolation of peptides.
Error tolerant search Large number of spectra remain without significant score. Reasonable number of fragment ion peaks might have not match. – Underestimated.
Proteomics Informatics (BMSC-GA 4437) Instructor David Fenyö Contact information
Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae Vu, T. T.,
De Novo Peptide Sequencing via Probabilistic Network Modeling PepNovo.
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
ISA Kim Hye mi. Introduction Input Spectrum data (Protein database) Peptide assignment Peptide validation manual validation PeptideProphet.
Constructing high resolution consensus spectra for a peptide library
김지형. Introduction precursor peptides are dynamically selected for fragmentation with exclusion to prevent repetitive acquisition of MS/MS spectra.
Peptide de novo sequencing Peptide de novo sequencing is the analytical process that derives a peptide’s amino acid sequence from its tandem mass spectrum.
이원엽. Abstract InsPecT: a tool to identify post-translational modifications using tandem mass spectrometry data Database filtering using Peptide.
MassMatrix Search Results Explained
Protein Identification via Database searching
Proteomics Informatics David Fenyő
Comparative RNA Structural Analysis
SEG5010 Presentation Zhou Lanjun.
Backtracking and Branch-and-Bound
謝孫源 (Sun-Yuan Hsieh) 成功大學 電機資訊學院 資訊工程系
High level view of the MAE algorithm.
Proteomics Informatics David Fenyő
(Journal of Computational Biology, 2001) (SODA, 2000)
Presentation transcript:

Tag-based Blind Identification of PTMs with Point Process Model 1 Chunmei Liu, 2 Bo Yan, 1 Yinglei Song, 2 Ying Xu, 1 Liming Cai 1 Dept. of Computer Science 2 Dept. of Biochemistry and Molecular Biology, University of Georgia, USA

2 Tandem mass spectra of peptides e.g., MS/MS of GLSDGEWQQVLNVWGK (

3 Tandem mass spectra of peptides (a tutorial from Mass b1 + Mass y8 = Mass total

4 Peptide sequencing De novo sequencing: directly infer the target peptide from its MS data [Fernandez et al 1992; Dancik et al 1999; Chen et al 2001; Searle et al 2001; Ma et al 2003, Liu et al 2006] sensitive to MS data; noises; missing peaks; and difficult. DB search based: compare the target MS with theoretical MS in a peptide database [Eng et al, 1994; Perkins et al 1999] slow; target may not be in the database; or modified after translation

5 Post-translational modification (PTM)

6 Identifying PTMs in peptide sequencing Assume a limited set of modification types and model them as pseudo amino acids [Yates et al 1995; Wilkins et al 1999; Tanner et al 2005] regular sequencing tools can apply may erroneously processing PTMs of unknown types Blind identification (unlimited modification types) spectral alignment based (difficult) [Pevnzer et al 2000, Tsur et al 2005, Yan et al 2006] de novo sequencing dependent [Han et al 2005]

7 This work DB search based (point process model, Yan et al 2006) PTM identification  Yan et al 2006 is comparable to Tsur et al 2005 Both are the best blind PTM identification programs Yan et al 2006 faster, hits of homologs Peptide tag-based filtering of database Graph-theoretic approach to generate tags

8 Our approach details Input: an experimental spectrum Output: a peptide sequence and possible PTMs Steps:  Construct an extended spectrum graph to find all maximum weighted anti-symmetric paths and select tags as the paths  Construct a DFA from the tags to filter the peptide database to obtain candidate peptides  Apply point process model to the candidates to identify the peptide and potential PTMs by maximizing spectra alignment score

9 Spectrum graph 1232n-12n b y b b y A tandem mass spectrum source sink Intensity m/z 13 source i2n-1 sink is the mass of a single amino acids. De novo sequencing corresponds to finding a longest directed anti-symmetric path from source to sink [Dancik et al 1999, etc.] 242i2n

10 Assume a MS/MS spectrum S of a peptide P be a set of mass peaks. if ; parent mass is M. …… If is a mass of a single amino acid, connect the corresponding vertices with directed edges Connect each pair of complementary vertices and with a non- directed edge. Extended spectrum graph [Liu et al 2006]

11 Extended spectrum graph Mass/Charge Intensity (a) (b) AM RL AMRL parent mass=471 Peptide: AMRL/LRMA

12 Tag selection for the target peptide Tag: a short sequence of amino acids Previous work: PepNovo [Frank et al 2005] apply de novo sequence algorithms first, and identify tags from the sequenced peptide Advantage: effective Disadvantages: the present of noises, missing peaks, and PTMs make it hard to improve the effectiveness; slow

13 Tag selection for the target peptide In this work: construct an extended spectrum graph (mixed graph) from the target spectrum tree-decompose the graph dynamic programming to find all maximum weighted anti- symmetric paths advantages: fast and effective, tolerating noises and missing peaks

14 Graph Tree Decomposition aa b c c c d e h f f a g g b c d e f h a g f a g f a g f Tree decomposition bag a b c d e a c f g h Graph

15 Properties of tree decomposition 1.Each vertex is contained in at least one bag aa b c c c d e h f f a g g b c d e f h a f a g f a g f g

16 2.For any edge {g, f}: there is a bag containing both g and f 1.Each vertex is contained in at least one bag aa b c c c d e h f g b c d e f h a g f a g f a g f Properties of tree decomposition

17 3. For every vertex c: the bags that contain c form a connected subtree aa b c c c d e h f g b c d e f h a g f a g f a g f 2.For any edge {g, f}: there is a bag containing both g and f 1.Each vertex is contained in at least one bag Properties of tree decomposition

18 Tree width of a tree decomposition: Tree width of a graph: minimum width over all tree decompositions of the graph aa b c c c d e hf f a g g Tree width = 2 b c d e f h a g a b c d e a c f g h Tree width = 4 Tree Width

19  Internal tree bags in a tree decomposition are separators of the graph aa b c c c d e hf f a g g b c d e f ha g Tree bags are separators

20  Tree bags in a tree decomposition are separators of the graph aa b c c c d e hf f a g g b c d e f ha g  This allows efficient dynamic programming b d e h g Tree bags are separators

21 A table is maintained for each bag Dynamic programming Compute tables bottom up Each table contains partial optimal solutions; the root table contains the optimal one  Time complexity: O(6 t n 2 )

22 Dynamic programming (cont ’ s) …… … bottom-up …… abc adbbec ……

23 Score scheme and reliability of sequence tags Assign the score scheme [Dancik et all 1999] as weights to the edges in spectrum graphs Overall reliability of a tag t i = w 1 r 1 (t i ) + w 2 r 2 (t i ) r 1 (t i ) - reliability computed from t i ’s edge normalized weights r 2 (t i ) - reliability computed autocorrelation score [Liu et al, 2005] Refer to the paper for details

24 PTM identification with point process blind search Find a set of PTMs to maximize the spectral alignment Can identify all possible PTMs through one round of cross-correlation calculation Computation time is independent of the number of PTMs

25 PTM identification with point process blind search Treat a spectrum and the theoretical spectrum of a candidate peptide as one point process: where {t i } is a set of mass locations with N peaks, and δ is the Kronecker delta function: Assume there is K PTMs, the {t i } can be clustered into K+1 groups:

26 PTM identification with point process blind search When a PTM happens, a shift occurs to x k (t) to produce y k (t) Use C[.] to denote the total number of non-zero values in a point process:

27 For K=1, ∆ represents the mass of a possible PTM, we report the top candidate with a ∆, and with the maximum PTM identification with point process blind search

28 Evaluations Datasets  2657 annotated yeast ion trap tandem mass spectra from OPD (Prince et al, 2004) having relatively low mass resolutions  2620 modified spectra with one artificially added one PTM to each spectrum (Yan et al, 2006) Experiments  Sequence tag generation  Database search via DFA based model  Blind PTM identification

29 Performance in tag selection Tag length AlgorithmTop 1Top 3Top 5Top 10Top 25Time(s) w/o PTM 3Ours Pepnovo Ours Pepnovo Ours Pepnovo Ours Pepnovo with PTM 3Ours Pepnovo Ours Pepnovo Columns: percentages of spectra that have at least one correct tag in top 1, 3, 5, 10, 25. Comparisons based on the sequencing results by SEQUEST [Eng et al 1994]

30 Performance in tag selection (cont’d) Time complexity of the tag selection depends on the tree width t of spectrum graphs: O(6 t n 2 ) About 90% of such graphs have tree width not exceeding 6 More than 10 times faster than PepNovo [Frank et al 2005]

31 Database search for PTM identification Construct a DFA from the selected sequence tags and use it to filter a peptide database Only small portion of peptides will remain Point process model for PTM identification are applied to identify the peptide and potential PTMs

32 Performance in PTM identification Tag length Top 1Top 2Top 3Top 4Top 5Filtration Ratio T(s) W/O Filtration Columns: cumulative percentages of search results capturing the target peptides exactly in Top i; T is the total time for all 2620 experimental spectra. Comparisons with Yan et al 2006 that does not employ filtration.

33 Summary A new graph-theoretic approach for peptide tag selection effective and efficient In combine with point process model to sequence peptide and identify PTMs effective and efficient More tests are needed (e.g. two PTMs) Tree decomposition based approaches have not been fully exploited (e.g., improving tag selection effectiveness)

34 Acknowledgement Chunmei Liu Yinglei Song Bo Yan Ying Xu NSF NIH