De Novo Peptide Sequencing via Probabilistic Network Modeling PepNovo.

Slides:



Advertisements
Similar presentations
Protein Sequencing and Identification by Mass Spectrometry.
Advertisements

Introduction of Probabilistic Reasoning and Bayesian Networks
MN-B-C 2 Analysis of High Dimensional (-omics) Data Kay Hofmann – Protein Evolution Group Week 5: Proteomics.
De Novo Sequencing and Homology Searching with De Novo Sequence Tags.
How to identify peptides October 2013 Gustavo de Souza IMM, OUS.
De Novo Sequencing v.s. Database Search Bin Ma School of Computer Science University of Waterloo Ontario, Canada.
CSE182 CSE182-L12 Mass Spectrometry Peptide identification.
Protein Sequencing and Identification by Mass Spectrometry.
Methods of identification and localization of the DNA coding sequences Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling,
Fa 05CSE182 CSE182-L7 Protein sequencing and Mass Spectrometry.
Peptide Identification by Tandem Mass Spectrometry Behshad Behzadi April 2005.
Data Processing Algorithms for Analysis of High Resolution MSMS Spectra of Peptides with Complex Patterns of Posttranslational Modifications Shenheng Guan.
PEAKS: De Novo Sequencing using MS/MS spectra Bin Ma, U. Western Ontario, Canada Kaizhong Zhang,U. Western Ontario, Canada Chengzhi Liang, Bioinformatics.
Statistics. Large Systems Macroscopic systems involve large numbers of particles.  Microscopic determinism  Macroscopic phenomena The basis is in mechanics.
Maximum Likelihood Flips usage of probability function A typical calculation: P(h|n,p) = C(h, n) * p h * (1-p) (n-h) The implied question: Given p of success.
The restriction mapping problem revisited Gopal Pandurangan and H. Ramesh Journal of Computer and System Sciences 526~544(2002)
Similar Sequence Similar Function Charles Yan Spring 2006.
Fa 05CSE182 CSE182-L8 Mass Spectrometry. Fa 05CSE182 Bio. quiz What is a gene? What is a transcript? What is translation? What are microarrays? What is.
ProReP - Protein Results Parser v3.0©
Experimental Evaluation
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
Mass spectrometry in proteomics Modified from: I519 Introduction to Bioinformatics, Fall, 2012.
Proteomics Informatics – Protein identification II: search engines and protein sequence databases (Week 5)
Previous Lecture: Regression and Correlation
Mass Spectrometry. What are mass spectrometers? They are analytical tools used to measure the molecular weight of a sample. Accuracy – 0.01 % of the total.
Lecture II-2: Probability Review
1 Mass Spectrometry-based Proteomics Xuehua Shen (Adapted from slides with textbook)
1 Mass Spectrometry-based Proteomics Xuehua Shen (Adapted from slides with textbook)
Fa 05CSE182 CSE182-L9 Mass Spectrometry Quantitation and other applications.
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
Protein sequencing and Mass Spectrometry. Sample Preparation Enzymatic Digestion (Trypsin) + Fractionation.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
A Brief Introduction to Graphical Models
The dynamic nature of the proteome
PROTEIN STRUCTURE NAME: ANUSHA. INTRODUCTION Frederick Sanger was awarded his first Nobel Prize for determining the amino acid sequence of insulin, the.
Hidden Markov Models for Sequence Analysis 4
Common parameters At the beginning one need to set up the parameters.
Analysis of Complex Proteomic Datasets Using Scaffold Free Scaffold Viewer can be downloaded at:
A Comprehensive Comparison of the de novo Sequencing Accuracies of PEAKS, BioAnalyst and PLGS Bin Ma 1 ; Amanda Doherty-Kirby 1 ; Aaron Booy 2 ; Bob Olafson.
Laxman Yetukuri T : Modeling of Proteomics Data
Forward-Scan Sonar Tomographic Reconstruction PHD Filter Multiple Target Tracking Bayesian Multiple Target Tracking in Forward Scan Sonar.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
INF380 - Proteomics-101 INF380 – Proteomics Chapter 10 – Spectral Comparison Spectral comparison means that an experimental spectrum is compared to theoretical.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Protein Identification via Database searching Attila Kertész-Farkas Protein Structure and Bioinformatics Group, ICGEB, Trieste.
PEAKS: De Novo Sequencing using Tandem Mass Spectrometry Bin Ma Dept. of Computer Science University of Western Ontario.
CSE182 CSE182-L12 Mass Spectrometry Peptide identification.
Slides for “Data Mining” by I. H. Witten and E. Frank.
INF380 - Proteomics-71 INF380 – Proteomics Chap 7 –Protein Identification and Characterization by MS Protein identification in our context means that we.
CSE182 CSE182-L11 Protein sequencing and Mass Spectrometry.
Error tolerant search Large number of spectra remain without significant score. Reasonable number of fragment ion peaks might have not match. – Underestimated.
Tag-based Blind Identification of PTMs with Point Process Model 1 Chunmei Liu, 2 Bo Yan, 1 Yinglei Song, 2 Ying Xu, 1 Liming Cai 1 Dept. of Computer Science.
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
Computer Performance Modeling Dirk Grunwald Prelude to Jain, Chapter 12 Laws of Large Numbers and The normal distribution.
STRUCTURAL DETERMINATION MASS SPECTRUM (MS) LAB 12.
Deducing protein composition from complex protein preparations by MALDI without peptide separation.. TP #419 Kenneth C. Parker SimulTof Corporation, Sudbury,
Constructing high resolution consensus spectra for a peptide library
김지형. Introduction precursor peptides are dynamically selected for fragmentation with exclusion to prevent repetitive acquisition of MS/MS spectra.
B Monoisotopic mass of neutral peptide M r (calc): Fixed modifications: Carbamidomethyl Ions score: 45 † Expect: ‡ Matches (red): 18/50.
‘Protein sequencing’: Determining protein sequences
Protein Identification via Database searching
De novo interpretation of peptide mass spectra
Computing Xcorr exact p values
N-Gram Model Formulas Word sequences Chain rule of probability
NoDupe algorithm to detect and group similar mass spectra.
Parametric Methods Berlin Chen, 2005 References:
BN Semantics 3 – Now it’s personal! Parameter Learning 1
Comparisons of MS/MS ion fragmentation spectra derived from GNPS network annotations (see Table S4 posted at ftp://massive.ucsd.edu/MSV /updates/ _aedlund_ /other/)
Processing of fragment ion information in DTA files to remove isotope ions and noise. Processing of fragment ion information in DTA files to remove isotope.
Presentation transcript:

De Novo Peptide Sequencing via Probabilistic Network Modeling PepNovo

Peptide Fragmentation ACFETPGR N C ACFET N PGR C M PM-M Collision-Induced Dissociation (CID)

Peptide Fragmentation  A peptide with mass PM, that fragments into a prefix of mass m, and a suffix of mass PM-m, can produce different fragment ions:  The intensities at the expected offsets from mass m are used to create an intensity vector: Prefix ionpositionSuffix Ionposition bm+1yPM-m+19 b-H 2 Om-17y-NH 3 PM-m+2 b +2 (m+2)/2y-H 2 O-H 2 OPM-m-17...

The Spectrum Graph

Scoring for De Novo Sequencing  All masses in spectrum range can be considered putative cleavage sites.  Given observed intensities, how to evaluate if mass m is cleavage site.  A common statistical tool used by many scoring functions is the likelihood ratio test (Dancik et al. 99’, Havilio et al. 03’,...)

Dancik et al. ’99 – Hypotheses  The main concept: Give premium for present peaks and penalties for missing peaks.  Uses a probability table:  P R – Probability of observing random peak (~0.1) (Random hypothesis). FragmentProbability y0.71 (P 1 ) b0.66 (P 2 ) a0.26 (P 3 ) y-H 2 OH 2 O0.09 (P k ) Fragmentation Hypothesis

Scoring a Cleavage Site (Dancik ‘99)  Out of k possible ions for cleavage at m, t are detected (w.l.o.g fragments 1,..,t ) and k-t are missing ( t+1,..,k ).  Score using a log ratio test: Probability of cleavage site m according to Fragmentation hypothesis Probability of cleavage site m according to Random hypothesis

PepNovo Scoring  PepNovo implements a similar likelihood ratio test mechanism.  Can be viewed as extending the scoring model of Dancik et al. 99’.  Includes several factors that are not sufficiently addressed in current scoring functions.

Enhancements to Dancik et al. (’99) 1. Several Intensity values. 2. Combinations of fragment ions. 3. Incorporation of additional chemical knowledge (e.g., preferred cleavage sites). 4. Positional influence of the cleavage site. 5. Improved Random Model.

pos(m) (region in peptide) y b y2y2 a b2b2 a-NH 3 a-H 2 O b-NH 3 b-H 2 O y-NH 3 y-H 2 O b-H 2 O-NH 3 b-H 2 O-H 2 O y-H 2 O-NH 3 y-H 2 O-H 2 O N-aa (N-terminal amino acid) C-aa (C-terminal amino acid) H CID - Fragmentation Network Amino acid influence Ion combinations Positional influence posyP(y 2 |y,po s)

Discrete Intensity Values  Peak intensity normalized according to grass level (average of weakest 33% of peaks in spectrum).  Normalized intensities Discretized into 4 intensity levels: zero : I < 0.05 low : 0.05 ≤ I < 2 ( 62% of peaks ) medium : 2 ≤ I < 10 ( 26% of peaks ) high : I ≥ 10 ( 12% of peaks )

Combinations of Fragments  Different combinations have significantly different probabilities: P(b=high| y=high) = 0.36, vs. P(b=high| y=low) = P(b-H 2 O > zero | b=high) = 0.5, vs. P(b-H 2 O > zero | b= zero) = y b y2y2 a b2b2 a-NH 3 a-H 2 O b-NH 3 b-H 2 O y-NH 3 y-H 2 O b-H 2 O- NH 3 b-H 2 O- H 2 O y-H 2 O- NH 3 y-H 2 O- H 2 O

Additional Chemical Knowledge  The identity of the flanking amino acids influences the peak intensities: Increased intensities N-terminal to Proline and Glycine Increased intensities C-terminal to Aspartic Acid.  400 amino acid combinations reduced to 15 equivalence sets (X-P,X-G, etc.). N-aa (N-terminal amino acid) C-aa (C-terminal amino acid) y b

Positional Influence  Creates separate models for different locations in the peptide  Models phenomena such as: weak b/y ions near the ends. prevalence of a-ions in the first half of the peptides. prevalence of b 2 towards the peptide’s C-terminal and y 2 near the N-terminal. pos(m) (region in peptide) yby2y2 ab2b2

Probability under H CID  From the decomposition properties of probabilistic networks, each node is independent from the rest of the nodes given the value of its parents so: where (f) are the parents of node f.

H Random – Regional Density Bin Intensity levels Window m/z w 2ε2ε

Computing the Random Probability  =1-(2ε)/w, is the probability of a single peak missing the bin.  Let n i, 1≤i≤d, be counts of peaks with intensity i in window w:

Random Model for H Random  Peak occurrences are treated as random independent events:  The probability of observing a peak at random is estimated from the local density of peaks in the spectrum.

The Likelihood Ratio Score  A putative cleavage site is scored according to the log ratio test:  Can be used to score a peptide by summing the score for the prefix masses:

PepNovo’s De Novo Sequencing  A spectrum graph is created from the experimental MS/MS spectrum.  The nodes are scored using our method.  Highest scoring anti-symmetric path is found using dynamic programming algorithm.

Spectrum Graph  Acyclic graph.  Nodes are cleavage sites, each has a mass m and score s.  Edges connect nodes with mass differences corresponding to an amino acid. m:0 s:5.0 m:163.2 s: 2.8 m:113 s: -1.2 m:71.2 s: 4.3 m:199.4 s: 5.6 A L m:99.1 s:8.1 V S W Q

Results AlgorithmAverage Accuracy Sequence Length Tag 3Tag 4Tag 5Tag 6 PepNov o Shereng a Peaks Lutefisk Benchmarking reported for 280 spectra.

Q & A