De Novo Sequencing and Homology Searching with De Novo Sequence Tags.

Slides:



Advertisements
Similar presentations
Tandem MS (MS/MS) on the Q-ToF2
Advertisements

Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
In-depth Analysis of Protein Amino Acid Sequence and PTMs with High-resolution Mass Spectrometry Lian Yang 2 ; Baozhen Shan 1 ; Bin Ma 2 1 Bioinformatics.
Proteomics Informatics – Protein identification III: de novo sequencing (Week 6)
Measuring the degree of similarity: PAM and blosum Matrix
How to identify peptides October 2013 Gustavo de Souza IMM, OUS.
Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally.
De Novo Sequencing v.s. Database Search Bin Ma School of Computer Science University of Waterloo Ontario, Canada.
CSE182 CSE182-L12 Mass Spectrometry Peptide identification.
Fa 05CSE182 CSE182-L7 Protein sequencing and Mass Spectrometry.
Heuristic alignment algorithms and cost matrices
Data Processing Algorithms for Analysis of High Resolution MSMS Spectra of Peptides with Complex Patterns of Posttranslational Modifications Shenheng Guan.
PEAKS: De Novo Sequencing using MS/MS spectra Bin Ma, U. Western Ontario, Canada Kaizhong Zhang,U. Western Ontario, Canada Chengzhi Liang, Bioinformatics.
Sequence similarity.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
Fa 05CSE182 CSE182-L8 Mass Spectrometry. Fa 05CSE182 Bio. quiz What is a gene? What is a transcript? What is translation? What are microarrays? What is.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Proteomics Informatics – Protein identification II: search engines and protein sequence databases (Week 5)
Previous Lecture: Regression and Correlation
De Novo Sequencing of MS Spectra
My contact details and information about submitting samples for MS
Facts and Fallacies about de Novo Sequencing & Database Search.
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Protein sequencing and Mass Spectrometry. Sample Preparation Enzymatic Digestion (Trypsin) + Fractionation.
Tryptic digestion Proteomics Workflow for Gel-based and LC-coupled Mass Spectrometry Protein or peptide pre-fractionation is a prerequisite for the reduction.
The dynamic nature of the proteome
PROTEIN STRUCTURE NAME: ANUSHA. INTRODUCTION Frederick Sanger was awarded his first Nobel Prize for determining the amino acid sequence of insulin, the.
PROTEIN QUANTIFICATION AND PTM JUN SIN HSS.I. PROJECT 1.
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
INF380 - Proteomics-91 INF380 – Proteomics Chapter 9 – Identification and characterization by MS/MS The MS/MS identification problem can be formulated.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Common parameters At the beginning one need to set up the parameters.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Analysis of Complex Proteomic Datasets Using Scaffold Free Scaffold Viewer can be downloaded at:
A Comprehensive Comparison of the de novo Sequencing Accuracies of PEAKS, BioAnalyst and PLGS Bin Ma 1 ; Amanda Doherty-Kirby 1 ; Aaron Booy 2 ; Bob Olafson.
Laxman Yetukuri T : Modeling of Proteomics Data
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
INF380 - Proteomics-101 INF380 – Proteomics Chapter 10 – Spectral Comparison Spectral comparison means that an experimental spectrum is compared to theoretical.
PeptideProphet Explained Brian C. Searle Proteome Software Inc SW Bertha Blvd, Portland OR (503) An explanation.
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Chapter 3 Computational Molecular Biology Michael Smith
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
PEAKS: De Novo Sequencing using Tandem Mass Spectrometry Bin Ma Dept. of Computer Science University of Western Ontario.
CSE182 CSE182-L12 Mass Spectrometry Peptide identification.
Sequence Alignments with Indels Evolution produces insertions and deletions (indels) – In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHALQRRTVWVNAY-
CSE182 CSE182-L11 Protein sequencing and Mass Spectrometry.
Peptide Identification via Tandem Mass Spectrometry Sorin Istrail.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Error tolerant search Large number of spectra remain without significant score. Reasonable number of fragment ion peaks might have not match. – Underestimated.
Tag-based Blind Identification of PTMs with Point Process Model 1 Chunmei Liu, 2 Bo Yan, 1 Yinglei Song, 2 Ying Xu, 1 Liming Cai 1 Dept. of Computer Science.
De Novo Peptide Sequencing via Probabilistic Network Modeling PepNovo.
ISA Kim Hye mi. Introduction Input Spectrum data (Protein database) Peptide assignment Peptide validation manual validation PeptideProphet.
Constructing high resolution consensus spectra for a peptide library
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
김지형. Introduction precursor peptides are dynamically selected for fragmentation with exclusion to prevent repetitive acquisition of MS/MS spectra.
B Monoisotopic mass of neutral peptide M r (calc): Fixed modifications: Carbamidomethyl Ions score: 45 † Expect: ‡ Matches (red): 18/50.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Sample Preparation Enzymatic Digestion (Trypsin) + Fractionation.
Proteomics Informatics David Fenyő
Interpretation of Mass Spectra I
Proteomics Informatics –
Homology Modeling.
Basic Local Alignment Search Tool
Interpretation of Mass Spectra
Presentation transcript:

De Novo Sequencing and Homology Searching with De Novo Sequence Tags

de novo sequencing Inexact protein DB homology search 1 2 Possible Ways to Interpret MS/MS Data protein DB database search MS/MS Spectra 3 peptides homologous peptides

Why Bother? De novo sequencing derives the sequence without looking into a database. De novo sequencing is useful for – unsequenced genomes (no protein database) – novel peptides (unmatched spectra after database search) – single amino acid polymorphism – unexpected PTM – database error – validate a database match

Outline Basics Manual De Novo Sequencing De novo Sequencing Algorithm (PEAKS) Homology Search with De Novo Tags

Sequence-specific fragment ions [N-term]-NH-CHR-C---NH 2 + -[C-term]H + O [N-term]-NH-CHR-CO + + NH 2 -[C-term] H + [N-term]-NH=CHR + + CO a – NH 3 or H 2 O b – NH 3 or H 2 O y – NH 3 or H 2 O (M+2H) +2 b-iony-ion a-ion

6 Non-sequence-specific fragmentations

Why does everyone analyze positively- charged tryptic peptides? Usually better sensitivity from positively-charged peptide ions. “Mobile protons” protonate peptide bonds and promote b/y fragmentation Arg sequesters protons in gas phase Tryptic peptides typically have 0 -1 Arg Tryptic peptide ions typically have two protons Therefore, tryptic peptides usually have b/y ions Placing Arg’s at the C-terminus makes it more likely that a complete series of y-ions will be observed.

MS/MS spectrum of doubly-charged tryptic peptide (one Arg and two protons) Y L Y E I A R y1y1 y2y2 y3y3 y4y4 y5y5 y6y6 y6y6 y5y5 y4y4 y3y3 y2y2 y1y1 L Y b2b2 a2a2 b2b2

MS/MS spectrum of a doubly-charged non- tryptic peptide(two Arg’s and two protons) Relative Ab. (%) Y S R R H P E y2y2 P H Y y1y1 (b 6 +18) +2 (b 5 +18) +2 y4y4 y3y3 y (M+2H) +2 b4b4 y b a 5 -17

CID in traps vs quadrupoles Relative Abundance m/z y 13 b 13 y 12 y 11 b 12 y 10 b 11 b 10 y9y9 y9y9 b9b9 y8y8 y8y8 y7y7 y7y7 (M + 2H) +2 b8b8 y b7b7 y6y6 y6y6 b6b6 y2y2 y2y2 y1y1 b2b2 y3y3 y4y4 y5y5 y3y3 y4y4 y5y5 Ion trap Qtof b3b3 b3b3 b4b4 b4b4 b5b5 b5b5 b6b6 IPIGFAGAQGGFDTR

Annoying things to remember when sequencing peptides by MS/MS Leucine and isoleucine have the same mass Glutamine and lysine differ by u Phenylalanine and oxidized methionine differ by u Cleavages do not occur at every bond (more often than not, there is no cleavage between the first and second residues) Certain amino acids have the same mass as pairs of other amino acids: G + G = N, A + G = Q, G + V ~ R, A + D ~ W, S + V ~ W However: mass accuracy resolves many of these ambiguities

Outline Basics Manual De Novo Sequencing De novo Sequencing Algorithm (PEAKS) Homology Search with De Novo Tags

Two approaches to manually sequencing peptides from MS/MS spectra 1.Finding a series of ions in the middle of the peptide, and working out towards the termini (illustrated using ion trap data) 2.Finding the C-terminus and working towards the N- terminus (illustrated using qtof data)

Sequencing from the middle: look for ion series in the region above the precursor ion (m/z 615)

An obvious series is the one that involves the more abundant fragment ions (m/z 575, 688, 775, 888, and 987) LSLV

Another ion series contains pairs separated by 18 Da (water losses) -18 LVES

Two ion series have been identified in the region above the precursor ion Problem: Two ion series defining partial sequences LSLV and LVES have been identified, but it is not known if these are y- or b- ions (i.e., the sequence direction is unknown). Solution: Since ion trap data often exhibits high mass b-ions, check to see if the highest mass ion in either series corresponds to a loss of either Arg or Lys (the usual tryptic C-terminus). If not, check to see if the mass difference corresponds to a dipeptide containing Lys or Arg (it is possible that the b-ion defining the C-terminus is absent). Calculation: Peptide MW – 17 – fragment ion = C-terminal residue mass

For the first ion series: 1228 – 17 – 987 = 224 Da LSLV 224 – 128 = – 156 = 68 Therefore this does not look like a b-ion series

For the second ion series: 1228 – 17 – 1083 = 128 (the residue mass of Lys); this looks like a b-ion series and maybe the other one is a y-ion series -18 LVESK

The high mass b-series predicts the presence of some low mass y-ions; are they there? b-series: …LVESK y 1 : 147 No y 2 : 234 Yes y 3 : 363 Yes y 4 : 462 Yes y 5 : 575 Yes!! b-ions y-ions y-ions = residue mass plus 19 Da

The high mass y-series predicts the presence of some low mass b-ions; are they there? b-ions y-ions y-series: [242]VLSL… b 2 : 243 Yes b 3 : 342 Yes b 4 : 455 Yes b 5 : 542 Yes b 6 : 655!! Yes

Can I account for most of the remaining ions as neutral losses or internal fragments? b-ions y-ions neutral loss [242]VLSLLVESK 242 = N+Q, N+K, L+E

Two approaches to manually sequencing peptides from MS/MS spectra 1.Finding a series of ions in the middle of the peptide, and working out towards one of the termini (illustrated using ion trap data) 2.Finding the C-terminus and working towards the N- terminus (illustrated using qtof data)

Outline Basics Manual De Novo Sequencing De novo Sequencing Algorithm (PEAKS) Homology Search with De Novo Tags

Algorithm Design The first thing for algorithm design is to define the property of the solution. For the de novo sequencing problem, one wants to compute a peptide that “best matches” the given spectrum. This “best match” is practically defined by a scoring function.

Peptide-Spectrum Match Score peptide prefix suffix A fragment score can be computed for every two adjacent amino acids. This score depends on the presence of the corresponding b and y ions. The peptide score is the sum of the fragment scores.

peptide prefix suffix

De Novo Sequencing

Algorithm Idea

Dynamic Programming BestScore

A Note on PTM Variable PTM does not cause major speed slow down for de novo sequencing algorithms. – Instead of trying 20 regular amino acids in the maximization, the algorithm simply tries all modified amino acids too. – The time complexity is increased by a constant factor. (Compare to the exponential growth in database search approach). However, since the solution space is larger when many variable PTMs are allowed, the accuracy of the algorithm is reduced.

Accounting for Other Ion Types When internal cleavage ions are considered in the scoring function, it becomes difficult to design efficient algorithm to find the optimal sequence. A compromise between efficiency and accuracy is to employ a two-stage approach. – First, compute many (e.g. 10,000) sequences using an efficient score function that uses only a few of the most important ions. – Then, evaluate these candidates using a more sophisticated scoring function additional ions. This two-round approach is a tradeoff between the algorithm speed and accuracy.

Mass Segment Error Most errors are due to incomplete ion ladders in the spectrum. – Thus, a segment of amino acids cannot be determined. – However, the total mass of the segment, is fixed. – E.g. [242]VLSLLVESK, where 242 = N+Q, N+K, or L+E The first two or three residues often have low confidence, because of a lack of fragment ions. Most de novo sequencing software uses the precursor mass as a constraint (thus the mass of the derived sequence is usually correct).

Outline Basics Manual De Novo Sequencing De novo sequencing Algorithm Homology Search with De Novo Tags

Why Homology Search with De Novo Sequence Advantages: – Database may not contain the exact peptide sequence, but a homologous one is there. – De novo + homology search is great to use the database of one organism to study a similar organism. Disadvantages: – De novo sequence can only provide partially correct sequence tags. – Conventional homology search may fail due to de novo sequencing errors.

Traditional Sequence Alignment Two peptide sequences are aligned by inserting spaces to appropriate positions. E.g. FVEVTKL-TDLTK | || || ||||| FAEV-KLVTDLTK The matching residues (including gaps, ‘-’) in each column has a similarity score that can be looked up in a pre-defined amino acid substitution matrix, such as BLOSUM or PAM. The alignment score is equal to the sum of the column-wise scores. There are algorithms to compute the optimal alignment that maximizes the alignment score.

Conventional search ignores the possible errors in de novo sequencing. Suppose a true sequence is SLCAFK, and the de novo sequence is LSCFAK, and the homolog is SLAAFK. Limitations of Conventional Homology Search (denovo) X: LSCFAK | (homolog) Z: SLAAFK (denovo) X: [LS]C[FA]K (real) Y: [SL]C[AF]K || || | (homolog) Z: [SL]A[AF]K Conventional search using evolutionary similarities to explain the mismatches results in a poor match. If de novo sequencing errors are considered, the match becomes more significant.

A Simple Approach We can enumerate all possible combinations of a mass segment, and search all of them together. – MS BLAST will do this. Difficulties: – Do not know which portion of the sequence is error. – Exponential growth of possibilities. [LS]C[FA]K LSCFAK SLCFAK TVCFAK VTCFAK LSCAFK SLCAFK TVCAFK VTCAFK

SPIDER Model Given a de novo sequence X, and a database sequence Z. Try to reconstruct the real sequence Y. – The difference between X and Y is explained by de novo sequencing errors. – The difference between Y and Z is explained by homology mutations. The real Y should minimize the de novo errors and the homology mutations needed in the above explanation. (de novo) X: [LS]C[FA]K (real) Y: [SL]C[AF]K || || | (homolog) Z: [SL]A[AF]K

Two exercises (denovo) X: LSCFV (real) Y: EACFV (homolog) Z: DACFV m(LS)=m(EA)=200.1 Da (denovo) X: LSCFAV (real) Y: SLCFAV (homolog) Z: SLCF-V blosum62 The swap of L and S is more likely a de novo error than a mutation. The deletion of A is unlikely a de novo error (de novo does not change peptide mass). Mutation and de novo error overlap. Hard for manual interpretation. Algorithm is needed.

Conclusion When the target peptides are not in a database. – De novo sequencing When the homologous peptides are in database – Homology search with the de novo tags can find them – Some de novo errors can be corrected by combining the homolog information