Part II. Tandem MS. Mass filter; complete spectrum is obtained by scanning whole range Ions are lost Mass range 10- 4,000 Da Mass Analyzer (2) – Quadrupole.

Slides:



Advertisements
Similar presentations
Genomes and Proteomes genome: complete set of genetic information in organism gene sequence contains recipe for making proteins (genotype) proteome: complete.
Advertisements

Kaizhong Zhang Department of Computer Science University of Western Ontario London, Ontario, Canada Joint work with Bin Ma, Gilles Lajoie, Amanda Doherty-Kirby,
Modification Site Localization Why is this a problem? Calculating localization reliability Ways of representing reliability Modification ambiguity.
Chapter 2 Section 3.
Protein Sequencing and Identification by Mass Spectrometry.
The Proteomics Core at Wayne State University
In-depth Analysis of Protein Amino Acid Sequence and PTMs with High-resolution Mass Spectrometry Lian Yang 2 ; Baozhen Shan 1 ; Bin Ma 2 1 Bioinformatics.
De Novo Sequencing and Homology Searching with De Novo Sequence Tags.
How to identify peptides October 2013 Gustavo de Souza IMM, OUS.
De Novo Sequencing v.s. Database Search Bin Ma School of Computer Science University of Waterloo Ontario, Canada.
GNANA SUNDAR RAJENDIRAN JOYESH MISHRA RISHI MISHRA FALL 2008 BIOINFORMATICS Clustering Method for Repeat Analysis in DNA sequences.
Bin Ma, CTO Bioinformatics Solutions Inc. June 5, 2011.
CSE182 CSE182-L12 Mass Spectrometry Peptide identification.
Protein Sequencing and Identification by Mass Spectrometry.
Peptide Identification by Tandem Mass Spectrometry Behshad Behzadi April 2005.
Data Processing Algorithms for Analysis of High Resolution MSMS Spectra of Peptides with Complex Patterns of Posttranslational Modifications Shenheng Guan.
PEAKS: De Novo Sequencing using MS/MS spectra Bin Ma, U. Western Ontario, Canada Kaizhong Zhang,U. Western Ontario, Canada Chengzhi Liang, Bioinformatics.
The restriction mapping problem revisited Gopal Pandurangan and H. Ramesh Journal of Computer and System Sciences 526~544(2002)
Geometric Crossovers for Supervised Motif Discovery Rolv Seehuus NTNU.
Fa 05CSE182 CSE182-L8 Mass Spectrometry. Fa 05CSE182 Bio. quiz What is a gene? What is a transcript? What is translation? What are microarrays? What is.
Physical Mapping II + Perl CIS 667 March 2, 2004.
Mass spectrometry in proteomics Modified from: I519 Introduction to Bioinformatics, Fall, 2012.
Proteomics Informatics – Protein identification II: search engines and protein sequence databases (Week 5)
Mass Spectrometry. What are mass spectrometers? They are analytical tools used to measure the molecular weight of a sample. Accuracy – 0.01 % of the total.
My contact details and information about submitting samples for MS
1 Mass Spectrometry-based Proteomics Xuehua Shen (Adapted from slides with textbook)
Facts and Fallacies about de Novo Sequencing & Database Search.
1 Mass Spectrometry-based Proteomics Xuehua Shen (Adapted from slides with textbook)
Protein sequencing and Mass Spectrometry. Sample Preparation Enzymatic Digestion (Trypsin) + Fractionation.
Tryptic digestion Proteomics Workflow for Gel-based and LC-coupled Mass Spectrometry Protein or peptide pre-fractionation is a prerequisite for the reduction.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
The dynamic nature of the proteome
PROTEIN STRUCTURE NAME: ANUSHA. INTRODUCTION Frederick Sanger was awarded his first Nobel Prize for determining the amino acid sequence of insulin, the.
Mass Spectrometry I Basic Data Processing. Mass spectrometry A mass spectrometer measures molecular masses. The mass unit is called dalton, which is 1/12.
INF380 - Proteomics-91 INF380 – Proteomics Chapter 9 – Identification and characterization by MS/MS The MS/MS identification problem can be formulated.
Acknowledgements This work is supported by NSF award DBI , and National Center for Glycomics and Glycoproteomics, funded by NIH/NCRR grant 5P41RR
Common parameters At the beginning one need to set up the parameters.
Analysis of Complex Proteomic Datasets Using Scaffold Free Scaffold Viewer can be downloaded at:
A Comprehensive Comparison of the de novo Sequencing Accuracies of PEAKS, BioAnalyst and PLGS Bin Ma 1 ; Amanda Doherty-Kirby 1 ; Aaron Booy 2 ; Bob Olafson.
Laxman Yetukuri T : Modeling of Proteomics Data
PeptideProphet Explained Brian C. Searle Proteome Software Inc SW Bertha Blvd, Portland OR (503) An explanation.
CS 461b/661b: Bioinformatics Tools and Applications Software Algorithm Mathematical Models Biology Experiments and Data.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Protein Identification via Database searching Attila Kertész-Farkas Protein Structure and Bioinformatics Group, ICGEB, Trieste.
Software Project MassAnalyst Roeland Luitwieler Marnix Kammer April 24, 2006.
PEAKS: De Novo Sequencing using Tandem Mass Spectrometry Bin Ma Dept. of Computer Science University of Western Ontario.
CSE182 CSE182-L12 Mass Spectrometry Peptide identification.
INF380 - Proteomics-71 INF380 – Proteomics Chap 7 –Protein Identification and Characterization by MS Protein identification in our context means that we.
CSE182 CSE182-L11 Protein sequencing and Mass Spectrometry.
Multiple flavors of mass analyzers Single MS (peptide fingerprinting): Identifies m/z of peptide only Peptide id’d by comparison to database, of predicted.
EBI is an Outstation of the European Molecular Biology Laboratory. In silico analysis of accurate proteomics, complemented by selective isolation of peptides.
Tag-based Blind Identification of PTMs with Point Process Model 1 Chunmei Liu, 2 Bo Yan, 1 Yinglei Song, 2 Ying Xu, 1 Liming Cai 1 Dept. of Computer Science.
De Novo Peptide Sequencing via Probabilistic Network Modeling PepNovo.
Deducing protein composition from complex protein preparations by MALDI without peptide separation.. TP #419 Kenneth C. Parker SimulTof Corporation, Sudbury,
ISA Kim Hye mi. Introduction Input Spectrum data (Protein database) Peptide assignment Peptide validation manual validation PeptideProphet.
2014 생화학 실험 (1) 6주차 실험조교 : 류 지 연 Yonsei Proteome Research Center 산학협동관 421호
Constructing high resolution consensus spectra for a peptide library
Minimize Database-Dependence in Proteome Informatics Apr. 28, 2009 Kyung-Hoon Kwon Korea Basic Science Institute.
김지형. Introduction precursor peptides are dynamically selected for fragmentation with exclusion to prevent repetitive acquisition of MS/MS spectra.
Yonsei Proteome Research Center Peptide Mass Finger-Printing Part II. MALDI-TOF 2013 생화학 실험 (1) 6 주차 자료 임종선 조교 내선 6625.
Mass Spectrometry makes it possible to measure protein/peptide masses (actually mass/charge ratio) with great accuracy Major uses Protein and peptide identification.
Mass Spectrometry 101 (continued) Hackert - CH 370 / 387D
Yiming Yang1,2, Abhay Harpale1 and Subramanian Ganaphathy1
The Syllabus. The Syllabus Safety First !!! Students will not be allowed into the lab without proper attire. Proper attire is designed for your protection.
Protein Identification via Database searching
Bioinformatics Solutions Inc.
Proteomics Informatics David Fenyő
Interpretation of Mass Spectra I
Proteomics Informatics –
Bioinformatics for Proteomics
Presentation transcript:

Part II. Tandem MS

Mass filter; complete spectrum is obtained by scanning whole range Ions are lost Mass range 10- 4,000 Da Mass Analyzer (2) – Quadrupole

Q2 Collision Q1 Selection Pusher TOF with reflectron Detector Hybrid Quadrupole/Time-of-Flight (Q-TOF) MS

Electrospray MS and MS/MS of Proteins

Sample Preparation tissue fraction gel peptides Add trypsin MPSER …… GTDIMR PAK …… HPLC To MS/MS

Tandem Mass Spectrometer Quadrupole mass analyzer collision parent ionsfragment ions MPSER SG… + PAK + + P + AK PAK + + PA + K AK + P K + PA P + K + + AK + PAK + + peptide sequencing … TOF mass analyzer ions detector ESI QTOF

How Does a Peptide Fragment? m(y 1 )=19+m(A 4 ) m(y 2 )=19+m(A 4 )+m(A 3 ) m(y 3 )=19+m(A 4 )+m(A 3 )+m(A 2 ) m(b 1 )=1+m(A 1 ) m(b 2 )=1+m(A 1 )+m(A 2 ) m(b 3 )=1+m(A 1 )+m(A 2 )+m(A 3 )

How MS/MS corresponds to peptide m/z LGER b1 b2 b3 m/z R y1 y2 y3 EGL N-term C-term

Put both together m/z LGER REGL In practice, there are many more peaks other than b and y peaks Many b and y peaks may disappear.

Matching Sequence with Spectrum

LGSSEVEQVQLVVDGVKpeptide sequence: tandem mass spectrometry: MS/MS spectrum de novo sequencing: LGSSEVEQVQLVVDGVK database

Database Search Methods Mascot – matrix sciences – General software Sequest – John Yates et. al. – Distributed by Thermo Finnigan. – Works for Thermo’s LTQ. PEAKS – Bin Ma et. Al. – Distributed by Bioinformatics Solutions Inc. – General software

Mascot

PEAKS

De Novo Sequencing (Dancik et al., JCB 6: ) – Given a spectrum, a mass value M, compute a sequence P, s.t. m(P)=M, and the matching score is maximized. We consider the matching score of P is the sum of the scores of the matched peaks. De Novo Sequencing

Spectrum Graph Approach Convert the peak list to a graph. A peptide sequence corresponds to a path in the graph. Bartels (1990), Biomed. Environ. Mass Spectrom 19: Taylor and Johnson (1997). Rapid Comm. Mass Spec. 11: (Lutefisk) Dancik et al. (1999), JCB 6: Chen et al. (2001), JCB 8: ……

Difficulties Spectrum graph approach has difficulties to handle errors: – Missing of ions – break a path. – Too many peaks in a small error tolerance – too many edges connecting to the same peak. (reduce efficiency) – Error accumulation. – A peak is used as both a y-ion and a b-ion. It is still possible to solve these problems under the spectrum graph schema – E.g. The y-b overlap problem had been addressed by Dancik et al (1999) and Chen et al. (2001). – But things are getting complicated. – A reliable signal preprocessing is required.

PEAKS’ approach It is more natural and easier to handle the errors and noises. – Less dependent to the signal preprocessing. – Solved the missing ions and y-b overlap problems naturally. – Showed great success on real-life lab data. – Has been licensed by tens of research labs in public and private sectors.

A simplified case – Counting Only Y-ions

The Score of a Suffix y1y1 y2y2 y3y3 score(Q) are the sum of scores of those y-ions of Q. Let Q be a suffix of the peptide. It can determine some y-ions. 19

Recursive Computation of DP(m) Q’ Do not know a? a Suppose Q is such that DP(m)=score(Q). 19 score(Q’)=DP(m(Q’))

Dynamic Programming 1.for m from 0 to M 2.backtracking to decide the optimal peptide.

PEAKS – The Software

Comparison LCQ data (Iontrap instrument): – Generously provided by Dr. Richard Johnson. 144 spectra. Micromass Q-Tof data: – Measured in UWO’s Protein ID lab. 61 spectra Sciex Q-Star data: – Provided by U. Victoria’s Genome BC Proteomics Centre. 13 good/okay spectra.

PEAKS v.s. Lutefisk completely correct sequences: –38/144 v.s. 15/144 correct amino acids: –1067/1702 v.s. 767/1702 v.s. partially correct sequences with 5 or more contiguous correct amino acids: –94/144 v.s. 64/144

PEAKS v.s. Micromass PLGS completely correct sequences: –23/61 v.s. 7/61 correct amino acids: –559/764 v.s. 232/764 partially correct sequences with 5 or more contiguous correct amino acids: –50/61 v.s. 24/61

PEAKS v.s. Sciex BioAnalyst completely correct sequences: –7/13 v.s. 1/13 correct amino acids: –115/150 v.s. 86/150 partially correct sequences with 5 or more contiguous correct amino acids: –12/61 v.s. 7/61

Post Translational Modification (PTM)

PTM PTMs are important to the functions of proteins. There are more than 500 types of PTMs included in the unimod PTM database. For example: Reversible phosphorylation of proteins is an important regulatory mechanism. Many enzymes are switched "on" or "off" by phosphorylation and dephosphorylation. This is done by the structural change caused by the PTM.

Phosphorylation pSpTpY HH H S TY Monoisotopic mass change: PO 3 H =

PTM increases complexity Most protein databases do not have the PTM information. Therefore, when PTM is present, one has to try different PTM possibilities to match a peptide with a spectrum. For peptide LGSSEVTMVYLK, if only phosphorylation is considered, there are 16 possibilities. – What if there are 10 possible PTM sites? This type of PTMs are called variable PTMs.

Fixed PTM Some PTMs are know to present all the time. – These are called fixed PTM. Oxidation of M. Mass +16. – It happens automatically in the air. So people often make sure that all of the M are oxidized. carboxyamidomethyl cysteine (CamC). Mass – These are added intentionally to break the disulphide bonds. Fixed PTMs are easier.

Variable PTM in DB Search and DeNovo For DB search, have to try different combinations. For De Novo, each variable PTM is like adding a new amino acid. – For example, if pS, pT, pY are variable, then instead of having 20 characters in alphabet, we have 23. – But too many variable PTMs will reduce the accuracy of the de novo sequencing.

Peptide Identification v.s. Protein Identification

Peptide sequencing Peptide sequencing digestion Proteins Peptides …… >protein A PAKGTIRHIHGCDKRGAPWPAS… >protein B MSERNHLREIIGNEVR…… >protein C LSIMQDKDYSASFIS…… >protein A PAKGTIRHIHGCDKRGAPWPAS… >protein B MSERNHLREIIGNEVR…… >protein C LSIMQDKDYSASFIS…… Proteins MS/MS Protein ID Peptides PAK MSER LSIMQDK HIHGCDK EIIGNEVR SIMQMDYSASFIS PAK MSER LSIMQDK HIHGCDK EIIGNEVR SIMQMDYSASFIS Peptides Common procedure for protein ID

Problems A peptide appears in several proteins. A protein family may share many peptides. – Usually only one of them is true. A protein may have only one peptide or two weak peptides, is it true or false positive? – The “one hit wonder”.

Estimate False Positives Suppose you have a score for each identified protein. You want to choose a score threshold T. – Score >T  positive (keep) – Score <=T  negative (discard) It is important to estimate the false positive rate for each given result. False Positive Rate – In statistics, FPR= #false positives/#negative results. – We care more about FPR = #false positives/#results reported as positives. Positive (prediction) Negative (prediction) Positive (reality) TPFN Negative (reality) FPTN The two definitions are different!

Decoy Database Method Choose a decoy database: – for example, reverse the database. Anything from this database is false. Search in a real database and a decoy database separately – For same T, if there are x proteins in the decoy database >T, then perhaps there are x false proteins in the real database with score >T. Threshold T, – real db has 497 proteins >T, – decoy db has 7 proteins >T, – False positive rate is 7/497 = 1.4%

Problems Only works for large dataset. – Not statistically significant when dataset is small. Does not care how many proteins are actually kept. – Keeping only the true results is not our only goal, we also want to keep as many as true results as possible. Decoy database is only good for validation and cannot substitute a good scoring method.

SPIDER – listen to both parties! The solution when there is no protein database and no perfect MS/MS. 兼听则明,偏听则暗

de novo sequencing EISGNEVR protein DB ESIGNEVR database search protein DB homology search ESIGSEVR PEAKS: Ma et. al, Rapid Comm. Mass Spec SI PatternHunter: Ma, Tromp and Li, Bioinformatics SPIDER: Han, Ma and Zhang, JBCB. 2005

Two purposes of our research 1.Given de novo sequence with errors, find homolog of the real sequence. (searching) 2.Using the de novo sequence and the homolog as input, compute the real sequence. (sequencing)

LSCFAV “Listen to both sides and you will be enlightened; Heed only one side you will be benighted.” EACFAV de novo DACFKAV homolog

Homology mutations Sequence alignment Also called edit distance EACF-AVQR DACFKAV-R

Common de novo sequencing errors same mass replacement AN? NA? GAG?

Two exercises (denovo) X: LSCFV (real) Y: EACFV (homolog) Z: DACFV m(LS)=m(EA)=200.1mu (denovo) X: LSCFAV (real) Y: SLCFAV (homolog) Z: SLCF-V blosum62

More formally Let Sequencing: Given de novo sequence X, homolog Z, find Y such that is minimized. Let Searching: search a database for Z such that d(X,Z) is minimized. X Y Z seqError editDist

How to compute d s (X,Y) Easily align X and Y together (according to mass). For each erroneous mass block with mass m i, define the cost to be Define X Y Z seqError editDist (denovo)X: LSCFAV (real) Y: EACFAV

How to compute d(X,Z) A multiple alignment can be built from alignments (X,Y) and (Y,Z). Lemma: Dynamic Programming! Let X Y Z seqError editDist (denovo) X: LSCF-AV (real) Y: EACF-AV (homolog)Z: DACFKAV

Four cases of the last Block (A)(B)(C) no sequencing error D(i,j) is the minimum of the four cases.

How to compute

Three cases of the alignment (1) (2)(3)

The algorithm for computing 1. for m from 0 to m(X) step Δ for i from 0 to |Z| for j from i to |Z| Time complexity:

The algorithm for computing d(X,Z) and Y 1. for i from 1 to |X| for j from 1 to |Z| 2. output D(|X|,|Z|) as d(X,Z). 3. backtracking to get the best middle sequence Y. Time complexity: Total time complexity:

Experiment 28 spectra from ALBU_BOVIN. PEAKS de novo sequencing gives 13 correct and 15 partially correct sequences SPIDER found good peptide homologues in human protein DB for all. 24 constructed correct peptide sequences. PEAKS EAEGNEVR human DB SPIDER ESIGSEVR ESIGNEVR ALBU_BOVIN

Two exemplary results (denovo) X: CCQ[W ]DAEAC[AF] K (real) Y: CCK AD DAEAC FA VE GP K (homolog)Z: CCK[AD]DKETC[FA] K (denovo) X: FVE LVTD[TL]K (real) Y: FVE VTK LVTD LT K (homolog)Z: FAE LVTD[LT]K homology mutations sequencing errors

Four modes in SPIDER Homology mode Non-gapped homology mode – Assume sequencing error and homology mutations do not overlap. Segment match mode – Assume no homology mutations. Exact match mode – Assume no sequencing errors or homology mutations.

Experiment 144 ion trap MS/MS spectra, lower quality spectra. The proteins are all in Swissprot but not in human database. PEAKS 2.0 was used to de novo sequence. SPIDER searches Swissprot and human databases, respectively.

People like SPIDER Best Paper Award at CSB2004 Some random s we received – “I'm a big SPIDER fan!” Shinichi Iwamoto, Shimadzu Corporation – “The results I've been getting have been consistently very good. Thank you for this great piece of software!” Jason W. H. Wong, University of Oxford – “Your software is by far the fastest and more user-friendly I have found.” Juan Luis, University of Georgia – …… – I plan to teach SPIDER in my Advanced Bioinformatics class. I wonder if your powerpoint slides are available?” Pavel Pevzner, Ronald R. Taylor Professor of Computer Science, UCSD Included in PEAKS as both a separate tool and an intermediate step in protein candidates generation. The best is yet to come – People just started using the de novo + homology approach.