Improving Statistical Significance Assignment in Mass Spectrometry Based Peptide Identification Overview Statistical Significance in Peptide Identification.

Slides:



Advertisements
Similar presentations
Proteomics Informatics – Protein characterization I: post-translational modifications (Week 10)
Advertisements

A Multi-PCA Approach to Glycan Biomarker Discovery using Mass Spectrometry Profile Data Anoop Mayampurath, Chuan-Yih Yu Info-690 (Glycoinformatics) Final.
1336 SW Bertha Blvd, Portland OR 97219
How to identify peptides October 2013 Gustavo de Souza IMM, OUS.
PepArML: A model-free, result-combining peptide identification arbiter via machine learning Xue Wu, Chau-Wen Tseng, Nathan Edwards University of Maryland,
De Novo Sequencing v.s. Database Search Bin Ma School of Computer Science University of Waterloo Ontario, Canada.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Peptide Identification by Tandem Mass Spectrometry Behshad Behzadi April 2005.
Differentially expressed genes
1 Information-Theoretic Mass Spectral Library Search Arvind Visvanathan CSCE 990 Seminar in Multi-Dimensional Chromatography Systems, Informatics, and.
Proteomics: A Challenge for Technology and Information Science CBCB Seminar, November 21, 2005 Tim Griffin Dept. Biochemistry, Molecular Biology and Biophysics.
Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Analysis of tandem mass spectra - I Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology.
Proteomics Informatics – Protein identification II: search engines and protein sequence databases (Week 5)
Proteomics Informatics Workshop Part I: Protein Identification
Previous Lecture: Regression and Correlation
Scaffold Download free viewer:
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Analysis of tandem mass spectra - II Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology.
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
Proteomics Informatics – Data Analysis and Visualization (Week 13)
The dynamic nature of the proteome
PROTEIN STRUCTURE NAME: ANUSHA. INTRODUCTION Frederick Sanger was awarded his first Nobel Prize for determining the amino acid sequence of insulin, the.
INF380 - Proteomics-91 INF380 – Proteomics Chapter 9 – Identification and characterization by MS/MS The MS/MS identification problem can be formulated.
Common parameters At the beginning one need to set up the parameters.
Analysis of Complex Proteomic Datasets Using Scaffold Free Scaffold Viewer can be downloaded at:
Laxman Yetukuri T : Modeling of Proteomics Data
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
INF380 - Proteomics-101 INF380 – Proteomics Chapter 10 – Spectral Comparison Spectral comparison means that an experimental spectrum is compared to theoretical.
Peptidesproteinsgenes protein accessionsharedsharedunique gene nameshareduniqueunique Identified by gene unique peptides Identified by protein and gene.
PeptideProphet Explained Brian C. Searle Proteome Software Inc SW Bertha Blvd, Portland OR (503) An explanation.
Metabolomics Metabolome Reflects the State of the Cell, Organ or Organism Change in the metabolome is a direct consequence of protein activity changes.
MS Calibration for Protein Profiles We need calibration for –Accurate mass value Mass error: (Measured Mass – Theoretical Mass) X 10 6 ppm Theoretical.
CS 461b/661b: Bioinformatics Tools and Applications Software Algorithm Mathematical Models Biology Experiments and Data.
Temple University MASS SPECTROMETRY FURTHER INVESTIGATIONS Ilyana Mushaeva and Amber Moscato Department of Electrical and Computer Engineering Temple University.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Poster produced by Faculty & Curriculum Support (FACS), Georgetown University Medical Center Application of meta-search, grid-computing, and machine-learning.
Protein Identification via Database searching Attila Kertész-Farkas Protein Structure and Bioinformatics Group, ICGEB, Trieste.
Clustering of MS/MS spectra for glycan biomarker discovery Anoop Mayampurath, Chuan-Yih Yu.
Proteomics What is it? How is it done? Are there different kinds? Why would you want to do it (what can it tell you)?
INF380 - Proteomics-71 INF380 – Proteomics Chap 7 –Protein Identification and Characterization by MS Protein identification in our context means that we.
A Reference Library of Peptide Ion Fragmentation Spectra Stephen Stein 1 ; Lisa Kilpatrick 2 ; Pedatsur Neta 1 ; Jeri Roth 1 ; Xiaoyu Yang 1 National Institute.
CSE182 CSE182-L11 Protein sequencing and Mass Spectrometry.
Peptide Identification via Tandem Mass Spectrometry Sorin Istrail.
Improving the Sensitivity of Peptide Identification Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical.
Eat Raw & Fresh: Introducing isotopic Mass-to-charge Ratio and Envelope Fingerprinting (iMEF) and ProteinGoggle for Protein Database Search Zhixin(Michael)
Proteomics Informatics (BMSC-GA 4437) Instructor David Fenyö Contact information
Tag-based Blind Identification of PTMs with Point Process Model 1 Chunmei Liu, 2 Bo Yan, 1 Yinglei Song, 2 Ying Xu, 1 Liming Cai 1 Dept. of Computer Science.
Application of meta-search, grid-computing, and machine-learning can significantly improve the sensitivity of peptide identification. The PepArML meta-search.
Protein Structure Prediction: Threading and Rosetta BMI/CS 576 Colin Dewey Fall 2008.
ISA Kim Hye mi. Introduction Input Spectrum data (Protein database) Peptide assignment Peptide validation manual validation PeptideProphet.
Minimize Database-Dependence in Proteome Informatics Apr. 28, 2009 Kyung-Hoon Kwon Korea Basic Science Institute.
김지형. Introduction precursor peptides are dynamically selected for fragmentation with exclusion to prevent repetitive acquisition of MS/MS spectra.
3.3b1 Protein Structure Threading (Fold recognition) Boris Steipe University of Toronto (Slides evolved from original material.
Peptide de novo sequencing Peptide de novo sequencing is the analytical process that derives a peptide’s amino acid sequence from its tandem mass spectrum.
Yiming Yang1,2, Abhay Harpale1 and Subramanian Ganaphathy1
Jarrett Egertson, Ph.D. MacCoss Lab
Bottom-Up Proteomics Data collection
MassMatrix Search Results Explained
Protein Identification via Database searching
Mass spectrometry-based proteomics
Bioinformatics Solutions Inc.
Proteomics Informatics David Fenyő
Proteomics Informatics –
NoDupe algorithm to detect and group similar mass spectra.
Processing of fragment ion information in DTA files to remove isotope ions and noise. Processing of fragment ion information in DTA files to remove isotope.
Sim and PIC scoring results for standard peptides and the test shotgun proteomics dataset. Sim and PIC scoring results for standard peptides and the test.
Proteomics Informatics David Fenyő
Presentation transcript:

Improving Statistical Significance Assignment in Mass Spectrometry Based Peptide Identification Overview Statistical Significance in Peptide Identification Statistics through deNovo method Example New way to Combine Search Results Yi-Kuo Yu Quantitative Molecular Biological Physics (QMBP) Group National Center for Biotechnology Information National Library of Medicine, National Institutes of Health

QMBP Research using Biowulf Molecular Dynamics Protein Folding Simulations Molecular Networks Information Transduction in protein-protein interaction networks Molecular Interactions Exact electrostatic force/energy Mass Spectrometry statistics of peptide/protein ID

Mass Spect. Task force LCDR Gelio Alves Relevant References: 1.Gelio Alves and Yi-Kuo Yu. Statistical Characterization of a 1D Random Potential Problem – with applications in score statistics of MS-based peptide sequencing Physica A (2008), 387: doi: G. Alves, A. Ogurtsov, Wells W. Wu, Guanhui Wang, R-F Shen and Yi-Kuo Yu Calibrating E-values for MS 2 Database Search Methods Biology Direct, 2007, 2:26 3. Gelio Alves, Wells W. Wu,Guanghui Wang, Rong-Fong Shen and Yi-Kuo Yu Enhancing Peptide Identification Confidence by Combining Search Methods Journal of Proteome Research, 7: (2008). Dr. Aleksey Ogurtsov

Important issues: (1)Protein ID in a mixture, (2)Protein Circuit / Localization, (3) Signaling and Communication. Overview: MS-based Proteomics Desirable to understand Proteins involved? A generic pathway Protein Identification is important for Proteomics/System Biology

What can mass spect do? Protein identification through peptide identification: MS/MS produces fragments of partial-peptides [(a,b,c)s and (x,y,z)s], thus provides more information about the peptide for sequencing. Given a set of MS/MS spectra, by database searches or denovo sequencing, one may identify peptides involved and then infer the proteins involved.

What is the problem? Confidence assignment in peptide identifications (How to confidently interpret biological experiments): Where to draw the line when selecting peptide candidates? How to rank peptide candidates across spectra? How to compare results analyzed using different search methods? ( Does a top hit in method M1 carries the same meaning as that in method M2?) How to compare results from different experiments? A possible solution is to have robust statistical significance assignment that provides (1)a quantifiable confidence measure for peptide ID (2)the flexibility to compare results from different spectra and even from different search methods.

Solid Statistics (E-values) might be our best rescue In the context of peptide searches, both the E- and P-values may be viewed as monotonically decreasing functions of some algorithm-dependent quality score S. For a given quality score cutoff, P-value refers to the probability of finding a random hit with quality score greater than or equal to the cutoff. E-value is defined as the expected number of hits in a random database with quality score greater than or equal to the cutoff. E = P*(random_db_size) [Equivalent to Bonferroni Correction] Key assumption needed: Aside from the true peptides, the rest of the peptides in the database appear to be random with respect to any given MS/MS spectrum. Using correct E-values, we can compare search results from different spectra and even different search methods!

Aren’t there many methods reporting E-values already, why not just use them? CR: removal of highly homologous clusters[Ref: Biol. Direct, 2007, 2:25] Apparently, most E-values reported deviate from the textbook definition.

To circumvent the statistical inaccuracy: (1)We developed RAId_DbS, a new search method that has satisfactory statistics (see below) but without losing performance (see ROC curves to the right) using profile data using centroid data

(2) We provide a protocol to calibrate E-values: There exist methods that do not report E-value. To compare the search results from these methods, one needs to calibrate statistics, see G Alves, AY Ogurtsov, Y-K Yu, Calibrating E-values for MS2 Database Search Methods [Biol. Direct (2007), 2:26] problem: may lose spectrum-specific statistics.

Other advantage of having accurate E-value: Simple connection to the False Discovery Rate (FDR) where E c is the E-value cutoff, N is the total number of spectra, and H(E c ) is the cumulative number of hits with E-value smaller than or equal to E c. No need to search in decoy database to get FDR! (3) Statistical calibration leads to a way to combine search results from different methods (but can’t enforce spectrum-specific statistics), see Alves et al. Enhancing peptide identification by combining search methods [JPR (2008), 7:3102–3113].

Spectrum-Specific Statistics Why spectrum-specific statistics? Fragment peaks depend on parent ion charge state, the presence of co-eluted materials and their physical interactions with each other, and the relative kinetic energy of the inert gas (CID), or the relative kinetic energy of the electrons (ECD, ETD), and the peptide/co-eluted material concentrations, and the peptide/co-eluted material conformation in gaseous phase, etc. Spectra from the same peptide KVPQVSTPTLVEVSR

The complication Spectrum-specific noise demands spectrum-specific statistics. Not every search method can do this. Only two known methods use spectrum-specific statistics: X!Tandem (fitted empirically) RAId_DbS (derived theoretically) Recently SEQUEST developers have also investigated the possibility of using spectrum-specific XCorr statistics.

A new approach: obtaining statistical standard from scoring all possible peptides Merit: Bypass the need of decoy database (when FDR is considered) and the need of E-value calibration. Challenge: the astronomically large number of peptides to score. For a peptide of molecular weight 2300 Da there are ≈ Scoring 10 9 peptides per second would take ≈ 3.2 x 10 9 years! all possible human tryptic peptidesall possible tryptic peptides Solution: see our recent paper, Physica A (2008), 387: doi:

A new approach: obtaining statistical standard from scoring all possible peptides (cont…) Algorithm: also capable of incorporating internal structures such as peptide lengths, hydrophobicity etc. by extending the dimension of the internal array. [Physica A (2008), 387: ] Scoring functions: RAId_DbS, Hyperscore (X!Tandem), K-score, XCorr, WP. This dynamic programming algorithm can score all possible peptides in a few seconds. A similar algorithm was proposed independently by Pevzner’s group [JPR (2008), 7: ].

ISB, Centroid data P-value of each candidate peptide from the 50MB random database is Inferred from the denovo score histogram of all possible peptides. (RAId_DbS strategy).

Combining search results For a given spectrum σ, search a database using methods, M1 and M2, return hit lists L 1 (σ) and L 2 (σ) respectively along with database P-values P db. L 1 (σ) SAMPLER 1.4e-4 TVPMRQK 4.6e-2 HVGTMHK 0.13 ………… L 2 (σ) GAMHLER 3.4e-6 TVPMRQK 1.6e-3 VGTMGSK 0.06 ………… L 1 (σ) U L 2 (σ) GAMHLER e-6 SAMPLER 1.4e TVPMRQK 4.6e-2 1.6e-3 VGTMGSK HVGTMHK ………… Peptide not present in a report list is assigned a database P-value 1.

Remarks and Acknowledgement It is anticipated that combining search methods that are orthogonal to each other might be most advantageous. It is easy to check the correlation between information utilized by various scoring methods. RAId_denovo can be accessed from (standalone version will be available for download this summer) We thank the administrative group of the Biowulf computers for constant technical support, which considerably helped our computational progress in improving the peptide identification statistics over the past few years. We thank Dr. R.-F. Shen for providing various peptide MS/MS data.