 We present the implementation of pluggable scoring in the X! Tandem MS/MS database search program.  We have modified the core search program to facilitate.

Slides:



Advertisements
Similar presentations
Chapter 10: The Traditional Approach to Design
Advertisements

Systems Analysis and Design in a Changing World, Fifth Edition
WASTE MANAGEMENT ©2010 SciQuest USA Confidential 1 Powered by RFx User Guide.
In-depth Analysis of Protein Amino Acid Sequence and PTMs with High-resolution Mass Spectrometry Lian Yang 2 ; Baozhen Shan 1 ; Bin Ma 2 1 Bioinformatics.
Chapter 10 The Traditional Approach to Design
Chapter 9: The Traditional Approach to Design Chapter 10 Systems Analysis and Design in a Changing World, 3 rd Edition.
EBI is an Outstation of the European Molecular Biology Laboratory. PRIDE associated tools: Practical exercise 1 PRIDE team, Proteomics Services Group PANDA.
1336 SW Bertha Blvd, Portland OR 97219
How to identify peptides October 2013 Gustavo de Souza IMM, OUS.
Prediction to Protein Structure Fall 2005 CSC 487/687 Computing for Bioinformatics.
PepArML: A model-free, result-combining peptide identification arbiter via machine learning Xue Wu, Chau-Wen Tseng, Nathan Edwards University of Maryland,
University of Leeds Department of Chemistry The New MCM Website Stephen Pascoe, Louise Whitehouse and Andrew Rickard.
De Novo Sequencing v.s. Database Search Bin Ma School of Computer Science University of Waterloo Ontario, Canada.
Peptide Identification by Tandem Mass Spectrometry Behshad Behzadi April 2005.
Three critical areas of change enabled the move from pipeline processing built mostly on unstructured custom code to a structured system where pipelines.
ProReP - Protein Results Parser v3.0©
FIGURE 5. Plot of peptide charge state ratios. Quality Control Concept Figure 6 shows a concept for the implementation of quality control as system suitability.
Scaffold Download free viewer:
My contact details and information about submitting samples for MS
Analysis of tandem mass spectra - II Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology.
Conclusions What’s next? * Implementation of additional input formats * Additional vendor support: As vendors become more open with their APIs for accessing.
EARTH SCIENCE MARKUP LANGUAGE “Define Once Use Anywhere” INFORMATION TECHNOLOGY AND SYSTEMS CENTER UNIVERSITY OF ALABAMA IN HUNTSVILLE.
The dynamic nature of the proteome
PROTEIN STRUCTURE NAME: ANUSHA. INTRODUCTION Frederick Sanger was awarded his first Nobel Prize for determining the amino acid sequence of insulin, the.
Introduction The GPM project (The Global Proteome Machine Organization) Salvador Martínez de Bartolomé Bioinformatics support –
Introduction to XML. XML - Connectivity is Key Need for customized page layout – e.g. filter to display only recent data Downloadable product comparisons.
0 eCPIC Admin Training: Custom Calculated Fields These training materials are owned by the Federal Government. They can be used or modified only by FESCOM.
StAR web server tutorial for ROC Analysis. ROC Analysis ROC Analysis: This module allows the user to input data for several classifiers to be tested.
Winrunner Usage - Best Practices S.A.Christopher.
INF380 - Proteomics-91 INF380 – Proteomics Chapter 9 – Identification and characterization by MS/MS The MS/MS identification problem can be formulated.
10 ITK261 The traditional approach to design Reading: Chapter 10 Oct 9, 11.
Introduction of Geoprocessing Topic 7a 4/10/2007.
Acknowledgements This work is supported by NSF award DBI , and National Center for Glycomics and Glycoproteomics, funded by NIH/NCRR grant 5P41RR
Common parameters At the beginning one need to set up the parameters.
EARTH SCIENCE MARKUP LANGUAGE Why do you need it? How can it help you? INFORMATION TECHNOLOGY AND SYSTEMS CENTER UNIVERSITY OF ALABAMA IN HUNTSVILLE.
Application of Hadoop to Proteomic Searches Steven Lewis 1, Attila Csordas 2, Sarah Killcoyne 1, Henning Hermjakob 2, John Boyle 1 1 Institute for Systems.
Analysis of Complex Proteomic Datasets Using Scaffold Free Scaffold Viewer can be downloaded at:
FlexElink Winter presentation 26 February 2002 Flexible linking (and formatting) management software Hector Sanchez Universitat Jaume I Ing. Informatica.
Laxman Yetukuri T : Modeling of Proteomics Data
INF380 - Proteomics-101 INF380 – Proteomics Chapter 10 – Spectral Comparison Spectral comparison means that an experimental spectrum is compared to theoretical.
PeptideProphet Explained Brian C. Searle Proteome Software Inc SW Bertha Blvd, Portland OR (503) An explanation.
The Functional Genomics Experiment Object Model (FuGE) Andrew Jones, School of Computer Science, University of Manchester MGED Society.
240-Current Research Easily Extensible Systems, Octave, Input Formats, SOA.
Attack signatures derived from Metasploit Final Presentation E. Ramirez A. Zoghbi
Poster produced by Faculty & Curriculum Support (FACS), Georgetown University Medical Center Application of meta-search, grid-computing, and machine-learning.
Overview of Bioinformatics 1 Module Denis Manley..
Software Project MassAnalyst Roeland Luitwieler Marnix Kammer April 24, 2006.
PEAKS: De Novo Sequencing using Tandem Mass Spectrometry Bin Ma Dept. of Computer Science University of Western Ontario.
Standards for proteomics: The HUPO Proteomics Standards Initiative (HUPO PSI) Public Repository for Mass spectrometry spectral.
Introduction of Geoprocessing Lecture 9. Geoprocessing  Geoprocessing is any GIS operation used to manipulate data. A typical geoprocessing operation.
EBI is an Outstation of the European Molecular Biology Laboratory. In silico analysis of accurate proteomics, complemented by selective isolation of peptides.
August 2003 At A Glance The IRC is a platform independent, extensible, and adaptive framework that provides robust, interactive, and distributed control.
Background Spectral library searching Spectral library searching is an alternative approach to traditional sequence database searching for peptide inference.
Fire Emissions Network Sept. 4, 2002 A white paper for the development of a NSF Digital Government Program proposal Stefan Falke Washington University.
Tag-based Blind Identification of PTMs with Point Process Model 1 Chunmei Liu, 2 Bo Yan, 1 Yinglei Song, 2 Ying Xu, 1 Liming Cai 1 Dept. of Computer Science.
Poster produced by Faculty & Curriculum Support (FACS), Georgetown University Medical Center Application of meta-search, grid-computing, and machine-learning.
Introduction of Geoprocessing Lecture 9 3/24/2008.
Application of meta-search, grid-computing, and machine-learning can significantly improve the sensitivity of peptide identification. The PepArML meta-search.
ECHO Technical Interchange Meeting 2013 Timothy Goff 1 Raytheon EED Program | ECHO Technical Interchange 2013.
ISA Kim Hye mi. Introduction Input Spectrum data (Protein database) Peptide assignment Peptide validation manual validation PeptideProphet.
Using Scaffold OHRI Proteomics Core Facility. This presentation is intended for Core Facility internal training purposes only.
CPAS Comparative Proteomics Analysis System Adam Rauch LabKey Software
Protein identification by mass spectrometry The shotgun proteomics strategy, based on digesting proteins into peptides and sequencing them using tandem.
Protein identification by mass spectrometry The shotgun proteomics strategy, based on digesting proteins into peptides and sequencing them using tandem.
Lesson: Sequence processing
Implementation Process
A Database of Peak Annotations of Empirically Derived Mass Spectra
System Design.
Bioinformatics for Proteomics
High level view of the MAE algorithm.
Presentation transcript:

 We present the implementation of pluggable scoring in the X! Tandem MS/MS database search program.  We have modified the core search program to facilitate the addition of new scoring algorithms without requiring any modifications to any part of the rest of the Tandem source code.  We present score function plug-in modules that can be used as templates to develop new scoring algorithms.  We demonstrate a framework for evaluating/comparing the performance of the various plug-ins within the CPAS analysis system.  Future work includes plans to investigate the development of more accurate and sensitive scoring algorithms; hopefully this work facilitates other computational researchers to do the same.  This work builds on the Tandem program and allows us to take advantage of the rich and growing open source development community fostered by the GPM project. The source code modifications also include the creation of a generalized plug-in manager. The plug-in manager maps a plug-in type to a set of named plug-ins; see mplugin.h and mplugin.cpp in the source distribution for details. The creation of a new score plug-in involves its implementation in a single C++ source file and corresponding single header file. Scoring modules can be added or removed simply by adding or removing the pair of.cpp/.h files from the project and recompiling. No other changes are required, to either the Makefile or the other source files, to add/remove these scoring plug-ins. The modifications made to Tandem, as well as Tandem’s native hyperscore score function (referred to as the mscore function) are available as part of the default Tandem distribution. Other example current and future scoring modules, including the ‘sscore’ module, which is a score function implementing simple spectral processing and scoring and intended to be used as a template for developing new plug-ins, will be available from the Hutch’s CPL website at proteomics.fhcrc.org. The framework for various MS/MS database search tools are conceptually very similar. Just about every database search tool contains the same core component steps: reading in input spectra, parsing sequence databases, identifying candidate peptides of the same mass as the input spectra, comparing sequence vs. spectrum to generate a score, storing relevant peptide hits as the database is being searched, and lastly writing out the search results. So the implementation and testing of a new score function typically requires developing an entire search tool just to investigate one (albeit very important) subcomponent. To facilitate the development and evaluation of novel score functions, we modified the Tandem program to allow scoring modules to be easily pluggable. The modifications enable adding new (and modifying existing) score functions without requiring any modifications to the rest of the source code. The modifications to the Tandem source involve first separating out general scoring infrastructure from the actual native Tandem score algorithm. Infrastructure for scoring functions, such as a state machine for choosing which peptide sequences to score, was left in the base class. Specific scoring decisions were moved into functions that were then made virtual so they could be overridden by other plug-ins. These virtual functions, which represent the true “Pluggable Scoring API,” are marked within the mscore.h file in the TANDEM project (see insert to the right). An open-source framework for evaluating MS/MS score functions: "pluggable scoring" in X! Tandem Jimmy Eng 1, Brendan Maclean 2, Ron Beavis 3, Martin McIntosh 1 1 Computational Proteomics Laboratory, Fred Hutchinson Cancer Research Center, Seattle, WA USA 2 LabKey Software, LLC, Seattle, WA USA 3 Beavis Informatics, LLC, Manitoba Canada Overview Methods Results Conclusions Peptide and protein identification via tandem mass spectrometry and sequence database searching has become ubiquitous in the realm of proteomics. As the field of proteomics grows and attracts a larger computational audience, there’s an ever increasing emphasis on developing more accurate and sensitive computational methods to identify the constituent proteins in a sample. Most of this research has focused on MS/MS database searching and more specifically variations on how to better score a match between a spectrum and candidate peptides from a sequence database. Widely used MS/MS database search tools such as Mascot and SEQUEST are proprietary and not amendable to user source modification. The recent availability of open source MS/MS search tools such as X! Tandem, OMSSA, and ProbID allows researchers the opportunity to transparently understand the mechanisms behind how they work and make modifications to them as needed. We focused on modifying the X! Tandem software to enable pluggable scoring because we found the software well architected, it arguably has the most traction of the three open source tools within the proteomics community, and it allows us to be compatible with and take advantage of the ever growing, rich set of tools of the GPM project. By extending Tandem to make the score function pluggable, this enables researchers to develop and evaluate new scoring algorithms without having to write an entire database search tool. Introduction The X! Tandem software tool has been adapted to allow novel score functions to be “plugged” in. The goal is to facilitate and foster the development, implementation, and evaluation of better ways of identifying peptides by tandem mass spectrometry database searching. This is achieved by modifying Tandem such that core infrastructure is separated from the actual score function module, enabling the scoring class to be overridden. We present a general introduction to the software modifications that enable pluggable scoring, supply example score function modules, and discuss plans for future work. ASMS )Perkins DN, Pappin DJ, Creasy DM, Cottrell JS. Electrophoresis Dec; 20(18): )Eng JK, McCormack AL, and Yates JR 3rd, J Am Soc Mass Spectrom. 1994; 5: )Craig R, Beavis RC. Bioinformatics Jun 12; 20(9): )Geer LY, Markey SP, Kowalak JA, Wagner L, Xu M, et al. J Proteome Res Sep-Oct;3(5): )Zhang N, Aebersold R, Schwikowski B. Proteomics Oct; 2(10): )Rauch A, Bellew M, Eng J, Fitzgibbon M, et al. J of Proteome Res. 5(1), , 01/2006 This work was funded by NCI contract 23XS144A and others??? References Read spectra /* The following section represents the pluggable scoring API devised and implemented by Brendan MacLean. It allows developers to simply add new scoring systems to X! Tandem for any purpose. Please do not modify without considering carefully how changes might impact external scoring plug-ins. */ virtual bool load_param(XmlParameter &_x); // allows score object to issue warnings/set vars based on xml virtual bool precondition(mspectrum &_s); // called before spectrum conditioning virtual bool add_details(mspectrum &_s); virtual bool add_mi(mspectrum &_s); virtual void prescore(const size_t _i); // called before scoring virtual float score(const size_t _i); /* If you have no need for scoring plug-ins other than mscore_tandem, making this function non-virtual can yield a 10% perf improvement. */ #ifdef PLUGGABLE_SCORING virtual unsigned long mconvert(double _m, const long _c); // convert mass to integer ion m/z for mi vector #else unsigned long mconvert(double _m, const long _c); // convert mass to integer ion m/z for mi vector #endif virtual double hfactor(long _l); // hyper scoring factor from number of ions matched virtual double sfactor(); // factor applied to final convolution score virtual float hconvert(float _h); // convert hyper score to histogram score virtual void report_score(char* _buff, float _h); // format hyper score for output virtual bool clear(); protected: virtual double dot(unsigned long *_v) = 0; // this is where the real scoring happens /* End puggable scoring API. */ Score peptides Read FASTA db Find peptides Store results Write output Score function 3 Score function 2 Default mscore score function K.THEANSWER.F mscore.h Some text describing the “sscore” algorithm Some text describing the “kscore” algorithm. Some text briefly describing CPAS and the analysis framework that Brendan created that generates all of the figures here. Some text talking about how we can use the expectation value as a score-function neutral metric to compare results. Related discussion about the histograms buckets used for the expectation value and their effects on the results. Would be nice to reconstruct the histograms for a couple of example searches to show the effect on the e-value calculation. Mention the Maclean et al. manuscript either in preparation or hopefully submitted. Use some subset of these plots for the results section.