INTRODUCTION We connect, in a complete pipeline, an ontology-based environment for proteomics spectra management with a distributed complete validation platform for predictive analysis. We leverage from two existing software platforms (MS-Analyzer and BioDCV) and from emerging proteomics standards. In the set-up, BioDCV is accessed from the MS-Analyzer workflow as a service, thus providing a complete pipeline for proteomics data analysis. Predictive classifica- tion studies on MALDI-TOF data based on this pipeline are presented. REFERENCES [1] M. Cannataro, P. Guzzi, T. Mazza, G. Tradigo, P. Veltri, Using ontologies for preprocessing and mining spectra data on the Grid,FGCS, 2006, In press, [2] M. Cannataro, P.H. Guzzi, T. Mazza, G. Tradigo, P. Veltri. Preprocessing of Mass Spectrometry Proteomics Data on the Grid. IEEE CBMS 2005: [3] C. Furlanello, M. Serafini, S. Merler, and G. Jurman. Semi-supervised learning for molecular profiling. IEEE Transactions on Computational Biology and Bioinformatics, 2(2): , [4] A.Barla, B.Irler, S.Merler, G. Jurman, S.Paoli and C. Furlanello, Proteome profiling without selection bias. IEEE CBMS 2006, 941—946 [5] R. Tibshirani, T. Hastie, B. Narasimhan, S. Soltys, G. Shi, A. Kong, and Q. Le. Sample classification from protein mass spectometry, by ”peak probability contrasts”. Bioinformatics, 20(17):3034–3044, Workflows, ontologies and standards for unbiased prediction in high-throughput proteomics Cannataro M*, Barla A**, Gallo A*, Paoli S**, Jurman G**, Merler S**, Veltri P*, Furlanello C**. *University Magna Graecia of Catanzaro, Italy, **ITC-irst, Trento, Italy MGED 9 MGED 9 September 7-10, 2006 Seattle, WA, U.S.A. DATASETS D1. MALDI-TOF Ovarian Cancer Dataset, from ( Rdist)[5] 49 samples (24 diseased + 25 controls) Each raw sample has m/z measurements (892 KB) Each preprocessed sample has 564 m/z measurements (19 KB) Preprocessing: Normalization Binning Biomarker identification Baseline subtraction Peak Alignment – Clustering 67 features identified D2. (lab calibration sample) MALDI-TOF, human serum, 20 technical replicates, 10 control samples, 10 with 2 proteins, measurement, 347 m/z after preprocessing, predictive discrimination with 7 peaks MS-ANALYZER MS-Analyzer [1] is a platform for the integrated management and processing of proteomics spectra data. It supports the ontology based design of “in silico” proteomics studies: ontologies are used to model software tools and spectra data, while workflows are used to model applications. MS-Analyzer uses a specialized spectra database and provides a set of pre-processing services: Interface to heterogeneous mass spectrometers formats such as MALDI-TOF, SELDI-TOF, ICAT-based LC-MS/MS. Formats are unified into mzData, in compliance with the HUPO-PSI proteomics standardization initiative. Acquisition, storage, and management of MS data with the SpecDB database. Spectra are stored in their different stages (raw, pre-processed, prepared). Single, multiple, or portions of spectra can be queried (in-database preprocessing). Preprocessing of MS data (smoothing, baseline subtraction, normalization, binning, peaks alignment), as well as spectra preparation for further data mining (spectra to ARFF conversion) [2]. Sharing of experiments data, workflows and knowledge WS RSR PPSR PSR raw spectra pre-processed spectra prepared spectra SpecDB APIs Ontology-based Workflow Designer OntologyAssistant - browsing - querying WF Editor - composition - browsing - selection - visualization WF Schema Abstract, Concrete WF Resource Discovery Services WF Translator WF Scheduler WF Monitor Workflow Scheduler Ontology manager Ontologies UDDI/MDS Metadata WSDL WS 1 WS 2 Spectra Management Services Network WS 1 WS 2 Spectra Visualization Services WS 1 WS 2 Spectra Preparation Services WS 1 WS 2 Spectra PreprocessingSe rvices 1 1 M-WS Ontology-based Workflow Designer BIODcv WS BioDCV WS front-end Server FTP repository Data Metadata Repository URL DMZ Server Apache mod_Python ZSI module BIODCV The predictive modeling portion of the proposed system is provided by BioDCV, the ITC-irst platform for machine learning in high-throughput functional genomics. BioDCV fully supports complete validation in order to control selection bias effects. To harness the intensive data throughput, BioDCV uses E-RFE, an entropy based acceleration of the SVM-RFE feature ranking procedure [3]. For proteomics, it includes methods for baseline subtraction, spectra alignment, peak clustering and peak assignment that were adapted from existing R packages and concatenated to the complete validation system. BioDCV is also a grid application and it has been used in production within the EGEE Biomed VO [4]. FEATURE EXTRACTION Within sample across sample Complete Validation R scripts visualization ATE, sampletracking PHP biomarker lists HTML publication Biomarkers data REPORT ACKNOWLEDGMENTS ITC-irst: R Flor, D Albanese, B Irler UniCZ: G. Cuda, M. Gaspari, PH Guzzi,T Mazza Three Internet Web Services are used to integrate remotely the two main system components. The BioDCV component is invoked from the MSAnalyzer workflow as a WebService (biodcv-ws-client) in the UniCZ network: data and metadata are copied in a FTP repository, then the data URL and a notification address are transmitted to the BioDCV WebService (biodcv-ws) on a DMZ area of the ITC-irst network. This service is directly run by Apache with Mod_Python and the Zolera Soap infrastructure. The incoming data are transferred to the internal front-end server ( within the firewalled area. The front-end launches first the feature extraction module and then a full complete validation process using the BioDCV component. The system outputs are thus formatted as graphs and tables by R and PHP scripts. The results are published by the front-end on the DMZ server, and notified back by . WEB SERVICES ARCHITECTURE 22: S21 (25) 1550n1 23: S22 (19) 1550n1 24: S23 (21) 1550n1 25: S24 (23) Error rate (tumour tissue) Error rate (non- tumoural tissue) No-information error rate 1 1 The BioDCV system: EGEE BioMed VO 2-50 MB MB grid-ftp scp grid-ftp scp Commands: 1.grid-url-copy/lcg-cp db from local to SE 2.edg-job-submit BioDCV.jdl 3.grid-url-copy/lcg-cp db from SE to local D2: mean A m/z Intensity D2:.95 Student bootstrap CI D2: mean B D2:.95 Student bootstrap CI 9133,17 Da