INTRODUCTION We connect, in a complete pipeline, an ontology-based environment for proteomics spectra management with a distributed complete validation.

Slides:



Advertisements
Similar presentations
© 2007 Open Grid Forum Data Management Challenge - The View from OGF OGF22 – February 28, 2008 Cambridge, MA, USA Erwin Laure David E. Martin Data Area.
Advertisements

18 Copyright © 2005, Oracle. All rights reserved. Distributing Modular Applications: Introduction to Web Services.
Abstraction Layers Why do we need them? –Protection against change Where in the hourglass do we put them? –Computer Scientist perspective Expose low-level.
Copyright Discovery Net Imperial College SARS Analysis on the Grid Discovery Net in Bioinformatics.
Correlation Aware Feature Selection Annalisa Barla Cesare Furlanello Giuseppe Jurman Stefano Merler Silvano Paoli Berlin – 8/10/2005.
Systems Biology Existing and future genome sequencing projects and the follow-on structural and functional analysis of complete genomes will produce an.
Group3 Results. Use Case: MS Analysis Huge number of data
Proposal for a Standard Representation of the Results of GC-MS Analysis: A Module for ArMet Helen Fuell 1, Manfred Beckmann 2, John Draper 2, Oliver Fiehn.
Office of Science U.S. Department of Energy Grids and Portals at NERSC Presented by Steve Chan.
NextGRID & OGSA Data Architectures: Example Scenarios Stephen Davey, NeSC, UK ISSGC06 Summer School, Ischia, Italy 12 th July 2006.
Proteomics: A Challenge for Technology and Information Science CBCB Seminar, November 21, 2005 Tim Griffin Dept. Biochemistry, Molecular Biology and Biophysics.
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
Slide 1 of 9 Presenting 24x7 Scheduler The art of computer automation Press PageDown key or click to advance.
Messaging Technologies Group: Yuzhou Xia Yi Tan Jianxiao Zhai.
INTRODUCTION GOAL: to provide novel types of interaction between classification systems and MIAME-compliant databases We present a prototype module aimed.
LabKey Server 10.3 and Office Hours Josh Eckels, LabKey Software.
Database System Concepts and Architecture Lecture # 3 22 June 2012 National University of Computer and Emerging Sciences.
SCIENCE-DRIVEN INFORMATICS FOR PCORI PPRN Kristen Anton UNC Chapel Hill/ White River Computing Dan Crichton White River Computing February 3, 2014.
Customized cloud platform for computing on your terms !
CceHUB A Knowledge Discovery Environment for Cancer Care Engineering Research Ann Christine Catlin HUBzero Workshop November 7, 2008.
INTRODUCTION We present an integrated computational platform for the analysis of time varying microarray data obtained from dynamic stimulus-response experiments.
SOS EGEE ‘06 GGF Security Auditing Service: Draft Architecture Brian Tierney Dan Gunter Lawrence Berkeley National Laboratory Marty Humphrey University.
1 School of Computer, National University of Defense Technology A Profile on the Grid Data Engine (GridDaEn) Xiao Nong
Taverna Workflow. A suite of tools for bioinformatics Fully featured, extensible and scalable scientific workflow management system – Workbench, server,
ITPA/IMAGE 7-10 May 2007 Software and Hardware Infrastructure for the ITM B.Guillerminet, on behalf of the ITM & ISIP teams (P Strand, F Imbeaux, G Huysmans,
Contents 1.Introduction, architecture 2.Live demonstration 3.Extensibility.
DATABASE MANAGEMENT SYSTEMS IN DATA INTENSIVE ENVIRONMENNTS Leon Guzenda Chief Technology Officer.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
The Future of the iPlant Cyberinfrastructure: Coming Attractions.
Introduction to the Adapter Server Rob Mace June, 2008.
INFSO-RI Enabling Grids for E-sciencE BioDCV: a grid-enabled complete validation setup for functional profiling EGEE User Forum.
Grid Computing Research Lab SUNY Binghamton 1 XCAT-C++: A High Performance Distributed CCA Framework Madhu Govindaraju.
BRUDNO LAB: A WHIRLWIND TOUR Marc Fiume Department of Computer Science University of Toronto.
IODE Ocean Data Portal - ODP  The objective of the IODE Ocean Data Portal (ODP) is to facilitate and promote the exchange and dissemination of marine.
Distributed database system
Proteomics databases for comparative studies: Transactional and Data Warehouse approaches Patricia Rodriguez-Tomé, Nicolas Pinaud, Thomas Kowall GeneProt,
Sage Bionetworks Mission Sage Bionetworks is a non-profit organization with a vision to create a “commons” where integrative bionetworks are evolved by.
CGW 04, Stripped replication for the grid environment as a web service1 Stripped replication for the Grid environment as a web service Marek Ciglan, Ondrej.
EGEE User Forum Data Management session Development of gLite Web Service Based Security Components for the ATLAS Metadata Interface Thomas Doherty GridPP.
C. Furlanello – June 22th, Annalisa Barla, Bettina Irler, Stefano Merler, Giuseppe Jurman, Silvano Paoli, Cesare Furlanello ITC-irst,
Bioinformatics Research Overview Outline Biomedical Ontologies oGlycO oEnzyO oProPreO Scientific Workflow for analysis of Proteomics Data Framework for.
NeuroLOG ANR-06-TLOG-024 Software technologies for integration of process and data in medical imaging A transitional.
H. Widmann (M&D) Data Discovery and Processing within C3Grid GO-ESSP/LLNL / June, 19 th 2006 / 1 Data Discovery and Basic Processing within the German.
WEB SERVER SOFTWARE FEATURE SETS
Enabling Grids for E-sciencE ITC-irst for NA4 biomed meeting at EGEE conference: Ginevra 2006 BioDCV - Features 1.Application for analysis of microarray.
A Technical Overview Bill Branan DuraCloud Technical Lead.
Intro to Web Services Dr. John P. Abraham UTPA. What are Web Services? Applications execute across multiple computers on a network.  The machine on which.
Design for a High Performance, Configurable caGrid Data Services Platform Peter Hussey LabKey Software, Inc, Seattle, WA USA Contact:
CATI Pitié-Salpêtrière CATI: A national platform for advanced Neuroimaging In Alzheimer’s Disease Standardized MRI and PET acquisitions Across a wide network.
Portals and my Grid Stefan Rennick Egglestone Mixed Reality Laboratory University of Nottingham.
An approach to Web services Management in OGSA environment By Shobhana Kirtane.
5/19/05 New Geoscience Applications 1 A DISTRIBUTED WORKFLOW DATABASE DESIGNED FOR COREWALL APPLICATIONS Bill KampBill Kamp, Lumnilogical Research Center,
High throughput biology data management and data intensive computing drivers George Michaels.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Bioinformatics activity Christophe BLANCHET.
Simulation Production System Science Advisory Committee Meeting UW-Madison March 1 st -2 nd 2007 Juan Carlos Díaz Vélez.
ArrayExpress Ugis Sarkans EMBL - EBI
E-commerce Architecture Ayşe Başar Bener. Client Server Architecture E-commerce is based on client/ server architecture –Client processes requesting service.
Monitoring and Information Services Core Infrastructure (MIS-CI) Service Description Mark L. Green OSG Integration Workshop at UC Feb 15-17, 2005.
Science Gateway- 13 th May Science Gateway Use Cases/Interfaces D. Sanchez, N. Neyroud.
Slide 1 © 2016, Lera Technologies. All Rights Reserved. Oracle Data Integrator By Lera Technologies.
Java Web Services Orca Knowledge Center – Web Service key concepts.
Simulation Production System
Joslynn Lee – Data Science Educator
CUAHSI HIS Sharing hydrologic data
Data challenges in the pharmaceutical industry
Recap: introduction to e-science
The ETICS Build and Test Service
MIK 2.1 DBNS - introduction to WS-PGRADE, 2013
Introduction to D4Science
Introduction of Week 11 Return assignment 9-1 Collect assignment 10-1
Presentation transcript:

INTRODUCTION We connect, in a complete pipeline, an ontology-based environment for proteomics spectra management with a distributed complete validation platform for predictive analysis. We leverage from two existing software platforms (MS-Analyzer and BioDCV) and from emerging proteomics standards. In the set-up, BioDCV is accessed from the MS-Analyzer workflow as a service, thus providing a complete pipeline for proteomics data analysis. Predictive classifica- tion studies on MALDI-TOF data based on this pipeline are presented. REFERENCES [1] M. Cannataro, P. Guzzi, T. Mazza, G. Tradigo, P. Veltri, Using ontologies for preprocessing and mining spectra data on the Grid,FGCS, 2006, In press, [2] M. Cannataro, P.H. Guzzi, T. Mazza, G. Tradigo, P. Veltri. Preprocessing of Mass Spectrometry Proteomics Data on the Grid. IEEE CBMS 2005: [3] C. Furlanello, M. Serafini, S. Merler, and G. Jurman. Semi-supervised learning for molecular profiling. IEEE Transactions on Computational Biology and Bioinformatics, 2(2): , [4] A.Barla, B.Irler, S.Merler, G. Jurman, S.Paoli and C. Furlanello, Proteome profiling without selection bias. IEEE CBMS 2006, 941—946 [5] R. Tibshirani, T. Hastie, B. Narasimhan, S. Soltys, G. Shi, A. Kong, and Q. Le. Sample classification from protein mass spectometry, by ”peak probability contrasts”. Bioinformatics, 20(17):3034–3044, Workflows, ontologies and standards for unbiased prediction in high-throughput proteomics Cannataro M*, Barla A**, Gallo A*, Paoli S**, Jurman G**, Merler S**, Veltri P*, Furlanello C**. *University Magna Graecia of Catanzaro, Italy, **ITC-irst, Trento, Italy MGED 9 MGED 9 September 7-10, 2006 Seattle, WA, U.S.A. DATASETS D1. MALDI-TOF Ovarian Cancer Dataset, from (www-stat.stanford.edu/~tibs/PPC/ Rdist)[5] 49 samples (24 diseased + 25 controls) Each raw sample has m/z measurements (892 KB) Each preprocessed sample has 564 m/z measurements (19 KB) Preprocessing: Normalization Binning Biomarker identification Baseline subtraction Peak Alignment – Clustering 67 features identified D2. (lab calibration sample) MALDI-TOF, human serum, 20 technical replicates, 10 control samples, 10 with 2 proteins, measurement, 347 m/z after preprocessing, predictive discrimination with 7 peaks MS-ANALYZER MS-Analyzer [1] is a platform for the integrated management and processing of proteomics spectra data. It supports the ontology based design of “in silico” proteomics studies: ontologies are used to model software tools and spectra data, while workflows are used to model applications. MS-Analyzer uses a specialized spectra database and provides a set of pre-processing services: Interface to heterogeneous mass spectrometers formats such as MALDI-TOF, SELDI-TOF, ICAT-based LC-MS/MS. Formats are unified into mzData, in compliance with the HUPO-PSI proteomics standardization initiative. Acquisition, storage, and management of MS data with the SpecDB database. Spectra are stored in their different stages (raw, pre-processed, prepared). Single, multiple, or portions of spectra can be queried (in-database preprocessing). Preprocessing of MS data (smoothing, baseline subtraction, normalization, binning, peaks alignment), as well as spectra preparation for further data mining (spectra to ARFF conversion) [2]. Sharing of experiments data, workflows and knowledge WS RSR PPSR PSR raw spectra pre-processed spectra prepared spectra SpecDB APIs Ontology-based Workflow Designer OntologyAssistant - browsing - querying WF Editor - composition - browsing - selection - visualization WF Schema Abstract, Concrete WF Resource Discovery Services WF Translator WF Scheduler WF Monitor Workflow Scheduler Ontology manager Ontologies UDDI/MDS Metadata WSDL WS 1 WS 2 Spectra Management Services Network WS 1 WS 2 Spectra Visualization Services WS 1 WS 2 Spectra Preparation Services WS 1 WS 2 Spectra PreprocessingSe rvices 1 1 M-WS Ontology-based Workflow Designer BIODcv WS BioDCV WS front-end Server FTP repository Data Metadata Repository URL DMZ Server Apache mod_Python ZSI module BIODCV The predictive modeling portion of the proposed system is provided by BioDCV, the ITC-irst platform for machine learning in high-throughput functional genomics. BioDCV fully supports complete validation in order to control selection bias effects. To harness the intensive data throughput, BioDCV uses E-RFE, an entropy based acceleration of the SVM-RFE feature ranking procedure [3]. For proteomics, it includes methods for baseline subtraction, spectra alignment, peak clustering and peak assignment that were adapted from existing R packages and concatenated to the complete validation system. BioDCV is also a grid application and it has been used in production within the EGEE Biomed VO [4]. FEATURE EXTRACTION Within sample across sample Complete Validation R scripts visualization ATE, sampletracking PHP biomarker lists HTML publication Biomarkers data REPORT ACKNOWLEDGMENTS ITC-irst: R Flor, D Albanese, B Irler UniCZ: G. Cuda, M. Gaspari, PH Guzzi,T Mazza Three Internet Web Services are used to integrate remotely the two main system components. The BioDCV component is invoked from the MSAnalyzer workflow as a WebService (biodcv-ws-client) in the UniCZ network: data and metadata are copied in a FTP repository, then the data URL and a notification address are transmitted to the BioDCV WebService (biodcv-ws) on a DMZ area of the ITC-irst network. This service is directly run by Apache with Mod_Python and the Zolera Soap infrastructure. The incoming data are transferred to the internal front-end server (server-cz-tn.py) within the firewalled area. The front-end launches first the feature extraction module and then a full complete validation process using the BioDCV component. The system outputs are thus formatted as graphs and tables by R and PHP scripts. The results are published by the front-end on the DMZ server, and notified back by . WEB SERVICES ARCHITECTURE 22: S21 (25) 1550n1 23: S22 (19) 1550n1 24: S23 (21) 1550n1 25: S24 (23) Error rate (tumour tissue) Error rate (non- tumoural tissue) No-information error rate 1 1 The BioDCV system: EGEE BioMed VO 2-50 MB MB grid-ftp scp grid-ftp scp Commands: 1.grid-url-copy/lcg-cp db from local to SE 2.edg-job-submit BioDCV.jdl 3.grid-url-copy/lcg-cp db from SE to local D2: mean A m/z Intensity D2:.95 Student bootstrap CI D2: mean B D2:.95 Student bootstrap CI 9133,17 Da