European Life Sciences Infrastructure for Biological Information www.elixir-europe.org “BILS-ProteomeXchange integration using EUDAT resources” ELIXIR-Pilot.

Slides:



Advertisements
Similar presentations
Ronn Ritke Tony McGregor NLANR/MNA (UCSD/SDSC) Funded by the National Science Foundation/CISE/SCI cooperative agreement no. ANI
Advertisements

Israel, 10th and 11th of December 2003 Italy Israel Bi-national Seminar on Digital Access to Scientific and Cultural Heritage Antonella Fresa MINERVA Technical.
Peter Berrisford RAL – Data Management Group SRB Services.
A Unified Approach to Combat Counterfeiting: Use of the Digital Object Architecture and ITU-T Recommendation X.1255 Robert E. Kahn President & CEO CNRI,
Test Case Management and Results Tracking System October 2008 D E L I V E R I N G Q U A L I T Y (Short Version)
ASCR Data Science Centers Infrastructure Demonstration S. Canon, N. Desai, M. Ernst, K. Kleese-Van Dam, G. Shipman, B. Tierney.
EBI Proteomics Services Team – Standards, Data, and Tools for Proteomics Henning Hermjakob European Bioinformatics Institute SME forum 2009 Vienna.
EBI is an Outstation of the European Molecular Biology Laboratory. PRIDE associated tools: Practical exercise 1 PRIDE team, Proteomics Services Group PANDA.
A Very Brief Introduction to iRODS
1 The IIPC Web Curator Tool: Steve Knight The National Library of New Zealand Philip Beresford and Arun Persad The British Library An Open Source Solution.
Robust Tools for Archiving and Preserving Digital Data Joseph JaJa, Mike Smorul, and Mike McGann Institute for Advanced Computer Studies Department of.
1 CES IASSIST 2002, June 2002 University of Connecticut MetaNet: Standardising Statistical Metadata Methodology Karen Brannen University of Edinburgh,
SpaceGRID and EGSO Satu Keski-Jaskari Maria Vappula Parallal Computing – Seminar
DATA PRESERVATION IN ALICE FEDERICO CARMINATI. MOTIVATION ALICE is a 150 M CHF investment by a large scientific community The ALICE data is unique and.
Annual SERC Research Review - Student Presentation, October 5-6, Extending Model Based System Engineering to Utilize 3D Virtual Environments Peter.
QCDgrid Technology James Perry, George Beckett, Lorna Smith EPCC, The University Of Edinburgh.
Citigroup's IBM FileNet P8 Platform. 2 Agenda Introduction: The BPM Platform at Citi EMEA Use Cases: BPM Solutions within Citi EMEA Discussion.
Geoff Payne ARROW Project Manager 1 April Genesis Monash University information management perspective Desire to integrate initiatives such as electronic.
European Life Sciences Infrastructure for Biological Information ELIXIR
EGI-Engage EGI-Engage Engaging the EGI Community towards an Open Science Commons Project Overview 9/14/2015 EGI-Engage: a project.
EBI is an Outstation of the European Molecular Biology Laboratory. Proteomics repositories PRIDE team, Proteomics Services Group PANDA group European Bioinformatics.
Authors Project Database Handler The project database handler dbCCP4i is a small server program that handles interactions between the job database and.
Usability Issues Documentation J. Apostolakis for Geant4 16 January 2009.
How to assure MIAPE compliance of the data using the ProteoRed MIAPE Extractor tool HUPO-PSI meeting - Liverpool (15th April 2013) Salvador Martínez-Bartolomé.
Configuration Management (CM)
QCDGrid Progress James Perry, Andrew Jackson, Stephen Booth, Lorna Smith EPCC, The University Of Edinburgh.
Results of the HPC in Europe Taskforce (HET) e-IRG Workshop Kimmo Koski CSC – The Finnish IT Center for Science April 19 th, 2007.
Rackspace Analyst Event Tim Bell
Production Data Grids SRB - iRODS Storage Resource Broker Reagan W. Moore
Data Standards Submission 1 st CHr-16 Workshop. Miraflores de la Sierra August, 28 th -29 th 2012 Alberto Medina.
Research Design for Collaborative Computational Approaches and Scientific Workflows Deana Pennington January 8, 2007.
European Life Sciences Infrastructure for Biological Information META-pipe WP6 Kick-off Lars Ailo Bongo, ELIXIR-NO.
1 Computing Challenges for the Square Kilometre Array Mathai Joseph & Harrick Vin Tata Research Development & Design Centre Pune, India CHEP Mumbai 16.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
The LHC Computing Grid – February 2008 The Challenges of LHC Computing Dr Ian Bird LCG Project Leader 6 th October 2009 Telecom 2009 Youth Forum.
Johannes Griss PSI Meeting Heidelberg, April 2011 EBI is an Outstation of the European Molecular Biology Laboratory. mzTab Proposal for.
Exploring ‘Workspaces’ Tom Visser, SARA compute and networking services, Amsterdam Garching Workshop 21 st September 2010.
EUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures. Contract No The pan-European.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
EBI is an Outstation of the European Molecular Biology Laboratory. PRIDE centric exercise: BioMart interface PRIDE team, Proteomics Services Group PANDA.
The Protein Identifier Cross-Reference (PICR) service.
Fire Emissions Network Sept. 4, 2002 A white paper for the development of a NSF Digital Government Program proposal Stefan Falke Washington University.
The (IMG) Systems for Comparative Analysis of Microbial Genomes & Metagenomes: N America: 1,180 Europe: 386 Asia: 235 Africa: 6 Oceania: 81 S America:
AHM04: Sep 2004 Nottingham CCLRC e-Science Centre eMinerals: Environment from the Molecular Level Managing simulation data Lisa Blanshard e- Science Data.
INFSO-RI JRA2 Test Management Tools Eva Takacs (4D SOFT) ETICS 2 Final Review Brussels - 11 May 2010.
25-September-2005 Manjit Dosanjh Welcome to CERN International Workshop on African Research & Education Networking September ITU, UNU and CERN.
ETICS An Environment for Distributed Software Development in Aerospace Applications SpaceTransfer09 Hannover Messe, April 2009.
CLARIN EUDAT2020 uptake plan Dieter Van Uytvanck CLARIN ERIC EUDAT User Forum, Rome.
An Open Data Platform in the framework of the EGI-LifeWatch Competence Centre Fernando Aguilar Jesús Marco
European Life Sciences Infrastructure for Biological Information EGI 2015, Lisbon, 18 May 2015 Rafael C Jimenez, ELIXIR CTO ELIXIR.
European Life Sciences Infrastructure for Biological Information ELIXIR Cloud Roadmap Chairs: Steven Newhouse, EMBL-EBI & Mirek Ruda,
EUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures. Contract No EUDAT Aalto Data.
CoLIMS progress Computational Omics and Systems Biology (CompOmics) Group Niels Hulstaert
European Life Sciences Infrastructure for Biological Information ELIXIR’s needs from the EOSC Steven Newhouse, EMBL-EBI Part of the.
EUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures. Contract No The use of the.
Store and exchange data with colleagues and team Synchronize multiple versions of data Ensure automatic desktop synchronization of large files B2DROP is.
Reproducibility and Big (Omics) Data Henning Hermjakob Team Leader Proteomics Services EMBL-EBI
ProteomeXchange: Data Deposition … but where? Questions about submission: Which repository should I submit to? Should I submit to more than one? Do I need.
International Planetary Data Alliance Registry Project Update September 16, 2011.
INTRODUCTION TO XSEDE. INTRODUCTION  Extreme Science and Engineering Discovery Environment (XSEDE)  “most advanced, powerful, and robust collection.
Rafael Jimenez ELIXIR CTO BioMedBridges Life science requirements from e-infrastructure: initial results from a joint BioMedBridges workshop Stephanie.
EUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures. Contract No Herbadrop.
Data Grids, Digital Libraries and Persistent Archives: An Integrated Approach to Publishing, Sharing and Archiving Data. Written By: R. Moore, A. Rajasekar,
Accessing the VI-SEEM infrastructure
EUDAT: collaborative pan-European infrastructure providing research data services, training and consultancy This work is licensed.
ORCID y la comunidad global
DATA SPHINX & EUDAT Collaboration
DRIVER Digital Repository Infrastructure Vision for European Research
Census Hub: Progress report
Presentation transcript:

European Life Sciences Infrastructure for Biological Information “BILS-ProteomeXchange integration using EUDAT resources” ELIXIR-Pilot Project Dr. Juan A. Vizcaíno, EMBL-EBI, Dr. Fredrik Levander, BILS,

Juan A. Vizcaíno ELIXIR Webinar 20 May 2015 Andy Jenkinson (Systems group) Rui Wang (PRIDE) Juan A. Vizcaíno (PRIDE) Fredrik Levander Samuel Lampa Janos Nagy Mikael Borg Jani Heikkinen Main people involved directly in this pilot

Juan A. Vizcaíno ELIXIR Webinar 20 May 2015 Short intro to PRIDE & ProteomeXchange, BILS and EUDAT Objectives of the pilot Report on the results Perspectives for the future and conclusions Overview

Juan A. Vizcaíno ELIXIR Webinar 20 May 2015 Short intro to PRIDE & ProteomeXchange, BILS and EUDAT Objectives of the pilot Report on the results Perspectives for the future and conclusions Overview

Juan A. Vizcaíno ELIXIR Webinar 20 May 2015 PRIDE stores mass spectrometry (MS)- based proteomics data: Peptide and protein expression data (identification and quantification) Post-translational modifications Mass spectra (raw data and peak lists) Technical and biological metadata Any other related information Full support for tandem MS approaches PRIDE (PRoteomics IDEntifications) database Martens et al., Proteomics, 2005 Vizcaíno et al., NAR, 2013

Juan A. Vizcaíno ELIXIR Webinar 20 May 2015 ProteomeXchange Consortium Goal: Development of a framework to allow standard data submission and dissemination pipelines between the main existing proteomics repositories. Includes PeptideAtlas (ISB, Seattle), PRIDE (Cambridge, UK) and MassIVE (UCSD, San Diego). Tranche and Peptidome initially included but discontinued. Common identifier space (PXD identifiers) Two supported data workflows: MS/MS and SRM. Main objective: Make life easier for researchers Vizcaíno et al., Nat Biotechnol, 2014

Juan A. Vizcaíno ELIXIR Webinar 20 May 2015 ProteomeCentral Metadata / Manuscript Raw Data* Results Journals UniProt/ neXtProt Peptide AtlasOther DBs Receiving repositories PASSEL (SRM data) PRIDE (MS/MS data) Other DBs GPMDB Researcher’s results Reprocessed results Raw data* Metadata MassIVE (MS/MS data) ProteomeXchange data workflow: PRIDE Vizcaíno et al., Nat Biotechnol, 2014

Juan A. Vizcaíno ELIXIR Webinar 20 May 2015 PX Data workflow for MS/MS data 1.Mass spectrometer output files: raw data (binary files) or peak list spectra in a standardized format (mzML, mzXML). 2.Result files: a.Complete submissions: Result files can be converted to PRIDE XML or the mzIdentML data standard. b.Partial submissions: For workflows not yet supported by PRIDE, search engine output files will be stored and provided in their original form. 3.Metadata: Sufficiently detailed description of sample origin, workflow, instrumentation, submitter. 4.Other files: Optional files: a.QUANT: Quantification related resultse. FASTA b.PEAK: Peak list filesf. SP_LIBRARY c.GEL: Gel images d.OTHER: Any other file type Published Raw Files Other files

Juan A. Vizcaíno ELIXIR Webinar 20 May 2015 Current PSI Standard File Formats for MS mzTab (Griss et al., MCP, 2014) Final Results TraML (Deutsch et al., MCP, 2012) SRM mzQuantML (Walter et al., MCP, 2013) Quantitation mzIdentML (Jones et al., MCP, 2012) Identification mzML (Martens et al., MCP, 2011) MS data

Juan A. Vizcaíno ELIXIR Webinar 20 May 2015 PRIDE Components: Submission Process PRIDE Converter 2 PRIDE Inspector PX Submission Tool mzIdentML PRIDE XML

Juan A. Vizcaíno ELIXIR Webinar 20 May 2015 Origin: 396 USA 224 Germany 191 United Kingdom 106 Netherlands 105 China 104 France 94 Switzerland 75 Canada 55 Japan 55 Spain 54 Denmark 52 Sweden 50 Belgium 48 Australia 34 Austria 25 Norway 23 Taiwan 22 India 21 Finland 20 Ireland 20 Italy 16 Brazil 15 Russia 14 Republic of Korea 10 Israel 10 Singapore … ProteomeXchange: 1,963 datasets up until 1 st April, 2015 Type: 613 PRIDE complete 1177 PRIDE partial 79 PeptideAtlas/PASSEL complete 69 MassIVE 25 reprocessed Publicly Accessible: 959 datasets, 49% of all 88% PRIDE 9% PASSEL 3% MassIVE Data volume: Total: ~102 TB Number of all files: ~250,000 PXD : ~ 5 TB PXD000065: ~ 1.4TB Datasets/year: 2012: : : : 371 Top Species studied by at least 20 datasets: 839 Homo sapiens 232 Mus musculus 79 Arabidopsis thaliana 77 Saccharomyces cerevisiae 44 Rattus norvegicus 35 Escherichia coli 21 Bos taurus 21 Glycine max ~ 460 species in total

Juan A. Vizcaíno ELIXIR Webinar 20 May 2015 BILS – Bioinformatics Infrastructure for Life Sciences Distributed national research infrastructure supported by the Swedish Research Council Coordination with other bioinformatics activities BILS provides: Bioinformatics support (consultancy) Bioinformatics infrastructure (data and tools) Computing and storage is provided in collaboration with SNIC Bioinformatics network Nodes at each of the 6 large university cities Annual workshop Training Coordination with other bioinformatics activities Swedish node in ELIXIR

Juan A. Vizcaíno ELIXIR Webinar 20 May 2015 Main BILS proteomics support aims Data storage: Secure Long-time Metadata Automated Publishing Standardised formats Data processing: Accessible data processing workflows

Juan A. Vizcaíno ELIXIR Webinar 20 May 2015 Proteios: Software environment for proteomics web browser access and analysis of own data only BILS Scripts Public access to released raw data Häkkinen et al. (2009) J Proteome Res A multi-user platform for analysis and management of proteomics data

Juan A. Vizcaíno ELIXIR Webinar 20 May 2015 EUDAT EUDAT aims to contribute to building and operating a Collaborative Data Infrastructure for European science. This involves a suite of co-ordinated and interoperable services for preserving scientific data, and for making them accessible to researchers. EUDAT collaborates with research communities across a range of disciplines, from social sciences to environmental science and including molecular biology (as represented by ELIXIR). These communities have diverse structures, cultures and scales but also share some common requirements regarding the management of data.

Juan A. Vizcaíno ELIXIR Webinar 20 May 2015 EUDAT services

Juan A. Vizcaíno ELIXIR Webinar 20 May 2015 B2SAFE

Juan A. Vizcaíno ELIXIR Webinar 20 May 2015 EUDAT: B2SAFE AND iRODS B2SAFE aims to provide a software ecosystem for persistently available data, including persistent identification, abstracted data storage, and reliable automated replication via auditable rules. It is built on top of the iRODS data management software ( and integrates a PID system such as the European Persistent Identification Consortium (EPIC - ( Handle API).

Juan A. Vizcaíno ELIXIR Webinar 20 May 2015 PRIDE, ProteomeXchange, BILS and EUDAT Objectives of the pilot Report on the results Perspectives for the future and conclusions Overview

Juan A. Vizcaíno ELIXIR Webinar 20 May 2015 Objective To integrate the data repositories for MS proteomics data run by BILS (Sweden) and ProteomeXchange (via the PRIDE database, EMBL-EBI, UK), using EUDAT’s B2SAFE software.

Juan A. Vizcaíno ELIXIR Webinar 20 May 2015 Plans at European level National proteomics centers Meta data Meta data Result s Raw Data Raw Data Central repository Meta data Meta data Result s Raw Data Raw Data Data storage centers Meta data Meta data Raw Data Raw Data 1.- ELIXIR replication 2.- EUDAT replication

Juan A. Vizcaíno ELIXIR Webinar 20 May 2015 Objective To integrate the data repositories for MS proteomics data run by BILS (Sweden) and ProteomeXchange (via the PRIDE database, EMBL-EBI, UK), using EUDAT’s B2SAFE software. This project will also show the potential of collaboration among research infrastructures and e-infrastructures to better manage the data deluge. It will help to evaluate the requirements of such federated systems.

Juan A. Vizcaíno ELIXIR Webinar 20 May 2015 Short intro to PRIDE & ProteomeXchange, BILS and EUDAT Objectives of the pilot Report on the results Perspectives for the future and conclusions Overview

Juan A. Vizcaíno ELIXIR Webinar 20 May 2015 Timeline The pilot started when Jani Heikkinen (EUDAT) installed B2SAFE at EMBL-EBI (July 2014). Data workflow was defined on September/ October Implementation work happened in parallel, with regular weekly calls from January The pilot is now finishing (May 2015).

Juan A. Vizcaíno ELIXIR Webinar 20 May 2015 Envisioned data workflow (September/October 2014) Default B2SAFE rules ->Trigger replication of data from BILS to EBI PIDS assigned per file

Juan A. Vizcaíno ELIXIR Webinar 20 May 2015 Implementation process (1) B2SAFE (including iRODS 3.3.1) was initially installed at EMBL-EBI. However, BILS had moved already to iRODS v4. Incompatibility problems were found. It was decided to install iRODS 4.0 at the EBI, to solve the incompatibility issue. At the time iRODS v4 was not officially supported with iRODS version 4.0.3, so changes were necessary to the original install procedure to accommodate

Juan A. Vizcaíno ELIXIR Webinar 20 May 2015 Implementation process (2) EBI and BILS obtained Handle prefixes and made them available within EPIC. The integration with iRODS was successfully tested. The next step was to configure B2SAFE and achieve a test replication of a file from BILS to EBI using the B2SAFE PID creation and file transfer rules. Unexpected delays: EBI experienced some network issues that affected communications between the EBI and BILS iRODS. Two successive bugs were discovered. Both centered on the rule execution engine and prevented B2SAFE from functioning. These bugs were solved by EUDAT & iRODs developers.

Juan A. Vizcaíno ELIXIR Webinar 20 May 2015 Implementation process (3) With workarounds now in place it was possible to manually trigger a successful replication of a file from BILS to EBI. However it became apparent that the authorisation mechanism employed by iRODS in a federation would make the proposed submission workflow difficult to manage in a production environment. This means every BILS researcher able to submit data must have a user created for them on the EBI server first. Alternative customised solutions could solve this issue by decoupling the actions of researchers from the replication itself. However this would inevitably add complexity.

Juan A. Vizcaíno ELIXIR Webinar 20 May 2015 Implementation process (4) At this point (March 2015) the pilot had overrun (it was expected to last 6 months), with more work required to integrate the B2SAFE replication process with the PRIDE submission pipeline. It was decided to halt the process and find an alternative way to achieve the same goals using existing resources. A detailed report has been written and has been sent to all the parties involved.

Juan A. Vizcaíno ELIXIR Webinar 20 May 2015 Implemented alternative solution Proteios is able to generate the metadata file needed for the submission to ProteomeXchange via PRIDE. The PX submission tool was extended to support loading of files not available locally at the moment of submission (URLs are specified). As a proof of concept, dataset PXD was submitted to PRIDE. Now it is publicly available.

Juan A. Vizcaíno ELIXIR Webinar 20 May 2015 PX submission tool updated to streamline BILS submissions

Juan A. Vizcaíno ELIXIR Webinar 20 May 2015 Submitted dataset (now publicly available in PRIDE)

Juan A. Vizcaíno ELIXIR Webinar 20 May 2015 Dataset tags in PRIDE Archive es%20(BILS)%20network%20(Sweden) -Datasets can be tags with different attributes. -Functionality available in the submission process. -Stable URLs can be generated.

Juan A. Vizcaíno ELIXIR Webinar 20 May 2015 Short intro to PRIDE & ProteomeXchange, BILS and EUDAT Objectives of the pilot Report on the results Perspectives for the future and conclusions Overview

Juan A. Vizcaíno ELIXIR Webinar 20 May 2015 At present and in the near future… EMBL-EBI is involved in the EUDAT 2020 project (PI is Steven Newhouse). EMBL-EBI will then continue to collaborate with EUDAT, for gaining experience in the use of this software. PRIDE will evaluate the situation in the future to decide if the originally envisioned submission pipeline (based on B2SAFE and IRODS) is implemented.

Juan A. Vizcaíno ELIXIR Webinar 20 May 2015 Conclusions The pilot establishes that the original use case is not the best application of B2SAFE at the present time. However, the situation will be kept under review by PRIDE. This conclusion is not a reflection on B2SAFE per se, indeed B2SAFE and iRODS have been found to be very flexible and are likely to be interesting candidates for other use cases outside of PRIDE elsewhere in EMBL-EBI or ELIXIR. In particular, use cases focused on data management within or between data centres (i.e. bipartite collaborations) or environments where mature data submission, curation and archiving solutions do not already exist. In addition, we recommend ELIXIR continues to explore EUDAT services and their relevance in ELIXIR use cases.

Juan A. Vizcaíno ELIXIR Webinar 20 May 2015 Conclusions: Technical recommendations Incorporate a fully-functional RESTful interface for iRODS into B2SAFE, that can be used by a client to avoid installing iCommands on the client machine. The security model should be adapted to allow anonymous RW to a specified URL. If widespread deployment of EUDAT software is expected, effort must be committed by EUDAT 2020 to make the software more easily and quickly deployable by ‘ordinary’ system administrators.

Juan A. Vizcaíno ELIXIR Webinar 20 May 2015 Henning Hermjakob Steven Newhouse Rafael Jimenez Bengt Persson EUDAT management & developers Acknowledgements