Building an Open Data Infrastructure for Science: Policy and Practice Juan Bicarregui STFC e-Science Terena 2012, Reykjavik, May 2012.

Slides:



Advertisements
Similar presentations
April 2010 MRC Data Sharing Policy Peter Dukes Policy Lead – Data Sharing & Preservation.
Advertisements

DELOS Highlights COSTANTINO THANOS ITALIAN NATIONAL RESEARCH COUNCIL.
I2S2 - Infrastructure for Integration in Structural Sciences Information Model Development Workshop RAL 11 th February 2010
ICAT + Information Model Brian Matthews Scientific Information Group E-Science Centre STFC Rutherford Appleton Laboratory
Slide: 1 Welcome to the workshop ESRFUP-WP7 User Single Entry Point.
Introduction on WP7/WP9 Dominique PORTE 29/05/2008 Menu What is WP7? What is WP9? Goal of the brainstorming Introduction on WP7/WP9.
PaN-data WP7 - Integration Brian Matthews STFC-e-Science.
Data Management Planning Kerry Miller Digital Curation Centre University of Edinburgh DIY Research Data Management Training Kit for.
Building an Open Data Infrastructure for Science: Policy and Practice Juan Bicarregui STFC e-Science WIFI: BMA Guest “apa2 conference” “visitor” APA conference,
How to Write a Data Management Plan Gareth Cole, Data Curation Officer, Open Access Team.
Federated data catalogues supporting cross-facility, cross- discipline interaction at the scale of atoms and molecules Neutron diffraction X-ray diffraction.
December 2008 MRC Data Support Services (DSS) Chris Morris 13 th February 2009 Sharing Research Data: Pioneers, Policies and Protocols The seventh cat.
Data-intensive research The RCUK Data Policy Mark Thorley
Data Catalogue Service Work Package 4. Main Objective: Deployment, Operation and Evaluation of a cataloguing service for scientific data. Why: Potential.
PaNdata Photon and Neutron Data Infrastructure I2S2Meeting 1 April 2011 Juan Bicarregui.
1 Ideas About the Future of HPC in Europe “The views expressed in this presentation are those of the author and do not necessarily reflect the views of.
THE JOINED UP WORLD OF E-RESEARCH Professor Neil McLean National Technical Standards Adviser to the Department of Education Science and Training (DEST)
Research and Innovation Research and Innovation Research and Innovation Research and Innovation Research Infrastructures and Horizon 2020 The EU Framework.
Saturday 1 SN4CI. November 2005SNAC2 Words (used across 3 or more groups) Defined: community, scope Identifying: developers, early adopters, mechanism.
Copyright 2006 M.R.Thorley/NERC Mark Thorley, Natural Environment Research Council Research Outputs: Their Access & Preservation A perspective.
PARSAlliance Conference ©John Womersley/Keith Jeffery/STFC Developing Tomorrow’s Infrastructure for Science John Womersley Director, Science.
EPSRC expectations on research data: What researchers need to know 12/03/2015 Masud Khokhar and Hardy Schwamm.
PaNdata Europe Midpoint workshop 8-10 February 2011 Soleil, Paris PaN-data Europe – building a sustainable data infrastructure for Neutron and Photon laboratories.
EGI-Engage EGI-Engage Engaging the EGI Community towards an Open Science Commons Project Overview 9/14/2015 EGI-Engage: a project.
The importance of DART for funding agencies Dr. Ingrid Kissling-Näf.
The European Science Foundation is a non-governmental organisation based in Strasbourg, France.
David Carr The Wellcome Trust Data Matters: Wellcome Trust perspective Dryad-UK Meeting 28 April 2010.
Managing Research Data – The Organisational Challenge at Oxford James A J Wilson Friday 6 th December,
Integrated e-Infrastructure for Scientific Facilities Kerstin Kleese van Dam STFC- e-Science Centre Daresbury Laboratory
1 INFRA : INFRA : Scientific Information Repository supporting FP7 “The views expressed in this presentation are those of the author.
DIY Research Data Management Training Kit for Librarians Data sharing Anne Donnelly Liaison Librarian College of Medicine & Veterinary Medicine College.
Caring and Sharing Collaboration in Digital Curation outside North America Ross Harvey Simmons College, Boston Curation Matters: 17 June 2010.
Towards a European network for digital preservation Ideas for a proposal Mariella Guercio, University of Urbino.
ICSTI Annual Members’ Meeting & Workshop Dr. Stefan Winkler-Nees; Paris, 5. March 2012 The Alliance of German Science Organisations - Recommendations on.
Context and Linking in the Research Lifecycle CERIF and other standards Catherine Jones Scientific Information Group Scientific Computing Department STFC.
‘intelligent openness’ The common objective of an RCUK data policy Gregor McDonagh
The Faster Research Cycle Interoperability for better science Brian Matthews, Leader, Information Management Group, E-Science Centre, STFC Rutherford Appleton.
Jamie Hall (ILL). SciencePAD Persistent Identifiers Workshop PANData Software Catalogue January 30th 2013 Jamie Hall Developer IT Services, Institut Laue-Langevin.
The Swiss Grid Initiative Context and Initiation Work by CSCS Peter Kunszt, CSCS.
WP5 – Virtual Laboratories. WP5 Deliverables  D5.1: Specific requirements for the virtual laboratories M6  D5.2: Deployment of Specification of the.
ESFRI & e-Infrastructure Collaborations, EGEE’09 Krzysztof Wrona September 21 st, 2009 European XFEL.
Interoperability from the e-Science Perspective Yannis Ioannidis Univ. Of Athens and ATHENA Research Center
The PHEA Educational Technology Initiative. Project Partners PHEA Foundations – Ford, Carnegie, Kresge, MacArthur South African Institute for Distance.
Metadata for structural science Workshop on research metadata in context Nijmegen, 7–8 September 2010 Simon Lambert STFC e-Science UK.
The DEER The Distributed European Electronic Resource.
It’s the data that makes a paper Joerg Heber Executive Editor Nature Communications.
PaNdata ODI Open Data Infrastructure INFRA : Data infrastructures for e-Science PaNdata-ODI will develop, deploy and operate an Open Data Infrastructure.
DOE Data Management Plan Requirements
Infrastructure Breakout What capacities should we build now to manage data and migrate it over the future generations of technologies, standards, formats,
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EGI strategy and Grand Vision Ludek Matyska EGI Council Chair EGI InSPIRE.
The 7th Framework Programme for Research: Strategy of international cooperation activities Robert Burmanjer Head of Unit, “International Scientific Cooperation.
Aalto Research Data Management Policy Ella Bingham 8 April 2016 This work is licensed under the Creative Commons Attribution 4.0 International License.
Open Science (publishing) as-a-Service Paolo Manghi (OpenAIRE infrastructure) Institute of Information Science and Technologies Italian Research Council.
Introduction to Research Data Management Joy Davidson and Sarah Jones Digital Curation Centre
Thomas Gutberlet HZB User Coordination NMI3-II Neutron scattering and Muon spectroscopy Integrated Initiative WP5 Integrated User Access.
UNDERSTANDING INFORMATION MANAGEMENT (IM) WITHIN THE FEDERAL GOVERNMENT.
Fedora Commons Overview and Background Sandy Payette, Executive Director UK Fedora Training London January 22-23, 2009.
Usecases: 1.ISIS Neutron Source 2.DP for HEP Matthew Viljoen STFC, UK APARSEN-EGI workshop: preserving big data for research Amsterdam Science Park 4-6.
Store and exchange data with colleagues and team Synchronize multiple versions of data Ensure automatic desktop synchronization of large files B2DROP is.
E-infrastructure requirements from the ESFRI Physics, Astronomy and Analytical Facilities cluster Provisional material based on outcome of workshop held.
EGI-InSPIRE RI EGI Compute and Data Services for Open Access in H2020 Tiziana Ferrari Technical Director, EGI.eu
CESSDA SaW Training on Trust, Identifying Demand & Networking
Digital Archiving & Preservation : How to compare and contrast
Research Data Context Preservation in SCAPE
National e-Infrastructure Vision
Pandata Service Verification
PaNdata Photon and Neutron Data Infrastructure Juan Bicarregui
Introduction to Research Data Management
Research Data Management
Presentation transcript:

Building an Open Data Infrastructure for Science: Policy and Practice Juan Bicarregui STFC e-Science Terena 2012, Reykjavik, May 2012

Overview Introduction – What is STFC? What do we do? Why do we do it? An example project (Practice) RCUK Data Principles (Policy)

Programme includes: Neutron and Muon Source Synchrotron Radiation Source Lasers Space Science Particle Physics Compuing and Data Management Microstructures Nuclear Physics Radio Communications What is STFC? 250m ESRF & ILL, GrenobleDaresbury Laboratory Square Kilometre Array Large Hadron Collider

What is the science?

Data centric view of research Data Creation Archival Access Storage Compute Network Services Curation the researcher acts through ingest and access Virtual Research Environment the researcher shouldn’t have to worry about the information infrastructure Information Infrastructure

The 7 C’s Creation Collection Capacity Computation Curation Collaboration Communication Data Creation Archival Access Storage Compute Network Services Curation

Linked systems for: Proposal submission User management Data acquisition Metadata carried from each system to the next Detectors moving from Hz to KHz, MHz, GHz,... Creation Examining the detectors on MAPS instrument on ISIS

Capacity Moore’s law for us is about 15 months } 10 x Moore’s Law (2 years) 2 x Moore’s Law (1.5 years) Moore’s Law x1000 in 13 years Doubling every 1.3 years 2012 Currently store about 20 PetaBytes of data 20PB

Computation Compute intensive components on the grid Computational applications for Laser Theory Group’s adoption of HPC Laser real-time diagnostics & data flow pipeline. Fitting of experimental data to model

Curation Complexity of Facility Archives All ISIS data (~25 years) > 3,000,000 files All Diamond Data (~2.5 years) > 11,000,000 files Breadth of Data Sources: STFC (Tier 1) NERC (BADC, NEODC) BBSRC (All institutes) AHRC MRC Others: – The STFC Data Portal – The STFC Publications Archive – The CCPs (Collaborative Computational Projects) – The Chemical Database Service – The Digital Curation Centre – The EUROPRACTICE Software service – The HPCx Supercomputer – The JISCmail service – The NERC Datagrid – The Starlink Software suite – The UK Grid Support Centre – The World Data Centre for Solar-Terrestrial Physics Atlas Datastore Tape Robot The StorageTek tape robot with capacity of 20PB

Collection Proposal Approval Scheduling Experiment Data cleansing Record Publication Scientist submits application for beamtime Facility committee approves application Facility registers, trains, and schedules scientist’s visit Scientists visits, facility run’s experiment Subsequent publication registered with facility Raw data filtered and cleansed Data analysis Tools for processing made available

Communication Immense Expectations ! Web enables: – access to everything Everything on-line Interlinking enables: – Validation of results – Repetition of experiment Discovery enables: – new knowledge from old Archiving enables: – Unplanned reuse of data Antarctic environmental data – Reuse of knowledge One paper has >20,000 downloads – Completing the cycle Publications entered in next proposal STFC’s “e-pubs” Institutional Repository has records of 30,000 publications spanning 25 years “The web has changed everything...”

Collaboration Technology integration facilitates scientific collaboration Cross facility/beamline Cross disciplinary Technology integration improves facility efficiency PaN-data –Photon and Neutron Data infrastructure ICAT also used in Australian Synchrotron and Oak Ridge National Lab

Overview Introduction An example project – PaNdata RCUK Data Policy Principles

PaN-data bring together 11 major European Research Infrastructures PaN-data is coordinated by the e-Science Department at the Rutherford Appleton Laboratory, UK ISIS is the world’s leading pulsed spallation neutron source ILL operates the most intense slow neutron source in the world PSI operates the Swiss Light Source, SLS, and Neutron Spallation Source, SINQ, and is developing the SwissFEL Free Electron Laser HZB operates the BER II research reactor the BESSY II synchrotron CEA/LLB operates neutron scattering spectrometers from the Orphée fission reactor ESRF is a third generation synchrotron light source jointly funded by 19 European countries Diamond is new 3rd generation synchrotron funded by the UK and the Wellcome Trust DESY operates two synchrotrons, Doris III and Petra III, and the FLASH free electron laser Soleil is a 2.75 GeV synchrotron radiation facility in operation since 2007 ELETTRA operates a GeV synchrotron and is building the FERMI Free Electron Laser ALBA is a new 3 GeV synchrotron facility due to become operational in 2010 PaN-data Partners JCNS Juelich Centre for Neutron Science MaxLab, Max IV Synchrotron The partners operate hundreds of instruments used by over 30,000 scientists each year

EDNS - European Data Infrastructure for Neutron and Synchrotron Sources PaNdata Vision Single Infrastructure  Single User Experience Capacity Storage Publications Repositories Data Repositories Software Repositories Raw Data Data Analysis Analysed Data Publication Data Publication s Facility 1 Raw Data Data Analysis Analysed Data Publication Data Publication s Facility 2 Raw Data Data Analysis Analysed Data Publication Data Publication s Facility 3 Different Infrastructures  Different User Experiences Raw Data Catalogue Data Analysis Analysed Data Catalogue Publication Data Catalogue Publications Catalogue

Science driver – Data Integration Neutron diffraction X-ray diffraction } NMR High-quality structure refinement }

PaN-data Standardisation PaN-data Europe is undertaking 5 standardisation activities: 1.Development of a common data policy framework 2.Agreement on protocols for shared user information exchange 3.Definition of standards for common scientific data formats 4.Strategy for the interoperation of data analysis software enabling the most appropriate software to be used independently of where the data is collected 5.Integration and cross-linking of research outputs completing the lifecycle of research, linking all information underpinning publications, and supporting the long-term preservation of the research outputs PaN-data Europe – building a sustainable data infrastructure for Neutron and Photon laboratories

ERA Open Access Sharing Initiatives (examples, etc) ERA Infrastructure Platform Initiatives (EGI, etc) PaNdata Support Action (Ends 30 Nov 11) Policies and Standards PaNdata ODI (begins end2011) JRAs Users Data Software Integration Provenance Preservation Scalability PaNdata ODI (begins end2011) Services Users Data PaNdata ODI Virtual Labs Policies Powder Diff SAXS & SANS Tomography

Data The Research Lifecycle – a personal view the researcher acts through ingest and access Research Environment Creation Archival Access Storage Compute Network Data Services the researcher shouldn’t have to worry about the information infrastructure Information Infrastructure MetaData/ Catalogues Portals User Info feed DAQ feed Data Analysis feed EGI GEANT Local resources e-Infrastructure Provenanced Research

Overview Introduction – What is STFC? What do we do? Why do we do it? An example project RCUK Data Policy Principles

RCUK Principles on Data Policy Seven (fairly) orthogonal principles: Public good Preservation Discoverability Confidentiality First use Recognition Costs } Data } Access } Rights

Repeat, Repeal, Repurpose Why might we want access to data? Three distinct reasons for sharing data: Repeat - Validation of previous analysis – How does this fit with peer review? Repeal/Reconsider/ReformReverse - Alternative hypotheses in the same field – c.f. Reuse – How does this fit with “right” to first use? Repurpose - New research in another field – c.f. Recycle – How does this fit with recognition of Intellectual contribution? (What’s in it for me?) Different concerns and requirements for each type of sharing

Data are a Public Good Publicly funded research data are a public good, produced in the public interest, which should be made openly available with as few restrictions as possible in a timely and responsible manner that does not harm intellectual property. Public good – is nonrival and non-excludable [wikipedia]nonrivalnon-excludable consumption by one does not reduce availability for others no one can be effectively excluded from using Research Data recorded factual material commonly retained by and accepted in the scientific community as necessary to validate research findings As few restrictions as possible Later (distinguish registration from restriction) Timely Later (discipline specific) Responsible Later (maximising access does not necessarily maximising research benefit) Intellectual Property Later (balance contribution from sharing and from primary research)

Data should be managed Institutional and project specific data management policies and plans should be in accordance with relevant standards and community best practice. Data with acknowledged long term value should be preserved and remain accessible and usable for future research Policies and Plans DMPs should exist for all data Institutional/Departmental v. Project Standards and Best practice discipline specific Acknowledged long term value discipline specific Future research (by current and future generations) Don’t lose it by accident

Data should be discoverable To enable research data to be discoverable and effectively reused by others, sufficient metadata should be recorded and made openly available to enable other researchers to understand the research and re-use potential of the data. Published results should always include information on how to access the supporting data Discoverable – repository? registration? Effectively reused by others – repeat, repeal, repurpose,.. Sufficient Metadata... openly available (stronger than for data)...To understand the re-use potential Published results should include (pointers) could be an address (but note longevity requirement)

Data may be protected RCUK recognises that there are legal, ethical and commercial constraints on release of research data. To ensure that the research process is not damaged by inappropriate release of data, research organisation policies and practices should ensure that these are considered at all stages in the research process. Legal Data Protection Act, Freedom of Information Act, Environmental Information Regulations, EU INSPIRE Directive (Spatial Data) Ethical Consent, Privacy, National Security, eg longitudinal cohort studies – protection of cohort participation Commercial – shared funding, patent pending, commercial in confidence... all stages in the research process... Plan in proposal – peer reviewed external review of access requests

Originators may have first use To ensure that research teams get appropriate recognition for the effort involved in collecting and analysing data, those who undertake research council funded work may be entitled to a limited period of privileged use of the data they have collected to enable them to publish the results of their research. The length of this period varies by research discipline and, where appropriate, is discussed further in the published policies of individual Research Councils.... may be entitled limited period of privileged use the length of this period varies... individual research councils’ policies elaborate

Reusers have responsibilities In order to recognise the intellectual contributions of researchers who generate, preserve and share key research datasets, all users of research data should acknowledge the sources of their data and abide by the terms and conditions under which they are accessed.... abide by the terms and conditions... terms and conditions may exist to monitor use to promote terms and conditions of use... should acknowledge the sources of their data... Data citation c.f. should be required to acknowledge....

Data sharing is not free It is appropriate to use public funds to support the management and sharing of publicly-funded research data. To maximise the research benefit which can be gained from limited budgets, the mechanisms for these activities should be both efficient and cost-effective in the use of public funds. Data sharing costs! Marginal cost may be small but initial cost may be high Even if data are free at the point of use (which it may not be) there are costs behind. Cost is fundable by RCs - but needs tensioning against other research You can ask for funding but be reasonable

Outcomes Ensure the continuing availability of data of long-term value Facilitate the development of mechanisms to improve the management, accessibility and attribution of data Promote awareness of legislation and guidance relating to the management and dissemination of information

Overview Introduction An example project - PaNdata RCUK Data Policy Principles a final thought

The 7 C’s Creation Collection Capacity Computation Curation Collaboration Communication Permanent Access Provenanced Research The Knowledge Lifecycle Data Creation Archival Access Storage Compute Network Services Curation

Thank You

Motivation Data are a critical output of the research process: For the integrity, transparency and robustness of the research record Often value increases through aggregation Enables new research questions to be addressed Supports the wider exploitation of data

The Seven Ps Public good  Public good Preservation  Preservation Discoverability  Promotion Confidentiality  Protection & Privacy First use  Privilege Recognition  Probity Costs  Price

Building an Open Data Infrastructure for Science Policy and Practice Juan Bicarregui STFC e-Science Terena 2012, Reykjavik, May 2012

Objectives Objective 2 – Users To deploy, operate and evaluate a system for pan-European user identification across the participating facilities and implement common processes for the joint maintenance of that system. Objective 3 – Data To deploy, operate and evaluate a generic catalogue of scientific data across the participating facilities and promote its integration with other catalogues beyond the project. Objective 4 – Provenance To research and develop a conceptual framework, defined as a metadata model, which can record the analysis process, and to provide a software infrastructure which implements that model to record analysis steps hence enabling the tracing of the derivation of analysed data outputs. Objective 5 – Preservation To add to the PaNdata infrastructure extra capabilities oriented towards long-term preservation and to integrate these within selected virtual laboratories of the project to demonstrate benefits. These capabilities should, as for the developments in the provenance JRA, be integrated into the normal scientific lifecycle as far as possible. The conceptual foundations will be the OAIS standard and the NeXus file format. Objective 6 – Scalability To develop a scalable data processing framework, combining parallel filesystems with a parallelized standard data formats (pNexus pHDF5) to permit applications to make most efficient use of dedicated multi-core environments and to permit simultaneous ingest of data from various sources, while maintaining the possibility for real-time data processing. Objective 7 – Demonstration To deploy and operate the services and technology developed in the project in virtual laboratories for three specific techniques providing a set of integrated end-to-end data services.

PaNdata ODI Joint Research Activities PaNdata ODI Service Activities PaNdata ODI Service Releases Standards from PaNdata Support Action uCat dCat vLabs Prov Pres Scale Rel 1Rel 2Rel 3Rel 4 users data s/w Integ Jun 2014 Jun 2013 Dec 2013 Dec 2012