Download presentation
Presentation is loading. Please wait.
Published byJacob Taylor Modified over 8 years ago
1
Building an Open Data Infrastructure for Science: Policy and Practice Juan Bicarregui STFC e-Science Terena 2012, Reykjavik, May 2012
2
Overview Introduction – What is STFC? What do we do? Why do we do it? An example project (Practice) RCUK Data Principles (Policy)
3
Programme includes: Neutron and Muon Source Synchrotron Radiation Source Lasers Space Science Particle Physics Compuing and Data Management Microstructures Nuclear Physics Radio Communications What is STFC? 250m ESRF & ILL, GrenobleDaresbury Laboratory Square Kilometre Array Large Hadron Collider
4
What is the science?
5
Data centric view of research Data Creation Archival Access Storage Compute Network Services Curation the researcher acts through ingest and access Virtual Research Environment the researcher shouldn’t have to worry about the information infrastructure Information Infrastructure
6
The 7 C’s Creation Collection Capacity Computation Curation Collaboration Communication Data Creation Archival Access Storage Compute Network Services Curation
7
Linked systems for: Proposal submission User management Data acquisition Metadata carried from each system to the next Detectors moving from Hz to KHz, MHz, GHz,... Creation Examining the detectors on MAPS instrument on ISIS
8
Capacity Moore’s law for us is about 15 months } 10 x Moore’s Law (2 years) 2 x Moore’s Law (1.5 years) Moore’s Law x1000 in 13 years Doubling every 1.3 years 2012 Currently store about 20 PetaBytes of data 20PB
9
Computation Compute intensive components on the grid Computational applications for Laser Theory Group’s adoption of HPC Laser real-time diagnostics & data flow pipeline. Fitting of experimental data to model
10
Curation Complexity of Facility Archives All ISIS data (~25 years) > 3,000,000 files All Diamond Data (~2.5 years) > 11,000,000 files Breadth of Data Sources: STFC (Tier 1) NERC (BADC, NEODC) BBSRC (All institutes) AHRC MRC Others: – The STFC Data Portal – The STFC Publications Archive – The CCPs (Collaborative Computational Projects) – The Chemical Database Service – The Digital Curation Centre – The EUROPRACTICE Software service – The HPCx Supercomputer – The JISCmail service – The NERC Datagrid – The Starlink Software suite – The UK Grid Support Centre – The World Data Centre for Solar-Terrestrial Physics Atlas Datastore Tape Robot The StorageTek tape robot with capacity of 20PB
11
Collection Proposal Approval Scheduling Experiment Data cleansing Record Publication Scientist submits application for beamtime Facility committee approves application Facility registers, trains, and schedules scientist’s visit Scientists visits, facility run’s experiment Subsequent publication registered with facility Raw data filtered and cleansed Data analysis Tools for processing made available
12
Communication Immense Expectations ! Web enables: – access to everything Everything on-line Interlinking enables: – Validation of results – Repetition of experiment Discovery enables: – new knowledge from old Archiving enables: – Unplanned reuse of data Antarctic environmental data – Reuse of knowledge One paper has >20,000 downloads – Completing the cycle Publications entered in next proposal STFC’s “e-pubs” Institutional Repository has records of 30,000 publications spanning 25 years “The web has changed everything...”
13
Collaboration Technology integration facilitates scientific collaboration Cross facility/beamline Cross disciplinary Technology integration improves facility efficiency PaN-data –Photon and Neutron Data infrastructure ICAT also used in Australian Synchrotron and Oak Ridge National Lab
14
Overview Introduction An example project – PaNdata RCUK Data Policy Principles
15
PaN-data bring together 11 major European Research Infrastructures PaN-data is coordinated by the e-Science Department at the Rutherford Appleton Laboratory, UK ISIS is the world’s leading pulsed spallation neutron source ILL operates the most intense slow neutron source in the world PSI operates the Swiss Light Source, SLS, and Neutron Spallation Source, SINQ, and is developing the SwissFEL Free Electron Laser HZB operates the BER II research reactor the BESSY II synchrotron CEA/LLB operates neutron scattering spectrometers from the Orphée fission reactor ESRF is a third generation synchrotron light source jointly funded by 19 European countries Diamond is new 3rd generation synchrotron funded by the UK and the Wellcome Trust DESY operates two synchrotrons, Doris III and Petra III, and the FLASH free electron laser Soleil is a 2.75 GeV synchrotron radiation facility in operation since 2007 ELETTRA operates a 2-2.4 GeV synchrotron and is building the FERMI Free Electron Laser ALBA is a new 3 GeV synchrotron facility due to become operational in 2010 PaN-data Partners JCNS Juelich Centre for Neutron Science MaxLab, Max IV Synchrotron The partners operate hundreds of instruments used by over 30,000 scientists each year
16
EDNS - European Data Infrastructure for Neutron and Synchrotron Sources PaNdata Vision Single Infrastructure Single User Experience Capacity Storage Publications Repositories Data Repositories Software Repositories Raw Data Data Analysis Analysed Data Publication Data Publication s Facility 1 Raw Data Data Analysis Analysed Data Publication Data Publication s Facility 2 Raw Data Data Analysis Analysed Data Publication Data Publication s Facility 3 Different Infrastructures Different User Experiences Raw Data Catalogue Data Analysis Analysed Data Catalogue Publication Data Catalogue Publications Catalogue
17
Science driver – Data Integration Neutron diffraction X-ray diffraction } NMR High-quality structure refinement }
18
PaN-data Standardisation PaN-data Europe is undertaking 5 standardisation activities: 1.Development of a common data policy framework 2.Agreement on protocols for shared user information exchange 3.Definition of standards for common scientific data formats 4.Strategy for the interoperation of data analysis software enabling the most appropriate software to be used independently of where the data is collected 5.Integration and cross-linking of research outputs completing the lifecycle of research, linking all information underpinning publications, and supporting the long-term preservation of the research outputs PaN-data Europe – building a sustainable data infrastructure for Neutron and Photon laboratories
19
ERA Open Access Sharing Initiatives (examples, etc) ERA Infrastructure Platform Initiatives (EGI, etc) PaNdata Support Action (Ends 30 Nov 11) Policies and Standards PaNdata ODI (begins end2011) JRAs Users Data Software Integration Provenance Preservation Scalability PaNdata ODI (begins end2011) Services Users Data PaNdata ODI Virtual Labs Policies Powder Diff SAXS & SANS Tomography
20
Data The Research Lifecycle – a personal view the researcher acts through ingest and access Research Environment Creation Archival Access Storage Compute Network Data Services the researcher shouldn’t have to worry about the information infrastructure Information Infrastructure MetaData/ Catalogues Portals User Info feed DAQ feed Data Analysis feed EGI GEANT Local resources e-Infrastructure Provenanced Research
21
Overview Introduction – What is STFC? What do we do? Why do we do it? An example project RCUK Data Policy Principles
22
RCUK Principles on Data Policy Seven (fairly) orthogonal principles: Public good Preservation Discoverability Confidentiality First use Recognition Costs } Data } Access } Rights
23
Repeat, Repeal, Repurpose Why might we want access to data? Three distinct reasons for sharing data: Repeat - Validation of previous analysis – How does this fit with peer review? Repeal/Reconsider/ReformReverse - Alternative hypotheses in the same field – c.f. Reuse – How does this fit with “right” to first use? Repurpose - New research in another field – c.f. Recycle – How does this fit with recognition of Intellectual contribution? (What’s in it for me?) Different concerns and requirements for each type of sharing
24
Data are a Public Good Publicly funded research data are a public good, produced in the public interest, which should be made openly available with as few restrictions as possible in a timely and responsible manner that does not harm intellectual property. Public good – is nonrival and non-excludable [wikipedia]nonrivalnon-excludable consumption by one does not reduce availability for others no one can be effectively excluded from using Research Data recorded factual material commonly retained by and accepted in the scientific community as necessary to validate research findings As few restrictions as possible Later (distinguish registration from restriction) Timely Later (discipline specific) Responsible Later (maximising access does not necessarily maximising research benefit) Intellectual Property Later (balance contribution from sharing and from primary research)
25
Data should be managed Institutional and project specific data management policies and plans should be in accordance with relevant standards and community best practice. Data with acknowledged long term value should be preserved and remain accessible and usable for future research Policies and Plans DMPs should exist for all data Institutional/Departmental v. Project Standards and Best practice discipline specific Acknowledged long term value discipline specific Future research (by current and future generations) Don’t lose it by accident
26
Data should be discoverable To enable research data to be discoverable and effectively reused by others, sufficient metadata should be recorded and made openly available to enable other researchers to understand the research and re-use potential of the data. Published results should always include information on how to access the supporting data Discoverable – repository? registration? Effectively reused by others – repeat, repeal, repurpose,.. Sufficient Metadata... openly available (stronger than for data)...To understand the re-use potential Published results should include (pointers) could be an email address (but note longevity requirement)
27
Data may be protected RCUK recognises that there are legal, ethical and commercial constraints on release of research data. To ensure that the research process is not damaged by inappropriate release of data, research organisation policies and practices should ensure that these are considered at all stages in the research process. Legal Data Protection Act, Freedom of Information Act, Environmental Information Regulations, EU INSPIRE Directive (Spatial Data) Ethical Consent, Privacy, National Security, eg longitudinal cohort studies – protection of cohort participation Commercial – shared funding, patent pending, commercial in confidence... all stages in the research process... Plan in proposal – peer reviewed external review of access requests
28
Originators may have first use To ensure that research teams get appropriate recognition for the effort involved in collecting and analysing data, those who undertake research council funded work may be entitled to a limited period of privileged use of the data they have collected to enable them to publish the results of their research. The length of this period varies by research discipline and, where appropriate, is discussed further in the published policies of individual Research Councils.... may be entitled...... limited period of privileged use...... the length of this period varies... individual research councils’ policies elaborate
29
Reusers have responsibilities In order to recognise the intellectual contributions of researchers who generate, preserve and share key research datasets, all users of research data should acknowledge the sources of their data and abide by the terms and conditions under which they are accessed.... abide by the terms and conditions... terms and conditions may exist to monitor use to promote terms and conditions of use... should acknowledge the sources of their data... Data citation c.f. should be required to acknowledge....
30
Data sharing is not free It is appropriate to use public funds to support the management and sharing of publicly-funded research data. To maximise the research benefit which can be gained from limited budgets, the mechanisms for these activities should be both efficient and cost-effective in the use of public funds. Data sharing costs! Marginal cost may be small but initial cost may be high Even if data are free at the point of use (which it may not be) there are costs behind. Cost is fundable by RCs - but needs tensioning against other research You can ask for funding...... but be reasonable
31
Outcomes Ensure the continuing availability of data of long-term value Facilitate the development of mechanisms to improve the management, accessibility and attribution of data Promote awareness of legislation and guidance relating to the management and dissemination of information
32
Overview Introduction An example project - PaNdata RCUK Data Policy Principles a final thought
33
The 7 C’s Creation Collection Capacity Computation Curation Collaboration Communication Permanent Access Provenanced Research The Knowledge Lifecycle Data Creation Archival Access Storage Compute Network Services Curation
34
www.pan-data.eu www.rcuk.ac.uk/research/Pages/DataPolicy.aspx Thank You
35
Motivation Data are a critical output of the research process: For the integrity, transparency and robustness of the research record Often value increases through aggregation Enables new research questions to be addressed Supports the wider exploitation of data
36
The Seven Ps Public good Public good Preservation Preservation Discoverability Promotion Confidentiality Protection & Privacy First use Privilege Recognition Probity Costs Price
37
Building an Open Data Infrastructure for Science Policy and Practice Juan Bicarregui STFC e-Science Terena 2012, Reykjavik, May 2012
38
Objectives Objective 2 – Users To deploy, operate and evaluate a system for pan-European user identification across the participating facilities and implement common processes for the joint maintenance of that system. Objective 3 – Data To deploy, operate and evaluate a generic catalogue of scientific data across the participating facilities and promote its integration with other catalogues beyond the project. Objective 4 – Provenance To research and develop a conceptual framework, defined as a metadata model, which can record the analysis process, and to provide a software infrastructure which implements that model to record analysis steps hence enabling the tracing of the derivation of analysed data outputs. Objective 5 – Preservation To add to the PaNdata infrastructure extra capabilities oriented towards long-term preservation and to integrate these within selected virtual laboratories of the project to demonstrate benefits. These capabilities should, as for the developments in the provenance JRA, be integrated into the normal scientific lifecycle as far as possible. The conceptual foundations will be the OAIS standard and the NeXus file format. Objective 6 – Scalability To develop a scalable data processing framework, combining parallel filesystems with a parallelized standard data formats (pNexus pHDF5) to permit applications to make most efficient use of dedicated multi-core environments and to permit simultaneous ingest of data from various sources, while maintaining the possibility for real-time data processing. Objective 7 – Demonstration To deploy and operate the services and technology developed in the project in virtual laboratories for three specific techniques providing a set of integrated end-to-end data services.
39
PaNdata ODI Joint Research Activities PaNdata ODI Service Activities PaNdata ODI Service Releases Standards from PaNdata Support Action uCat dCat vLabs Prov Pres Scale Rel 1Rel 2Rel 3Rel 4 users data s/w Integ Jun 2014 Jun 2013 Dec 2013 Dec 2012
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.