APARSEN Webinar, November 2014

Slides:



Advertisements
Similar presentations
Data Preservation at the Exa-Scale and Beyond Challenges of the Next Decade(s) APA Conference, Brussels, October 2014.
Advertisements

SCD in Horizon 2020 Ian Collier RAL Tier 1 GridPP 33, Ambleside, August 22 nd 2014.
Co-funded by the European Union under FP7-ICT Alliance Permanent Access to the Records of Science in Europe Network Co-ordinated by aparsen.eu #APARSEN.
DATA PRESERVATION IN ALICE FEDERICO CARMINATI. MOTIVATION ALICE is a 150 M CHF investment by a large scientific community The ALICE data is unique and.
Exa-Scale Data Preservation in HEP
Long-Term Data Preservation in HEP Challenges, Opportunities and Solutions(?) Workshop on Best Practices for Data Management & Sharing.
Long-Term Data Preservation in HEP Challenges, Opportunities and Solutions(?) Joint Data Preservation RDA-3 International Collaboration.
Co-funded by the European Union under FP7-ICT Co-ordinated by aparsen.eu #APARSEN Why persistent identifiers are crucial in digital preservation.
Data Preservation at the Exa-Scale and Beyond Challenges of the Next Decade(s) APARSEN Webinar, November 2014.
Results of the HPC in Europe Taskforce (HET) e-IRG Workshop Kimmo Koski CSC – The Finnish IT Center for Science April 19 th, 2007.
Long-Term Data Preservation: Debriefing Following RDA-4 WLCG GDB, October 2014
Managing, Preserving & Computing with Big Research Data Challenges, Opportunities and Solutions(?) EU-T0 F2F, April 2014 International.
The DPHEP Collaboration & Project(s) Services, Common Projects, Business Model(s) PH/SFT Group Meeting December 2013 International.
1 Computing Challenges for the Square Kilometre Array Mathai Joseph & Harrick Vin Tata Research Development & Design Centre Pune, India CHEP Mumbai 16.
Ian Bird Trigger, Online, Offline Computing Workshop CERN, 5 th September 2014.
Long Term Data Preservation LTDP = Data Sharing – In Time and Space Big Data, Open Data Workshop, May 2014 International Collaboration.
Data Preservation in HEP Use Cases, Business Cases, Costs & Cost Models Grid Deployment Board International Collaboration for Data.
Office of Science Statement on Digital Data Management Laura Biven, PhD Senior Science and Technology Advisor Office of the Deputy Director for Science.
ATTRACT is a proposal for an EU-funded R&D programme for sensor, imaging and related computing devlopment Its purpose is to demonstrate the value of European.
#DPHEP: Status and Outlook Sustainable Strategies for Long-Term DP at the Exa-scale LHCC Referees Meeting International Collaboration.
Ian Bird WLCG Networking workshop CERN, 10 th February February 2014
Preservation e-Infrastructures, Certification & ADMP IGs DPHEP Status and Outlook RDA Plenary 6 Paris, September 2016 International.
International Collaboration for Data Preservation and Long Term Analysis in High Energy Physics RECODE - Final Workshop - January.
The DPHEP Collaboration & Project(s) Services, Common Projects, Business Model(s) EGI “towards H2020” Workshop December 2013 International.
Preparing Data Management Plans for WLCG and HNISciCloud IT International Collaboration for Data Preservation and Long Term.
DPHEP Update LTDP = Data Sharing – In Time and Space WLCG Overview Board, May 2014 International Collaboration for Data Preservation.
Data Preservation in HEP Use Cases, Business Cases, Costs & Cost Models Grid Deployment Board International Collaboration for Data.
Computing infrastructures for the LHC: current status and challenges of the High Luminosity LHC future Worldwide LHC Computing Grid (WLCG): Distributed.
School on Grid & Cloud Computing International Collaboration for Data Preservation and Long Term Analysis in High Energy Physics.
What is ATTRACT? A proposal has been made to the European Commission (EC) for a dedicated EC-funded program to develop new (ionizing) radiation sensor.
Ian Bird, CERN WLCG Project Leader Amsterdam, 24 th January 2012.
Evolution of storage and data management
Digital Sustainability on the EU Policy Level
HEP LTDP Use Case & EOSC Pilot
Server Virtualization IT Steering Committee, March 11, 2009
Digital Repository Certification Schema A Pathway for Implementing the GEO Data Sharing and Data Management Principles Robert R. Downs, PhD Sr. Digital.
Long Term Data Preservation meets the European Open Science Cloud
Certification of CERN as a Trusted Digital Repository
Ian Bird WLCG Workshop San Francisco, 8th October 2016
GISELA & CHAIN Workshop Digital Cultural Heritage Network
EGEE Middleware Activities Overview
Computing models, facilities, distributed computing
EOSCpilot WP4: Use Case 5 Material for
Jarek Nabrzyski Director, Center for Research Computing
Stream 2: Technical research Achievements and future plans
2. ISO Certification Discussed already at 2015 PoW and several WLCG OB meetings Proposed approach: An Operational Circular that describes the organisation's.
Data Management and Access Policies: CERN, HEP (and beyond)
WLCG: TDR for HL-LHC Ian Bird LHCC Referees’ meting CERN, 9th May 2017.
Programme Board 6th Meeting May 2017 Craig Larlee
Goal of the workshop To define an international roadmap towards colliders based on advanced accelerator concepts, including intermediate milestones, and.
Data Preservation Update Data Preservation, Curation & Stewardship
What is ATTRACT? ATTRACT (breAkThrough innovaTion pRogrAmme for deteCtor / inrAstructure eCosysTem) A proposal from CERN to the European Commission (EC)
Connecting the European Grid Infrastructure to Research Communities
New strategies of the LHC experiments to meet
EGI – Organisation overview and outreach
DATA SPHINX & EUDAT Collaboration
EOSC Governance Development Forum
Moving in the digital world – breaking down the barriers Monique Nielsen National Archives of Australia February 2018.
Structures for Implementation
NFFA Europe.
ASSESS Initiative Update
Research Data Management
What does DPHEP do? DPHEP has become a Collaboration with signatures from the main HEP laboratories and some funding agencies worldwide. It has established.
Common Solutions to Common Problems
A Funders Perspective Maria Uhle Co-Chair, Belmont Forum Directorates for Geosciences, US National Science Foundation.
GISELA & CHAIN Workshop Digital Cultural Heritage Network
Sergio Andreozzi Strategy and Policy Manager (EGI.eu)
The ESA Earth Observation Long Term Data Preservation (LTDP) Programme
Jisc Research Data Shared Service (RDSS)
Presentation transcript:

Jamie.Shiers@cern.ch APARSEN Webinar, November 2014 Data Preservation at the Exa-Scale and Beyond Challenges of the Next Decade(s) Jamie.Shiers@cern.ch APARSEN Webinar, November 2014

The Story So Far… Together, we have reached the point where a generic, multi-disciplinary, scalable e-i/s for LTDP is achievable – and will hopefully be funded  Built on standards, certified via agreed procedures, using the “Cream of DP services” In parallel, Business Cases and Cost Models are increasingly understood, working closely with Projects, Communities and Funding Agencies

Open Questions Long-term sustainability is still a technical issue Let’s assume that we understand the Business Cases & Cost Models well enough… And (we) even have agreed funding for key aspects But can the service providers guarantee a multi-decade service? Is this realistic? Is this even desirable?

4C Roadmap Messages A Collaboration to Clarify the Costs of Curation Identify the value of digital assets and make choices Demand and choose more efficient systems Develop scalable services and infrastructure Design digital curation as a sustainable service Make funding dependent on costing digital assets across the whole lifecycle Be collaborative and transparent to drive down costs OSD@Orsay - Jamie.Shiers@cern.ch

OSD@Orsay - Jamie.Shiers@cern.ch “Observations” (unrepeatable) versus “measurements” “Records” versus “data” Choices & decisions: Some (re-)uses of data are unforeseen! No “one-size fits all” OSD@Orsay - Jamie.Shiers@cern.ch

Suppose these guys can build / share the most cost effective, scalable and reliable federated storage services, e.g. for peta- / exa- / zetta- scale bit preservation? Can we ignore them?

H2020 EINFRA-1-2014 Managing, preserving and computing with big research data Proof of concept and prototypes of data infrastructure-enabling software (e.g. for databases and data mining) for extremely large or highly heterogeneous data sets scaling to zetabytes and trillion of objects. Clean slate approaches to data management targeting 2020+ 'data factory' requirements of research communities and large scale facilities (e.g. ESFRI projects) are encouraged

Next Generation Data Factories HL-LHC (https://indico.cern.ch/category/4863/) Europe’s top priority should be the exploitation of the full potential of the LHC, including the high-luminosity upgrade of the machine and detectors with a view to collecting ten times more data than in the initial design, by around 2030 (European Strategy for Particle Physics) SKA The Square Kilometre Array (SKA) project is an international effort to build the world’s largest radio telescope, with a square kilometre (one million square metres) of collecting area Typified by SCALE in several dimensions: Cost; longevity; data rates & volumes Last decades; cost O(EUR 109); EB / ZB data volumes

http://science. energy http://science.energy.gov/funding-opportunities/digital-data-management/ “The focus of this statement is sharing and preservation of digital research data” All proposals submitted to the Office of Science (after 1 October 2014) for research funding must include a Data Management Plan (DMP) that addresses the following requirements: DMPs should describe whether and how data generated in the course of the proposed research will be shared and preserved. If the plan is not to share and/or preserve certain data, then the plan must explain the basis of the decision (for example, cost/benefit considerations, other parameters of feasibility, scientific appropriateness, or limitations discussed in #4). At a minimum, DMPs must describe how data sharing and preservation will enable validation of results, or how results could be validated if data are not shared or preserved.

Data: Outlook for HL-LHC PB We are here! Very rough estimate of a new RAW data per year of running using a simple extrapolation of current data volume scaled by the output rates. To be added: derived data (ESD, AOD), simulation, user data… At least 0.5 EB / year (x 10 years of data taking)

Bit-preservation WG one-slider Mandate summary (see w3.hepix.org/bit-preservation) Collecting and sharing knowledge on bit preservation across HEP (and beyond) Provide technical advice to Recommendations for sustainable archival storage in HEP Survey on Large HEP archive sites carried out and presented at last HEPiX 19 sites; areas such as archive lifetime, reliability, access, verification, migration HEP Archiving has become a reality by fact rather than by design Overall positive but lack of SLA’s, metrics, best practices, and long-term costing impact

Verification & reliability Systematic verification of archive data ongoing “Cold” archive: Users only accessed ~20% of the data (2013) All “historic” data verified between 2010-2013 All new and repacked data being verified as well Data reliability significantly improved over last 5 years From annual bit loss rates of O(10-12) (2009) to O(10-16) (2012) Still, room for improvement Vendor quoted bit error rates: O(10-19..-20) But, these only refer to media failures Errors (eg bit flips) appearing in complete chain ~35 PB verified in 2014 No losses 12

“LHC Cost Model” (simplified) Start with 10PB, then +50PB/year, then +50% every 3y (or +15% / year) 10EB 1EB

Case B) increasing archive growth Total cost: ~$59.9M (~$2M / year)

Certification – Why Bother? Help align policies and practices across sites Improve reliability, eliminate duplication of effort, reduce “costs of curation” Some of this is being done via HEPiX WG Help address the “Data Management Plan” issue required by Funding Agencies Increase “trust” with “customers” wrt stewardship of the data Increase attractiveness for future H2020 bids and / or to additional communities

2020 Vision for LT DP in HEP Long-term – e.g. FCC timescales: disruptive change By 2020, all archived data – e.g. that described in DPHEP Blueprint, including LHC data – easily findable, fully usable by designated communities with clear (Open) access policies and possibilities to annotate further Best practices, tools and services well run-in, fully documented and sustainable; built in common with other disciplines, based on standards DPHEP portal, through which data / tools accessed “HEP FAIRport”: Findable, Accessible, Interoperable, Re-usable Agree with Funding Agencies clear targets & metrics

OSD@Orsay - Jamie.Shiers@cern.ch

Summary Next generation data factories will bring with them many challenges for computing, networking and storage Data Preservation – and management in general – will be key to their success and must be an integral part of the projects: not an afterthought Raw “bit preservation” costs may drop to ~$100K / year / EB over the next 25 years

3 Points to Take Away: Efficient; Scalable; Sustainable. A (small-ish) network of certified, trusted digital repositories can address all of these