Managing, Preserving & Computing with Big Research Data Challenges, Opportunities and Solutions(?) EU-T0 F2F, April 2014 International.

Slides:



Advertisements
Similar presentations
Peter Griffith and Megan McGroddy 4 th NACP All Investigators Meeting February 3, 2013 Expectations and Opportunities for NACP Investigators to Share and.
Advertisements

Depositing and Disseminating Digital Resources Alan Morrison Collections Manager AHDS Subject Centre for Literature, Linguistics and Languages.
Co-funded by the European Union under FP7-ICT Alliance Permanent Access to the Records of Science in Europe Network Co-ordinated by aparsen.eu #APARSEN.
Data Seal of Approval Overview Lightning Talk RDA Plenary 5 – San Diego March 11, 2015 Mary Vardigan University of Michigan Inter-university Consortium.
DATA PRESERVATION IN ALICE FEDERICO CARMINATI. MOTIVATION ALICE is a 150 M CHF investment by a large scientific community The ALICE data is unique and.
Exa-Scale Data Preservation in HEP
Chinese-European Workshop on Digital Preservation, Beijing July 14 – Network of Expertise in Digital Preservation 1 Trusted Digital Repositories,
Walkthrough Data Seal of Approval Hervé L’Hours DSA Conference 2012.
EGI-Engage EGI-Engage Engaging the EGI Community towards an Open Science Commons Project Overview 9/14/2015 EGI-Engage: a project.
Long-Term Data Preservation in HEP Challenges, Opportunities and Solutions(?) Workshop on Best Practices for Data Management & Sharing.
Repository Requirements and Assessment August 1, 2013 Data Curation Course.
Data Archiving and Networked Services DANS is an institute of KNAW en NWO Trusted Digital Archives and the Data Seal of Approval Peter Doorn Data Archiving.
Data Archiving and Networked Services DANS is an institute of KNAW en NWO and the Peter Doorn Data Archiving and Networked Services EUDAT Conference Trust.
Research Data Management Services Katherine McNeill Social Sciences Librarians Boot Camp June 1, 2012.
Long-Term Data Preservation in HEP Challenges, Opportunities and Solutions(?) Joint Data Preservation RDA-3 International Collaboration.
Data Preservation at the Exa-Scale and Beyond Challenges of the Next Decade(s) APARSEN Webinar, November 2014.
Usability Issues Documentation J. Apostolakis for Geant4 16 January 2009.
Sharing Research Data Globally Alan Blatecky National Science Foundation Board on Research Data and Information.
Long-Term Data Preservation: Debriefing Following RDA-4 WLCG GDB, October 2014
Towards Data Management Principles (report of progress of the Task Force on Data Management Principles) Alessandro Annoni European Commission Joint Research.
The DPHEP Collaboration & Project(s) Services, Common Projects, Business Model(s) PH/SFT Group Meeting December 2013 International.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks David Kelsey RAL/STFC,
Bob Jones Technical Director CERN - August 2003 EGEE is proposed as a project to be funded by the European Union under contract IST
EPA Geospatial Segment United States Environmental Protection Agency Office of Environmental Information Enterprise Architecture Program Segment Architecture.
RI EGI-InSPIRE RI EGI Future activities Peter Solagna – EGI.eu.
Slide David Britton, University of Glasgow IET, Oct 09 1 Prof. David Britton GridPP Project leader University of Glasgow UK-T0 Meeting 21 st Oct 2015 GridPP.
ARL Workshop on New Collaborative Relationships: The Role of Academic Libraries in the Digital Data Universe September 26-27, 2006 ARL Prue.
DOE Data Management Plan Requirements
Long Term Data Preservation LTDP = Data Sharing – In Time and Space Big Data, Open Data Workshop, May 2014 International Collaboration.
Data Preservation in HEP Use Cases, Business Cases, Costs & Cost Models Grid Deployment Board International Collaboration for Data.
EGI-InSPIRE RI EGI EGI-InSPIRE RI Service Operations Security Policy the new generalised site operations security policy.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Monitoring of the LHC Computing Activities Key Results from the Services.
INFSO-RI Enabling Grids for E-sciencE The EGEE Project Owen Appleton EGEE Dissemination Officer CERN, Switzerland Danish Grid Forum.
Update on HEP SSC WLCG MB, 6 th July 2009 Jamie Shiers Grid Support Group IT Department, CERN.
#DPHEP: Status and Outlook Sustainable Strategies for Long-Term DP at the Exa-scale LHCC Referees Meeting International Collaboration.
Ian Bird WLCG Networking workshop CERN, 10 th February February 2014
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EGI strategy and Grand Vision Ludek Matyska EGI Council Chair EGI InSPIRE.
LHC Computing – the 3 rd Decade Jamie Shiers LHC OPN meeting October 2010.
Data Seal of Approval (DSA) SEEDS Kick-off meeting May 5, Lausanne Renate Kunz.
Preservation e-Infrastructures, Certification & ADMP IGs DPHEP Status and Outlook RDA Plenary 6 Paris, September 2016 International.
International Collaboration for Data Preservation and Long Term Analysis in High Energy Physics RECODE - Final Workshop - January.
WLCG Status Report Ian Bird Austrian Tier 2 Workshop 22 nd June, 2010.
The DPHEP Collaboration & Project(s) Services, Common Projects, Business Model(s) EGI “towards H2020” Workshop December 2013 International.
EUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures. Contract No EPOS and EUDAT.
Preparing Data Management Plans for WLCG and HNISciCloud IT International Collaboration for Data Preservation and Long Term.
DPHEP Update LTDP = Data Sharing – In Time and Space WLCG Overview Board, May 2014 International Collaboration for Data Preservation.
Data Preservation in HEP Use Cases, Business Cases, Costs & Cost Models Grid Deployment Board International Collaboration for Data.
Usecases: 1.ISIS Neutron Source 2.DP for HEP Matthew Viljoen STFC, UK APARSEN-EGI workshop: preserving big data for research Amsterdam Science Park 4-6.
Data Stewardship Lifecycle A framework for data service professionals Protectors of data.
Authentication and Authorisation for Research and Collaboration Heiko Hütter, Martin Haase, Peter Gietz, David Groep AARC 3 rd.
Update on Data Preservation (CERN / WLCG Scope) WLCG OB June 2016 International Collaboration for Data Preservation and Long Term.
CERN IT Department CH-1211 Genève 23 Switzerland t EIS Section input to GLM For GLM attended by Director for Computing.
LHCbComputing Update of LHC experiments Computing & Software Models Selection of slides from last week’s GDB
School on Grid & Cloud Computing International Collaboration for Data Preservation and Long Term Analysis in High Energy Physics.
Co-funded by the European Union under FP7-ICT Co-ordinated by aparsen.eu #APARSEN CoE offerings Simon Lambert STFC All Hands Meeting, Amsterdam,
Ian Bird, CERN WLCG Project Leader Amsterdam, 24 th January 2012.
CESSDA SaW Training on Trust, Identifying Demand & Networking
Digital Sustainability on the EU Policy Level
HEP LTDP Use Case & EOSC Pilot
Certification of CERN as a Trusted Digital Repository
EOSCpilot WP4: Use Case 5 Material for
APARSEN Webinar, November 2014
Trustworthiness of Preservation Systems
2. ISO Certification Discussed already at 2015 PoW and several WLCG OB meetings Proposed approach: An Operational Circular that describes the organisation's.
Steven Newhouse EGI-InSPIRE Project Director, EGI.eu
Data Preservation Update Data Preservation, Curation & Stewardship
Access  Discovery  Compliance  Identification  Preservation
Research Data Management
What does DPHEP do? DPHEP has become a Collaboration with signatures from the main HEP laboratories and some funding agencies worldwide. It has established.
Bird of Feather Session
Presentation transcript:

Managing, Preserving & Computing with Big Research Data Challenges, Opportunities and Solutions(?) EU-T0 F2F, April 2014 International Collaboration for Data Preservation and Long Term Analysis in High Energy Physics

Introduction For Scientific, Educational & Cultural reasons it is essential that we “preserve” our data This is increasingly becoming a requirement from FAs – including in H2020 A model for DP exists (OAIS), together with recognised methods for auditing sites Work in progress (RDA) to harmonize these – some tens to ~hundred of certified sites 2

WLCG Storage Experience We have developed significant knowledge / experience with offering reliable, cost-effective storage services – “pain” documented (SIRs) “Curation” a commitment from WLCG Tier0 + TIer1 sites (aka “EU-T0”) – WLCG MoU  HEPiX survey shows a very (10 4 ) wide range in “results” – encourage sharing of best practices as well as issues involving loss or “near miss” All this in parallel with / funding from major e- infrastructure projects (EGEE, EGI, etc.) 3

DP (“Data Sharing”) & H2020 OAIS+DSA maps well to EINFRA – Points 1 & 2 in particular Does anyone else have the same (technical, financial) experience as us? – IEEE MSST 2014: “large” = (yes, we know there are larger) – DPHEP “cost model” very warmly received Could make significant contribution to realisation of “Riding the Wave” vision “Our” solutions have short-comings in terms of Open Access optimisation – plenty of work to do! 4

Generic e-if vs VREs We have also learned that attempts to offer “too much” functionality in the core infrastructure does not work (e.g. FPS) This is recognised (IMHO) in H2020 calls, via “infrastructure” vs “virtual research environments” There is a fuzzy boundary, and things developed within 1 VRE can be deployed to advantage for others (EGI-InSPIRE SA3) 5

EGI-InSPIRE RI SA3 Objectives  Transition to sustainable support: +Identify tools of benefit to multiple communities –Migrate these as part of the core infrastructure +Establish support models for those relevant to individual communities SA3 - PY3 - June

EINFRA DP & Other Players EUDAT claim that they address “the long tail” of science – Not addressing the multi-PB to few-EB scale AFAIK – No clear funding model for sustainability (e.g. after project funding) – Over-lap between HEP and key EUDAT storage sites: SARA (NL-T1), FZK, RZG – Interest (verbal) from above in working together (& on RDA WGs) EGI + APARSEN + SCIDIP-ES also planning to prepare a proposal Can we agree on a common approach? – Did not manage over the past 6 months with “EGI” holding separate events close in time to other “joint DP” workshops IMHO none of us has the knowledge / experience to “go it alone”  Key issue to resolve asap 7

2020 Vision for LT DP in HEP Long-term – e.g. LC timescales: disruptive change – By 2020, all archived data – e.g. that described in DPHEP Blueprint, including LHC data – easily findable, fully usable by designated communities with clear (Open) access policies and possibilities to annotate further – Best practices, tools and services well run-in, fully documented and sustainable; built in common with other disciplines, based on standards – DPHEP portal, through which data / tools accessed  Agree with Funding Agencies clear targets & metrics 8

DPHEP Portal – Zenodo like? 9

David South | Data Preservation and Long Term Analysis in HEP | CHEP 2012, May | Page 10 Documentation projects with INSPIREHEP.net > Internal notes from all HERA experiments now available on INSPIRE  A collaborative effort to provide “consistent” documentation across all HEP experiments – starting with those at CERN – as from 2015  (Often done in an inconsistent and/or ad-hoc way, particularly for older experiments)

Additional Metrics (Beyond DSA) 1.Open Data for educational outreach – Based on specific samples suitable for this purpose – MUST EXPLAIN BENEFIT OF OUR WORK FOR FUTURE FUNDING! – High-lighted in European Strategy for PP update 2.Reproducibility of results – A (scientific) requirement (from FAs) – “The Journal of Irreproducible Results” 3.Maintaining full potential of data for future discovery / (re-)use 11

Additional Services / Requirements The topics covered by EINFRA-1, whilst providing a solid basis, are not enough Nor (IMHO) is it realistic to try to put services that are likely to have VRC-components in generic infrastructure (incl. SCIDIP-ES services) Hence the VRE calls (TWO OF THEM) Use our existing multi-disciplinary contacts to build a multi-disciplinary, multi-faceted project? Probably more important (for DP) than EINFRA-1 12

Looking beyond H The above ideas are “evolution” rather than “revolution” and as such rather short-term A more ambitious, but more promising approach is being draft by Marcello Maggi – Open Data, not just Open Access (to proprietary formats not maintained in the long term) It is not clear to me that this fits in the currently open calls, but we should start to sow seeds for the (near) future, including elaborating further the current proposal & evangelising 13

A Possible Way Ahead EINFRA DP – Try to identify / agree on a core set of “data service providers” out of WLCG / EU-T0 / EUDAT / EGI set – Try to agree (with EUDAT?) on relevant WPs VRE-DP (EINFRA-9, INFRADEV-4) – Ditto, but for higher level services, including those developed by SCIDIP-ES, those in preparation by “DPHEP” and other high profile communities Timeline: mid-May to mid-July for writing proposals; mid-July to mid-August “dead”, top priority corrections late August + repeated submission prior to 2 Sep close – I would do test submissions much earlier… 14

The Guidelines Guidelines Relating to Data Producers: 1.The data producer deposits the data in a data repository with sufficient information for others to assess the quality of the data and compliance with disciplinary and ethical norms. 2.The data producer provides the data in formats recommended by the data repository. 3.The data producer provides the data together with the metadata requested by the data repository.

Guidelines Related to Repositories (4-8): 4.The data repository has an explicit mission in the area of digital archiving and promulgates it. 5.The data repository uses due diligence to ensure compliance with legal regulations and contracts including, when applicable, regulations governing the protection of human subjects. 6.The data repository applies documented processes and procedures for managing data storage. 7.The data repository has a plan for long-term preservation of its digital assets. 8.Archiving takes place according to explicit work flows across the data life cycle.

Guidelines Related to Repositories (9-13): 9.The data repository assumes responsibility from the data producers for access and availability of the digital objects. 10.The data repository enables the users to discover and use the data and refer to them in a persistent way. 11.The data repository ensures the integrity of the digital objects and the metadata. 12.The data repository ensures the authenticity of the digital objects and the metadata. 13.The technical infrastructure explicitly supports the tasks and functions described in internationally accepted archival standards like OAIS.

Guidelines Related to Data Consumers (14-16): 14.The data consumer complies with access regulations set by the data repository. 15.The data consumer conforms to and agrees with any codes of conduct that are generally accepted in the relevant sector for the exchange and proper use of knowledge and information. 16.The data consumer respects the applicable licences of the data repository regarding the use of the data.

DSA self-assessment & peer review Complete a self-assessment in the DSA online tool. The online tool takes you through the 16 guidelines and provides you with supportDSA online toolguidelines Submit self-assessment for peer review. The peer reviewers will go over your answers and documentation Your self-assessment and review will not become public until the DSA is awarded. After the DSA is awarded by the Board, the DSA logo may be displayed on the repository’s Web site with a link to the organization’s assessment.

Run 1 – which led to the discovery of the Higgs boson – is just the beginning. There will be further data taking – possibly for another 2 decades or more – at increasing data rates, with further possibilities for discovery! We are here HL-LHC 20

Predrag Buncic, October 3, 2013 ECFA Workshop Aix-Les-Bains - 21 Data: Outlook for HL-LHC Very rough estimate of a new RAW data per year of running using a simple extrapolation of current data volume scaled by the output rates. To be added: derived data (ESD, AOD), simulation, user data…  0.5 EB / year is probably an under estimate! PB

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 22 Cost Modelling: Regular Media Refresh + Growth Start with 10PB, then +50PB/year, then +50% every 3y (or +15% / year)

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services DSS 23 Total cost: ~$60M (~$2M / year) Case B) increasing archive growth

2020 Vision for LT DP in HEP Long-term – e.g. LC timescales: disruptive change – By 2020, all archived data – e.g. that described in DPHEP Blueprint, including LHC data – easily findable, fully usable by designated communities with clear (Open) access policies and possibilities to annotate further – Best practices, tools and services well run-in, fully documented and sustainable; built in common with other disciplines, based on standards – DPHEP portal, through which data / tools accessed  Agree with Funding Agencies clear targets & metrics 24

25 Volume: 100PB + ~50PB/year (+400PB/year from 2020)

Requirements from Funding Agencies To integrate data management planning into the overall research plan, all proposals submitted to the Office of Science for research funding are required to include a Data Management Plan (DMP) of no more than two pages that describes how data generated through the course of the proposed research will be shared and preserved or explains why data sharing and/or preservation are not possible or scientifically appropriate. At a minimum, DMPs must describe how data sharing and preservation will enable validation of results, or how results could be validated if data are not shared or preserved. Similar requirements from European FAs and EU (H2020) 26

20 Years of the Top Quark 27

28

How? How are we going to preserve all this data? And what about “the knowledge” needed to use it? How will we measure our success? And what’s it all for? 29

Answer: Two-fold Specific technical solutions – Main-stream; – Sustainable; – Standards-based; – COMMON Transparent funding model Project funding for short- term issues – Must have a plan for long- term support from the outset! Clear, up-front metrics – Discipline neutral, where possible; – Standards-based; – EXTENDED IFF NEEDED Start with “the standard”, coupled with recognised certification processes – See RDA IG Discuss with FAs and experiments – agree! (For sake of argument, let’s assume DSA) 30

Additional Metrics (aka “The Metrics”) 1.Open Data for educational outreach – Based on specific samples suitable for this purpose – MUST EXPLAIN BENEFIT OF OUR WORK FOR FUTURE FUNDING! – High-lighted in European Strategy for PP update 2.Reproducibility of results – A (scientific) requirement (from FAs) – “The Journal of Irreproducible Results” 3.Maintaining full potential of data for future discovery / (re-)use 31

LHC Data Access Policies Level (standard notation)Access Policy L0 (raw) (cf “Tier”)Restricted even internally Requires significant resources (grid) to use L1 (1 st processing)Large fraction available after “embargo” (validation) period Duration: a few years Fraction: 30 / 50 / 100% L2 (analysis level)Specific (meaningful) samples for educational outreach: pilot projects on-going CMS, LHCb, ATLAS, ALICE L3 (publications)Open Access (CERN policy) 32

2.Digital library tools (Invenio) & services (CDS, INSPIRE, ZENODO) + domain tools (HepData, RIVET, RECAST…) 3.Sustainable software, coupled with advanced virtualization techniques, “snap-shotting” and validation frameworks 4.Proven bit preservation at the 100PB scale, together with a sustainable funding model with an outlook to 2040/50 5.Open Data 33 (and several EB of data)

DPHEP Portal – Zenodo like? 34

David South | Data Preservation and Long Term Analysis in HEP | CHEP 2012, May | Page 35 Documentation projects with INSPIREHEP.net > Internal notes from all HERA experiments now available on INSPIRE  A collaborative effort to provide “consistent” documentation across all HEP experiments – starting with those at CERN – as from 2015  (Often done in an inconsistent and/or ad-hoc way, particularly for older experiments)

The Guidelines Guidelines Relating to Data Producers: 1.The data producer deposits the data in a data repository with sufficient information for others to assess the quality of the data and compliance with disciplinary and ethical norms. 2.The data producer provides the data in formats recommended by the data repository. 3.The data producer provides the data together with the metadata requested by the data repository.

Some HEP data formats ExperimentAcceleratorFormatStatus ALEPHLEPBOS? DELPHILEPZebraCERNLIB – no longer formally supported L3LEPZebra“ OPALLEPZebra“ ALICELHCROOTPH/SFT + FNAL ATLASLHCROOT“ CMSLHCROOT“ LHCbLHCROOT“ COMPASSSPSObjySupport dropped at CERN ~10 years ago COMPASSSPSDATEALICE online format – 300TB migrated Other formats used at other labs – many previous formats no longer supported! 37

Work in Progress… By September, CMS should have made a public release of some data + complete environment – LHCb, now also ATLAS, plan something similar, based on “common tools / framework” By end 2014, a first version of the “DPHEP portal” should be up and running “DSA++” – by end 2015??? More news in Sep (RDA-4) / Oct (APA) 38

Mapping DP to H2020 EINFRA “Big Research Data” – Trusted / certified federated digital repositories with sustainable funding models that scale from many TB to a few EB “Digital library calls”: front-office tools – Portals, digital libraries per se etc. VRE calls: complementary proposal(s) – INFRADEV-4 – EINFRA-1/9 39

The Bottom Line We have particular skills in the area of large- scale digital (“bit”) preservation AND a good (unique?) understanding of the costs – Seeking to further this through RDA WGs and eventual prototyping -> sustainable services through H2020 across “federated stores” There is growing realisation that Open Data is “the best bet” for long-term DP / re-use We are eager to collaborate further in these and other areas… 40

Key Metrics For Data Sharing 1.(Some) Open Data for educational outreach 2.Reproducibility of results 3.Maintaining full potential of data for future discovery / (re-)use “Service provider” and “gateway” metrics still relevant but IMHO secondary to the above! 41

Data Sharing in Time & Space Challenges, Opportunities and Solutions(?) Workshop on Best Practices for Data Management & Sharing International Collaboration for Data Preservation and Long Term Analysis in High Energy Physics