iPRES 2016, CH https://indico.cern.ch/event/448571/

Slides:



Advertisements
Similar presentations
DATA PRESERVATION IN ALICE FEDERICO CARMINATI. MOTIVATION ALICE is a 150 M CHF investment by a large scientific community The ALICE data is unique and.
Advertisements

Coming revolutions in mass storage: implications for image archives Christopher D. Elvidge, Ph.D. NOAA-NESDIS National Geophysical Data Center E/GC2 325.
Long-Term Data Preservation in HEP Challenges, Opportunities and Solutions(?) Workshop on Best Practices for Data Management & Sharing.
Science Archives in the 21st Century 25/26 April Towards an International standard for Audit and Certification of Digital Repositories David Giaretta.
Long-Term Data Preservation in HEP Challenges, Opportunities and Solutions(?) Joint Data Preservation RDA-3 International Collaboration.
Data Preservation at the Exa-Scale and Beyond Challenges of the Next Decade(s) APARSEN Webinar, November 2014.
ETICS2 All Hands Meeting VEGA GmbH INFSOM-RI Uwe Mueller-Wilm Palermo, Oct ETICS Service Management Framework Business Objectives and “Best.
Usability Issues Documentation J. Apostolakis for Geant4 16 January 2009.
Event Management & ITIL V3
Nov 1, 2000Site report DESY1 DESY Site Report Wolfgang Friebel DESY Nov 1, 2000 HEPiX Fall
Long-Term Data Preservation: Debriefing Following RDA-4 WLCG GDB, October 2014
Data Preservation in High Energy Physics Towards a Global Effort for Sustainable Long-Term Data Preservation in HEP
The DPHEP Collaboration & Project(s) Services, Common Projects, Business Model(s) PH/SFT Group Meeting December 2013 International.
Archival Workshop on Ingest, Identification, and Certification Standards Certification (Best Practices) Checklist Does the archive have a written plan.
“Linear collider studies: This heading includes the total funding for CTF3, the CLIC study and the CLIC/ILC collaboration as well as the CERN’s participation.
1 Tunnel implementations (laser straight) Central Injector complex.
CERN IT Department CH-1211 Genève 23 Switzerland t Frédéric Hemmer IT Department Head - CERN 23 rd August 2010 Status of LHC Computing from.
Digital Preservation across the technologies, strategies, open standards & interoperability aspects including the legal issues Pratik Shrivastava Scientist.
Introduction What is detector simulation? A detector simulation program must provide the possibility of describing accurately an experimental setup (both.
Mike Hildreth DASPOS Update Mike Hildreth representing the DASPOS project 1.
1 Future Circular Collider Study Preparatory Collaboration Board Meeting September 2014 R-D Heuer Global Future Circular Collider (FCC) Study Goals and.
Peer review in the era of LHC experiments Experimental particle physics as a Big Science paradigm Rüdiger Voss Physics Department CERN, Geneva, Switzerland.
Data Preservation in HEP Use Cases, Business Cases, Costs & Cost Models Grid Deployment Board International Collaboration for Data.
Data Preservation at Rutherford Lab David Corney 9 th July 2010 KEK.
Preservation e-Infrastructures, Certification & ADMP IGs DPHEP Status and Outlook RDA Plenary 6 Paris, September 2016 International.
International Collaboration for Data Preservation and Long Term Analysis in High Energy Physics RECODE - Final Workshop - January.
The DPHEP Collaboration & Project(s) Services, Common Projects, Business Model(s) EGI “towards H2020” Workshop December 2013 International.
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
Preparing Data Management Plans for WLCG and HNISciCloud IT International Collaboration for Data Preservation and Long Term.
Data Preservation in HEP Use Cases, Business Cases, Costs & Cost Models Grid Deployment Board International Collaboration for Data.
Update on Data Preservation (CERN / WLCG Scope) WLCG OB June 2016 International Collaboration for Data Preservation and Long Term.
School on Grid & Cloud Computing International Collaboration for Data Preservation and Long Term Analysis in High Energy Physics.
Hall D Computing Facilities Ian Bird 16 March 2001.
Operations Coordination Team Maria Girone, CERN IT-ES GDB, 11 July 2012.
Ian Bird, CERN WLCG Project Leader Amsterdam, 24 th January 2012.
Theme (iv): Standards and international collaboration
Long-Term Sustainability A User (Support) View
Long-Term Sustainability: Services & Data A User (Support) View
A Quick Overview of ITIL
HEP LTDP Use Case & EOSC Pilot
Software Prototyping.
Disaster and Emergency Planning
Long Term Data Preservation meets the European Open Science Cloud
Certification of CERN as a Trusted Digital Repository
Ian Bird WLCG Workshop San Francisco, 8th October 2016
Working Group 4 Facilities and Technologies
EOSCpilot WP4: Use Case 5 Material for
Jarek Nabrzyski Director, Center for Research Computing
APARSEN Webinar, November 2014
Cisco Data Virtualization
CERN presentation & CFD at CERN
2. ISO Certification Discussed already at 2015 PoW and several WLCG OB meetings Proposed approach: An Operational Circular that describes the organisation's.
Experiences and Outlook Data Preservation and Long Term Analysis
Data Management and Access Policies: CERN, HEP (and beyond)
WLCG: TDR for HL-LHC Ian Bird LHCC Referees’ meting CERN, 9th May 2017.
The International Plant Protection Convention
Blackburn College Employer Portal
Goal of the workshop To define an international roadmap towards colliders based on advanced accelerator concepts, including intermediate milestones, and.
FAIR Data Management, Trustworthy Digital Repositories and Business Continuity / Disaster Preparedness
METHOD VALIDATION: AN ESSENTIAL COMPONENT OF THE MEASUREMENT PROCESS
Chapter 2: The Linux System Part 1
Nuclear Physics Data Management Needs Bruce G. Gibbard
Creating a Culture of Open Data in Academia
What does DPHEP do? DPHEP has become a Collaboration with signatures from the main HEP laboratories and some funding agencies worldwide. It has established.
Emulation: Good or Bad? Emulation as a Digital Preservation Strategy – Stewart Granger Reality and Chimeras in the Preservation of Electronic Records –
Using an Object Oriented Database to Store BaBar's Terabytes
Building an open library without walls : Archiving of particle physics data and results for long-term access and use Joanne Yeomans CERN Scientific Information.
OU BATTLECARD: E-Business Suite Courses and Certifications
OU BATTLECARD: Oracle WebCenter Training
OU BATTLECARD: Oracle Utilities Learning Subscription
Presentation transcript:

iPRES 2016, CH https://indico.cern.ch/event/448571/ International Collaboration for Data Preservation and Long Term Analysis in High Energy Physics CERN Services for LTDP iPRES 2016, CH https://indico.cern.ch/event/448571/

Overview of CERN CERN – the European Organisation for Nuclear Research – is situated just outside Geneva, extending into France Founded in 1954, it now has 22 member states It operates a wide range of accelerators of which the LHC is probably best known A Large Hadron Collider was first proposed in the late 1970s, when discussions on a Lepton Collider (LEP) were being held A High Luminosity upgrade (HL-LHC) was approved in June, extending the LHC’s life until around 2040 A High Energy (HE-LHC) upgrade may follow…

The LHC (former LEP) ring The CERN Accelerator Complex The LHC (former LEP) ring 4

LTDP: Now & Then in HEP Traditionally at CERN, users (experiments) were responsible for buying their own tapes and managing them Capacity: 40-200MB! (1600bpi to first 3480 cartridges) This started to change with LEP (1989), including with the introduction of robots and Unix-style filenames instead of tape numbers But at the end of LEP (2000) still no sustainable preservation services ~1 million tape volumes! Impossible to automate! ALEPH: 1 PC with full environment + all data per collaborating institute

The DPHEP Study Group Formed late 2008 at the initiative of DESY Included representatives from all major HEP labs worldwide, including from experiments due to end data-taking shortly Produced a Blueprint Report that detailed the situation and made concrete recommendations, now being acted upon Input to European Particle Physics Strategy update of 2012/3 – highly influential!

What is the problem? The data from the world’s particle accelerators and colliders (HEP data) is both costly and time consuming to produce That from the LHC is a particularly striking example and ranges in volume from several hundred PB today to tens of EB by 2035 or so. HEP data contains a wealth of scientific potential, plus high value for educational outreach. Given that much of the data is unique, it is essential to preserve not only the data but also the full capability to reproduce past analyses and perform new ones. This means preserving data, documentation, software and "knowledge". There are numerous cases where data from a past experiment has been re-analyzed: we must retain the ability in the future

What does DPHEP do? DPHEP has become a Collaboration with signatures from the main HEP laboratories and some funding agencies worldwide. It has established a "2020 vision", whereby: All archived data – e.g. that described in DPHEP Blueprint, including LHC data – should be easily findable and fully usable by the designated communities with clear (Open) access policies and possibilities to annotate further; Best practices, tools and services should be well run-in, fully documented and sustainable; built in common with other disciplines, based on standards; There should be a DPHEP portal, through which data / tools accessed; Clear targets & metrics to measure the above should be agreed between Funding Agencies, Service Providers and the Experiments.

What Makes HEP Different? We throw away most of our data before it is even recorded – “triggers” Our detectors are relatively stable over long periods of time (years) – not “doubling every 6 or 18 months” We make “measurements” – not “observations” Our projects typically last for decades – we need to keep data usable during at least this length of time We have shared “data behind publications” for more than 30 years… (HEPData)

Barry Barish; ICHEP - Chicago An OBSERVATION… First observed during the solar eclipse of 1919 by Sir Arthur Eddington, when the Sun was silhouetted against the Hyades star cluster 9-Aug-2016 Barry Barish; ICHEP - Chicago

And another... (Black holes merging…) 1.3 Billion Years Ago And another... (Black holes merging…) 9-Aug-2016 ICHEP - Chicago

Future Circular Colliders (FCC) International Collaboration: ~ 70 Institutes International conceptual design study of a ~100 km ring: pp collider (FCC-hh): ultimate goal  defines infrastructure requirements √s ~ 100 TeV, L~2x1035; 4 IP, ~20 ab-1/expt e+e- collider (FCC-ee): possible first step √s = 90-350 GeV, L~200-2 x 1034; 2 IP pe collider (FCC-he): option √s ~ 3.5 TeV, L~1034 “LEP3” Also part of the study: HE-LHC: FCC-hh dipole technology (~16 T) in LHC tunnel  √s ~ 30 TeV GOAL: CDR in time for next ES FCC-ee options could have 1000 times the luminosity of LEP2. 10 years x 100 days = 1000 Machine studies are site-neutral. However, FCC at CERN would greatly benefit from existing laboratory infrastructure and accelerator complex 90-100 km ring fits geology

HEP LTDP Use Cases Bit preservation as a basic “service” on which higher level components can build; “Maybe CERN does bit preservation better than anyone else in the world” (David Giaretta) Preserve data, software, and know-how in the collaborations; Basis for reproducibility; Share data and associated software with (wider) scientific community, such as theorists or physicists not part of the original collaboration; Open access to reduced data sets to general public (LHC experiments) These match very well to the requirements for DMPs

Workshop on Active Data Management Plans Agenda, talks, videos, conclusions Includes more detailed talks about HEP data preservation & Open Data releases

DMPs for the LHC experiments The first LHC experiment to produce a “DMP” was CMS in 2012 This called for Open Data Releases of significant fractions of the (cooked) data after an embargo period (see ADMP w/s) Now all 4 main experiments have DMPs Foresee capturing project-specific detail in DMPs (as opposed to overall site policy) Open Data Releases are now “routine”! See this talk @ ICHEP for more details

CERN Services for LTDP State-of-the art "bit preservation", implementing practices that conform to the ISO 16363 standard "Software preservation" - a key challenge in HEP where the software stacks are both large and complex (and dynamic) Analysis capture and preservation, corresponding to a set of agreed Use Cases Access to data behind physics publications - the HEPData portal An Open Data portal for released subsets of the (currently) LHC data A DPHEP portal that links also to data preservation efforts at other HEP institutes worldwide. Each of these is a talk topic in its own right!

Bit Preservation: Steps Include Regular media verification When tape written, filled, every 2 years… Controlled media lifecycle Media kept for 2 max. 2 drive generations Reducing tape mounts Reduces media wear-out & increases efficiency Data Redundancy For “smaller” communities, a 2nd copy can be created: separate library in a different building (e.g. LEP – 3 copies at CERN!) Protecting the physical link Between disk caches and tape servers Protecting the environment Dust sensors! (Don’t let users touch tapes) Constant improvement: reduction in bit-loss rate: 5 x 10-16

Software Preservation HEP has since long shared its software across international collaborations CERNLIB – first started in 1964 and used by many communities worldwide Today HEP s/w is O(107) lines of code, 10s to 100s of modules and many languages! (No standard app) Versioning filesystems and virtualisation look promising: have demonstrated resurrecting s/w 15 years after data taking and hope to provide stability 5-15 years into the future Believe we can analyse LEP data ~30 years after data taking ended! Does anyone have a better idea?

Analysis Preservation The ability to reproduce analyses is not only required by Funding Agencies but also essential to the work of the experiments / collaborations Use Cases include: An analysis that is underway has to be handed over, e.g. as someone is leaving the collaboration; A previous analysis has to be repeated; Data from different experiments have to be combined.   Need to capture: metadata, software, configuration options, high-level physics information, documentation, instructions, links to presentations, quality protocols, internal notes, etc. At least one experiment (ALICE) would like demonstrable reproducibility to be part of the publication approval process!

Portals No time to discuss in detail but clearly address the challenges of making the data “discoverable” and “usable” (if not necessarily F.A.I.R.)

Certification of the CERN Site We believe certification will allow us to ensure that best practices are implemented and followed up on in the long-term: “written into fabric of organisation” Scope: Scientific Data and CERN’s Digital Memory Timescale: complete prior to 2019/2020 ESPP update Will also “ensure” adequate resources, staffing, training, succession plans etc. CERN can expect to exist until HL/HE LHC (2040/50) And beyond? FCC? Depends on physics…

Infrastructure & Security Risk Management ISO 16363 metrics 5.1 Technical Infrastructure Risk Management We do all of this, but is it documented? Technology watches, h/w & s/w changes, detection of bit corruption or loss, reporting, security updates, storage media refreshing, change management, critical processes, handling of multiple data copies etc. 5.2 Security Risk Management Do we do all of this, and is it documented? Security risks (data, systems, personnel, physical plant), disaster preparedness and recovery plans … ISO 27000 etc.

Organisational Infrastructure ISO 16363 metrics Organisational Infrastructure 3.1 Governance & Organisational Viability Mission Statement, Preservation Policy, Implementation plan(s) etc. [ CERN, CERN, project(s) ] 3.2 Organisational Structure & Staffing Duties, staffing, professional development etc. [ APT etc. ] 3.3 Procedural accountability & preservation policy framework Designated communities, knowledge bases, policies & reviews, change management, transparency & accountability etc. [ At least partially projects ] 3.4 Financial sustainability Business planning processes, financial practices and procedures etc 3.5 Contracts, licenses & liabilities For the digital materials preserved… [ CERN? Projects? ]

Collaboration with others The elaboration of a clear "business case" for long-term data preservation The development of an associated "cost model” A common view of the Use Cases driving the need for data preservation Understanding how to address Funding Agencies requirements for Data Management Plans Preparing for Certification of HEP digital repositories and their long-term future.

How Much Data? 100TB per LEP experiment: 3 copies at CERN (1 on disk, 2 on tape) (+ copies outside) 1-10PB for experiments at the HERA collider at DESY, the TEVATRON at Fermilab or the BaBar experiment at SLAC. The LHC experiments is already in the multi-hundred PB range (x00PB) 10EB or more including the High Luminosity upgrade of the LHC (HL-LHC)

Conclusions & Next Steps As is well known, Data Preservation is a Journey and not a destination. Can we capture sufficient “knowledge” to keep the data usable beyond the lifetime of the original collaboration? Can we prepare for major migrations, similar to those that happened in the past? (Or will x86 and Linux last “forever”) For the HL-LHC, we may have neither the storage resources to keep all (intermediate) data, nor the computational resources to re-compute them! You can’t share or re-use data, nor reproduce results, if you haven’t first preserved it (data, software, documentation, knowledge)

Conclusions & Next Steps As is well known, Data Preservation is a Journey and not a destination. Can we capture sufficient “knowledge” to keep the data usable beyond the lifetime of the original collaboration? Can we prepare for major migrations, similar to those that happened in the past? (Or will x86 and Linux last “forever”) For the HL-LHC, we may have neither the storage resources to keep all (intermediate) data, nor the computational resources to re-compute them! Open Data Releases – in addition to Certification – provide a powerful way of measuring whether we are achieving our goals!