Download presentation
Presentation is loading. Please wait.
Published byAustin Stokes Modified over 6 years ago
2
iPRES 2016, CH https://indico.cern.ch/event/448571/
International Collaboration for Data Preservation and Long Term Analysis in High Energy Physics CERN Services for LTDP iPRES 2016, CH
3
Overview of CERN CERN – the European Organisation for Nuclear Research – is situated just outside Geneva, extending into France Founded in 1954, it now has 22 member states It operates a wide range of accelerators of which the LHC is probably best known A Large Hadron Collider was first proposed in the late 1970s, when discussions on a Lepton Collider (LEP) were being held A High Luminosity upgrade (HL-LHC) was approved in June, extending the LHC’s life until around 2040 A High Energy (HE-LHC) upgrade may follow…
4
The LHC (former LEP) ring
The CERN Accelerator Complex The LHC (former LEP) ring 4
5
LTDP: Now & Then in HEP Traditionally at CERN, users (experiments) were responsible for buying their own tapes and managing them Capacity: MB! (1600bpi to first 3480 cartridges) This started to change with LEP (1989), including with the introduction of robots and Unix-style filenames instead of tape numbers But at the end of LEP (2000) still no sustainable preservation services ~1 million tape volumes! Impossible to automate! ALEPH: 1 PC with full environment + all data per collaborating institute
6
The DPHEP Study Group Formed late 2008 at the initiative of DESY
Included representatives from all major HEP labs worldwide, including from experiments due to end data-taking shortly Produced a Blueprint Report that detailed the situation and made concrete recommendations, now being acted upon Input to European Particle Physics Strategy update of 2012/3 – highly influential!
7
What is the problem? The data from the world’s particle accelerators and colliders (HEP data) is both costly and time consuming to produce That from the LHC is a particularly striking example and ranges in volume from several hundred PB today to tens of EB by 2035 or so. HEP data contains a wealth of scientific potential, plus high value for educational outreach. Given that much of the data is unique, it is essential to preserve not only the data but also the full capability to reproduce past analyses and perform new ones. This means preserving data, documentation, software and "knowledge". There are numerous cases where data from a past experiment has been re-analyzed: we must retain the ability in the future
8
What does DPHEP do? DPHEP has become a Collaboration with signatures from the main HEP laboratories and some funding agencies worldwide. It has established a "2020 vision", whereby: All archived data – e.g. that described in DPHEP Blueprint, including LHC data – should be easily findable and fully usable by the designated communities with clear (Open) access policies and possibilities to annotate further; Best practices, tools and services should be well run-in, fully documented and sustainable; built in common with other disciplines, based on standards; There should be a DPHEP portal, through which data / tools accessed; Clear targets & metrics to measure the above should be agreed between Funding Agencies, Service Providers and the Experiments.
9
What Makes HEP Different?
We throw away most of our data before it is even recorded – “triggers” Our detectors are relatively stable over long periods of time (years) – not “doubling every 6 or 18 months” We make “measurements” – not “observations” Our projects typically last for decades – we need to keep data usable during at least this length of time We have shared “data behind publications” for more than 30 years… (HEPData)
10
Barry Barish; ICHEP - Chicago
An OBSERVATION… First observed during the solar eclipse of 1919 by Sir Arthur Eddington, when the Sun was silhouetted against the Hyades star cluster 9-Aug-2016 Barry Barish; ICHEP - Chicago
11
And another... (Black holes merging…)
1.3 Billion Years Ago And another... (Black holes merging…) 9-Aug-2016 ICHEP - Chicago
12
Future Circular Colliders (FCC)
International Collaboration: ~ 70 Institutes International conceptual design study of a ~100 km ring: pp collider (FCC-hh): ultimate goal defines infrastructure requirements √s ~ 100 TeV, L~2x1035; 4 IP, ~20 ab-1/expt e+e- collider (FCC-ee): possible first step √s = GeV, L~200-2 x 1034; 2 IP pe collider (FCC-he): option √s ~ 3.5 TeV, L~1034 “LEP3” Also part of the study: HE-LHC: FCC-hh dipole technology (~16 T) in LHC tunnel √s ~ 30 TeV GOAL: CDR in time for next ES FCC-ee options could have 1000 times the luminosity of LEP2. 10 years x 100 days = 1000 Machine studies are site-neutral. However, FCC at CERN would greatly benefit from existing laboratory infrastructure and accelerator complex km ring fits geology
13
HEP LTDP Use Cases Bit preservation as a basic “service” on which higher level components can build; “Maybe CERN does bit preservation better than anyone else in the world” (David Giaretta) Preserve data, software, and know-how in the collaborations; Basis for reproducibility; Share data and associated software with (wider) scientific community, such as theorists or physicists not part of the original collaboration; Open access to reduced data sets to general public (LHC experiments) These match very well to the requirements for DMPs
14
Workshop on Active Data Management Plans
Agenda, talks, videos, conclusions Includes more detailed talks about HEP data preservation & Open Data releases
15
DMPs for the LHC experiments
The first LHC experiment to produce a “DMP” was CMS in 2012 This called for Open Data Releases of significant fractions of the (cooked) data after an embargo period (see ADMP w/s) Now all 4 main experiments have DMPs Foresee capturing project-specific detail in DMPs (as opposed to overall site policy) Open Data Releases are now “routine”! See this ICHEP for more details
16
CERN Services for LTDP State-of-the art "bit preservation", implementing practices that conform to the ISO standard "Software preservation" - a key challenge in HEP where the software stacks are both large and complex (and dynamic) Analysis capture and preservation, corresponding to a set of agreed Use Cases Access to data behind physics publications - the HEPData portal An Open Data portal for released subsets of the (currently) LHC data A DPHEP portal that links also to data preservation efforts at other HEP institutes worldwide. Each of these is a talk topic in its own right!
17
Bit Preservation: Steps Include
Regular media verification When tape written, filled, every 2 years… Controlled media lifecycle Media kept for 2 max. 2 drive generations Reducing tape mounts Reduces media wear-out & increases efficiency Data Redundancy For “smaller” communities, a 2nd copy can be created: separate library in a different building (e.g. LEP – 3 copies at CERN!) Protecting the physical link Between disk caches and tape servers Protecting the environment Dust sensors! (Don’t let users touch tapes) Constant improvement: reduction in bit-loss rate: 5 x 10-16
18
Software Preservation
HEP has since long shared its software across international collaborations CERNLIB – first started in 1964 and used by many communities worldwide Today HEP s/w is O(107) lines of code, 10s to 100s of modules and many languages! (No standard app) Versioning filesystems and virtualisation look promising: have demonstrated resurrecting s/w 15 years after data taking and hope to provide stability 5-15 years into the future Believe we can analyse LEP data ~30 years after data taking ended! Does anyone have a better idea?
19
Analysis Preservation
The ability to reproduce analyses is not only required by Funding Agencies but also essential to the work of the experiments / collaborations Use Cases include: An analysis that is underway has to be handed over, e.g. as someone is leaving the collaboration; A previous analysis has to be repeated; Data from different experiments have to be combined. Need to capture: metadata, software, configuration options, high-level physics information, documentation, instructions, links to presentations, quality protocols, internal notes, etc. At least one experiment (ALICE) would like demonstrable reproducibility to be part of the publication approval process!
20
Portals No time to discuss in detail but clearly address the challenges of making the data “discoverable” and “usable” (if not necessarily F.A.I.R.)
21
Certification of the CERN Site
We believe certification will allow us to ensure that best practices are implemented and followed up on in the long-term: “written into fabric of organisation” Scope: Scientific Data and CERN’s Digital Memory Timescale: complete prior to 2019/2020 ESPP update Will also “ensure” adequate resources, staffing, training, succession plans etc. CERN can expect to exist until HL/HE LHC (2040/50) And beyond? FCC? Depends on physics…
22
Infrastructure & Security Risk Management
ISO metrics 5.1 Technical Infrastructure Risk Management We do all of this, but is it documented? Technology watches, h/w & s/w changes, detection of bit corruption or loss, reporting, security updates, storage media refreshing, change management, critical processes, handling of multiple data copies etc. 5.2 Security Risk Management Do we do all of this, and is it documented? Security risks (data, systems, personnel, physical plant), disaster preparedness and recovery plans … ISO etc.
23
Organisational Infrastructure
ISO metrics Organisational Infrastructure 3.1 Governance & Organisational Viability Mission Statement, Preservation Policy, Implementation plan(s) etc. [ CERN, CERN, project(s) ] 3.2 Organisational Structure & Staffing Duties, staffing, professional development etc. [ APT etc. ] 3.3 Procedural accountability & preservation policy framework Designated communities, knowledge bases, policies & reviews, change management, transparency & accountability etc. [ At least partially projects ] 3.4 Financial sustainability Business planning processes, financial practices and procedures etc 3.5 Contracts, licenses & liabilities For the digital materials preserved… [ CERN? Projects? ]
24
Collaboration with others
The elaboration of a clear "business case" for long-term data preservation The development of an associated "cost model” A common view of the Use Cases driving the need for data preservation Understanding how to address Funding Agencies requirements for Data Management Plans Preparing for Certification of HEP digital repositories and their long-term future.
25
How Much Data? 100TB per LEP experiment: 3 copies at CERN (1 on disk, 2 on tape) (+ copies outside) 1-10PB for experiments at the HERA collider at DESY, the TEVATRON at Fermilab or the BaBar experiment at SLAC. The LHC experiments is already in the multi-hundred PB range (x00PB) 10EB or more including the High Luminosity upgrade of the LHC (HL-LHC)
26
Conclusions & Next Steps
As is well known, Data Preservation is a Journey and not a destination. Can we capture sufficient “knowledge” to keep the data usable beyond the lifetime of the original collaboration? Can we prepare for major migrations, similar to those that happened in the past? (Or will x86 and Linux last “forever”) For the HL-LHC, we may have neither the storage resources to keep all (intermediate) data, nor the computational resources to re-compute them! You can’t share or re-use data, nor reproduce results, if you haven’t first preserved it (data, software, documentation, knowledge)
27
Conclusions & Next Steps
As is well known, Data Preservation is a Journey and not a destination. Can we capture sufficient “knowledge” to keep the data usable beyond the lifetime of the original collaboration? Can we prepare for major migrations, similar to those that happened in the past? (Or will x86 and Linux last “forever”) For the HL-LHC, we may have neither the storage resources to keep all (intermediate) data, nor the computational resources to re-compute them! Open Data Releases – in addition to Certification – provide a powerful way of measuring whether we are achieving our goals!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.