Download presentation
Presentation is loading. Please wait.
Published byLeslie Clark Modified over 8 years ago
1
Overview of the computing activities of the LHC experiments Alessandro De Salvo T. Boccali – G. Carlino - C. Grandi - D. Elia 28-5-2014 A. De Salvo – 28 maggio 2014
2
The Computing Models at LHC Distributed As for MONARC(1999), political decision NOT to have all in a single place Tiered Not all sites are equal, they live in a hierarchy. One reason is Networking, not believed sufficient/affordable to guarantee the full mesh of connections between the (~200) sites Other reason is to share responsibilities with FA Using Grid as the MiddleWare which glues sites together Have WLCG taking care of Support/Debug/Implementations 2
3
LHC Run schedule beyond Run1 3
4
Run2: 2015-2018 Up to 1.5e34, 25 ns, 13-14 TeV – up to 50 fb -1 /y = 40 Run3: 2020-2022 Up to 2.5e34, 25 ns, 13-14 TeV - up to 100 fb -1 /y = 60 Run4: 2025-2028 Up to ~5e34, 25 ns 13-14 TeV – up to 300 fb -1 /y = 140+ This is “Phase2” for pp 1 y 6 y 11 y Run#, terminology (pp, for ATLAS and CMS)
5
ATLAS + CMS If you want to do Higgs Physics and Searches, you cannot increase too much the threshold of single leptons Otherwise you start to loose reach 20-30 GeV is deemed as necessary You cannot limit the trigger rate by just increasing thresholds … To maintain a level close to 2012 with this, you need(ed) O(400 Hz) in 2012 O(1 kHz) in 2015-2018 O(5-10 kHz) in 2025+ 5
6
Huge change in DAQ+Offline model for 2020+ Input to HLT from O(100) to O(50kHz) peak: which is …”everything” 40% of it to offline HLT/offline get mixed (O 2 ): HLT effectively does Reconstruction, not selection To allow for this, use of GPUs (TPC reco) needed soon (and to a good extent already tested) 2015+: up to 0.5 nb-1 (PbPb) 2020+: up to 10 nb-1 (PbPb) 1 TB/s 13 GB/s to tape Storage Reconstruction + Compression 50 kHz 80 GB/s 50 kHz (1.5 MB/event) Alice
7
LHCb Upgrade @ Run3 (2020+) Essentially trigger-less (40 MHz to HLT) Able to sustain 2x10 33 cm -2 s -1 (25 ns), this is 5x wrt to 2012 (with 50 ns); > 2 Up to 20 kHz “on tape” 2018+ It was 5,12,20 kHz in 2012,2015 20kHz*5Msec = 100B events/y ! (but just 100 kB/ev) No resource to plan reprocessing: prompt Reconstruction has to be ok
8
Very rough estimate of a new RAW data per year of running using a simple extrapolation of current data volume scaled by the output rates. To be added: derived data (ESD, AOD), simulation, user data (which is easily a big factor) The different experiments have different timing: ATLAS/CMS: Run4 is the inflation LHCb/ALICE: Run3 PB RAW DISK/TAPE Disk space evolution
9
CPU evolution CPU
10
Using network resources more efficiently The LHC Experiments are recognized to use to a very good level, with T0, T1s and T2s used at 90% or better The rest is difficult to squeeze, due to MONARC limitations Some examples the T0 can be free at moments, but can be difficult to feed it with analysis jobs, since input data is not available at CERN. T2s could reprocess data during Xmas break, but they have no RAW data We can flatten MONARC and improve the network All T1s exceed 10 Gbps and are planning for 40+ Gbps All the major T2s have at least 10 Gbps connections
11
Storage Federations The LHC experiments are deploying federated storage infrastructures based on xrootd or http protocols Provide new access modes & redundancy Jobs access data on shared storage resources via WAN Relaxes CPU-data locality opening up regional storage models: from ”jobs-go-to-data” to “jobs-go-as-close-as-possible-to-data" Failover capability for local storage problems Dynamic data caching based on access A data solution for computing sites without storage: opportunistic, cloud, Tier3 Disk can be concentrated only in large sites Reduction of operational load at sites Lower disk storage demands
12
Using ICT resources more efficiently Using Analysis Trains Instead of running N analysis jobs on a single input file, N being the number of ongoing analysis, Run a single job, centrally managed, which executes N analysis modules (ALICE already does this) What do you gain? I/O can be provisioned and planned Fewer jobs in the system, easier to manage Analysis trains are run by operators, not physicists Divide by N also all the overheads (startup time, access to conditions, …) The train gets much more resources than single user jobs … can be much faster What do you lose? Well, you can lose the train! When is the next one? In general, less controlled by the users Not guaranteed a a single train is ok for everyone Not optimal for late moment analysis code changes or last minute emergencies (i.e. Conf panic)
13
Grab more resources (for free…) We DO have other resources outside TierX hierarchies the filter (HLT) farm is a specialized system, difficult to turn into an offline capable system on short notice HLT is NOT used at all for what is it paid for ~ 3 months a year (Winter shutdown) 2 weeks every couple of months (MD) For roughly 50% of the Run time (typical Interfill/Total in Run1) More generally issue: how to use a big resource You cannot control (decide which OS, which network configuration, …) Comes and goes fast (Hopefully you did not pay) Opportunistic Resource usage
14
Opportunistic usage While we are very good in using out offline resources, the same cannot be said for other (scientific) fields HPC centers are often keen to offer cycles to HEP (not CINECA, currently) Google/Amazon are interested in experimenting our workflows (hoping for future contracts) Some commercial entities seem to have spare cycles to offer (reward = PR) Use Clouds to exploit them Free opportunistic cloud resources HLT farms accessible through cloud interface during shutdowns or LHC inter-fill (LHCb since early 2013, ~20% of their resources in 2013) Academic facilities Cheap (?) opportunistic cloud resources Commercial cloud infrastructure (Amazon EC2, Google): good deal but under restrictive conditions PROs: They can increase substantially the available resources; 2015 estimates are at the 10% level (excluding HLT) Unlike the GRID, you do not have to pay for the development of the Cloud CONSs: You need to have an agile DM/WM system, which accepts high failure rates
15
Improve the algorithms Pre Run1 phase focus was to have a running software Since then, large effort spent on optimization of critical part, with huge success (not uncommon 2x per year) Much more difficult in the future, low hanging fruits are gone ATLAS
16
4. Change technology (CPUs to GPUs, ARM,…) To get the most performance per $, we need to stay at the top of the commodity wave M HEP! 16
17
Computer programming evolution Need(ed) to gain expertise in Multicore programming Concurrency Forum evolving in HEP SW Collaboration next month; let’s see which is the proposal Need(ed) to gain Expertise in GPU programming For time critical tasks, like tracking @ HLT (ALICE already does, ATLAS has shown numbers) PROs more performance per $ sure helps to fill the gap CONSs we already suffered Fortran to C++, this is much more difficult; we need years of preparation, and good software frameworks
18
Evolution of infrastructures Technological evolution The trend is towards a concentration of the activities and “regionalization” of the sites Concentration of the local activities in central farms Diskless small sites (e.g. Tier3s) concentrating the storage in few big sites Reduction of manpower needs and storage costs Distributed sites Sharing of large sites’ computing infrastructures with non HEP and/or non scientific activities to ensure the auto-sustainability (in INFN, ReCaS is be a good example for that) Moving progressively from Grid-Computing to Cloud-Computing Load from sites to central services All the experiment software delivered by central services to sites Servers configurations supplied as images directly from central experiment support Very few services still at the level of the site Not a solution for everything (so far at least): not yet addressing completely the needs of I/O intensive tasks from the storage point of view Governance Coordination of the computing and software activities Coordination of the activities for the participation to H2020 calls In INFN, interaction with all the Italian institutions involved in computing: GARR (CSD department), INAF, INGV, CNR …..
19
Conclusions The LHC Computing Infrastructure based on the GRID paradigm satisfied completely the needs of the experiments in Run1 GRID will remain the baseline model for Run2, anyway, new software and technologies are helping us in the evolution towards a more cost-effective and dynamic model Run3 seems not critical for ATLAS/CMS, but is a major change for ALICE/LHCb: major changes are foreseen in their structures Run4 is more intensive for ATLAS and CMS, but it’s too far away yet to have a clear idea of what will really happen! Many optimizations and infrastructure evolutions are going on in all the experiments, in order to use industry standard in a more accurate way, optimize the usage of resources and fully exploit all the computing facilities we are able to access 19
20
Backup slides 20
21
Moore and Friends Moore’s law: “The number of transistors on integrated circuits doubles approximately every two years” Usually, translates to “every two years for the same $$ you get a computer twice as fast” Kryder’s law: “the capacity of Hard Drives doubles approximately every two years” Butter’s law of photonics: “The amount of data coming out of an optical fiber doubles every nine months” and Nielsen’s law: “Bandwidth available to users increases by 50% every year”
22
More recent (worse? realistic?) estimates B.Panzer/CERN (2013): CPU +25%/y Storage +20%/y (includes ~reasonably timed resource retirement) Means ~ doubling every 3-4 years Over 3 years (2013-2015): a factor 2 Over 7 years (2013-2020): a factor 5 Over 13 years (2013-2026): a factor 10-15 We are here CPU Disk
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.