Presentation is loading. Please wait.

Presentation is loading. Please wait.

Overview of the computing activities of the LHC experiments Alessandro De Salvo T. Boccali – G. Carlino - C. Grandi - D. Elia 28-5-2014 A. De Salvo – 28.

Similar presentations


Presentation on theme: "Overview of the computing activities of the LHC experiments Alessandro De Salvo T. Boccali – G. Carlino - C. Grandi - D. Elia 28-5-2014 A. De Salvo – 28."— Presentation transcript:

1 Overview of the computing activities of the LHC experiments Alessandro De Salvo T. Boccali – G. Carlino - C. Grandi - D. Elia 28-5-2014 A. De Salvo – 28 maggio 2014

2 The Computing Models at LHC  Distributed  As for MONARC(1999), political decision NOT to have all in a single place  Tiered  Not all sites are equal, they live in a hierarchy. One reason is Networking, not believed sufficient/affordable to guarantee the full mesh of connections between the (~200) sites  Other reason is to share responsibilities with FA  Using Grid as the MiddleWare which glues sites together  Have WLCG taking care of Support/Debug/Implementations 2

3 LHC Run schedule beyond Run1 3

4 Run2: 2015-2018 Up to 1.5e34, 25 ns, 13-14 TeV – up to 50 fb -1 /y = 40 Run3: 2020-2022 Up to 2.5e34, 25 ns, 13-14 TeV - up to 100 fb -1 /y = 60 Run4: 2025-2028 Up to ~5e34, 25 ns 13-14 TeV – up to 300 fb -1 /y = 140+ This is “Phase2” for pp 1 y 6 y 11 y Run#, terminology (pp, for ATLAS and CMS)

5 ATLAS + CMS  If you want to do Higgs Physics and Searches, you cannot increase too much the threshold of single leptons  Otherwise you start to loose reach  20-30 GeV is deemed as necessary  You cannot limit the trigger rate by just increasing thresholds …  To maintain a level close to 2012 with this, you need(ed)  O(400 Hz) in 2012  O(1 kHz) in 2015-2018  O(5-10 kHz) in 2025+ 5

6  Huge change in DAQ+Offline model for 2020+  Input to HLT from O(100) to O(50kHz) peak: which is …”everything”  40% of it to offline  HLT/offline get mixed (O 2 ): HLT effectively does Reconstruction, not selection  To allow for this, use of GPUs (TPC reco) needed soon  (and to a good extent already tested)  2015+: up to 0.5 nb-1 (PbPb)  2020+: up to 10 nb-1 (PbPb) 1 TB/s 13 GB/s to tape Storage Reconstruction + Compression 50 kHz 80 GB/s 50 kHz (1.5 MB/event) Alice

7 LHCb  Upgrade @ Run3 (2020+)  Essentially trigger-less (40 MHz to HLT)  Able to sustain 2x10 33 cm -2 s -1 (25 ns), this is 5x wrt to 2012 (with 50 ns); > 2  Up to 20 kHz “on tape” 2018+  It was 5,12,20 kHz in 2012,2015  20kHz*5Msec = 100B events/y !  (but just 100 kB/ev)  No resource to plan reprocessing: prompt Reconstruction has to be ok

8 Very rough estimate of a new RAW data per year of running using a simple extrapolation of current data volume scaled by the output rates. To be added: derived data (ESD, AOD), simulation, user data (which is easily a big factor) The different experiments have different timing: ATLAS/CMS: Run4 is the inflation LHCb/ALICE: Run3 PB RAW DISK/TAPE Disk space evolution

9 CPU evolution CPU

10 Using network resources more efficiently  The LHC Experiments are recognized to use to a very good level, with T0, T1s and T2s used at 90% or better  The rest is difficult to squeeze, due to MONARC limitations  Some examples  the T0 can be free at moments, but can be difficult to feed it with analysis jobs, since input data is not available at CERN.  T2s could reprocess data during Xmas break, but they have no RAW data  We can flatten MONARC and improve the network  All T1s exceed 10 Gbps and are planning for 40+ Gbps  All the major T2s have at least 10 Gbps connections

11 Storage Federations The LHC experiments are deploying federated storage infrastructures based on xrootd or http protocols Provide new access modes & redundancy Jobs access data on shared storage resources via WAN Relaxes CPU-data locality opening up regional storage models: from ”jobs-go-to-data” to “jobs-go-as-close-as-possible-to-data" Failover capability for local storage problems Dynamic data caching based on access A data solution for computing sites without storage: opportunistic, cloud, Tier3 Disk can be concentrated only in large sites Reduction of operational load at sites Lower disk storage demands

12 Using ICT resources more efficiently  Using Analysis Trains  Instead of running N analysis jobs on a single input file, N being the number of ongoing analysis,  Run a single job, centrally managed, which executes N analysis modules  (ALICE already does this)  What do you gain?  I/O can be provisioned and planned  Fewer jobs in the system, easier to manage  Analysis trains are run by operators, not physicists  Divide by N also all the overheads (startup time, access to conditions, …)  The train gets much more resources than single user jobs … can be much faster  What do you lose?  Well, you can lose the train! When is the next one?  In general, less controlled by the users  Not guaranteed a a single train is ok for everyone  Not optimal for late moment analysis code changes or last minute emergencies (i.e. Conf panic)

13 Grab more resources (for free…)  We DO have other resources outside TierX hierarchies  the filter (HLT) farm is a specialized system, difficult to turn into an offline capable system on short notice  HLT is NOT used at all for what is it paid for  ~ 3 months a year (Winter shutdown)  2 weeks every couple of months (MD)  For roughly 50% of the Run time (typical Interfill/Total in Run1)  More generally issue: how to use a big resource  You cannot control (decide which OS, which network configuration, …)  Comes and goes fast  (Hopefully you did not pay)  Opportunistic Resource usage

14 Opportunistic usage  While we are very good in using out offline resources, the same cannot be said for other (scientific) fields  HPC centers are often keen to offer cycles to HEP (not CINECA, currently)  Google/Amazon are interested in experimenting our workflows (hoping for future contracts)  Some commercial entities seem to have spare cycles to offer (reward = PR)  Use Clouds to exploit them  Free opportunistic cloud resources  HLT farms accessible through cloud interface during shutdowns or LHC inter-fill (LHCb since early 2013, ~20% of their resources in 2013)  Academic facilities  Cheap (?) opportunistic cloud resources  Commercial cloud infrastructure (Amazon EC2, Google): good deal but under restrictive conditions  PROs:  They can increase substantially the available resources;  2015 estimates are at the 10% level (excluding HLT)  Unlike the GRID, you do not have to pay for the development of the Cloud  CONSs:  You need to have an agile DM/WM system, which accepts high failure rates

15 Improve the algorithms  Pre Run1 phase focus was to have a running software  Since then, large effort spent on optimization of critical part, with huge success (not uncommon 2x per year)  Much more difficult in the future, low hanging fruits are gone ATLAS

16 4. Change technology (CPUs to GPUs, ARM,…) To get the most performance per $, we need to stay at the top of the commodity wave M HEP! 16

17 Computer programming evolution  Need(ed) to gain expertise in Multicore programming  Concurrency Forum evolving in HEP SW Collaboration next month; let’s see which is the proposal  Need(ed) to gain Expertise in GPU programming  For time critical tasks, like tracking @ HLT (ALICE already does, ATLAS has shown numbers)  PROs  more performance per $ sure helps to fill the gap  CONSs  we already suffered Fortran to C++, this is much more difficult; we need years of preparation, and good software frameworks

18 Evolution of infrastructures Technological evolution The trend is towards a concentration of the activities and “regionalization” of the sites Concentration of the local activities in central farms Diskless small sites (e.g. Tier3s) concentrating the storage in few big sites Reduction of manpower needs and storage costs Distributed sites Sharing of large sites’ computing infrastructures with non HEP and/or non scientific activities to ensure the auto-sustainability (in INFN, ReCaS is be a good example for that) Moving progressively from Grid-Computing to Cloud-Computing Load from sites to central services All the experiment software delivered by central services to sites Servers configurations supplied as images directly from central experiment support Very few services still at the level of the site Not a solution for everything (so far at least): not yet addressing completely the needs of I/O intensive tasks from the storage point of view Governance Coordination of the computing and software activities Coordination of the activities for the participation to H2020 calls In INFN, interaction with all the Italian institutions involved in computing: GARR (CSD department), INAF, INGV, CNR …..

19 Conclusions  The LHC Computing Infrastructure based on the GRID paradigm satisfied completely the needs of the experiments in Run1  GRID will remain the baseline model for Run2, anyway, new software and technologies are helping us in the evolution towards a more cost-effective and dynamic model  Run3 seems not critical for ATLAS/CMS, but is a major change for ALICE/LHCb: major changes are foreseen in their structures  Run4 is more intensive for ATLAS and CMS, but it’s too far away yet to have a clear idea of what will really happen!  Many optimizations and infrastructure evolutions are going on in all the experiments, in order to use industry standard in a more accurate way, optimize the usage of resources and fully exploit all the computing facilities we are able to access 19

20 Backup slides 20

21 Moore and Friends Moore’s law: “The number of transistors on integrated circuits doubles approximately every two years” Usually, translates to “every two years for the same $$ you get a computer twice as fast” Kryder’s law: “the capacity of Hard Drives doubles approximately every two years” Butter’s law of photonics: “The amount of data coming out of an optical fiber doubles every nine months” and Nielsen’s law: “Bandwidth available to users increases by 50% every year”

22 More recent (worse? realistic?) estimates B.Panzer/CERN (2013): CPU +25%/y Storage +20%/y (includes ~reasonably timed resource retirement) Means ~ doubling every 3-4 years Over 3 years (2013-2015): a factor 2 Over 7 years (2013-2020): a factor 5 Over 13 years (2013-2026): a factor 10-15 We are here CPU Disk


Download ppt "Overview of the computing activities of the LHC experiments Alessandro De Salvo T. Boccali – G. Carlino - C. Grandi - D. Elia 28-5-2014 A. De Salvo – 28."

Similar presentations


Ads by Google