Predrag Buncic, October 3, 2013 ECFA Workshop Aix-Les-Bains - 1 Computing at the HL-LHC Predrag Buncic on behalf of the Trigger/DAQ/Offline/Computing Preparatory.

Slides:

Advertisements

Similar presentations

Ian Bird WLCG Workshop, Copenhagen 12 th November Nov 2013 Ian Bird; WLCG Workshop1.

Advertisements

T1 at LBL/NERSC/OAK RIDGE General principles. RAW data flow T0 disk buffer DAQ & HLT CERN Tape AliEn FC Raw data Condition & Calibration & data DB disk.

CHEP 2012 – New York City 1.  LHC Delivers bunch crossing at 40MHz  LHCb reduces the rate with a two level trigger system: ◦ First Level (L0) – Hardware.

Ian M. Fisk Fermilab February 23, Global Schedule External Items ➨ gLite 3.0 is released for pre-production in mid-April ➨ gLite 3.0 is rolled onto.

Assessment of Core Services provided to USLHC by OSG.

GridPP Steve Lloyd, Chair of the GridPP Collaboration Board.

Computer System Architectures Computer System Software

Ian Fisk and Maria Girone Improvements in the CMS Computing System from Run2 CHEP 2015 Ian Fisk and Maria Girone For CMS Collaboration.

Computing for ILC experiment Computing Research Center, KEK Hiroyuki Matsunaga.

Introduction to Cloud Computing

Adaptive software in cloud computing Marin Litoiu York University Canada.

J OINT I NSTITUTE FOR N UCLEAR R ESEARCH OFF-LINE DATA PROCESSING GRID-SYSTEM MODELLING FOR NICA 1 Nechaevskiy A. Dubna, 2012.

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

Take on messages from Lecture 1 LHC Computing has been well sized to handle the production and analysis needs of LHC (very high data rates and throughputs)

Use of GPUs in ALICE (and elsewhere) Thorsten Kollegger TDOC-PG | CERN |

Finnish DataGrid meeting, CSC, Otaniemi, V. Karimäki (HIP) DataGrid meeting, CSC V. Karimäki (HIP) V. Karimäki (HIP) Otaniemi, 28 August, 2000.

ALICE Upgrade for Run3: Computing HL-LHC Trigger, Online and Offline Computing Working Group Topical Workshop Sep 5 th 2014.

ATLAS in LHCC report from ATLAS –ATLAS Distributed Computing has been working at large scale Thanks to great efforts from shifters.

Server Virtualization

Management of the LHCb DAQ Network Guoming Liu * †, Niko Neufeld * * CERN, Switzerland † University of Ferrara, Italy.

ALICE Offline Week | CERN | November 7, 2013 | Predrag Buncic AliEn, Clouds and Supercomputers Predrag Buncic With minor adjustments by Maarten Litmaath.

Next Generation Operating Systems Zeljko Susnjar, Cisco CTG June 2015.

US ATLAS Tier 1 Facility Rich Baker Brookhaven National Laboratory DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National Laboratory.

Predrag Buncic, October 3, 2013 ECFA Workshop Aix-Les-Bains - 1 Computing at the HL-LHC Predrag Buncic on behalf of the Trigger/DAQ/Offline/Computing Preparatory.

CERN IT Department CH-1211 Genève 23 Switzerland t Frédéric Hemmer IT Department Head - CERN 23 rd August 2010 Status of LHC Computing from.

DPHEP Workshop CERN, December Predrag Buncic (CERN/PH-SFT) CernVM R&D Project Portable Analysis Environments using Virtualization.

Claudio Grandi INFN Bologna CMS Computing Model Evolution Claudio Grandi INFN Bologna On behalf of the CMS Collaboration.

Future computing strategy Some considerations Ian Bird WLCG Overview Board CERN, 28 th September 2012.

Architecture & Cybersecurity – Module 3 ELO-100Identify the features of virtualization. (Figure 3) ELO-060Identify the different components of a cloud.

Predrag Buncic Future IT challenges for ALICE Technical Workshop November 6, 2015.

LHC Computing, CERN, & Federated Identities

US ATLAS Tier 1 Facility Rich Baker Deputy Director US ATLAS Computing Facilities October 26, 2000.

Predrag Buncic ALICE Status Report LHCC Referee Meeting CERN

Wesley Smith, Graeme Stewart, June 13, 2014 ECFA - TOOC - 1 ECFA - TOOC ECFA - TOOC ECFA HL-LHC workshop PG7: Trigger, Online, Offline and Computing preparatory.

Pierre VANDE VYVRE ALICE Online upgrade October 03, 2012 Offline Meeting, CERN.

Ian Bird WLCG Networking workshop CERN, 10 th February February 2014

Predrag Buncic (CERN/PH-SFT) Software Packaging: Can Virtualization help?

Some GPU activities at the CMS experiment Felice Pantaleo EP-CMG-CO EP-CMG-CO 1.

LHCbComputing Computing for the LHCb Upgrade. 2 LHCb Upgrade: goal and timescale m LHCb upgrade will be operational after LS2 (~2020) m Increase significantly.

Latest Improvements in the PROOF system Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers, Gerri Ganis, Jan Iwaszkiewicz CERN.

Distributed Physics Analysis Past, Present, and Future Kaushik De University of Texas at Arlington (ATLAS & D0 Collaborations) ICHEP’06, Moscow July 29,

ALICE Software Evolution Predrag Buncic. GDB | September 11, 2013 | Predrag Buncic 2.

Workshop ALICE Upgrade Overview Thorsten Kollegger for the ALICE Collaboration ALICE | Workshop |

16 September 2014 Ian Bird; SPC1. General ALICE and LHCb detector upgrades during LS2  Plans for changing computing strategies more advanced CMS and.

WLCG Status Report Ian Bird Austrian Tier 2 Workshop 22 nd June, 2010.

Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.

Maria Girone, CERN  CMS in a High-Latency Environment  CMSSW I/O Optimizations for High Latency  CPU efficiency in a real world environment  HLT 

Current activities and short term plans 24/04/2015 P. Hristov.

IT-DSS Alberto Pace2 ? Detecting particles (experiments) Accelerating particle beams Large-scale computing (Analysis) Discovery We are here The mission.

Alessandro De Salvo CCR Workshop, ATLAS Computing Alessandro De Salvo CCR Workshop,

October 19, 2010 David Lawrence JLab Oct. 19, 20101RootSpy -- CHEP10, Taipei -- David Lawrence, JLab Parallel Session 18: Software Engineering, Data Stores,

Scientific Computing at Fermilab Lothar Bauerdick, Deputy Head Scientific Computing Division 1 of 7 10k slot tape robots.

Computing infrastructures for the LHC: current status and challenges of the High Luminosity LHC future Worldwide LHC Computing Grid (WLCG): Distributed.

Predrag Buncic CERN Data management in Run3. Roles of Tiers in Run 3 Predrag Buncic 2 ALICEALICE ALICE Offline Week, 01/04/2016 Reconstruction Calibration.

Predrag Buncic CERN Plans for Run2 and the ALICE upgrade in Run3 ALICE Tier-1/Tier-2 Workshop February 2015.

Use of HLT farm and Clouds in ALICE

Ian Bird WLCG Workshop San Francisco, 8th October 2016

SuperB and its computing requirements

The “Understanding Performance!” team in CERN IT

evoluzione modello per Run3 LHC

Workshop Computing Models status and perspectives

for the Offline and Computing groups

Bernd Panzer-Steindel, CERN/IT

Dagmar Adamova (NPI AS CR Prague/Rez) and Maarten Litmaath (CERN)

Thoughts on Computing Upgrade Activities

ALICE Computing Model in Run3

CernVM Status Report Predrag Buncic (CERN/PH-SFT).

ALICE Computing Upgrade Predrag Buncic

New strategies of the LHC experiments to meet

Computing at the HL-LHC

Presentation transcript:

Predrag Buncic, October 3, 2013 ECFA Workshop Aix-Les-Bains - 1 Computing at the HL-LHC Predrag Buncic on behalf of the Trigger/DAQ/Offline/Computing Preparatory Group ALICE: Pierre Vande Vyvre, Thorsten Kollegger, Predrag Buncic; ATLAS: David Rousseau, Benedetto Gorini, Nikos Konstantinidis; CMS: Wesley Smith, Christoph Schwick, Ian Fisk, Peter Elmer ; LHCb: Renaud Legac, Niko Neufeld

Predrag Buncic, October 3, 2013 ECFA Workshop Aix-Les-Bains - 2 HL-HLC Run1 Run2 Run3 ALICE + LHCb Run4 ATLAS + CMS CPU needs (per event) will grow with track multiplicity (pileup) Storage needs are proportional to accumulated luminosity 1.Resource estimates 2.The probable evolution of the distributed computed technologies

Predrag Buncic, October 3, 2013 ECFA Workshop Aix-Les-Bains - 3 LHCb & Run 3 40 MHz 5-40 MHz 20 kHz (0.1 MB/event) 2 GB/s Storage Reconstruction + Compression 50 kHz 75 GB/s 50 kHz (1.5 MB/event)  PEAK OUTPUT 

Predrag Buncic, October 3, 2013 ECFA Workshop Aix-Les-Bains - 4 ATLAS & Run GB/s Storage Level 1 HLT 5-10 kHz (2MB/event) 40 GB/s Storage Level 1 HLT 10 kHz (4MB/event)  PEAK OUTPUT 

Predrag Buncic, October 3, 2013 ECFA Workshop Aix-Les-Bains - 5 Big Data “Data that exceeds the boundaries and sizes of normal processing capabilities, forcing you to take a non-traditional approach” size access rate standard data processing applicable BIG DATA

Predrag Buncic, October 3, 2013 ECFA Workshop Aix-Les-Bains - 6 How do we score today? PB RAW + Derived Run1

Predrag Buncic, October 3, 2013 ECFA Workshop Aix-Les-Bains - 7 Data: Outlook for HL-LHC Very rough estimate of a new RAW data per year of running using a simple extrapolation of current data volume scaled by the output rates. To be added: derived data (ESD, AOD), simulation, user data… PB

Predrag Buncic, October 3, 2013 ECFA Workshop Aix-Les-Bains - 8 Digital Universe Expansion

Predrag Buncic, October 3, 2013 ECFA Workshop Aix-Les-Bains - 9 Data storage issues Our data problems may still look small on the scale of storage needs of internet giants Business , video, music, smartphones, digital cameras generate more and more need for storage The cost of storage will probably continue to go down but… Commodity high capacity disks may start to look more like tapes, optimized for multimedia storage, sequential access Need to be combined with flash memory disks for fast random access The residual cost of disk servers will remain While we might be able to write all this data, how long it will take to read it back? Need for sophisticated parallel I/O and processing. +We have to store this amount of data every year and for many years to come (Long Term Data Preservation )

Predrag Buncic, October 3, 2013 ECFA Workshop Aix-Les-Bains - 10 Exabyte Scale We are heading towards Exabyte scale

Predrag Buncic, October 3, 2013 ECFA Workshop Aix-Les-Bains - 11 CPU: Online requirements ALICE & Run 3,4 HLT farms for online reconstruction/compression or event rejections with aim to reduce data volume to manageable levels Event rate x 100, data volume x 10 ALICE estimate: 200k today’s core equivalent, 2M HEP-SPEC-06 LHCb estimate: 880k today’s core equivalent, 8M HEP-SPEC-06 Looking into alternative platforms to optimize performance and minimize cost GPUs ( 1 GPU = 3 CPUs for ALICE online tracking use case) ARM, Atom… ATLAS & Run 4 Trigger rate x 10, CMS 4MB/event, ATLAS 2MB/event CMS: Assuming linear scaling with pileup, a total factor of 50 increase (10M HEP-SPEC-06) of HLT power would be needed wrt. today’s farm

Predrag Buncic, October 3, 2013 ECFA Workshop Aix-Les-Bains - 12 CPU: Online + Offline Very rough estimate of new CPU requirements for online and offline processing per year of data taking using a simple extrapolation of current requirements scaled by the number of events. Little headroom left, we must work on improving the performance. ONLINE GRID Moore’s law limit MHS06 Historical growth of 25%/year Room for improvement

Predrag Buncic, October 3, 2013 ECFA Workshop Aix-Les-Bains - 13 Clock frequency Vectors Instruction Pipelining Instruction Level Parallelism (ILP) Hardware threading Multi-core Multi-socket Multi-node Running independent jobs per core (as we do now) is optimal solution for High Throughput Computing applications } Gain in memory footprint and time-to-finish but not in throughput Very little gain to be expected and no action to be taken Potential gain in throughput and in time-to-finish How to improve the performance? NEXT TALK THIS TALK Improving the algorithms is the only way to reclaim factors in performance!

Predrag Buncic, October 3, 2013 ECFA Workshop Aix-Les-Bains - 14 WLCG Collaboration Distributed infrastructure of 150 computing centers in 40 countries 300+ k CPU cores (~ 2M HEP-SPEC-06) The biggest site with ~50k CPU cores, 12 T2 with 2-30k CPU cores Distributed data, services and operation infrastructure

Predrag Buncic, October 3, 2013 ECFA Workshop Aix-Les-Bains - 15 Distributed computing today Running millions of jobs and processing hundreds of PB of data Efficiency (CPU/Wall time) from 70% (analysis) to 85% (organized activities, reconstruction, simulation) 60-70% of all CPU time on the grid is spent in running simulation

Predrag Buncic, October 3, 2013 ECFA Workshop Aix-Les-Bains - 16 Can it scale? Each experiment today runs their own Grid overlay on top of shared WLCG infrastructure In all cases the underlying distributed computing architecture is similar based on “pilot” job model Can we scale these systems expected levels? Lots of commonality and potential for consolidation GlideinWMS

Predrag Buncic, October 3, 2013 ECFA Workshop Aix-Les-Bains - 17 Grid Tier model Network capabilities and data access technologies significantly improved our ability to use resources independent of location

Predrag Buncic, October 3, 2013 ECFA Workshop Aix-Les-Bains - 18 Global Data Federation FAX – Federated ATLAS Xrootd, AAA – Any Data, Any Time, Anywhere (CMS), AliEn (ALICE) B. Bockelman, CHEP 2012

Predrag Buncic, October 3, 2013 ECFA Workshop Aix-Les-Bains - 19 VirtualizationVirtualization Virtualization provides better system utilization by putting all those many cores to work, resulting in power & cost savings.

Predrag Buncic, October 3, 2013 ECFA Workshop Aix-Les-Bains - 20 Software integration problem Traditional model Horizontal layers Independently developed Maintained by the different groups Different lifecycle Application is deployed on top of the stack Breaks if any layer changes Needs to be certified every time when something changes Results in deployment and support nightmare Difficult to do upgrades Even worse to switch to new OS versions OS Application Libraries Tools Databases Hardware

Predrag Buncic, October 3, 2013 ECFA Workshop Aix-Les-Bains - 21 Application driven approach 1.Start by analysing the application requirements and dependencies 2.Add required tools and libraries Use virtualization to 1.Build minimal OS 2.Bundle all this into Virtual Machine image Separates lifecycles of the application and underlying computing infrastructure Decoupling Apps and Ops OS Libraries Tools Databases Application

Predrag Buncic, October 3, 2013 ECFA Workshop Aix-Les-Bains - 22 Cloud: Putting it all together API IaaS = Infrastructure as a Service Server consolidation using virtualization improves resource usage Public, on demand, pay as you go infrastructure with goals and capabilities similar to those of academic grids

Predrag Buncic, October 3, 2013 ECFA Workshop Aix-Les-Bains - 23 Public Cloud At present 9 large sites/zones up to ~2M CPU cores/site, ~4M total 10 x more cores on 1/10 of the sites compared to our Grid 100 x more users (500,000)

Predrag Buncic, October 3, 2013 ECFA Workshop Aix-Les-Bains - 24 Could we make our own Cloud? Open Source Cloud components Open Source Cloud middleware VM building tools and infrastructure CernVM+CernVM/FS, boxgrinder.. Common API From the Grid heritage… Common Authentication/Authorization High performance global data federation/storage Scheduled file transfer Experiment distributed computing frameworks already adapted to Cloud To do Unified access and cross Cloud scheduling Centralized key services at a few large centers Reduce complexity in terms of number of sites

Predrag Buncic, October 3, 2013 ECFA Workshop Aix-Les-Bains - 25 Grid on top of Clouds AliEn CAF On Demand CAF On Demand AliEn O 2 Online-Offline Facility Private CERN Cloud Vision Public Cloud(s) AliEn HLT DAQ T1/T2/T3 Analysis Reconstruction Custodial Data Store Analysis Simulation Data Cache Simulation Reconstruction Calibration Data Store

Predrag Buncic, October 3, 2013 ECFA Workshop Aix-Les-Bains - 26 And if this is not enough… 18,688 nodes with total of 299,008 processor cores and one GPU per node. And, there are spare CPU cycles available… Ideal for event generators/simulation type workloads (60- 70% of all CPU cycles used by the experiments). Simulation frameworks must be adapted to efficiently use such resources where available

Predrag Buncic, October 3, 2013 ECFA Workshop Aix-Les-Bains - 27 ConclusionsConclusions Resources needed for Computing at HL-LHC are going to be large but not unprecedented Data volume is going to grow dramatically Projected CPU needs are reaching the Moore’s law ceiling Grid growth of 25% per year is by far not sufficient Speeding up the simulation is very important Parallelization at all levels is needed for improving performance (application, I/O) Requires reengineering of experiment software New technologies such as clouds and virtualization may help to reduce the complexity and help to fully utilize available resources