Presentation is loading. Please wait.

Presentation is loading. Please wait.

Markus Schulz Understanding Performance CERN-IT-DI-LCG

Similar presentations


Presentation on theme: "Markus Schulz Understanding Performance CERN-IT-DI-LCG"— Presentation transcript:

1 Markus Schulz Understanding Performance CERN-IT-DI-LCG
Software Performance Markus Schulz Understanding Performance CERN-IT-DI-LCG

2 Will HL-LHC face a computational problem?
More data Luminosity Detector upgrades Trigger strategies Higher Complexity Pile Up Estimates: 65 – 200 times larger needs than for LHC Run2 CMS: To be taken with a grain of salt ( like all 10 year estimates in computing) LHCb and ALICE will face massive increases already for Run3!

3 What does Moore’s Law and friends offer?
Moore’s 1st law: Number of components/chip double every 24 months House’s law: Performance doubles every 18 months Smaller struct.  higher frequency Kryder’s law: disk storage density doubles every 18 months Butters' Law of Photonics: data rate of a fibre doubles every 9 months Pollack’s Rule: Architecture Gain ˜ sqrt(#transistors) In addition to frequency etc. Proebsting's Law: Compilers double code efficiency every 18 years …...... All this would give over 10 years a factor of “only” Ignoring market, usability of new architectures, …....

4 And it is unlikely to happen…..
Moore’s 2nd (Rock’s) law: cost of semiconductor fabs grow exponentially Current generation: ˜ 16B$ per fab 2003: 25 state of the art producers 2015: 4  not much incentive for change  or competition Same pattern for disks Just worse… Market shifts to mobile devices General slowdown INTEL moved away from TickTock Now two architecture changes for one new hardware gen. International Technology Roadmap for Semiconductors (industry oracle) expected to adjust their forecasts we

5 For the times they are a-changin’ already and not in our favor…..
More cores is all we can expect Challenge for our software and workflows

6 Putting all together: Costs per HS06 Bernd Panzer WLCG Workshop at CHEP 2015
Gap: 8 – 25 (optimistic) (conservative) Optimistic: Factor 7.5 Conservative: Factor 2.X This has to be compensated by improvements in efficiency (Software, Workflows, Infrastructure) We have 10 years 

7 How are we progressing? WLCG community is very good at gathering data related to efficiency Monitoring, profiling, ….. ( XX TBytes) But not so good at combining and understanding the data Data Analytics WG made some progress We are not so good at predicting the performance in changing environments Wigner, OpenStack, AthenaMP,.... No good quantitative understanding on impact of changes (memory, cores, SSDs..) Many (sometimes isolated) parallel activities Especially between software and infrastructure/operations Serious investments by ALICE, LHCb in GPU and accelerators Some investment in adapting to HPC Common expertise and tools: igprof, VTune, Coverity, FOM Maintaining expertise isn’t easy Optimisation often done by transient staff (fellows, students) Expertise often build by OpenLab projects Which end after a while ....

8 How are we progressing? Experiments have massively improved their code
Only few low hanging fruits are left Are we reaching the region of diminishing returns? Paradigm shift needed? Good communication Experiments and tool providers In the scope of: HEP Software Foundation Workshop had a session on performance Agreed that joined activities are needed Software Technology Forum Formerly known as the: Concurrency Forum GaudiHive “Connecting The Dots workshop” Experiment operations and infrastructure CMS

9 Understanding Performance
Team in IT Part of the WLCG team ( IT-DI-LCG) Closely working with the HSF and Software Technology Forum Started January 2016 Long-term activity Focus on linking activities in the community Software and infrastructure Aggregation of knowledge about existing tools and data Providing tools to understand, measure and improve performance Analyze software and workflows Members Nathalie Rauschmayr Markus Schulz Andrea Sciaba David Smith Andrea Valassi

10 High Level Goals Agreeing on a common metric for efficiency
Based on throughput and cost Not time/event Workload based measurements Comparing testbeds with production environments High level model of workloads and infrastructures To help planning and answer “What it?” questions Common approach between experiments Tools, experience… Already happening: Tracking software, GaudiHive, IgProf .... Moving efficiency into the spotlight

11 Current Activities Developing tools and procedures
by working on concrete workflows together with experiment experts Trying to make sense of experiment workflow records Getting an overview of existing activities Within IT and WLCG Within the community Session at the last HSF workshop And documenting them Linking activities Organising workshops etc.

12 Tool development Tools and analysis of memory usage and dynamics
FOM tools (HSF) With Sami Kama (ATLAS) Memory use evaluation, allocation lifetime measurements… X32-ABI Re-evaluation Applied to several experiment workflows Studies on CPU hardware counters Evaluation of kernel tracing tools SystemTap…..... Studies on Feedback-Directed Optimization Job profiles helping compilers to auto optimize code (AutoFDO) Currently looking at GEANT and experiment reconstruction code Building expertise Coverit (project with LHCb) Vtunes

13 Workflow Analysis Quantitative (detailed) analysis of ATLAS jobs running on the HLT I/O patterns and constraints Workload behaviour over time CPU, I/O, mem, swap…... Tuning of OS and VMs Comparing with testbeds Next step: Guide and tools for assessment and tuning

14 Workflow Analysis Analysing ATLAS Panda logs
Differences between sites Differences between job types …. Some overlap with the Analytics Working Group In close exchange Goal: Tool to understand site differences Identify additional information needed Focus: Approach and Tools Not the specific use case

15 Contributions to software development
Persistency framework optimisation Memory usage (very early) Contributions to ROOT-I/O

16 Infrastructure related
Computing on storage servers Many WLCG storage servers have low CPU utilisation Proof of concept implementation: EOS ( 4 nodes) BOINC client Condor client External I/O load generator Currently being evaluated Interference between compute and I/O services

17 Next Steps: Continue with current activities
Expand to more experiments and workflows “Performance analysis on demand” service? And/or training program (together with experiment experts) Within HSF/Software Technology Forum organize a workshop on performance Identify joined projects Define a common roadmap Build a first version of a cost/performance model for a small number of workloads Including verification


Download ppt "Markus Schulz Understanding Performance CERN-IT-DI-LCG"

Similar presentations


Ads by Google