HPC computing at CERN - use cases from the engineering and physics communities Michal HUSEJKO, Ioannis AGTZIDIS IT/PES/ES 1.

Slides:

Advertisements

Similar presentations

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.

Advertisements

Beowulf Supercomputer System Lee, Jung won CS843.

Performance Analysis of Virtualization for High Performance Computing A Practical Evaluation of Hypervisor Overheads Matthew Cawood University of Cape.

Scaling Up Engineering Analysis using Windows HPC Server 2008 Todd Wedge Platform Strategy Advisor, HPC Microsoft.

Leader in Next Generation Ethernet. 2 Outline Where is iWARP Today? Some Proof Points Conclusion Questions.

PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.

SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.

History of Distributed Systems Joseph Cordina

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO IEEE Symposium of Massive Storage Systems, May 3-5, 2010 Data-Intensive Solutions.

1 Dr. Frederica Darema Senior Science and Technology Advisor NSF Future Parallel Computing Systems – what to remember from the past RAMP Workshop FCRC.

High Performance Computing (HPC) at Center for Information Communication and Technology in UTM.

MultiJob PanDA Pilot Oleynik Danila 28/05/2015. Overview Initial PanDA pilot concept & HPC Motivation PanDA Pilot workflow at nutshell MultiJob Pilot.

Optimizing RAM-latency Dominated Applications

Real Parallel Computers. Modular data centers Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra,

Cluster computing facility for CMS simulation work at NPD-BARC Raman Sehgal.

© 2013 Mellanox Technologies 1 NoSQL DB Benchmarking with high performance Networking solutions WBDB, Xian, July 2013.

Cloud MapReduce ： a MapReduce Implementation on top of a Cloud Operating System Speaker : 童耀民 MA1G Authors: Huan Liu, Dan Orban Accenture.

Scalable System for Large Unstructured Mesh Simulation Miguel A. Pasenau, Pooyan Dadvand, Jordi Cotela, Abel Coll and Eugenio Oñate.

STRATEGIES INVOLVED IN REMOTE COMPUTATION

Dual Stack Virtualization: Consolidating HPC and commodity workloads in the cloud Brian Kocoloski, Jiannan Ouyang, Jack Lange University of Pittsburgh.

© 2008 The MathWorks, Inc. ® ® Parallel Computing with MATLAB ® Silvina Grad-Freilich Manager, Parallel Computing Marketing

1 An SLA-Oriented Capacity Planning Tool for Streaming Media Services Lucy Cherkasova, Wenting Tang, and Sharad Singhal HPLabs,USA.

Principles of Scalable HPC System Design March 6, 2012 Sue Kelly Sandia National Laboratories Abstract: Sandia National.

Performance Concepts Mark A. Magumba. Introduction Research done on 1058 correspondents in 2006 found that 75% OF them would not return to a website that.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

Trace Generation to Simulate Large Scale Distributed Application Olivier Dalle, Emiio P. ManciniMar. 8th, 2012.

Extreme-scale computing systems – High performance computing systems Current No. 1 supercomputer Tianhe-2 at petaflops Pushing toward exa-scale computing.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.

The Red Storm High Performance Computer March 19, 2008 Sue Kelly Sandia National Laboratories Abstract: Sandia National.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.

Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,

Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.

Next Generation Operating Systems Zeljko Susnjar, Cisco CTG June 2015.

Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.

1 Lattice QCD Clusters Amitoj Singh Fermi National Accelerator Laboratory.

1 THE EARTH SIMULATOR SYSTEM By: Shinichi HABATA, Mitsuo YOKOKAWA, Shigemune KITAWAKI Presented by: Anisha Thonour.

Efficient Live Checkpointing Mechanisms for computation and memory-intensive VMs in a data center Kasidit Chanchio Vasabilab Dept of Computer Science,

HIGUCHI Takeo Department of Physics, Faulty of Science, University of Tokyo Representing dBASF Development Team BELLE/CHEP20001 Distributed BELLE Analysis.

Modeling Virtualized Environments in Simalytic ® Models by Computing Missing Service Demand Parameters CMG2009 Paper 9103, December 11, 2009 Dr. Tim R.

EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Enabling the use of e-Infrastructures with.

COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.

HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.

Mellanox Connectivity Solutions for Scalable HPC Highest Performing, Most Efficient End-to-End Connectivity for Servers and Storage September 2010 Brandon.

PROOF Benchmark on Different Hardware Configurations 1 11/29/2007 Neng Xu, University of Wisconsin-Madison Mengmeng Chen, Annabelle Leung, Bruce Mellado,

OPTIMIZATION OF DIESEL INJECTION USING GRID COMPUTING Miguel Caballer Universidad Politécnica de Valencia.

Tackling I/O Issues 1 David Race 16 March 2010.

Latest Improvements in the PROOF system Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers, Gerri Ganis, Jan Iwaszkiewicz CERN.

Page 1 Monitoring, Optimization, and Troubleshooting Lecture 10 Hassan Shuja 11/30/2004.

7.1 Operating Systems. 7.2 A computer is a system composed of two major components: hardware and software. Computer hardware is the physical equipment.

LIOProf: Exposing Lustre File System Behavior for I/O Middleware

Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,

A Practical Evaluation of Hypervisor Overheads Matthew Cawood Supervised by: Dr. Simon Winberg University of Cape Town Performance Analysis of Virtualization.

CERN IT Department CH-1211 Genève 23 Switzerland t Self service for software development tools Michal Husejko, behalf of colleagues in CERN.

Introduction to Data Analysis with R on HPC Texas Advanced Computing Center Feb

Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.

Slide 1 User-Centric Workload Analytics: Towards Better Cluster Management Saurabh Bagchi Purdue University Joint work with: Subrata Mitra, Suhas Javagal,

HPC need and potential of ANSYS CFD and mechanical products at CERN A. Rakai EN-CV-PJ2 5/4/2016.

Evolution at CERN E. Da Riva1 CFD team supports CERN development 19 May 2011.

High Performance Computing (HPC)

Chapter 1: Introduction to the Personal Computer

HPC Roadshow Overview of HPC systems and software available within the LinkSCEEM project.

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING CLOUD COMPUTING

CLIC module simulation model

The “Understanding Performance!” team in CERN IT

Heterogeneous Computation Team HybriLIT

Operating Systems (CS 340 D)

Instructor Materials Chapter 1: Introduction to the Personal Computer

Operating Systems (CS 340 D)

Overview of HPC systems and software available within

Performance And Scalability In Oracle9i And SQL Server 2000

Presentation transcript:

HPC computing at CERN - use cases from the engineering and physics communities Michal HUSEJKO, Ioannis AGTZIDIS IT/PES/ES 1

Agenda Introduction – Where we are now CERN used applications requiring HPC infrastructure User cases (Engineering) – Ansys Mechanical – Ansys Fluent Physics HPC applications Next steps Q&A 2

Agenda Introduction – Where we are now CERN used applications requiring HPC infrastructure User cases (Engineering) – Ansys Mechanical – Ansys Fluent Physics HPC applications Next steps Q&A 3

Introduction Some 95% of our applications are served well with bread- and-butter machines We (CERN IT) have invested heavily in AI including layered approach to responsibilities, virtualization, private cloud. There are certain applications, traditionally called HPC applications, which have different requirements Even though these applications sail under common HPC name, they are different and have different requirements These applications need detailed requirements analysis 4

Scope of talk We contacted our user community and started to gather continuously user requirements We have started detailed system analysis of our HPC applications to gain knowledge of their behavior. In this talk I would like to present the progress and the next steps At a later stage, we will look how the HPC requirements can fit into the IT infrastructure 5

HPC applications Engineering applications: – Used at CERN in different departments to model and design parts of the LHC machine. – IT-PES-ES section is supporting the user community of these tools – Tools used for: structural analysis, Fluid Dynamics, Electromagnetics, Multiphysics – Major commercial tools : Ansys, Fluent, HFSS, Comsol, CST – but also open source: OpenFOAM (fluid dynamics) Physics simulation applications – PH-TH Lattice QCD simulations – BE LINAC4 plasma simulations – BE beams simulation (CLIC, LHC etc) – HEP simulation applications for theory and accelerator physics 6

Agenda Introduction – Where we are now CERN used applications requiring HPC infrastructure User cases (Engineering) – Ansys Mechanical – Ansys Fluent Physics HPC applications Next steps Q&A 7

Use case 1: Ansys Mechanical Where? – LINAC4 Beam Dump System Who ? – Ivo Vicente Leitao, Mechanical Engineer (EN/STI/TCD) How ? – Ansys Mechanical for design modeling and simulations (stress and thermal structural analysis) Use case 1: Ansys Mechanical 8

How does it work ? Ansys Mechanical – Structural analysis: stress and thermal, steady and transient – Finite Element Method We have physical problem defined by differential equations It is impossible to analytically solve it for complicated structure (problem) We divide problem into subdomains (elements) We solve differential equations (numerically) for selected points (nodes) And then by the means of approximation functions we project solution to the global structure Example has 6.0 Million (6M0) of mesh nodes – Compute intensive – Memory intensive Use case 1: Ansys Mechanical 9

10

Simulation results Measurement hardware configuration: – 2x HP 580 G7 server (4x E7-8837, 512 GB RAM, 32c), 10 Gb low latency Ethernet link Time to obtain single cycle 6M0 solution: – 8 cores -> 63 hours to finish simulation, 60 GB RAM used during simulation – 64 cores -> 17 hours to finish simulation, 2*200 GB RAM used during simulation – User interested in 50 cycles: would need 130 days on 8 cores, or 31 days on 64 cores It is impossible to get simulation results for this case in a reasonable time on a standard user engineering workstation Use case 1: Ansys Mechanical 11

Challenges Why do we care ? – Everyday we are facing users asking us how to speed up some engineering application Challenges – Problem size and its complexity are challenging user computer workstations in terms of computing power, memory size, and file I/O – This can be extrapolated to other Engineering HPC applications How to solve the problem ? – Can we use current infrastructure to provide a platform for these demanding applications ? – … or do we need something completely new ? – … and if something new, how this could fit into our IT infrastructure So, let’s have a look at what is happening behind the scene 12

Analysis tools Standard Linux performance monitoring tools used: – Memory usage: sar, – Memory bandwidth: Intel PCM (Performance Counter Monitor, open source) – CPU usage: iostat, dstat – Disk I/O : dstat – Network traffic monitoring: netstat Monitoring scripts started from the same node where the simulation job is started. Collection of measurement results is done automatically by our tools. 13

Multi-core scalability Measurement info: – LINAC4 beam dump system, single cycle simulation – 2 nodes of (quad socket Westmere, E7-8837, 512 GB), 10 Gb iWARP Results: – Ansys Mechnical simulation scales well beyond single multi-core box. – Greatly improved number of jobs/week, or simulation cycles/week Next steps: scale on more than two nodes and measure impact of MPI Conclusion – Multi-core platforms needed to finish simulation in reasonable time Use case 1 : Ansys Mechanical 14

Memory requirements In-core/out-core simulations (avoiding costly file I/O) – In-core = most of temporary data is stored in the RAM (still can write to disk during simulation) – Out-of-core = uses files on file system to store temporary data. – Preferable mode is in-core to avoid costly disk I/O accesses, but this requires increased RAM memory and its bandwidth Ansys Mechanical (and some other engineering applications) has limited scalability – Depends heavily on solver and user problem All commercial engineering application use some licencing scheme, which can put skew on choice of a platform Conclusion: – We are investigating if we can spread required memory on multiple dual socket systems, or 4 socket systems are necessary for some HPC applications – There are certain engineering simulations which seem to be limited by a memory bandwidth, this has to be also considered when choosing a platform Use case 1 : Ansys Mechanical 15

Disk I/O impact Ansys Mechanical – BE CLIC test system – Two Supermicro servers (dual E5-2650, 128 GB), 10 Gb iWARP back to back. Disk I/O impact on speedup. Two configurations compared. – Measured with sar, and iostat – Applications spends a lot of time in iowait – Using SSD instead of HDD increases jobs/week by almost 100 % Conclusion: – We need to investigate more cases to see if this is a marginal case or something more common Use case 3 : Ansys Mechanical 16

Agenda Introduction – Where we are now CERN used applications requiring HPC infrastructure User cases (Engineering) – Ansys Mechanical – Ansys Fluent Physics HPC applications Next steps Q&A 17

Use case 2: Fluent CFD Computational Fluid Dynamics (CFD) application, Fluent (now provided by Ansys) Beam dump system at PS booster. – Heat is generated inside the dump and you need to cool it in order to avoid it to melt or break because of mechanical stresses. Extensively parallelized MPI-based software Performance characteristics similar to other MPI-based software: – Importance of low latency for short messages – Importance of bandwidth for medium and big messages 18

Interconnect network latency impact Ansys Fluent – CFD “heavy” test case from CFD group ( EN-CV-PJ)EN-CV-PJ – 2 nodes of (quad socket Westmere, E7-8837, 512 GB), 10 Gb iWARP Speedup beyond single node can be diminished because of high latency interconnect. – The graph shows good scalability for 10 Gb low latency beyond single box, and dips in performance when switched to 1 Gb for node to node MPI Next step: Perform MPI statistical analysis (size and type of messages, computation vs. communication) Use case 2 : Fluent 19

Memory bandwidth impact Ansys Fluent: – Measured with Intel PCM – Supermicro SandyBridge server (Dual E5-2650), GB/s peak memory bandwidth Observed “few” seconds peaks demanding 57 GB/s, during period=5s. This is very close to numbers measured with STREAM synthetic benchmark on this platform. Memory bandwidth measured with Intel PCM at memory controller level Next step: check impact of memory speed on solution time Use case 2 : Fluent 20

Analysis done so far We have invested our time to build first generation of tools in order to monitor different system parameters – Multi-core scalability (Ansys Mechanical) – Memory size requirements (Ansys Mechanical) – Memory bandwidth requirements (Fluent) – Interconnect network (Fluent) – File I/O (Ansys Mechanical) Redo some parts – Westmere 4 sockets -> SandyBridge 4 sockets Next steps: – Start performing detailed interconnect monitoring by using MPI tracing tools (Intel Trace Analyzer and Collector) 21

Agenda Introduction – Where we are now CERN used applications requiring HPC infrastructure User cases (Engineering) – Ansys Mechanical – Ansys Fluent Physics HPC applications Next steps Q&A 22

Physics HPC applications PH-TH: – Lattice QCD simulations BE LINAC4 plasma simulations: – plasma formation in the Linac4 ion source BE CLIC simulations: – preservation of the Luminosity in time, under the effects of dynamic imperfections, such as vibrations, ground motion, failures of accelerator components 23

Lattice QCD MPI based application with inline assembly in the most time-critical parts of the program Main objective is to investigate: – Impact of memory bandwidth on performance – Impact of interconnection network on performance (comparison of 10 Gb iWARP and Infiniband QDR) 24

BE LINAC4 Plasma studies MPI based application Users requesting system with 250 GB of RAM for 48 cores. Main objective is to investigate: – Scalability of application beyond 48 cores for a reason of spreading memory requirement on more cores than 48 25

Clusters To better understand requirements of CERN Physics HPC applications two clusters have been prepared – Investigate Scalability – Investigate importance of interconnect, memory bandwidth and file i/o Test configuration – 20x Sandy Bridge dual socket nodes with 10 Gb iWARP low latency link – 16x Sandy Bridge dual socket nodes with Quad Data Rate (40Gb/s) Infiniband 26

Agenda Introduction – Where we are now CERN used applications requiring HPC infrastructure User cases (Engineering) – Ansys Mechanical – Ansys Fluent Physics HPC applications Next steps Q&A 27

Next steps Activity started to better understand requirements of CERN HPC applications The standard Linux performance monitoring tools give us a very detailed overview of system behavior for different applications Next steps are to: – Refine our approach and our scripts to work at higher scale (first target 20 nodes). – gain more knowledge about impact of interconnection network on MPI jobs 28

Thank you Q&A 29