Performance benchmark of LHCb code on state-of-the-art x86 architectures Daniel Hugo Campora Perez, Niko Neufled, Rainer Schwemmer CHEP 2015 - Okinawa.

Slides:

Advertisements

Similar presentations

Starfish: A Self-tuning System for Big Data Analytics.

Advertisements

Concurrent programming: From theory to practice Concurrent Algorithms 2014 Vasileios Trigonakis Georgios Chatzopoulos.

Profiling your application with Intel VTune at NERSC

Alastair Dewhurst, Dimitrios Zilaskos RAL Tier1 Acknowledgements: RAL Tier1 team, especially John Kelly and James Adams Maximising job throughput using.

Benchmarking Parallel Code. Benchmarking2 What are the performance characteristics of a parallel code? What should be measured?

NUMA Tuning for Java Server Applications Mustafa M. Tikir.

Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.

Remigius K Mommsen Fermilab A New Event Builder for CMS Run II A New Event Builder for CMS Run II on behalf of the CMS DAQ group.

CHEP04 - Interlaken - Sep. 27th - Oct. 1st 2004T. M. Steinbeck for the Alice Collaboration1/20 New Experiences with the ALICE High Level Trigger Data Transport.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

1 Software Testing and Quality Assurance Lecture 40 – Software Quality Assurance.

Cosc 2150 Current CPUs Intel and AMD processors. Notes The information is current as of Dec 5, 2014, unless otherwise noted. The information for this.

HS06 on the last generation of CPU for HEP server farm Michele Michelotto 1.

Digital Graphics and Computers. Hardware and Software Working with graphic images requires suitable hardware and software to produce the best results.

Large scale data flow in local and GRID environment V.Kolosov, I.Korolko, S.Makarychev ITEP Moscow.

Computer System Architectures Computer System Software

Bob Thome, Senior Director of Product Management, Oracle SIMPLIFYING YOUR HIGH AVAILABILITY DATABASE.

© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Operating Systems Overview: Using Hardware.

1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.

Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.

Multi-core architectures. Single-core computer Single-core CPU chip.

Farm Completion Beat Jost and Niko Neufeld LHCb Week St. Petersburg June 2010.

CERN - IT Department CH-1211 Genève 23 Switzerland t Tier0 database extensions and multi-core/64 bit studies Maria Girone, CERN IT-PSS LCG.

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.

3. April 2006Bernd Panzer-Steindel, CERN/IT1 HEPIX 2006 CPU technology session some ‘random walk’

Hardware Trends. Contents Memory Hard Disks Processors Network Accessories Future.

High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.

Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.

Lessons from HLT benchmarking (For the new Farm) Rainer Schwemmer, LHCb Computing Workshop 2014.

Performance of mathematical software Agner Fog Technical University of Denmark

Srihari Makineni & Ravi Iyer Communications Technology Lab

Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.

1 Database mini workshop: reconstressing athena RECONSTRESSing: stress testing COOL reading of athena reconstruction clients Database mini workshop, CERN.

Next Generation Operating Systems Zeljko Susnjar, Cisco CTG June 2015.

Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.

Embedded System Lab. 정범종 A_DRM: Architecture-aware Distributed Resource Management of Virtualized Clusters H. Wang et al. VEE, 2015.

1 Lattice QCD Clusters Amitoj Singh Fermi National Accelerator Laboratory.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Processor Architecture

Yang Yu, Tianyang Lei, Haibo Chen, Binyu Zang Fudan University, China Shanghai Jiao Tong University, China Institute of Parallel and Distributed Systems.

Presentation 31 – Multicore, Multiprocessing, Multithreading, and Multitasking. When discussing modern PCs, the term “Multi” is thrown around a lot as.

Why it might be interesting to look at ARM Ben Couturier, Vijay Kartik Niko Neufeld, PH-LBC SFT Technical Group Meeting 08/10/2012.

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

MULTICORE PROCESSOR TECHNOLOGY.  Introduction  history  Why multi-core ?  What do you mean by multicore?  Multi core architecture  Comparison of.

CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.

Martin Kruliš by Martin Kruliš (v1.1)1.

PROOF tests at BNL Sergey Panitkin, Robert Petkus, Ofer Rind BNL May 28, 2008 Ann Arbor, MI.

Inside a PC Exploring the CPU. CPU: Objectives Students will identify and explain the function of the CPU Students will research various CPUs so they.

Pierre VANDE VYVRE ALICE Online upgrade October 03, 2012 Offline Meeting, CERN.

Moore vs. Moore Rainer Schwemmer, LHCb Computing Workshop 2015.

Markus Frank (CERN) & Albert Puig (UB).  An opportunity (Motivation)  Adopted approach  Implementation specifics  Status  Conclusions 2.

Background Computer System Architectures Computer System Software.

Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.

PASTA 2010 CPU, Disk in 2010 and beyond m. michelotto.

Barthélémy von Haller CERN PH/AID For the ALICE Collaboration The ALICE data quality monitoring system.

Multi-Core CPUs Matt Kuehn. Roadmap ► Intel vs AMD ► Early multi-core processors ► Threads vs Physical Cores ► Multithreading and Multi-core processing.

Storage for Run 3 Rainer Schwemmer, LHCb Computing Workshop 2015.

Chapter Overview General Concepts IA-32 Processor Architecture

The “Understanding Performance!” team in CERN IT

Kilohertz Decision Making on Petabytes

RT2003, Montreal Niko Neufeld, CERN-EP & Univ. de Lausanne

Assembly Language for Intel-Based Computers, 5th Edition

Low-Cost High-Performance Computing Via Consumer GPUs

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Grid Canada Testbed using HEP applications

Support for ”interactive batch”

Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle

High-Performance Storage System for the LHCb Experiment

IBM Power Systems.

Presentation transcript:

Performance benchmark of LHCb code on state-of-the-art x86 architectures Daniel Hugo Campora Perez, Niko Neufled, Rainer Schwemmer CHEP Okinawa

Background The LHCb High Level Trigger (HLT) farm –1800 servers –Approximately x86 compatible cores –Reduce detector data stream from GB/s to 1 GB/s –100 m underground –Heterogeneous compute cluster Farm Upgrade –More computing power is necessary to handle the LHC lumi upgrade –Decommissioned 550 oldest machines (Nehalem Based Systems) –Bought 800 Haswell systems as replacement Constraints –Farm needs to be able to filter an additional events per second (2/3 of old capacity) –Has to fit in a 200 kW power envelope due to cooling limitations 2 Rainer Schwemmer - CHEP – Okinawa 2015

Benchmark Determine the most cost efficient compute platform Help our colleagues to tune the Trigger Create a live DVD to distribute to vendors DVD has only 4 GB of space –1.x GB of operating system –2 GB of sample data for input –Leaves < 1 GB of space for actual HLT software –Need to somehow get rid of unnecessary baggage in our software packages Use “strace” on running HLT instance to find the minimum number of files required for the trigger Copy onto live DVD + One touch wrapper script 3 Capture every: open/stat/execv Rainer Schwemmer - CHEP – Okinawa 2015

Application Profile Processes very short lived, highly independent tasks, a.k.a Events – ms per Event No inherent multi-threading –Parallelism is achieved by launching multiple instances –Launch N instances of program –Choose N for max machine throughput Mixture of strongly branching code and floating point operations Memory footprint is approximately 1.1 GB –600 MB static – MB dynamic Processes are created by fork()ing a master process –Reduces memory footprint –Accelerates startup –Has some issues  TBD 4 Event Building Overflow N x HLT Tasks Output LHC running Disk Reader Overflow Output LHC off Rainer Schwemmer - CHEP – Okinawa 2015 N x HLT Tasks

Results

Benchmark Output 6 X5650 Opteron 6272 New farm node E v3 E v3 Plot 1: Machine throughput vs. number of running instances Plot 2: Increase in throughput per additional instance Benchmark scans the optimum number of program instances Results from previous generation and Haswell Dual Socket farm nodes Rainer Schwemmer - CHEP – Okinawa 2015

Interesting little detail 7 Program seems to run faster on first socket than on second –Effect is also visible on old generation nodes, but to lesser extent ~630 ~560 Rainer Schwemmer - CHEP – Okinawa 2015

Consequences of NUMA Architecture Fork() master process is scheduled to a particular NUMA node and allocates most of its memory there Children are distributed over the 2 sockets depending on number of children –First to the socket with master –Then other socket(s) Off master processes are hit by additional latency in QPI/HT hop Spawn as many masters as sockets –Lock master and children to specific NUMA node with numactl –Up to 14% performance gain in machine throughout! Haswell 10+ core CPUs are internally divided into two NUMA nodes –2-3% additional gain –Enable ‘Cluster On Die’ (COD) feature in BIOS 8 CPUDecisions/s No NUMA Decisions/s NUMA NUMA Gain Intel X5650 (8 cores) Opteron X E_2630 v3 (8 cores) E_2650 v3 (10 cores) Rainer Schwemmer - CHEP – Okinawa 2015

Other Memory Issues Memory Latency seems to finally have caught up with our application Westmere: 2 MB/Core Ivy: 2.5 MB/Core Haswell 2.5 MB/Core Tool: Intel Performance Counter Monitor While L3 Miss rate has gone down Each L3 Miss loses more clock cycles now DDR4 memory has better throughput but currently higher latency! Potential for more optimization Should become better in the future –DDR4 is quite new Rainer Schwemmer - CHEP – Okinawa

Performance Overview Performance model based on benchmark and memory measurements Assumes a dual socket system CPUCoresFreq 2630 v v v v v W v v v v v v v v v AMD C Rainer Schwemmer - CHEP – Okinawa

11 Costs / Power Cost of dual socket system for fixed throughput –24 blades for Avoton Costs included: –Main board –Memory –PSU –CPU –Chassis –Double memory required for 2687W and above –AMD6376 and Avoton based on system quotes Power consumption Measured –Avoton –AMD 6376 –E v3 –E v3 Other Power consumption estimated –1.6 x TDP x N sockets –Fits all measured systems within 5%

A word about AMD and Avoton AMD –Pretty good value for price (not much choice) –Power consumption is an issue Avoton –Actually Better than portrayed here –Suffers from two key shortcomings in our case No Hyperthreading Very little cache per core (512 k)  Instructions per Clock cycle: < 0.4 If your code is optimized for this it should be quite efficient –Many slow systems Network overhead Administration overhead HDDs (if you need them) Rainer Schwemmer - CHEP – Okinawa

Conclusion We have created a stand alone version of our High Level Trigger Significantly improved Trigger performance with NUMA awareness Memory access optimization has become even more critical with latest Intel CPU generation –Both Xeon and Avoton Selected E v3 as most cost efficient platform for our new farm This does not mean it’s also the best platform for you though Rainer Schwemmer - CHEP – Okinawa

Acknowledgements Companies / Institutions helping with access to test platforms and power measurements CERN IT E4 Computer Engineering Intel Swindon Labs ASUS Macle GMBH Rainer Schwemmer - CHEP – Okinawa

Questions