Lessons from HLT benchmarking (For the new Farm) Rainer Schwemmer, LHCb Computing Workshop 2014.

Slides:



Advertisements
Similar presentations
Computer Organization. The Nature of Data Spreadsheets give us a way to create processes that manipulate data We can think of a piece of data as a quantity.
Advertisements

4.4 Page replacement algorithms
KAIST Computer Architecture Lab. The Effect of Multi-core on HPC Applications in Virtualized Systems Jaeung Han¹, Jeongseob Ahn¹, Changdae Kim¹, Youngjin.
Energy Efficiency through Burstiness Athanasios E. Papathanasiou and Michael L. Scott University of Rochester, Computer Science Department Rochester, NY.
Concurrent programming: From theory to practice Concurrent Algorithms 2014 Vasileios Trigonakis Georgios Chatzopoulos.
A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.
Performance of Cache Memory
Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.
Practical Caches COMP25212 cache 3. Learning Objectives To understand: –Additional Control Bits in Cache Lines –Cache Line Size Tradeoffs –Separate I&D.
ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto
Daniel Schall, Volker Höfner, Prof. Dr. Theo Härder TU Kaiserslautern.
SLA-Oriented Resource Provisioning for Cloud Computing
Alastair Dewhurst, Dimitrios Zilaskos RAL Tier1 Acknowledgements: RAL Tier1 team, especially John Kelly and James Adams Maximising job throughput using.
NUMA Tuning for Java Server Applications Mustafa M. Tikir.
LHCb Upgrade Overview ALICE, ATLAS, CMS & LHCb joint workshop on DAQ Château de Bossey 13 March 2013 Beat Jost / Cern.
Dr. Alexandra Fedorova August 2007 Introduction to Systems Research at SFU.
5 th LHCb Computing Workshop, May 19 th 2015 Niko Neufeld, CERN/PH-Department
1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
OEP infrastructure issues Gregory Dubois-Felsmann Trigger & Online Workshop Caltech 2 December 2004.
Performance benchmark of LHCb code on state-of-the-art x86 architectures Daniel Hugo Campora Perez, Niko Neufled, Rainer Schwemmer CHEP Okinawa.
Optimizing RAM-latency Dominated Applications
Parallel Communications and NUMA Control on the Teragrid’s New Sun Constellation System Lars Koesterke with Kent Milfeld and Karl W. Schulz AUS Presentation.
Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.
Computer System Architectures Computer System Software
More on Locks: Case Studies
Scaling and Packing on a Chip Multiprocessor Vincent W. Freeh Tyler K. Bletsch Freeman L. Rawson, III Austin Research Laboratory.
Farm Completion Beat Jost and Niko Neufeld LHCb Week St. Petersburg June 2010.
Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.
INTRODUCTION Crusoe processor is 128 bit microprocessor which is build for mobile computing devices where low power consumption is required. Crusoe processor.
1 Instruction Sets and Beyond Computers, Complexity, and Controversy Brian Blum, Darren Drewry Ben Hocking, Gus Scheidt.
Your university or experiment logo here Caitriana Nicholson University of Glasgow Dynamic Data Replication in LCG 2008.
JPCM - JDC121 JPCM. Agenda JPCM - JDC122 3 Software performance is Better Performance tuning requires accurate Measurements. JPCM - JDC124 Software.
Common software needs and opportunities for HPCs Tom LeCompte High Energy Physics Division Argonne National Laboratory (A man wearing a bow tie giving.
Server to Server Communication Redis as an enabler Orion Free
Embedded System Lab 김해천 Thread and Memory Placement on NUMA Systems: Asymmetry Matters.
Why it might be interesting to look at ARM Ben Couturier, Vijay Kartik Niko Neufeld, PH-LBC SFT Technical Group Meeting 08/10/2012.
1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.
PROOF tests at BNL Sergey Panitkin, Robert Petkus, Ofer Rind BNL May 28, 2008 Ann Arbor, MI.
Single Node Optimization Computational Astrophysics.
PROOF Benchmark on Different Hardware Configurations 1 11/29/2007 Neng Xu, University of Wisconsin-Madison Mengmeng Chen, Annabelle Leung, Bruce Mellado,
Exploiting Task-level Concurrency in a Programmable Network Interface June 11, 2003 Hyong-youb Kim, Vijay S. Pai, and Scott Rixner Rice Computer Architecture.
Moore vs. Moore Rainer Schwemmer, LHCb Computing Workshop 2015.
Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms Lab Exercises: Lab 1 (Performance measurement)
Equalizer: Dynamically Tuning GPU Resources for Efficient Execution Ankit Sethia* Scott Mahlke University of Michigan.
Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.
1© Copyright 2015 EMC Corporation. All rights reserved. NUMA(YEY) BY JACOB KUGLER.
Storage for Run 3 Rainer Schwemmer, LHCb Computing Workshop 2015.
SPIDAL Java Optimized February 2017 Software: MIDAS HPC-ABDS
Web Server Load Balancing/Scheduling
Workshop Concluding Remarks
Jacob R. Lorch Microsoft Research
Web Server Load Balancing/Scheduling
Diskpool and cloud storage benchmarks used in IT-DSS
The “Understanding Performance!” team in CERN IT
GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)
Lecture: Large Caches, Virtual Memory
The Multikernel: A New OS Architecture for Scalable Multicore Systems
Geant4 MT Performance Soon Yung Jun (Fermilab)
ALICE, ATLAS, CMS & LHCb joint workshop on DAQ
Intel’s Core i7 Processor
ALICE Computing Upgrade Predrag Buncic
CMSC 611: Advanced Computer Architecture
Lecture 12: Cache Innovations
CMSC 611: Advanced Computer Architecture
Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle
CANDY: Enabling Coherent DRAM Caches for Multi-node Systems
CMSC 611: Advanced Computer Architecture
Computing at the HL-LHC
CSE 542: Operating Systems
Presentation transcript:

Lessons from HLT benchmarking (For the new Farm) Rainer Schwemmer, LHCb Computing Workshop 2014

The Setup Goals –Determine most cost efficient compute platform for next year’s farm upgrade –Help to estimate what can be expected from the new farm CPU time measurement: –Run Moore similar to how it runs at P8 during data taking (actually deferred processing) –Buffer Manager –File Reader (instead of MEPRx) –Variable instances of HLT1/HLT1+2 –Measure how many triggers where processed over a certain amount of time (typically 1 hour) Memory measurement: –Intel Performance Counter Monitor –Profiles entire system for Cache behaviour IPC Other interesting stats 2

Results 3 Had access to quite a few next generation prototype systems

Results 4 Machines that are interesting for the new farm Current farm nodes New Farm Node (x800)

Interesting little detail 5 HLT seems to run faster on first socket than on second –Effect is also visible on current farm nodes, but to lesser extent ~630 ~560

NUMA Architecture 6 Non Uniform Memory Access hits us when we are launching applications as forks of a master process HLT Master >50% of mem HLT Master >50% of mem HLT Slave HLT Slave HLT Slave HLT Slave HLT Slave HLT Slave HLT Slave HLT Slave HLT Slave HLT Slave HLT Slave HLT Slave HLT Slave HLT Slave HLT Slave HLT Slave

NUMA Architecture 7 When off core/off socket instances access master memory they incur additional latency due to Socket-Socket/Core-Core interconnect HLT Master >50% of mem HLT Master >50% of mem HLT Slave HLT Slave HLT Slave HLT Slave HLT Slave HLT Slave HLT Slave HLT Slave HLT Slave HLT Slave HLT Slave HLT Slave HLT Slave HLT Slave HLT Slave HLT Slave

NUMA Architecture 8 Solution: Launch one master process per NUMA node Disadvantage: Every additional master needs memory but does not participate in data processing HLT Master HLT Master HLT Slave HLT Slave HLT Slave HLT Slave HLT Slave HLT Slave HLT Slave HLT Slave HLT Slave HLT Slave HLT Slave HLT Slave HLT Slave HLT Slave HLT Slave HLT Slave HLT Master HLT Master

The numbers The raw numbers for HLT1/HT1+2 without/with NUMA awareness Values are in Hz Core Value: Hlt1+2_Classic / HLT1+2_Single / N_Cores –If you want to compute fully loaded performance from single core performance –Warning: New machines are not 1.0 anymore! I have most of these values for most of the Haswell, AMD and Atom cores  Can provide a spreadsheet if you are interested 9 CPUHLT1 Single HLT1 Classic HLT1 NUMA HLT1+2 Single HLT1+2 Classic HLT1+2 NUMA NUMA Gain 1 NUMA GAIN1+2 Core Value DELL SM590627~1.063 AMD ~ E_2630 (8 cores) ~ E_2650 (10 cores) ~

Frequency Scaling Benchmark Performance vs. Core Frequency –Results from 2010 Performance scales more or less linear with frequency –Dashed: linear extrapolation based on lowest measurement point –Solid: Measured performance 10

Frequency Scaling Benchmark Performance vs. Core Frequency –Results from Performance does not scale with frequency at all anymore –There is a good chance the extrapolation curve is underestimated and could be better 11

Frequency Scaling Cause unclear so far –No profiler for Haswell yet Suspect high memory latency and bad data locality in application DDR4 is still in its infancy  might get better but we won’t profit from it with new farm 12

Conclusion Forking is good for start up, but we lost quite a bit of performance –This did not go unnoticed  Johannes ca –We did not keep track of performance values so it was blamed on changes in the application Running the HLT NUMA aware can give us 14% more performance on new machines –6%-8% on old farm nodes –Memory consumption will go up Performance does not scale with CPU anymore –Probably issue with memory latency and bad data access pattern –No plan to upgrade farm again until next shutdown –Will not profit from better memory until after LS2 –Some more % can probably be gained by optimizing data structures and access patterns 13

Questions?