Investigation of the improved performance on Haswell processors

Slides:



Advertisements
Similar presentations
CPU Structure and Function
Advertisements

Profiler In software engineering, profiling ("program profiling", "software profiling") is a form of dynamic program analysis that measures, for example,
Chapter 12 CPU Structure and Function. CPU Sequence Fetch instructions Interpret instructions Fetch data Process data Write data.
Computer Organization and Architecture
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Enabling Efficient On-the-fly Microarchitecture Simulation Thierry Lafage September 2000.
Where Do the 7 layers “fit”? Or, where is the dividing line between hdw & s/w? ? ?
San Diego Supercomputer Center Performance Modeling and Characterization Lab PMaC Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation.
MEMORY MANAGEMENT By KUNAL KADAKIA RISHIT SHAH. Memory Memory is a large array of words or bytes, each with its own address. It is a repository of quickly.
1 Sec (2.3) Program Execution. 2 In the CPU we have CU and ALU, in CU there are two special purpose registers: 1. Instruction Register 2. Program Counter.
COP4020 Programming Languages
CH12 CPU Structure and Function
Test results Test definition (1) Istituto Nazionale di Fisica Nucleare, Sezione di Roma; (2) Istituto Nazionale di Fisica Nucleare, Sezione di Bologna.
Fast Dynamic Binary Translation for the Kernel Piyus Kedia and Sorav Bansal IIT Delhi.
A Comparison of Software and Hardware Techniques for x86 Virtualization Keith Adams Ole Agesen Oct. 23, 2006.
High-level Languages.
University of Amsterdam Computer Systems – a guided tour Arnoud Visser 1 Computer Systems A guided Tour.
Computer Systems Organization CS 1428 Foundations of Computer Science.
F. Brasolin / A. De Salvo – The ATLAS benchmark suite – May, Benchmarking ATLAS applications Franco Brasolin - INFN Bologna - Alessandro.
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
CDA 3101 Fall 2013 Introduction to Computer Organization Computer Performance 28 August 2013.
Performance of mathematical software Agner Fog Technical University of Denmark
Srihari Makineni & Ravi Iyer Communications Technology Lab
Hyper Threading Technology. Introduction Hyper-threading is a technology developed by Intel Corporation for it’s Xeon processors with a 533 MHz system.
Virtual Machines, Interpretation Techniques, and Just-In-Time Compilers Kostis Sagonas
Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
Accurate Prediction of Power Consumption in Sensor Networks University of Tubingen, Germany In EmNetS 2005 Presented by Han.
QEMU, a Fast and Portable Dynamic Translator Fabrice Bellard (affiliation?) CMSC 691 talk by Charles Nicholas.
Background Computer System Architectures Computer System Software.
Computer Operation. Binary Codes CPU operates in binary codes Representation of values in binary codes Instructions to CPU in binary codes Addresses in.
Operating Systems A Biswas, Dept. of Information Technology.
Qin Zhao1, Joon Edward Sim2, WengFai Wong1,2 1SingaporeMIT Alliance 2Department of Computer Science National University of Singapore
© 2003, Carla Ellis Model Vague idea “groping around” experiences Hypothesis Initial observations Experiment Data, analysis, interpretation Results & final.
What’s going on here? Can you think of a generic way to describe both of these?
CPU Central Processing Unit
Topic 2: Hardware and Software
Virtualization.
??? ple r B Amulya Sai EDM14b005 What is simple scalar?? Simple scalar is an open source computer architecture simulator developed by Todd.
DDC 2223 SYSTEM SOFTWARE DDC2223 SYSTEM SOFTWARE.
Presented by Mike Marty
Chapter 1 Introduction.
Chapter 1: A Tour of Computer Systems
The “Understanding Performance!” team in CERN IT
x86 Processor Architecture
A Closer Look at Instruction Set Architectures
INTEL HYPER THREADING TECHNOLOGY
Passive benchmarking of ATLAS Tier-0 CPUs
Chapter 1 Introduction.
Benchmarks Breakout.
February WLC GDB Short summaries
Operating System Concepts
Flow Path Model of Superscalars
Hyperthreading Technology
Agenda Why simulation Simulation and model Instruction Set model
CMSC 611: Advanced Computer Architecture
Instruction Set Architectures Continued
Mastering Memory Modes
Introduction to Computer Systems
CMSC 611: Advanced Computer Architecture
Lecture 13 Harvard architecture Coccone OS demonstrator
A Level Computer Science Topic 5: Computer Architecture and Assembly
Chapter 11 Processor Structure and function
Dynamic Binary Translators and Instrumenters
TEE-Perf A Profiler for Trusted Execution Environments
CS295: Modern Systems Virtualization
Sec (2.3) Program Execution.
Chapter 4 The Von Neumann Model
Presentation transcript:

Investigation of the improved performance on Haswell processors Marco Guerri 7th February 2017 Pre Grid Development Board

Pre Grid Development Board The problem The performance of some WLCG applications is much better on worker nodes with Intel Haswell (E5-2600v3) processors than with Sandy Bridge (E5-2600) processors. Scaled with the HS06 score of the provided job slot, there is a magic boost of around 45%. This is much more than the expected improvement in the general purpose performance. This improvement can be observed with Dirac benchmark. Sandy Bridge Haswell CPU Dual socket E5-2650 @ 2.00 GHz Dual socket E5-2640v3 @ 2.60 GHz RAM 64 GiB RAM DDR3 128 GiB RAM DDR4 OS SLC 6.8, 2.6.32-642.el6.x86_64 Compiler 4.4.7 CPU freq userspace, 2.00 GHz userspace gov, 2.00 GHz 7th February 2017 Pre Grid Development Board

Pre Grid Development Board Approach adopted Task Tools 1) Idenfity the major contributors to the runtime perf, Intel Software Developer Emulator 2) Verify that the instruction set used is the same and that there is no Intel Software Developer Emulator 3) Identify peculiarities of the major contributors perf, source code inspection 4) Try to write synthetic benchmark reproducing the main peculiarities of the workload C/Python 5) Identify relevant performance counters, draft hypothesis perf 6) Validate hypothesis on initial workload Intel Software Developer Emulator, SDE: Uses dynamic binary instrumentation to trace the execution and emulates instructions that are not supported by the architecture according to CPUID (e.g. AVX-512) Builds a trace of each instruction executed 7th February 2017 Pre Grid Development Board

Profiling Dirac Benchmark Profile obtained with Intel Software Development Emulator 7th February 2017 Pre Grid Development Board

Analysis of the main contributors 1/2 PyEval_EvalFrameEx 7th February 2017 Pre Grid Development Board

Analysis of the main contributors 2/2 PySequence_GetSlice 7th February 2017 Pre Grid Development Board

Pre Grid Development Board PyEval_EvalFrameEx Large switch statement that constitutes the core of CPython. Dispatches byte-code instructions to corresponding branches. Synthetic benchmark based on recursive macros, switch argument is incremented at each step (code compiled with –O2) #define CASE(N) \ case N: \ flag[N] = N; \ break; #define CASE2(N) CASE(N) CASE(N+1) #define CASE4(N) CASE2(N) CASE2(N+2) #define CASE8(N) CASE4(N) CASE4(N+4) 7th February 2017 Pre Grid Development Board

Synthetic benchmark results [root@sandybridge ~]# time numactl --physcpubind=0 --membind=0 ./test real 0m4.009s user 0m4.007s sys 0m0.000s [root@haswell ~]# time numactl --physcpubind=0 --membind=0 ./test real 0m0.857s user 0m0.852s sys 0m0.005s 7th February 2017 Pre Grid Development Board

Synthetic benchmark results [root@sandybridge ~]# time numactl --physcpubind=0 --membind=0 ./test real 0m4.009s user 0m4.007s sys 0m0.000s [root@haswell ~]# time numactl --physcpubind=0 --membind=0 ./test real 0m0.857s user 0m0.852s sys 0m0.005s More than 4 times slower on SandyBridge 7th February 2017 Pre Grid Development Board

Profile of the synthetic benchmark SandyBridge Haswell 7th February 2017 Pre Grid Development Board

Profile of the synthetic benchmark SandyBridge Haswell 30% branches mispredicted on SandyBridge, only 0.6% on Haswell 7th February 2017 Pre Grid Development Board

Mispredicted branches (Synthetic benchmark) That jump is actually jmpq *0x402378(,%rcx,8) The switch statement is implemented with a jump table. The argument of the switch is used to access the table, retrieve the address of the corresponding branch, and jump The same applies to Sandy Bridge and Ivy Bridge 7th February 2017 Pre Grid Development Board

Profiling Dirac benchmark SandyBridge Haswell 3% branches mispredicted on SandyBridge, only 0.4% on Haswell 7th February 2017 Pre Grid Development Board

Mispredicted branches (Dirac) Large majority of mispredicted branches happens in PyEval_EvalFrameEx on the switch statement, which is again an indirect jump. The jmp correctly decoded would be jmpq *%rax. 7th February 2017 Pre Grid Development Board

Pre Grid Development Board Conclusions The major contributor to the Dirac benchmark is PyEval_EvalFrameEx, which is the core of CPython interpreter. Switch/case statement is the basic building block of this routine, and it heavily benefits from a performant branch prediction unit to correctly identify the target of the indirect jump This type of workload is probably very relevant also for other virtual machines implementations and particular attention was given to the Branch Prediction when designing the new Haswell microarchitecture 7th February 2017 Pre Grid Development Board

Investigation of the dual peak effect in ATLAS Kit Validation Marco Guerri 7th February 2017 Pre Grid Development Board

Pre Grid Development Board The problem The distribution of ATLAS Kit Validation runs has a faster mode and a slower one. Slower runs seem to be always associated to CPU1, in particular, KV runs slower on hyperthread 8 and 24, 2 threads which belong to the first physical core of the second processor. 7th February 2017 Pre Grid Development Board

Pre Grid Development Board 7th February 2017 Pre Grid Development Board

Pre Grid Development Board 7th February 2017 Pre Grid Development Board