Simple Illustration of L1 Bandwidth Limitations on Vector Performance

Slides:



Advertisements
Similar presentations
CPU Structure and Function
Advertisements

ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Cache III Steve Ko Computer Sciences and Engineering University at Buffalo.
ARM Cortex-A9 MPCore ™ processor Presented by- Chris Cai (xiaocai2) Rehana Tabassum (tabassu2) Sam Mussmann (mussmnn2)
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Chapter 12 CPU Structure and Function. CPU Sequence Fetch instructions Interpret instructions Fetch data Process data Write data.
Computer Organization and Architecture
Project Testing; Processor Examples. Project Testing --thorough, efficient, hierarchical --done by “independent tester” --well-documented, repeatable.
Chapter 12 CPU Structure and Function. Example Register Organizations.
Kathy Grimes. Signals Electrical Mechanical Acoustic Most real-world signals are Analog – they vary continuously over time Many Limitations with Analog.
Inside The CPU. Buses There are 3 Types of Buses There are 3 Types of Buses Address bus Address bus –between CPU and Main Memory –Carries address of where.
Adam Meyer, Michael Beck, Christopher Koch, and Patrick Gerber.
1 Chapter 04 Authors: John Hennessy & David Patterson.
© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.
B. Ramamurthy.  12 stage pipeline  At peak speed, the processor can request both an instruction and a data word on every clock.  We cannot afford pipeline.
Alpha Supplement CS 740 Oct. 14, 1998
Computer Systems - Processor. Objectives To investigate and understand the structure and role of the processor.
Exploiting Parallelism
Advanced Topics: Prefetching ECE 454 Computer Systems Programming Topics: UG Machine Architecture Memory Hierarchy of Multi-Core Architecture Software.
CBP 2005Comp 3070 Computer Architecture1 Last Time … All instructions the same length We learned to program MIPS And a bit about Intel’s x86 Instructions.
for (int i = 0; i < numBodies; i++) { float_3 netAccel; netAccel.x = netAccel.y = netAccel.z = 0; for (int j = 0; j < numBodies; j++) { float_3 r;
Georg Hager, Jan Treibig, Gerhard Wellein Erlangen Regional Computing Center (RRZE) University of Erlangen-Nuremberg Erlangen, Germany GRS-Sim Aachen March.
My Coordinates Office EM G.27 contact time:
CSE373: Data Structures & Algorithms Lecture 26: Memory Hierarchy and Locality 1 Kevin Quinn, Fall 2015.
SIMD Programming CS 240A, Winter Flynn* Taxonomy, 1966 In 2013, SIMD and MIMD most common parallelism in architectures – usually both in same.
1 Lecture 5a: CPU architecture 101 boris.
vector computer overlap arithmetic operation on the elements of the vectorinstruction-level.
Pipelining – Loop unrolling and Multiple Issue
CMSC 611: Advanced Computer Architecture
Yang Gao and Dr. Jason D. Bakos
Nios II Processor: Memory Organization and Access
Lecture 3: MIPS Instruction Set
A Memory Aliased Instruction Set Architecture
CSC 4250 Computer Architectures
How will execution time grow with SIZE?
Profiling and Optimization Outline
Exploiting Parallelism
5.2 Eleven Advanced Optimizations of Cache Performance
A Closer Look at Instruction Set Architectures
Cache Memory Presentation I
Morgan Kaufmann Publishers
Lu Peng, Jih-Kwon Peir, Konrad Lai
Compilers for Embedded Systems
Greg Stitt ECE Department University of Florida
Lecture 5: GPU Compute Architecture
Vector Processing => Multimedia
Chapter 8: Main Memory.
Moving Arrays -- 1 Completion of ideas needed for a general and complete program Final concepts needed for Final Review for Final – Loop efficiency.
Lecture 5: GPU Compute Architecture for the last time
CMPT 886: Computer Architecture Primer
Introduction to CUDA Programming
Lecture 17: Case Studies Topics: case studies for virtual memory and cache hierarchies (Sections )
Types of Computers Mainframe/Server
Moving Arrays -- 1 Completion of ideas needed for a general and complete program Final concepts needed for Final Review for Final – Loop efficiency.
Lecture 20: OOO, Memory Hierarchy
Computer Architecture
Chapter 12 Software Optimisation
Lecture 3: MIPS Instruction Set
Instruction Rescheduling and Loop-Unroll
Memory System Performance Chapter 3
Chapter 11 Processor Structure and function
Lecture 11: Machine-Dependent Optimization
Introduction to Optimization
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

Simple Illustration of L1 Bandwidth Limitations on Vector Performance Steve Lantz 11/4/2016

What’s the Point? Relatively simple parts of the fitting code do not get full vector speedup A possible explanation is that the arithmetic intensity (AI = flop/byte) of the code is too low Low AI makes it impossible to load and store data fast enough, even though the right vectors are sitting in L1 cache, due to the limited bandwidth to L1 It might be useful to look at an easy test code from TACC first to see how this works Tests are run on phiphi, Intel Sandy Bridge, 1 core 11/4/2016

Super-Simple Code vector.c #define N 256 ... tstart = omp_get_wtime(); // External loop to measure a reasonable time interval for(i = 0; i < 1000000; i++){ #pragma vector aligned for( j = 0; j < N; j++){ x[j] = y[j] + z[j]; // Low AI } tend = omp_get_wtime(); 11/4/2016

All Arrays Fit in L1 Cache #define N 256 ... double *x, *y, *z; // Allocate memory aligned to a 64 byte boundary x = (double *)memalign(64,N*sizeof(double)); y = (double *)memalign(64,N*sizeof(double)); z = (double *)memalign(64,N*sizeof(double)); Total storage is 3*256*8 bytes = 6 KB Sandy Bridge L1 data cache is 32 KB 11/4/2016

What Do We Expect? Sandy Bridge has AVX capability, supports vector operations up to 256 bits = 32 bytes = 4 doubles Expectation (fantasy) is 1 vector add/cycle... The Xeon E5-2620’s on phiphi run at 2 GHz The vector.c code adds 256 doubles 1M times 256 M dble / (4 dble/cycle) / (2000 M cycle/s) = 0.032 s Measured time for the loop is 0.068s, or roughly 2x slower than predicted max... Why? 11/4/2016

Sandy Bridge Architecture data paths go via stacks, 1 cycle to load and/or cross previous CPU: SSE = 128 bits instruction ports AVX = 256 bits Memory Unit services three data accesses per cycle 2 read requests of up to 16 bytes [2 x 128 bits, 1 AVX load] AND 1 store of up to 16 bytes [128 bits, 0.5 AVX store] http://www.anandtech.com/show/3922/intels-sandy-bridge-architecture-exposed/3 11/4/2016

Under-Filled AVX Pipeline in SNB AVX stages cycles Load y0 z0 y4 z4 y8 y12 z12 y16 z16 y20 ... Mult Add +0 +4 +8 +12 +16 Store -- x0 x4 x8 x12 pipeline fill = 3 cycles repeat unit = 8 cycles, do 16 times pipeline drain = 3 cycles, at end Compiler unrolls j loop by 4, creates 4-stage pipeline Two AVX loads take 2 cycles; one store also takes 2 cycles Every 2 cycles, we get an AVX add on 4 doubles (= 4 flop) Adding 256 doubles takes 3 + 2 x 256/4 + 3 = 134 cycles Do 1M times: 134 M cycles / (2000 M cycles/s) = 0.067 s 11/4/2016

Code with Higher AI: vector2m2a.c #define N 256 ... tstart = omp_get_wtime(); // External loop to measure a reasonable time interval for(i = 0; i < 1000000; i++){ #pragma vector aligned for( j = 0; j < N; j++){ x[j] = 2.5*y[j] + 3.6*z[j] + 4.7; } tend = omp_get_wtime(); 11/4/2016

Filling the AVX Pipeline in SNB? AVX stages cycles Load y0 z0 y4 z4 y8 y12 z12 y16 z16 y20 ... Mult *c1 *c2 Add +c3 +0 +4 +8 +12 +16 Store -- x0 x4 x8 x12 pipeline fill = 4 cycles repeat unit = 8 cycles, do 16 times pipeline drain = 4 cycles, at end Compiler fills the 4-stage pipeline for vector2m2a.c SNB can do one AVX add AND one AVX multiply, per cycle! With low AI, we got only one AVX add every other cycle Here, doing 1024 flop takes 4 + 2 x 256/4 + 4 = 136 cycles? If correct: 136 M cycles / (2000 M cycles/s) = 0.068 s 11/4/2016

OK, Prove It! Use performance models to predict total loop time... For vector.c: 0.067 s, predicted 0.068 s, measured= 136 M cycles, 2 cycles/(AVX store) For vector2m2a.c: 0.068 s, predicted 0.083 s, measured = 166 M cycles, 2.5 cycles/(AVX store) For vectorma.c: 0.070 s, measured (more pipeline latency than predicted) 11/4/2016

Roofline Analysis 11/4/2016

KNL Empirical Roofline 11/4/2016