Compilers for Embedded Systems

Slides:



Advertisements
Similar presentations
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Advertisements

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.
Kathy Grimes. Signals Electrical Mechanical Acoustic Most real-world signals are Analog – they vary continuously over time Many Limitations with Analog.
Lecture#14. Last Lecture Summary Memory Address, size What memory stores OS, Application programs, Data, Instructions Types of Memory Non Volatile and.
IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.
DSP Processors We have seen that the Multiply and Accumulate (MAC) operation is very prevalent in DSP computation computation of energy MA filters AR filters.
Recitation 7: 10/21/02 Outline Program Optimization –Machine Independent –Machine Dependent Loop Unrolling Blocking Annie Luo
C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 31/01/2014 Compilers for Embedded Systems.
CDA 3101 Fall 2013 Introduction to Computer Organization
December 2, 2015Single-Instruction Multiple Data (SIMD)1 Performance Optimization, cont. How do we fix performance problems?
Crosscutting Issues: The Rôle of Compilers Architects must be aware of current compiler technology Compiler Architecture.
Introduction to MMX, XMM, SSE and SSE2 Technology
The Alpha Thomas Daniels Other Dude Matt Ziegler.
Exploiting Parallelism
SSE and SSE2 Jeremy Johnson Timothy A. Chagnon All images from Intel® 64 and IA-32 Architectures Software Developer's Manuals.
C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 20/11/2015 Compilers for Embedded Systems.
Buffering Techniques Greg Stitt ECE Department University of Florida.
SIMD Programming CS 240A, Winter Flynn* Taxonomy, 1966 In 2013, SIMD and MIMD most common parallelism in architectures – usually both in same.
1 Lecture 5a: CPU architecture 101 boris.
A few words on locality and arrays
Lecture 38: Compiling for Modern Architectures 03 May 02
Memory Buffering Techniques
Cache Memories.
Concepts and Challenges
A Closer Look at Instruction Set Architectures
Cache Memories CSE 238/2038/2138: Systems Programming
SIMD Multimedia Extensions
The Hardware/Software Interface CSE351 Winter 2013
The University of Adelaide, School of Computer Science
Exploiting Parallelism
Cache Memory Presentation I
Prof. Zhang Gang School of Computer Sci. & Tech.
CS 105 Tour of the Black Holes of Computing
Morgan Kaufmann Publishers
CSCE430/830 Computer Architecture
Vector Processing => Multimedia
COMP4211 : Advance Computer Architecture
SIMD Programming CS 240A, 2017.
Pipelining and Vector Processing
Array Processor.
Cache Memories Topics Cache memory organization Direct mapped caches
CS170 Computer Organization and Architecture I
Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
CSCE Fall 2013 Prof. Jennifer L. Welch.
Register Pressure Guided Unroll-and-Jam
November 14 6 classes to go! Read
CSCE 121: Simple Computer Model Spring 2015
URECA: A Compiler Solution to Manage Unified Register File for CGRAs
Multivector and SIMD Computers
Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2) HW3 posted, due in a week.
BIC 10503: COMPUTER ARCHITECTURE
Samuel Larsen and Saman Amarasinghe, MIT CSAIL
CSCE Fall 2012 Prof. Jennifer L. Welch.
COMS 361 Computer Organization
Introduction to Microprocessor Programming
EE 193: Parallel Computing
Cache Memories Professor Hugh C. Lauer CS-2011, Machine Organization and Assembly Language (Slides include copyright materials from Computer Systems:
Dynamic Hardware Prediction
Memory System Performance Chapter 3
Computer Organization and Assembly Languages Yung-Yu Chuang 2006/01/05
Data Parallel Pattern 6c.1
Lecture 5: Pipeline Wrap-up, Static ILP
Lecture 11: Machine-Dependent Optimization
Introduction to Optimization
Optimizing single thread performance
Writing Cache Friendly Code
Presentation transcript:

Compilers for Embedded Systems Lecture 1 Integrated Systems of Hardware and Software V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 12/11/2017

Loop unroll transformation (1) Creates additional copies of loop body Always safe //C-code1 for (i=0; i < 100; i++) A[i] = B[i]; //C-code2 for (i=0; i < 100; i+=4) { A[i+1] = B[i+1]; A[i+2] = B[i+2]; A[i+3] = B[i+3]; } In code1, I value takes 0,1,2,3 while in code2 0,4,8 Why do that? In code2, 4 iterations have been unrolled Pros: Reduces the number of instructions Increase instruction parallelism Cons: Increases code size Increases register pressure

Loop unroll transformation (2) // C code1 for (i=0; i<100; i++) { … } // assembly code1 loop_i … inc i // increment i cmp i, 100 // compare i to 100 jl loop_i // jump if i lower to 100 // C code2 for (i=0; i<100; i+=4) { … } // assembly code2 loop_i … inc i // increment i cmp i, 100 // compare i to 100 jl loop_i // jump if lower 100 times A[i] = B[i]; 100/4 times A[i] = B[i]; A[i+1] = B[i+1]; A[i+2] = B[i+2]; A[i+3] = B[i+3]; So, in order the C-code to run to the target platform, it has to be converted into assembly code and then into binary code; in this slide I give the assembly code that a typical compiler generates when loop unroll is applied (code2) and not (code1). The for loop is transformed into 3 assembly instructions, these are, … It is clear from this figure that in code1 these 3 assembly instructions are executed N times while in the second slide N/4 times. Thus, by applying loop unroll less compare, add and jump instructions occur. The number of arithmetical instructions is reduced Less add instructions for i, i.e., i=i+4 instead of i=i+1 Less compare instructions, i.e., i==100 ? Less jump instructions

Loop unroll transformation (2) // C code1 for (i=0; i<100; i++) { … } // assembly code1 loop_i … inc i // increment i cmp i, 100 // compare i to 100 jl loop_i // jump if i lower to 100 // C code2 for (i=0; i<100; i+=4) { … } // assembly code2 loop_i … inc i // increment i cmp i, 100 // compare i to 100 jl loop_i // jump if lower 100 times A[i] = B[i]; 100/4 times A[i] = B[i]; A[i+1] = B[i+1]; A[i+2] = B[i+2]; A[i+3] = B[i+3]; So, in order the C-code to run to the target platform, it has to be converted into assembly code and then into binary code; in this slide I give the assembly code that a typical compiler generates when loop unroll is applied (code2) and not (code1). The for loop is transformed into 3 assembly instructions, these are, … It is clear from this figure that in code1 these 3 assembly instructions are executed N times while in the second slide N/4 times. Thus, by applying loop unroll less compare, add and jump instructions occur. Execution time is reduced Energy consumption on execution unit and instruction fetch unit is reduced

Scalar replacement transformation Converts array reference to scalar reference Always safe //Code-1 for (i=0; i < 100; i++){ A[i] = … + B[i]; C[i] = … + B[i]; D[i] = … + B[i]; } //Code-2 t=B[i]; A[i] = … + t; C[i] = … + t; D[i] = … + t; B array reference is accessed 3 times for every i and therefore 300 times in total Cons: introduces extra dependencies, may disable other transformations Reduces the number of L/S instructions Reduces the number of memory accesses Reduces the number of arithmetic instructions

Scalar Replacement Transformation example (1) // C-code2 for (i=0; i<N; i++) for (j=0; j<N; j++) { tmp=C[i][j]; for (k=0; k<N; k++) { tmp += A[i][k] * B[k][j]; } C[i][j] = tmp; // C-code1 for (i=0; i<N; i++) for (j=0; j<N; j++) for (k=0; k<N; k++) C[i][j] += A[i][k] * B[k][j]; C[0][0] Main memory C[i][j] is not affected by k loop For every k, C[i][j] is redundantly loaded/stored from/to memory A load/store instruction needs 1-3 CPU cycles L2 unified cache L1 instruction cache L1 data cache C[0][0] C[0][0] RF CPU

Scalar Replacement Transformation example (2) tmp C N A B N N // C code of MMM for (i=0; i<N; i++) for (j=0; j<N; j++) { tmp=C[i][j]; for (k=0; k<N; k++) { tmp += A[i][k] * B[k][j]; } C[i][j] = tmp; ... ... ... N N the number of L/S instructions is reduced N2 instead of N3 loads and stores for C array the number of arithmetical instructions is reduced less address computations for C the number of L1 data accesses is reduced N2 instead of N3 L1 accesses for C array

Scalar Replacement Transformation (3) Main memory L2 unified cache Faster and smaller Cache lines L1 instruction cache L1 data cache This pie shows energy consumption in HW components of the initial MMM code on ARM. Most of the energy is consumed on memory hierarchy RF words CPU Dynamic power consumption in memory hierarchy is reduced by reducing the number of memory accesses The power consumption in Instruction fetch unit and execution unit is reduced by reducing the number of instructions

Think Pair Share Exercise When code2 is faster than code1? Always Never It depends on the hardware architecture It is impossible to know When the code2 size becomes larger than L1 instruction cache size, code2 is no longer efficient Main memory //code1 N=1000000; for (i=0; i < N; i++) A[i] = B[i]; //code2 for (i=0; i < N; i+=10000) { A[i+1] = B[i+1]; A[i+2] = B[i+2]; A[i+3] = B[i+3]; … A[i+9999] = B[i+9999]; } L2 unified cache So, regarding code1, it may consists of about 9 assembly instructions. These instructions have to be loaded from main memory to L1. Then these instructions will be fetched 1 million times. Regarding code2, it consists of a huge amount of assembly instructions, which may not fit in L1. So, if the L1 is smaller than the code2 size, the code2 will be fetched 1 million times from L2 memory. Keep in mind that L2 is much slower than L1 and more energy hungry. L1 data cache L1 instruction cache RF CPU

Single Instruction Multiple Data (SIMD) (1) Modern processors use vector assembly instructions to increase performance Modern compilers use auto-vectorization There is specific HW supporting a variety of vector instructions as well as wide registers Normally, array operations are implemented by using vector not scalar assembly instructions

Single Instruction Multiple Data (SIMD) (2) Intel MMX technology 8 mmx registers of 64 bit extension of the floating point registers can be handled as 8 8-bit, 4 16-bit, 2 32-bit and 1 64-bit, operations An entire L1 cache line is loaded to the RF in 1-3 cycles Intel SSE technology 8/16 xmm registers of 128 bit (32-bit architectures support 8 registers only) Can be handled from 16 8-bit to 1 128-bit operations Intel AVX technology 8/16 ymm registers of 256 bit (32-bit architectures support 8 registers only) Can be handled from 32 8-bit to 1 256-bit operations

Single Instruction Multiple Data (SIMD) (3)

Single Instruction Multiple Data (SIMD) (4) Vector instructions work only for data that they are written in consecutive main memory addresses Aligned load/store instructions are faster than the no aligned ones. MMX instructions have lower latency but SSE instructions have higher throughput MMX instructions are preferred for 64-bit operations The packing/unpacking overhead may be high We can use both mmx and xmm registers SSE memory and arithmetical instructions are executed in parallel

Basic SSE Instructions (1) __m128 _mm_load_ps (float * p ) – Loads four SP FP values. The address must be 16-byte-aligned __m128 _mm_loadu_ps (float * p) - Loads four SP FP values. The address need not be 16-byte-aligned L1 A[0] A[1] A[2] A[3] A[4] A[5] A[6] A[7] …. Main memory Aligned load L1 L2 unified cache A[0] A[1] A[2] A[3] A[4] A[5] A[6] A[7] …. Misaligned load Faster and smaller Cache lines L1 instruction cache L1 data cache L1 A[1] A[2] A[3] A[4] A[5] A[6] A[7] A[8] …. Misaligned load RF words CPU

Basic SSE Instructions (2) __m128 _mm_load_ps(float * p ) – Loads four SP FP values. The address must be 16-byte-aligned __m128 _mm_loadu_ps(float * p) - Loads four SP FP values. The address need not be 16-byte-aligned L1 float A[N] __attribute__((aligned(16))); A[0] A[1] A[2] A[3] A[4] A[5] A[6] A[7] …. Aligned load Main Memory A[0] A[1] A[2] A[3] …. L1 A[0] A[1] A[2] A[3] A[4] A[5] A[6] A[7] …. Misaligned load Modulo (Address ,16)=0 L1 A[0] A[1] A[2] A[3] A[4] A[5] A[6] A[7] …. Misaligned load

Basic SSE Instructions (3) __m128 _mm_store_ps(float * p ) – Stores four SP FP values. The address must be 16-byte-aligned __m128 _mm_storeu_ps(float * p) – Stores four SP FP values. The address need not be 16-byte-aligned __m128 _mm_mul_ps(__m128 a, __m128 b) - Multiplies the four SP FP values of a and b __m128 _mm_mul_ss(__m128 a, __m128 b) - Multiplies the lower SP FP values of a and b; the upper 3 SP FP values are passed through from a. XMM1=_mm_mul_ss(XMM1, XMM0) XMM1=_mm_mul_ps(XMM1,XMM0)

Basic SSE Instructions (4) __m128 _mm_unpackhi_ps (__m128 a, __m128 b) - Selects and interleaves the upper two SP FP values from a and b. __m128 _mm_unpacklo_ps (__m128 a, __m128 b) - Selects and interleaves the lower two SP FP values from a and b. XMM0=_mm_unpacklo_ps (XMMO, XMM1) XMM0=_mm_unpackhi_ps (XMMO, XMM1) __m128 _mm_shuffle_ps(__m128 a, __m128 b, unsigned int imm8) - Selects four specific SP FP values from a and b, based on the mask imm8, num0=_mm_shuffle_ps(num1,num2,_MM_SHUFFLE(1,0,1,0));

Basic SSE Instructions (5) __m128 _mm_hadd_ps (__m128 a, __m128 b) - Adds adjacent vector elements void _mm_store_ss (float * p, __m128 a) - Stores the lower SP FP value __m128 _mm_shuffle_ps(__m128 a, __m128 b, unsigned int imm8) - Selects four specific SP FP values from a and b, based on the mask imm8, num0=_mm_shuffle_ps(num1,num2,_MM_SHUFFLE(1,0,1,0));

= … x Example – MVM with SSE … … … … … … for (i=0; i!=N; i++){ num3= _mm_setzero_ps(); for (j=0; j!=N; j+=4){ num0=_mm_load_ps( &A[i][j] ); num1=_mm_load_ps(X + j ); num3+=_mm_mul_ps(num0,num1); } num4=_mm_hadd_ps(num3, num3); num4=_mm_hadd_ps(num4, num4); _mm_store_ss((float *)Y+i, num4); Example – MVM with SSE After j loop finishes its execution, num3 contains the output data of y0 num3=[ya, yb, yc, yd] and y0=ya+yb+yc+yd after the 1st hadd -> num3=[ya+yb, yc+yd, ya+yb, yc+yd] after the 2nd hadd -> num3=[ya+yb+yc+yd, ya+yb+yc+yd, ya+yb+yc+yd, ya+yb+yc+yd] Y A (NxN) X num3 num0 num1 y0 y1 y2 y3 y4 y5 yN a00 a01 a02 a03 a0N a10 a11 a12 a13 a1N a20 a21 a22 a23 a2N a30 a31 a32 a33 a3N a40 a41 a42 a43 a4N a50 a51 a52 a53 aN0 aNN x0 x1 x2 x3 x4 xN … … = x … … … … …

= … x Example – MVM with SSE (2) … … … … … … for (i=0; i!=N; i+=2){ num5= _mm_setzero_ps(); num6= _mm_setzero_ps(); for (j=0; j!=N; j+=4){ num3=_mm_load_ps( &A[i][j] ); num4=_mm_load_ps(X + j ); num5+=_mm_mul_ps(num3,num4); num3=_mm_load_ps( &A[i+1][j] ); num6+=_mm_mul_ps(num3,num4); } num5=_mm_hadd_ps(num5, num5); _mm_store_ss((float *)Y+i, num5); num6=_mm_hadd_ps(num6, num6); _mm_store_ss((float *)Y+i+1, num6); Example – MVM with SSE (2) Previous code after Loop unroll and scalar replacement X array is accessed two times less (data reuse – num4) we use more registers Y A (NxN) X num5 num3 num4 y0 y1 y2 y3 y4 y5 yN a00 a01 a02 a03 a0N a10 a11 a12 a13 a1N a20 a21 a22 a23 a2N a30 a31 a32 a33 a3N a40 a41 a42 a43 a4N a50 a51 a52 a53 aN0 aNN x0 x1 x2 x3 x4 xN num6 num3 … x … … = … … … …

dL1 accesses: N+N2+N2/unroll factor

Speeding up MVM for regular matrices using SIMD (4) There are several ways to sum the Y array’s intermediate results to accumulate the four values of each XMM register, to pack their results into new registers and to store each one directly to accumulate the four values of each XMM register and store each single value separately to pack the Y values in new registers in such a way to add elements of different registers a) b) c)

Example – MVM with SSE (5) Assume the previous code with unroll factor 4. Therefore, 4 XMM registers are needed for the results y1a y1b y1c y1d y2a y2b y2c y2d y3a y3b y3c y3d y4a y4b y4c y4d 2 hadd instr. to get Y[0]=(y1a+y1b+y1c+y1d) and 1 store_ss instr. normally, hadd needs more than 5 cycles store_ss latency is x2 comparing to store_ps y1a y2a y3a y4a y1b y2b y3b y4b y1c y2c y3c y4c y1d y2d y3d y4d 3 x add_ps() = 1 store_ps instr. Y[0] Y[1] Y[2] Y[3]

Example – MVM with SSE (6) Assume the previous code with unroll factor 4. Therefore, 4 XMM registers are needed for the results y1a y1b y1c y1d y2a y2b y2c y2d y3a y3b y3c y3d y4a y4b y4c y4d m1=unpacklo_ps(y1,y2) m2=unpacklo_ps(y3,y4) m1: y1c y2c y1d y2d y3c y4c y3d y4d m2: k1=shuffle_ps(m1,m2, (1,0,1,0)) k2=shuffle_ps(m1,m2, (3,2,3,2)) k1: y1c y2c y3c y4c y1d y2d y3d y4d k2: Apply the same procedure as before using unpackhi_ps() k3: y1a y2a y3a y4a y1b y2b y3b y4b k4:

... ... ... MMM – Project 1b (1) for (jj=0; jj!=M; jj+=Tile) for (i=0; i!=M; i++) for (j=jj; j!=jj+Tile; j++){ num3= _mm_setzero_ps(); for (k=0; k!=M; k+=4){ num0=_mm_load_ps(A + M*i + k); num1=_mm_load_ps(Btrans + M*j + k ); num3+=_mm_mul_ps(num0,num1); } num4=_mm_hadd_ps(num3, num3); num4=_mm_hadd_ps(num4, num4); _mm_store_ss((float *)C + M*i + j , num4); for (jj=0; jj!=M; jj+=Tile) for (i=0; i!=M; i++) for (j=jj; j!=jj+Tile; j++) for (k=0; k!=M; k++) C[M*i+j]+=A[M*i+k] * Btrans[M*j+k]; XMM2 C XMM0 A Btrans k M XMM1 P Tile j P Tile j ... ... ... i i N N k Tile1 Tile1 Tile1 M

... ... ... MMM – Project 1b (2) 2 rows of A[ ] (2 x M x 4 bytes) and Main memory 2 rows of A[ ] (2 x M x 4 bytes) and Tile columns of Btrans[ ] (Tiles x M x 4bytes) fit in L1 data cache L2 unified cache L1 instruction cache L1 data cache Important: the tiles are written in consecutive main memory locations RF (Tiles+2) x M x 4 ≈ L1 cache size L2acc.=N2 + N2/Tile + N2 CPU XMM2 C XMM0 A Btrans k M XMM1 P Tile j P Tile j ... ... ... i i N N k Tile1 Tile1 Tile1 M

If-condition on SSE for (i=0; i < n; i++) if ( x[i] > 2 || x[i] < -2 ) a[i]+=x[i]; If-condition on SSE 2 -2 5 -3 1 a[i] a[i+1] a[i+2] a[i+3] 1 1 1 x[i] x[i+1] a[i] + x[i] a[i+1] + x[i+1] a[i+2] a[i+3]

Thank you Date 22/11/2017