C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 20/11/2015 Compilers for Embedded Systems Lecture 7 Integrated Systems of Hardware and Software
Single Instruction Multiple Data (SIMD) (1) MMX technology 8 mmx registers of 64 bit extension of the floating point registers can be handled as 8 8-bit, 4 16-bit, 2 32-bit and 1 64-bit, operations 1 L1 cache line of size 64 bits is loaded to the RF in 1 clk SSE technology 8/16 xmm registers of 128 bit (32-bit architectures support 8 registers only) Can be handled from 16 8-bit to bit operations 1 L1 cache line of size 128 bits is loaded to the RF in 1 clk AVX technology 8/16 ymm registers of 256 bit (32-bit architectures support 8 registers only) Can be handled from 32 8-bit to bit operations
Single Instruction Multiple Data (SIMD) (2) SSE instructions work only for data that they are written in consecutive main memory addresses Aligned load/store instructions are faster than the no aligned ones. MMX instructions have lower latency but SSE instructions have higher throughput MMX instructions are preferred for 64-bit operations – The packing/unpacking overhead may be high We can use both mmx and xmm registers SSE memory and arithmetical instructions are executed in parallel VLSI lab, C.E. Goutis, V.I. Kelefouras
Speeding up MVM for regular matrices using SIMD (1) a00a01a02a03a0N a10a11a12a13a1N a20a21a22a23a2N a30a31a32a33a3N a40a41a42a43a4N a50a51a52a53 aM0aMN y0 y1 y2 y3 y4 y5 yM = x YA (NxN)X … … … … … … x0 x1 x2 x3 x4 xN … XMM0 XMM1 XMM2 XMM3 XMM4 XMM6 XMM7 XMM5 we select the optimum production-consumption and the sub-optimum data reuse we use (Regs − 2) registers for Y, 1 register for A and 1 register for X Regs = m (7) o The scheduling with the optimum production-consumption of Y is the optimum each register of Y contains more than one Y values which have to be summed, unpacked and stored into memory; thus, by maximizing the production-consumption, the number of SEE instructions is minimized (both load/store and arithmetic) o The scheduling is shown below
// sum the y0, y1,y2, y3, y4, y5 and store the results into Y[] count+=6; } a00a01a02a03a0N a10a11a12a13a1N a20a21a22a23a2N a30a31a32a33a3N a40a41a42a43a4N a50a51a52a53 aM0aMN y0 y1 y2 y3 y4 y5 yM = x … … … … … … x0 x1 x2 x3 x4 xN … XMM0 XMM1 XMM2 XMM3 XMM4 XMM6 XMM7 XMM5
Speeding up MVM for regular matrices using SIMD (3) a00a01a02a03a0N a10a11a12a13a1N a20a21a22a23a2N a30a31a32a33a3N a40a41a42a43a4N a50a51a52a53 aM0aMN y0 y1 y2 y3 y4 y5 yM = x YA (NxN)X … … … … … … x0 x1 x2 x3 x4 xN … XMM0 XMM1 XMM2 XMM3 XMM4 XMM6 XMM7 XMM5 o There are several ways to sum the XMM0:XMM5 data 1.to accumulate the four values of each XMM register, to pack their results into new registers and to store each one directly 2.to accumulate the four values of each XMM register and store each single value separately 3.to pack the XMM0:XMM5 values in new registers in such a way to add elements of different registers
a) b) c) Speeding up MVM for regular matrices using SIMD (4)
Speeding up MVM for regular matrices using SIMD (5) a00a01a02a03a0N a10a11a12a13a1N a20a21a22a23a2N a30a31a32a33a3N a40a41a42a43a4N a50a51a52a53 aM0aMN y0 y1 y2 y3 y4 y5 yM = x YA (NxN)X … … … … … … x0 x1 x2 x3 x4 xN … XMM0 XMM1 XMM2 XMM3 XMM4 XMM6 XMM7 XMM5 o In the case that the m rows of A and the X do not fit in L1 data cache, tiling for L1 is applied. In many general purpose processors, tiling for data cache is not performance efficient (the extra addressing and load/store instructions degrade performance) Tiling for L1 is used to decrease the number of X array cache misses only, whose number is small (Y is stored into memory once and A is fetched only once)
Speeding up BTMVM using SIMD Regarding SIMD, opt1 and opt2 are not good solutions The structure of BT matrix cannot be profitably exploited by SIMD architecture for two basic reasons 1. it is performance efficient to load address aligned data than no aligned 2. it is faster to load 128bit into one XMM register at once, than to load less bits and apply shuffling operations To implement opt1 into SSE four copies of each element of X (for float numbers of 4 bytes each) has to be stored into 1 XMM register; this needs more than one SSE load instructions To implement opt2 into SSE more shuffling and packing operations are needed. Thus, reusing the A array elements does not lead to high performance when SSE instructions are used. VLSI lab, C.E. Goutis, V.I. Kelefouras
a0a1a2a3a4a1’ a2’ a3’a4’aN’aN Arow (2N-1) Speeding up BTMVM and TMVM using SIMD (1) …… y0 y1 y2 y3 y4 y5 yM = Y … XMM0 XMM1 XMM2 XMM3 XMM4 XMM5 x X … x0 x1 x2 x3 x4 xN … XMM7 XMM6 Regarding T and BT matrices, the same schedule as for regular matrices is used, but matrix A is 1-d and of size (2 × N − 1) The first N elements of Arow are these of the first column of A in reversed order, i.e. N, N-1,..., 1, and the next N-1 elements are elements of the first row of A except from the first one – It achieves a larger number of SSE instructions than MVM because the elements of A are fetched from non aligned memory locations the size of A is much smaller the m different elements fetched are in consecutive memory locations (smaller number of cache misses) VLSI lab, C.E. Goutis, V.I. Kelefouras
a0a1a2a3a4a1’ a2’ a3’a4’aN’aN Arow (2N-1) Speeding up BTMVM and TMVM using SIMD (2) …… y0 y1 y2 y3 y4 y5 yM = Y … XMM0 XMM1 XMM2 XMM3 XMM4 XMM5 x X … x0 x1 x2 x3 x4 xN … XMM7 XMM6 When (m + 1) × N ≤ L1 holds, we use MVM instead of BTMVM or TMVM (the Toeplitz symmetry is not utilized) MVM is faster than BTMVM/TMVM because it achieves a lower number of SSE instructions When (m + 1) × N > L1 holds, we use BTMVM/TMVM routine BTMVM/TMVM is faster than MVM because the lower number of SSE instructions benefit is lost by the higher number of data cache misses VLSI lab, C.E. Goutis, V.I. Kelefouras
// sum the y0, y1,y2, y3, y4, y5 and store the results into Y[] count+=6; } a0a1a2a3a4a1’ a2’ a3’a4’aN’aN Arow (2N-1) …… y0 y1 y2 y3 y4 y5 yM = Y … XMM0 XMM1 XMM2 XMM3 XMM4 XMM5 x X … x0 x1 x2 x3 x4 xN … XMM7 XMM6 VLSI lab, C.E. Goutis, V.I. Kelefouras
Speeding up BTMVM and TMVM using SIMD (4) VLSI lab, C.E. Goutis, V.I. Kelefouras
for Toeplitz matrices the MVM problem can also be implemented by using the FFT algorithm we use three FFTs and one vector multiplication O(15nlog(2n) + 18n) complexity which is lower than O(n 2 ) It is obvious that for medium/large input sizes, MVM using FFT achieves a lower number of instructions. Speeding up BTMVM and TMVM using SIMD (5)