MMX-accelerated Matrix Multiplication

Name: MMX-accelerated Matrix Multiplication
Uploaded: 2017-07-30T09:26:54+00:00
Duration: PTM11S23
Channel: Dorcas Pope
Description: MMX-accelerated Matrix Multiplication

MMX-accelerated Matrix Multiplication
Assembly Language & System Software National Chiao-Tung Univ.

Motivation Pentium processors support SIMD instructions for vector operations Multiple operations can be perform in parallel In this lecture, we shall show how to accelerate matrix multiplication by using MMX instructions

Naïve Matrix Multiplication

Naïve Matrix Multiplication
int16 vect[Y_SIZE]; int16 matr[Y_SIZE][X_SIZE]; int16 result[X_SIZE]; int32 accum; for (i = 0; i < X_SIZE; i++) { accum = 0; for (j = 0; j < Y_SIZE; j++) accum += vect[j] * matr[j][i]; result[i] = accum; }

MMX A collection of MMX is primarily for integer vector operations
new SIMD instructions new registers mm0~mm7, each is of 64 bits MMX is primarily for integer vector operations

MMXTM registers char a; a int b; b1 b2 b3 b4 mmx register float mmx
8 bits int b; b1 b2 b3 b4 64 bits 32 bits 80 bits p p+8 講一下 char, 4-byte integer, 8-byte packed integer 的樣子以及長相 16 16 16 16 16 16 16 16 16 16 16 16 64 bits 64 bits 64 bits

MMX™ instructions movd、movq—Move Doubleword、Move Quadword
punpcklbw、punpcklwd、punpckldq—Unpack Low Data and Interleave (word、doubleword) punpckhwd—Unpack High Data and Interleave (word) LBW PUNPCKLBW low, byte PUNPCKLWD low, word PUNPCKLDQ low, dword PUNPCKLQDQlow, double quad word Naming convention: (high low)(unit) (2* unit 幹嘛用的？) HBW

MMX™ instructions pmaddwd—Multiply and Add Packed Integers (word)
paddd—Add Packed Integers (doubleword)

MMX™ for Matrix Multiply
One matrix multiplication is divide into a series of multiplying a 1*2 vector with a 2*4 sub-matrix

[edx] [esi] ecx elements

int16 vect[Y_SIZE]; int16 matr[Y_SIZE][X_SIZE]; int16 result[X_SIZE]; int32 accum[4]; for (i = 0; i < X_SIZE; i += 4) { accum = { 0, 0, 0, 0}; for (j = 0; j < Y_SIZE; j += 2) accum += MULT4x2 (&vect[j], &matr[j][i]); result[i..i + 3] = accum; }

MMX™ code for MULT4x2 MULT4x2
movd mm7, [esi] ; Load two elements from input vector punpckldq mm7, mm7 ; Duplicate input vector: x0:x1:x0:x1 movq mm0, [edx+0] ; Load first line of matrix (4 elements) movq mm6, [edx+2*ecx] ; Load second line of matrix (4 elements) movq mm1, mm0 ; Transpose matrix to column presentation punpcklwd mm0, mm6 ; mm0 keeps columns 0 and 1 punpckhwd mm1, mm6 ; mm1 keeps columns 2 and 3 pmaddwd mm0, mm7 ; multiply and add the 1st and 2nd column pmaddwd mm1, mm7 ; multiply and add the 3rd and 4th column paddd mm2, mm0 ; accumulate 32 bit results for col. 0/1 paddd mm3, mm1 ; accumulate 32 bit results for col. 2/3 2*ecx because each element is of 2 bytes **there is an error in this piece of code! (the first movd instruction…).. It is okay to replace it with movq since the low-order data will be erased by the second punpckldq instruction.

MMX™ code for MULT4x2 Matrix states in multiplication
movd mm7, [esi] ; Load two elements from input vector punpckldq mm7, mm7; Duplicate input vector: X0:X1:X0:X1 Punpckldq  以 DWORD interleave low part

MMX™ code for MULT4x2 movq mm0, [edx+0] ; Load first line of matrix
the 4x2 block is addressed through register edx movq mm6, [edx+2*ecx] ; Load second line of matrix ecx contains the number of elements per matrix line

MMX™ code for MULT4x2 movq mm1, mm0 ; Transpose matrix to column presentation punpcklwd mm0, mm6 ; mm0 keeps columns 0 and 1 punpckhwd mm1, mm6 ; mm1 keeps columns 2 and 3

MMX™ code for MULT4x2 pmaddwd mm0, mm7;multiply and add the 1st and 2nd column pmaddwd mm1, mm7;multiply and add the 3rd and 4th column

MMX™ code for MULT4x2 paddd mm2, mm0 ; accumulate 32 bit results for col. 0/1 paddd mm3, mm1; accumulate 32 bit results for col. 2/3 Mm2 and mm3 是兩個 column 累加器，各有兩個 32 bits 整數

MMX™ code for MULT4x2 Packing and storing results
packssdw mm2, mm2 ; Pack the results for columns 0 and 1 to 16 Bits packssdw mm3, mm3 ; Pack the results for columns 2 and 3 to 16 Bits punpckldq mm2, mm3 ; All four 16 Bit results in one register (mm2) movq [edi], mm2 ; Store four results into output vector

MMX™ code for MULT4x2 packssdw mm2,mm2 packssdw mm3,mm3
Convert (shrink) signed DWORDs into WORDs 把四個 32bit 整數縮小成 4 個 16 bit 整數，寫回記憶體

Punpckldq以 D 為單位 interleave 到 Q
Src (mm3) 先放。 Little endian Y, Z, W,V

Memory Alignment Memory operations for MMX must be aligned at 8-byte boundaries 16-byte boundaries for SSE2 .data ALIGN 8 myBuf DWORD 128 DUP(?) >16 則需要 segment align 也 >16 (a paragraph), e.g., data seg is aligned at 128 then aligning data objects at boundaries no larger than 128 bytes is okay. However, the default alignment of data segment is on paragraph. I.e., 16. So alignment for data objects can be 1,2,4,8,16 If coarser-grained alignment is needed, try to align data segment at larger boundaries.

CPU-Mode Directives In Irvine32.inc, the CPU mode is specified as .686P MMX is supported since Pentium Additionally, you should specify .mmx to use MMX instructions If you want to use SSE2, specify .xmm

Debugging with MMX MMX/SSE2 registers are hidden unless you specify to see them

High-Resolution Counter
A PC clock ticks 18.7 times every second Low resolution Use the CPU internal clock counter for high accuracy performance measurement

RDTSC Read the CPU cycle counter +1 every clock every second for a 3GHz CPU The result is put in EDX:EAX readTSC PROC rdtsc ret readTSC ENDP

To calculate time spent in a specific interval, Recording the starting time and finish tine Finish-start Time stamps are of 64 bits, SUB instruction is for up to 32-bit operands Use SBB (sub with borrow) for implementation

SSE2 SIMD instructions for MMX extension
Basically SSE2 and MMX are the sane, except Registers for SSE2 are 128 bits instead of 64 bits, named by xmm0~xmm7 8 16-bit integers in one single register xmm8~xmm15 are accessible only with 64-bit processors Memory operations should be aligned at 16-byte boundaries Use .xmm directive to enable SSE2 for MASM Use MOVDQ instead of MOVQ for data movement

From MMX to SSE2 Change the multiplication for 1*2 x 2*4 matrixes
The rest are almost the same!

Things you have to do… Understand the code of MUL4x2
Extend the logic to handle generic matrix multiplication Understand alignment of memory operations Remember to put an “EMMS” instruction by the end of your program Not required if you are using SSE2 Implement 1) naïve 2) MMX-based 3) SSE2-based algorithms and measure their performance

MMX-accelerated Matrix Multiplication

Similar presentations

Presentation on theme: "MMX-accelerated Matrix Multiplication"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

MMX-accelerated Matrix Multiplication

Similar presentations

Presentation on theme: "MMX-accelerated Matrix Multiplication"— Presentation transcript:

Similar presentations

About project

Feedback