Presentation is loading. Please wait.

Presentation is loading. Please wait.

ECE 1773- Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Multimedia ISA Extensions Intel’s MMX –The Basics –Instruction.

Similar presentations


Presentation on theme: "ECE 1773- Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Multimedia ISA Extensions Intel’s MMX –The Basics –Instruction."— Presentation transcript:

1 ECE 1773- Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Multimedia ISA Extensions Intel’s MMX –The Basics –Instruction Set –Examples –Integration into Pentium –Relationship to vector ISAs AMD’s 3DNow! Intel’s ISSE (a.k.a. KNI)

2 ECE 1773- Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos MMX: Basics Multimedia applications are becoming popular Are current ISAs a good match for them? Methodology: –Consider a number of “typical” applications –Can we do better? –Cost vs. performance vs. utility tradeoffs Net Result: Intel’s MMX Can also be viewed as an attempt to maintain market share –If people are going to use these kind of applications we better support them

3 ECE 1773- Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Multimedia Applications Most multimedia apps have lots of parallelism: –for I = here to infinity out[I] = in_a[I] * in_b[I] –At runtime: out[0] = in_a[0] * in_b[0] out[1] = in_a[1] * in_b[1] out[2] = in_a[2] * in_b[2] out[3] = in_a[3] * in_b[3] ….. Also, work on short integers: –in_a[i] is 0 to 256 for example (color) –or, 0 to 64k (16-bit audio)

4 ECE 1773- Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Observations 32-bit registers are wasted –only using part of them and we know –ALUs underutilized and we know Instruction specification is inefficient –even though we know that a lot of the same operations will be performed still we have to specify each of the individually –Instruction bandwidth –Discovering Parallelism –Memory Ports? Could read four elements of an array with one 32-bit load Same for stores The hardware will have a hard time discovering this –Coalescing and dependences

5 ECE 1773- Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos MMX Contd. Can do better than traditional ISA –new data types –new instructions Pack data in 64-bit words –bytes –“words” (16 bits) –“double words” (32 bits) Operate on packed data like short vectors (arrays)

6 ECE 1773- Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos MMX:Example Up to 8 operations (64bit) go in parallel  Potential improvement: 8x  In practice less but still good  Besides another reason to think your machine  is obsolete

7 ECE 1773- Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos MMX Data Types

8 ECE 1773- Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos A bit of History This is a special case of SIMD –Single Instruction –Multiple Data One instruction specifies that an operation should be applied: –Repeatedly –To possibly different data elements each time –Each of these operations are independent Conventional ISA is SISD –Single Instruction/Single Data First used in Livermore S-1 (> 25 years)

9 ECE 1773- Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos MMX: Instruction Set 57 new instructions Integer Arithmetic –add/sub/mul –multiply add –signed/unsigned –saturating/wraparound Shifts Compare (form mask) Pack/Unpack Move –from/to memory –from/to registers

10 ECE 1773- Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Arithmetic Conventional: Wrap-around –on overflow, wrap to -1 –on underflow, wrap to MAXINT Think of digital audio –What happens when you turn volume to the MAX? Brightness in pictures Saturating arithmetic: –on overflow, stay at MAXINT –on underflow, stat at MININT Two flavors: –unsigned –signed

11 ECE 1773- Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Operations Mult/Add Compares Conversion –Interpolation/Transpose –Unpack (e.g., byte to word) –Pack (e.g., word to byte)

12 ECE 1773- Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Examples Image Composting –A and B images fade-in and fade-out –A * fade + B * (1 - fade), OR –(A - B) * fade + B Image Overlay –Sprite: e.g., mouse cursor –Spite: normal colors + transparent –for i = 1 to Sprite_Length if A[I] = clear_color then –Out_frame[I] = C[I] –else Out_frame[I] = A[I] Matrix Transpose –Covert from row major to column major –Used in JPEG

13 ECE 1773- Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Matrix Transpose 4x4 That’s for the first two rows m33 m32 m31 m30m13 m12 m11 m10 m23 m22 m21 m20m03 m02 m01 m00 punpcklwd m31 m21 m30 m20m11 m01 m10 m00 punpckhdq punpckldq m31 m21 m11 m01m30 m20 m10 m00 m03 m02 m01 m00 m13 m12 m11 m10 m23 m22 m21 m20 m33 m32 m31 m30 m30 m20 m10 m00 m31 m21 m11 m01 m33 m22 m12 m02 m33 m23 m13 m03

14 ECE 1773- Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Chroma Keying for (i=0; i<image_size; i++) –if (x[i] == Blue) new_image[i] =y[i] – else new_image[i] = x[i];

15 ECE 1773- Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Chroma Keying Code Movq mm3, mem1 –Load eight pixels from persons’ image Movq mm4, mem2 –Load eight pixels from the background image Pcmpeqb mm1, mm3 Pand mm4, mm1 Pandn mm1, mm3 Por mm4, mm1

16 ECE 1773- Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Integration into Pentium Major issue: OS compatibility –Create new registers? –Share registers with FP Existing OSes will save/restore Use 64-bit datapaths Pipe capable of 2 MMX IPC Separate MEM and Execute stage

17 ECE 1773- Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos “Recent” Multimedia Extensions Intel MMX: integer arithmetic only New algorithms -> new needs Need for massive amounts of FP ops Solution? MMX like ISA but for FP not only integer Example: AMD’s 3DNow! –New data type: 2 packed single-precision FP –2 x 32-bits »sign + exponent + significant –New instructions –Speedup potential: 2x

18 ECE 1773- Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos AMD’s 3DNow! 21 new instructions Average: motivated by MPEG Add, Sub, Reverse Sub, Mul Accumulate –(A1, A2) acc (B1, B2) = (B1 + B2, A1 + A2) Comparison (create mask) Min, Max (pairwise) Reciprocal and SQRT, –Approximation: 1st step and other steps Prefetch Integer from/to FP conversion All operate on packed FP data –sign * 2^(mantissa - 127) * exponent

19 ECE 1773- Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Recent Extensions Cont. Intel’s ISSE –very similar to AMD’s 3DNow! –But has separate registers Lessons? –Applications change over time –Careful when introducing new instructions How useful are they? Cost? LEGACY: are they going to be useful in the future? Everyone has their own Multimedia Instruction set these days –read handout

20 ECE 1773- Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Intel’s SSE Multimedia/Internet? 70 new instructions Major Types: –SIMD-FP 128-bit wide 4 x 16 bit FP –Data movement and re-organization –Type conversion Int to Fp and vice versa Scalar/FP precision –State Save/Restore New SSE registers not like MMX –Memory Streaming Prefetch to specified hierarchy level –New Media Absolute Diff, Rounded AVG, MIN/MAX SSE2: –SIMD-FP two 64-bit fp as 128-bit

21 ECE 1773- Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Altivec (PowerPC Mmedia Ext) 128-bit registers 8, 16, or 32 bit data types Scalar or single-precision FP 162 Instructions Saturation or Modulo arithmetic Four operand Instructions –3 sources, 1 target

22 ECE 1773- Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Altivec Design Process Look at Mmedia Kernel Justify new instructions Video –8bit int LowQ, 16-bit int HighQ Audio –16bit int LowQ, SP FP HighQ Image Processing –8bit int LowQ, 16bit Int HighQ 3D Graphics –16bit int LowQ, SP FP HighQ Speech Recog. –16bit int Low Q, Sp FP HighQ Communications/Crypto –8-bit or 16bit unsigned int

23 ECE 1773- Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Vector Processors Vector: –One-Dimensional array of numbers Original Motivation: –Scientific/Numerical Programs operate on vectors Parallelism Abound Example: –Do i = 1 to 64 C[I] = A[I] + B[I] Vector Processors Registers are vectors Operations are element-wise across multiple vectors Example: –addv Rc, Ra,Rb

24 ECE 1773- Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Vector Example Do i = 1 to 64 C[I] = A[I] + B[I] addv rc, ra, rb a[0]b[0] + c[0] = a[1]b[1] + c[1] = a[2]b[2] + c[2] = a[63]b[63] + c[63] =

25 ECE 1773- Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Why Vector Processors? Deeper Pipelines  faster Clock  Higher Performance BUT! –Interlock logic becomes really complicated as pipeline deepens –Bubbles due to data deps increase Want Wider Machines to exploit Parallelism BUT! –Increasingly Harder to increase issue width Finally Recall Fetch and Issue Bottleneck –Can’t execute more that you fetch/decode

26 ECE 1773- Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos What’s Good About Vector Procs Vectors facilitate deeper Pipelines –No intra vector interlocks –No intra vector data deps –Inner loop control deps eliminated They were artificial to start with –Single Instruction for Multiple operations –Vector instruction provides information for what the machine is going to be doing for a while Could exploit in memory system Know that we are going to use 64 elements which are likely one after the other

27 ECE 1773- Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Vector Architectures Vectors in Memory –All vectors in memory –Long startup latency –Memory ports? –Good for long vectors Vectors in Registers –Load/store –Vector ops only on regs –Register ports less expensive than memory ports –Good for small vectors also –Register Vector is the limiter Fact: in most applications vectors are short Hence Register Vectors better

28 ECE 1773- Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Vector ISA Example Vector-Vector Insts –VRC[i] = VRA[i] op VRB[i] Vector – Scalar Inst –VRB[i] = VRA[i] op CONST Vector Load/Store –Mem[i]= VRA[i] –W/ Stride M[r1 + i * r2] = VRA[i] –Indexed M[r1+ VRB[i]] = VRA[i] Also called scatter/gather Support for shorter vectors –Vector Length Register Vector Masks –VRb[i] = op VRa[i] if (VRc [i])

29 ECE 1773- Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Vector Chaining C[i] = A[i] * B[i] D[i] =C[i] + x MULTV VRC, VRA, VRB ADDVI VRD, VRC, Rx VRDi add can be initiated as soon as MUTLVi finishes We do not have to wait for the whole MULTV to finish

30 ECE 1773- Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Vector Processors – A bit of History CRAY-1: started in ’72, completed in ’74 12ns cycle time 8 Scalar Registers 8 Address Registers 8 Vectors or 64 words 64 Scalar and 64 Address temporaries 12 Functional Units 1Mword memory: 4 clock cycles

31 ECE 1773- Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Are Vectors Always a Win? From Gordon Bell’s talk Scalar is way better for short vectors Vector 7x Scalar for larger vectors Vector size Time/element

32 ECE 1773- Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Cray-1 Architecture

33 ECE 1773- Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Vectors and SIMD Vector Length –Not programmable (no VL reg) –Must be multiple of 64 total bits Memory Load/Store –stride one only Arithmetic –Integer only Conditionals –builds byte mask –do both ways and choose –no trap problem -- no trapping instructions Data Movement –minimal –only pack/unpack

34 ECE 1773- Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Specifying Independence Vectors and SIMD are examples of “independence” ISAs Conventional ISA –One instruction after the other –No way of explicitly stating: Inst A and B are independent Vectors and SIMD –A series of many conventional instructions that are the same  one vector or SIMD inst. Limited flexibility for specifying independence Still, these were optimized for the common case in a specific class of applications


Download ppt "ECE 1773- Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Multimedia ISA Extensions Intel’s MMX –The Basics –Instruction."

Similar presentations


Ads by Google