ECE 1773- Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Multimedia ISA Extensions Intel’s MMX –The Basics –Instruction.

Slides:



Advertisements
Similar presentations
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Advertisements

DSPs Vs General Purpose Microprocessors
Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.
Chapter 3 Instruction Set Architecture Advanced Computer Architecture COE 501.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
ISA Issues; Performance Considerations. Testing / System Verilog: ECE385.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Computer Architecture A.
CS252 Graduate Computer Architecture Spring 2014 Lecture 9: VLIW Architectures Krste Asanovic
The University of Adelaide, School of Computer Science
ECE291 Computer Engineering II Lecture 24 Josh Potts University of Illinois at Urbana- Champaign.
Click to add text © IBM Corporation Optimization Issues in SSE/AVX-compatible functions on PowerPC Ian McIntosh November 5, 2014.
Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Parallell Processing Systems1 Chapter 4 Vector Processors.
Intel’s MMX Dr. Richard Enbody CSE 820. Michigan State University Computer Science and Engineering Why MMX? Make the Common Case Fast Multimedia and Communication.
COMP3221: Microprocessors and Embedded Systems Lecture 2: Instruction Set Architecture (ISA) Lecturer: Hui Wu Session.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
Slide 1Michael Flynn EE382 Winter/99 EE382 Processor Design Winter Chapter 7 and Green Book Lectures Concurrent Processors, including SIMD and.
High Performance Computing Introduction to classes of computing SISD MISD SIMD MIMD Conclusion.
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
Chapter 12 CPU Structure and Function. Example Register Organizations.
1 Lecture 10: FP, Performance Metrics Today’s topics:  IEEE 754 representations  FP arithmetic  Evaluating a system Reminder: assignment 4 due in a.
CS854 Pentium III group1 Instruction Set General Purpose Instruction X87 FPU Instruction SIMD Instruction MMX Instruction SSE Instruction System Instruction.
NATIONAL POLYTECHNIC INSTITUTE COMPUTING RESEARCH CENTER IPN-CICMICROSE Lab Design and implementation of a Multimedia Extension for a RISC Processor Eduardo.
Streaming SIMD Extensions CSE 820 Dr. Richard Enbody.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
RICE UNIVERSITY Implementing the Viterbi algorithm on programmable processors Sridhar Rajagopal Elec 696
1 Chapter 04 Authors: John Hennessy & David Patterson.
Classifying GPR Machines TypeNumber of Operands Memory Operands Examples Register- Register 30 SPARC, MIPS, etc. Register- Memory 21 Intel 80x86, Motorola.
Floating Point Representation for non-integral numbers – Including very small and very large numbers Like scientific notation – –2.34 × –
MMX technology for Pentium. Introduction Multi Media Extension (MMX) for Pentium Processor Which has built in 80X87 Can be switched for multimedia computations.
December 2, 2015Single-Instruction Multiple Data (SIMD)1 Performance Optimization, cont. How do we fix performance problems?
1. 2 Pipelining vs. Parallel processing  In both cases, multiple “things” processed by multiple “functional units” Pipelining: each thing is broken into.
Crosscutting Issues: The Rôle of Compilers Architects must be aware of current compiler technology Compiler Architecture.
Introduction to MMX, XMM, SSE and SSE2 Technology
CS/EE 5810 CS/EE 6810 F00: 1 Multimedia. CS/EE 5810 CS/EE 6810 F00: 2 New Architecture Direction “… media processing will become the dominant force in.
November 22, 1999The University of Texas at Austin Native Signal Processing Ravi Bhargava Laboratory of Computer Architecture Electrical and Computer.
1 Lecture 10: Floating Point, Digital Design Today’s topics:  FP arithmetic  Intro to Boolean functions.
The Alpha Thomas Daniels Other Dude Matt Ziegler.
Exploiting Parallelism
11/13/2012CS4230 CS4230 Parallel Programming Lecture 19: SIMD and Multimedia Extensions Mary Hall November 13, 2012.
® GDC’99 Streaming SIMD Extensions Overview Haim Barad Project Leader/Staff Engineer Media Team Haifa, Israel Intel Corporation March.
Copyright © Curt Hill SIMD Single Instruction Multiple Data.
SSE and SSE2 Jeremy Johnson Timothy A. Chagnon All images from Intel® 64 and IA-32 Architectures Software Developer's Manuals.
EECS 583 – Class 22 Research Topic 4: Automatic SIMDization - Superword Level Parallelism University of Michigan December 10, 2012.
Instruction Sets. Instruction set It is a list of all instructions that a processor can execute. It is a list of all instructions that a processor can.
Introduction to Intel IA-32 and IA-64 Instruction Set Architectures.
Vector computers.
Computer Architecture: SIMD and GPUs (Part I)
Lecture 17. Vector Machine, and Intel MMX/SSEx Extensions
Independence Instruction Set Architectures
Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 13 SIMD Multimedia Extensions Prof. Zhang Gang School.
Morgan Kaufmann Publishers
Vector Processing => Multimedia
Advanced Computer Architecture 5MD00 / 5Z032 Instruction Set Design
MMX Multi Media eXtensions
Special Instructions for Graphics and Multi-Media
Comparison of Two Processors
STUDY AND IMPLEMENTATION
Multivector and SIMD Computers
Samuel Larsen Saman Amarasinghe Laboratory for Computer Science
Addressing mode summary
Review In last lecture, done with unsigned and signed number representation. Introduced how to represent real numbers in float format.
MMX technology for Pentium
CSE 502: Computer Architecture
Presentation transcript:

ECE Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Multimedia ISA Extensions Intel’s MMX –The Basics –Instruction Set –Examples –Integration into Pentium –Relationship to vector ISAs AMD’s 3DNow! Intel’s ISSE (a.k.a. KNI)

ECE Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos MMX: Basics Multimedia applications are becoming popular Are current ISAs a good match for them? Methodology: –Consider a number of “typical” applications –Can we do better? –Cost vs. performance vs. utility tradeoffs Net Result: Intel’s MMX Can also be viewed as an attempt to maintain market share –If people are going to use these kind of applications we better support them

ECE Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Multimedia Applications Most multimedia apps have lots of parallelism: –for I = here to infinity out[I] = in_a[I] * in_b[I] –At runtime: out[0] = in_a[0] * in_b[0] out[1] = in_a[1] * in_b[1] out[2] = in_a[2] * in_b[2] out[3] = in_a[3] * in_b[3] ….. Also, work on short integers: –in_a[i] is 0 to 256 for example (color) –or, 0 to 64k (16-bit audio)

ECE Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Observations 32-bit registers are wasted –only using part of them and we know –ALUs underutilized and we know Instruction specification is inefficient –even though we know that a lot of the same operations will be performed still we have to specify each of the individually –Instruction bandwidth –Discovering Parallelism –Memory Ports? Could read four elements of an array with one 32-bit load Same for stores The hardware will have a hard time discovering this –Coalescing and dependences

ECE Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos MMX Contd. Can do better than traditional ISA –new data types –new instructions Pack data in 64-bit words –bytes –“words” (16 bits) –“double words” (32 bits) Operate on packed data like short vectors (arrays)

ECE Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos MMX:Example Up to 8 operations (64bit) go in parallel  Potential improvement: 8x  In practice less but still good  Besides another reason to think your machine  is obsolete

ECE Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos MMX Data Types

ECE Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos A bit of History This is a special case of SIMD –Single Instruction –Multiple Data One instruction specifies that an operation should be applied: –Repeatedly –To possibly different data elements each time –Each of these operations are independent Conventional ISA is SISD –Single Instruction/Single Data First used in Livermore S-1 (> 25 years)

ECE Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos MMX: Instruction Set 57 new instructions Integer Arithmetic –add/sub/mul –multiply add –signed/unsigned –saturating/wraparound Shifts Compare (form mask) Pack/Unpack Move –from/to memory –from/to registers

ECE Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Arithmetic Conventional: Wrap-around –on overflow, wrap to -1 –on underflow, wrap to MAXINT Think of digital audio –What happens when you turn volume to the MAX? Brightness in pictures Saturating arithmetic: –on overflow, stay at MAXINT –on underflow, stat at MININT Two flavors: –unsigned –signed

ECE Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Operations Mult/Add Compares Conversion –Interpolation/Transpose –Unpack (e.g., byte to word) –Pack (e.g., word to byte)

ECE Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Examples Image Composting –A and B images fade-in and fade-out –A * fade + B * (1 - fade), OR –(A - B) * fade + B Image Overlay –Sprite: e.g., mouse cursor –Spite: normal colors + transparent –for i = 1 to Sprite_Length if A[I] = clear_color then –Out_frame[I] = C[I] –else Out_frame[I] = A[I] Matrix Transpose –Covert from row major to column major –Used in JPEG

ECE Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Matrix Transpose 4x4 That’s for the first two rows m33 m32 m31 m30m13 m12 m11 m10 m23 m22 m21 m20m03 m02 m01 m00 punpcklwd m31 m21 m30 m20m11 m01 m10 m00 punpckhdq punpckldq m31 m21 m11 m01m30 m20 m10 m00 m03 m02 m01 m00 m13 m12 m11 m10 m23 m22 m21 m20 m33 m32 m31 m30 m30 m20 m10 m00 m31 m21 m11 m01 m33 m22 m12 m02 m33 m23 m13 m03

ECE Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Chroma Keying for (i=0; i<image_size; i++) –if (x[i] == Blue) new_image[i] =y[i] – else new_image[i] = x[i];

ECE Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Chroma Keying Code Movq mm3, mem1 –Load eight pixels from persons’ image Movq mm4, mem2 –Load eight pixels from the background image Pcmpeqb mm1, mm3 Pand mm4, mm1 Pandn mm1, mm3 Por mm4, mm1

ECE Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Integration into Pentium Major issue: OS compatibility –Create new registers? –Share registers with FP Existing OSes will save/restore Use 64-bit datapaths Pipe capable of 2 MMX IPC Separate MEM and Execute stage

ECE Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos “Recent” Multimedia Extensions Intel MMX: integer arithmetic only New algorithms -> new needs Need for massive amounts of FP ops Solution? MMX like ISA but for FP not only integer Example: AMD’s 3DNow! –New data type: 2 packed single-precision FP –2 x 32-bits »sign + exponent + significant –New instructions –Speedup potential: 2x

ECE Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos AMD’s 3DNow! 21 new instructions Average: motivated by MPEG Add, Sub, Reverse Sub, Mul Accumulate –(A1, A2) acc (B1, B2) = (B1 + B2, A1 + A2) Comparison (create mask) Min, Max (pairwise) Reciprocal and SQRT, –Approximation: 1st step and other steps Prefetch Integer from/to FP conversion All operate on packed FP data –sign * 2^(mantissa - 127) * exponent

ECE Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Recent Extensions Cont. Intel’s ISSE –very similar to AMD’s 3DNow! –But has separate registers Lessons? –Applications change over time –Careful when introducing new instructions How useful are they? Cost? LEGACY: are they going to be useful in the future? Everyone has their own Multimedia Instruction set these days –read handout

ECE Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Intel’s SSE Multimedia/Internet? 70 new instructions Major Types: –SIMD-FP 128-bit wide 4 x 16 bit FP –Data movement and re-organization –Type conversion Int to Fp and vice versa Scalar/FP precision –State Save/Restore New SSE registers not like MMX –Memory Streaming Prefetch to specified hierarchy level –New Media Absolute Diff, Rounded AVG, MIN/MAX SSE2: –SIMD-FP two 64-bit fp as 128-bit

ECE Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Altivec (PowerPC Mmedia Ext) 128-bit registers 8, 16, or 32 bit data types Scalar or single-precision FP 162 Instructions Saturation or Modulo arithmetic Four operand Instructions –3 sources, 1 target

ECE Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Altivec Design Process Look at Mmedia Kernel Justify new instructions Video –8bit int LowQ, 16-bit int HighQ Audio –16bit int LowQ, SP FP HighQ Image Processing –8bit int LowQ, 16bit Int HighQ 3D Graphics –16bit int LowQ, SP FP HighQ Speech Recog. –16bit int Low Q, Sp FP HighQ Communications/Crypto –8-bit or 16bit unsigned int

ECE Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Vector Processors Vector: –One-Dimensional array of numbers Original Motivation: –Scientific/Numerical Programs operate on vectors Parallelism Abound Example: –Do i = 1 to 64 C[I] = A[I] + B[I] Vector Processors Registers are vectors Operations are element-wise across multiple vectors Example: –addv Rc, Ra,Rb

ECE Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Vector Example Do i = 1 to 64 C[I] = A[I] + B[I] addv rc, ra, rb a[0]b[0] + c[0] = a[1]b[1] + c[1] = a[2]b[2] + c[2] = a[63]b[63] + c[63] =

ECE Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Why Vector Processors? Deeper Pipelines  faster Clock  Higher Performance BUT! –Interlock logic becomes really complicated as pipeline deepens –Bubbles due to data deps increase Want Wider Machines to exploit Parallelism BUT! –Increasingly Harder to increase issue width Finally Recall Fetch and Issue Bottleneck –Can’t execute more that you fetch/decode

ECE Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos What’s Good About Vector Procs Vectors facilitate deeper Pipelines –No intra vector interlocks –No intra vector data deps –Inner loop control deps eliminated They were artificial to start with –Single Instruction for Multiple operations –Vector instruction provides information for what the machine is going to be doing for a while Could exploit in memory system Know that we are going to use 64 elements which are likely one after the other

ECE Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Vector Architectures Vectors in Memory –All vectors in memory –Long startup latency –Memory ports? –Good for long vectors Vectors in Registers –Load/store –Vector ops only on regs –Register ports less expensive than memory ports –Good for small vectors also –Register Vector is the limiter Fact: in most applications vectors are short Hence Register Vectors better

ECE Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Vector ISA Example Vector-Vector Insts –VRC[i] = VRA[i] op VRB[i] Vector – Scalar Inst –VRB[i] = VRA[i] op CONST Vector Load/Store –Mem[i]= VRA[i] –W/ Stride M[r1 + i * r2] = VRA[i] –Indexed M[r1+ VRB[i]] = VRA[i] Also called scatter/gather Support for shorter vectors –Vector Length Register Vector Masks –VRb[i] = op VRa[i] if (VRc [i])

ECE Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Vector Chaining C[i] = A[i] * B[i] D[i] =C[i] + x MULTV VRC, VRA, VRB ADDVI VRD, VRC, Rx VRDi add can be initiated as soon as MUTLVi finishes We do not have to wait for the whole MULTV to finish

ECE Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Vector Processors – A bit of History CRAY-1: started in ’72, completed in ’74 12ns cycle time 8 Scalar Registers 8 Address Registers 8 Vectors or 64 words 64 Scalar and 64 Address temporaries 12 Functional Units 1Mword memory: 4 clock cycles

ECE Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Are Vectors Always a Win? From Gordon Bell’s talk Scalar is way better for short vectors Vector 7x Scalar for larger vectors Vector size Time/element

ECE Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Cray-1 Architecture

ECE Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Vectors and SIMD Vector Length –Not programmable (no VL reg) –Must be multiple of 64 total bits Memory Load/Store –stride one only Arithmetic –Integer only Conditionals –builds byte mask –do both ways and choose –no trap problem -- no trapping instructions Data Movement –minimal –only pack/unpack

ECE Spring ‘02 Some material © Hill, Sohi, Smith, Wood (UW-Madison) © A. Moshovos Specifying Independence Vectors and SIMD are examples of “independence” ISAs Conventional ISA –One instruction after the other –No way of explicitly stating: Inst A and B are independent Vectors and SIMD –A series of many conventional instructions that are the same  one vector or SIMD inst. Limited flexibility for specifying independence Still, these were optimized for the common case in a specific class of applications