ALU Architecture and ISA Extensions Lecture notes from MKP, H. H. Lee and S. Yalamanchili.

ALU Architecture and ISA Extensions Lecture notes from MKP, H. H. Lee and S. Yalamanchili

(2) Reading Sections 3.2-3.5 (only those elements covered in class) Sections 3.6-3.8 Appendix B.5 Goal: Understand the  ISA view of the core microarchitecture  Organization of functional units and register files into basic data paths

(3) Overview Instruction Set Architectures have a purpose  Applications dictate what we need We only have a fixed number of bits  Impact on accuracy More is not better  We cannot afford everything we want Basic Arithmetic Logic Unit (ALU) Design  Addition/subtraction, multiplication, division

(4) Reminder: ISA byte addressed memory 0xFFFFFFFF Arithmetic Logic Unit (ALU) 0x00 0x01 0x02 0x03 0x1F Processor Internal Buses Memory Interface Register File (Programmer Visible State) stack Data segment (static) Text Segment Dynamic Data Reserved Program Counter Programmer Invisible State Kernel registers Who sees what? Memory Map Instruction register

(5) Arithmetic for Computers Operations on integers  Addition and subtraction  Multiplication and division  Dealing with overflow Operation on floating-point real numbers  Representation and operations Let us first look at integers

(6) Integer Addition(3.2) Example: 7 + 6 Overflow if result out of range Adding +ve and –ve operands, no overflow Adding two +ve operands Overflow if result sign is 1 Adding two –ve operands Overflow if result sign is 0

(7) Integer Subtraction Add negation of second operand Example: 7 – 6 = 7 + (–6) +7:0000 0000 … 0000 0111 –6:1111 1111 … 1111 1010 +1:0000 0000 … 0000 0001 Overflow if result out of range  Subtracting two +ve or two –ve operands, no overflow  Subtracting +ve from –ve operand oOverflow if result sign is 0  Subtracting –ve from +ve operand oOverflow if result sign is 1 2’s complement representation

(8) ISA Impact Some languages (e.g., C) ignore overflow  Use MIPS addu, addui, subu instructions Other languages (e.g., Ada, Fortran) require raising an exception  Use MIPS add, addi, sub instructions  On overflow, invoke exception handler oSave PC in exception program counter (EPC) register oJump to predefined handler address omfc0 (move from coprocessor register) instruction can retrieve EPC value, to return after corrective action (more later) ALU Design leads to many solutions. We look at one simple example

(9) Build a 1 bit ALU, and use 32 of them (bit-slice) b a operation result opabres Integer ALU (arithmetic logic unit)(B.5)

(10) Single Bit ALU 0 1 A B Result Operation Implements only AND and OR operations

(11) We can add additional operators (to a point) How about addition? Review full adders from digital design Adding Functionality c out = ab + ac in + bc in sum = a  b  c in

(12) Building a 32-bit ALU

(13) Two's complement approach: just negate b and add 1. How do we negate? A clever solution: Subtraction (a – b) ? Binvert b31 b0 b1 b2 Result31 a31 Result0 CarryIn a0 Result1 a1 Result2 a2 Operation ALU0 CarryIn CarryOut ALU1 CarryIn CarryOut ALU2 CarryIn CarryOut ALU31 CarryIn sub

(14) Need to support the set-on-less-than instruction( slt )  remember: slt is an arithmetic instruction  produces a 1 if rs < rt and 0 otherwise  use subtraction: (a-b) < 0 implies a < b Need to support test for equality ( beq $t5, $t6, $t7 )  use subtraction: (a-b) = 0 implies a = b Tailoring the ALU to the MIPS

(15) What Result31 is when (a-b)<0? 0 3 Result Operation a 1 CarryIn CarryOut 0 1 Binvert b 2 Less Unsigned vs. signed support

(16) Test for equality Notice control lines: 000 = and 001 = or 010 = add 110 = subtract 111 = slt Note: zero is a 1 when the result is zero! Note test for overflow!

(17) ISA View Register-to-Register data path We want this to be as fast as possible ALU $0 $1 $31 CPU/Core

(18) Multiplication (3.3) Long multiplication 1000 × 1001 1000 0000 1000 1001000 Length of product is the sum of operand lengths multiplicand multiplier product

(19) A Multiplier Uses multiple adders  Cost/performance tradeoff Can be pipelined Several multiplication performed in parallel

(20) MIPS Multiplication Two 32-bit registers for product  HI: most-significant 32 bits  LO: least-significant 32-bits Instructions  mult rs, rt / multu rs, rt o64-bit product in HI/LO  mfhi rd / mflo rd oMove from HI/LO to rd oCan test HI value to see if product overflows 32 bits  mul rd, rs, rt oLeast-significant 32 bits of product – > rd Study Exercise: Check out signed and unsigned multiplication with QtSPIM

(21) Division(3.4) Check for 0 divisor Long division approach  If divisor ≤ dividend bits o1 bit in quotient, subtract  Otherwise o0 bit in quotient, bring down next dividend bit Restoring division  Do the subtract, and if remainder goes < 0, add divisor back Signed division  Divide using absolute values  Adjust sign of quotient and remainder as required 1001 1000 1001010 -1000 10 101 1010 -1000 10 n-bit operands yield n-bit quotient and remainder quotient dividend remainder divisor

(22) Faster Division Can’t use parallel hardware as in multiplier  Subtraction is conditional on sign of remainder Faster dividers (e.g. SRT division) generate multiple quotient bits per step  Still require multiple steps Customized implementations for high performance, e.g., supercomputers

(23) MIPS Division Use HI/LO registers for result  HI: 32-bit remainder  LO: 32-bit quotient Instructions  div rs, rt / divu rs, rt  No overflow or divide-by-0 checking oSoftware must perform checks if required  Use mfhi, mflo to access result Study Exercise: Check out signed and unsigned division with QtSPIM

(24) ISA View Additional function units and registers (Hi/Lo) Additional instructions to move data to/from these registers  mfhi, mflo What other instructions would you add? Cost? ALU Hi Multiply Divide Lo $0 $1 $31 CPU/Core

(25) Floating Point(3.5) Representation for non-integral numbers  Including very small and very large numbers Like scientific notation  –2.34 × 10 56  +0.002 × 10 –4  +987.02 × 10 9 In binary  ±1.xxxxxxx 2 × 2 yyyy Types float and double in C normalized not normalized

(26) IEEE 754 Floating-point Representation 292827262524232221201918171615141312111098765432103130 Sexponentsignificand 1bit 8 bits 23 bits 6160595857565554535251504948474645444342414039383736353433326362 Sexponentsignificand 1bit 11 bits 20 bits significand (continued) 32 bits Single Precision (32-bit) Double Precision (64-bit) (–1) sign x (1+fraction) x 2 exponent-127 (–1) sign x (1+fraction) x 2 exponent-1023

(27) Floating Point Standard Defined by IEEE Std 754-1985 Developed in response to divergence of representations  Portability issues for scientific code Now almost universally adopted Two representations  Single precision (32-bit)  Double precision (64-bit)

(28) FP Adder Hardware Much more complex than integer adder Doing it in one clock cycle would take too long  Much longer than integer operations  Slower clock would penalize all instructions FP adder usually takes several cycles  Can be pipelined Example: FP Addition

(29) FP Adder Hardware Step 1 Step 2 Step 3 Step 4

(30) FP Arithmetic Hardware FP multiplier is of similar complexity to FP adder  But uses a multiplier for significands instead of an adder FP arithmetic hardware usually does  Addition, subtraction, multiplication, division, reciprocal, square-root  FP  integer conversion Operations usually takes several cycles  Can be pipelined

(31) ISA Impact FP hardware is coprocessor 1  Adjunct processor that extends the ISA Separate FP registers  32 single-precision: $f0, $f1, … $f31  Paired for double-precision: $f0/$f1, $f2/$f3, … oRelease 2 of MIPs ISA supports 32 × 64-bit FP reg’s FP instructions operate only on FP registers  Programs generally do not perform integer ops on FP data, or vice versa  More registers with minimal code-size impact

(32) ISA View: The Co-Processor Floating point operations access a separate set of 32-bit registers  Pairs of 32-bit registers are used for double precision ALU Hi Multiply Divide Lo $0 $1 $31 FP ALU $0 $1 $31 BadVaddr Status Causes EPC CPU/Core Co-Processor 1 Co-Processor 0 later

(33) ISA View Distinct instructions operate on the floating point registers (pg. A-73)  Arithmetic instructions oadd.d fd, fs, ft, and add.s fd, fs, ft Data movement to/from floating point coprocessors  mcf1 rt, fs and mtc1 rd, fs Note that the ISA design implementation is extensible via co-processors FP load and store instructions  lwc1, ldc1, swc1, sdc1 oe.g., ldc1 $f8, 32($sp) single precisiondouble precision Example: DP Mean

(34) Associativity Floating point arithmetic is not commutative Parallel programs may interleave operations in unexpected orders  Assumptions of associativity may fail Need to validate parallel programs under varying degrees of parallelism

(35) Performance Issues Latency of instructions  Integer instructions can take a single cycle  Floating point instructions can take multiple cycles  Some (FP Divide) can take hundreds of cycles What about energy (we will get to that shortly) What other instructions would you like in hardware?  Would some applications change your mind? How do you decide whether to add new instructions?

(36) Multimedia (3.6, 3.7, 3.8) Lower dynamic range and precision requirements  Do not need 32-bits! Inherent parallelism in the operations

(37) Vector Computation Operate on multiple data elements (vectors) at a time Flexible definition/use of registers Registers hold integers, floats (SP), doubles DP) 1x128 bit integer 4 x 32-bit single precision 2x64-bit double precision 8x16 short integers 128-bit Register

(38) Processing Vectors Memory vector registers When is this more efficient? When is this not efficient? Think of 3D graphics, linear algebra and media processing

(39) Case Study: Intel Streaming SIMD Extensions 8, 128-bit XMM registers  X86-64 adds 8 more registers XMM8-XMM15 8, 16, 32, 64 bit integers (SSE2) 32-bit (SP) and 64-bit (DP) floating point Signed/unsigned integer operations IEEE 754 floating point support Reading Assignment:  http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions  http://neilkemp.us/src/sse_tutorial/sse_tutorial.html#I http://neilkemp.us/src/sse_tutorial/sse_tutorial.html#I

(40) Instruction Categories Floating point instructions  Arithmetic, movement  Comparison, shuffling  Type conversion, bit level Integer Other  e.g., cache management ISA extensions! Advanced Vector Extensions (AVX)  Successor to SSE register memory

(41) Arithmetic View Graphics and media processing operates on vectors of 8-bit and 16-bit data  Use 64-bit adder, with partitioned carry chain oOperate on 8×8-bit, 4×16-bit, or 2×32-bit vectors  SIMD (single-instruction, multiple-data) Saturating operations  On overflow, result is largest representable value oc.f. 2s-complement modulo arithmetic  E.g., clipping in audio, saturation in video 4x16-bit2x32-bit

(42) SSE Example // A 16byte = 128bit vector struct struct Vector4 { float x, y, z, w; }; // Add two constant vectors and return the resulting vector Vector4 SSE_Add ( const Vector4 &Op_A, const Vector4 &Op_B ) { Vector4 Ret_Vector; __asm { MOV EAX Op_A // Load pointers into CPU regs MOV EBX, Op_B MOVUPS XMM0, [EAX] // Move unaligned vectors to SSE regs MOVUPS XMM1, [EBX] ADDPS XMM0, XMM1 // Add vector elements MOVUPS [Ret_Vector], XMM0 // Save the return vector } return Ret_Vector; } From http://neilkemp.us/src/sse_tutorial/sse_tutorial.html#I More complex example (matrix multiply) in Section 3.8 – using AVX

(43) Characterizing Parallelism Characterization due to M. Flynn* SISDSIMD MISDMIMD Single instruction multiple data stream computing, e.g., SSE Data Streams Instruction Streams Today serial computing cores (von Neumann model) Today’s Multicore *M. Flynn, (September 1972). "Some Computer Organizations and Their Effectiveness". IEEE Transactions on Computers, C–21 (9): 948–960tt

(44) Parallelism Categories From http://en.wikipedia.org/wiki/Flynn%27s_taxonomy

(45) Data Parallel vs. Traditional Vector Vector Register A Vector Register B Vector Register C pipelined functional unit registers Vector Architecture Data Parallel Architecture Process each square in parallel – data parallel computation

(46) ISA View Separate core data path Can be viewed as a co-processor with a distinct set of instructions ALU Hi Multiply Divide Lo $0 $1 $31 Vector ALU XMM0 XMM1 XMM15 CPU/Core SIMD Registers

(47) Domain Impact on the ISA: Example Floats Double precision Massive data Power constrained Integers Lower precision Streaming data Security support Energy constrained Scientific ComputingEmbedded Systems

(48) Summary ISAs support operations required of application domains  Note the differences between embedded and supercomputers!  Signed, unsigned, FP, SIMD, etc. Bounded precision effects  Software must be careful how hardware used e.g., associativity  Need standards to promote portability Avoid “kitchen sink” designs  There is no free lunch  Impact on speed and energy  we will get to this later

(49) Study Guide Perform 2’s complement addition and subtraction (review) Add a few more instructions to the simple ALU  Add an XOR instruction  Add an instruction that returns the max of its inputs  Make sure all control signals are accounted for Convert real numbers to single precision floating point (review) and extract the value from an encoded single precision number (review) Execute the SPIM programs (class website) that use floating point numbers. Study the memory/register contents via single step execution

(50) Study Guide (cont.) Write a few simple SPIM programs for  Multiplication/division of signed and unsigned numbers oUse numbers that produce >32-bit results oMove to/from HI and LO registers ( find the instructions for doing so)  Addition/subtraction of floating point numbers Try to write a simple SPIM program that demonstrates that floating point operations are not associative (this takes some thought and review of the range of floating point numbers) Look up additional SIMD instruction sets and compare  AMD NEON, Altivec, AMD 3D Now

(51) Glossary Co-processor Data parallelism Data parallel computation vs. vector computation Instruction set extensions Overflow MIMD Precision SIMD Saturating arithmetic Signed arithmetic support Unsigned arithmetic support Vector processing

ALU Architecture and ISA Extensions Lecture notes from MKP, H. H. Lee and S. Yalamanchili.

Similar presentations

Presentation on theme: "ALU Architecture and ISA Extensions Lecture notes from MKP, H. H. Lee and S. Yalamanchili."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ALU Architecture and ISA Extensions Lecture notes from MKP, H. H. Lee and S. Yalamanchili.

Similar presentations

Presentation on theme: "ALU Architecture and ISA Extensions Lecture notes from MKP, H. H. Lee and S. Yalamanchili."— Presentation transcript:

Similar presentations

About project

Feedback