Download presentation
Presentation is loading. Please wait.
Published byTabitha Hall Modified over 9 years ago
1
ALU Architecture and ISA Extensions Lecture notes from MKP, H. H. Lee and S. Yalamanchili
2
(2) Reading Sections 3.2-3.5 (only those elements covered in class) Sections 3.6-3.8 Appendix B.5 Goal: Understand the ISA view of the core microarchitecture Organization of functional units and register files into basic data paths
3
(3) Overview Instruction Set Architectures have a purpose Applications dictate what we need We only have a fixed number of bits Impact on accuracy More is not better We cannot afford everything we want Basic Arithmetic Logic Unit (ALU) Design Addition/subtraction, multiplication, division
4
(4) Reminder: ISA byte addressed memory 0xFFFFFFFF Arithmetic Logic Unit (ALU) 0x00 0x01 0x02 0x03 0x1F Processor Internal Buses Memory Interface Register File (Programmer Visible State) stack Data segment (static) Text Segment Dynamic Data Reserved Program Counter Programmer Invisible State Kernel registers Who sees what? Memory Map Instruction register
5
(5) Arithmetic for Computers Operations on integers Addition and subtraction Multiplication and division Dealing with overflow Operation on floating-point real numbers Representation and operations Let us first look at integers
6
(6) Integer Addition(3.2) Example: 7 + 6 Overflow if result out of range Adding +ve and –ve operands, no overflow Adding two +ve operands Overflow if result sign is 1 Adding two –ve operands Overflow if result sign is 0
7
(7) Integer Subtraction Add negation of second operand Example: 7 – 6 = 7 + (–6) +7:0000 0000 … 0000 0111 –6:1111 1111 … 1111 1010 +1:0000 0000 … 0000 0001 Overflow if result out of range Subtracting two +ve or two –ve operands, no overflow Subtracting +ve from –ve operand oOverflow if result sign is 0 Subtracting –ve from +ve operand oOverflow if result sign is 1 2’s complement representation
8
(8) ISA Impact Some languages (e.g., C) ignore overflow Use MIPS addu, addui, subu instructions Other languages (e.g., Ada, Fortran) require raising an exception Use MIPS add, addi, sub instructions On overflow, invoke exception handler oSave PC in exception program counter (EPC) register oJump to predefined handler address omfc0 (move from coprocessor register) instruction can retrieve EPC value, to return after corrective action (more later) ALU Design leads to many solutions. We look at one simple example
9
(9) Build a 1 bit ALU, and use 32 of them (bit-slice) b a operation result opabres Integer ALU (arithmetic logic unit)(B.5)
10
(10) Single Bit ALU 0 1 A B Result Operation Implements only AND and OR operations
11
(11) We can add additional operators (to a point) How about addition? Review full adders from digital design Adding Functionality c out = ab + ac in + bc in sum = a b c in
12
(12) Building a 32-bit ALU
13
(13) Two's complement approach: just negate b and add 1. How do we negate? A clever solution: Subtraction (a – b) ? Binvert b31 b0 b1 b2 Result31 a31 Result0 CarryIn a0 Result1 a1 Result2 a2 Operation ALU0 CarryIn CarryOut ALU1 CarryIn CarryOut ALU2 CarryIn CarryOut ALU31 CarryIn sub
14
(14) Need to support the set-on-less-than instruction( slt ) remember: slt is an arithmetic instruction produces a 1 if rs < rt and 0 otherwise use subtraction: (a-b) < 0 implies a < b Need to support test for equality ( beq $t5, $t6, $t7 ) use subtraction: (a-b) = 0 implies a = b Tailoring the ALU to the MIPS
15
(15) What Result31 is when (a-b)<0? 0 3 Result Operation a 1 CarryIn CarryOut 0 1 Binvert b 2 Less Unsigned vs. signed support
16
(16) Test for equality Notice control lines: 000 = and 001 = or 010 = add 110 = subtract 111 = slt Note: zero is a 1 when the result is zero! Note test for overflow!
17
(17) ISA View Register-to-Register data path We want this to be as fast as possible ALU $0 $1 $31 CPU/Core
18
(18) Multiplication (3.3) Long multiplication 1000 × 1001 1000 0000 1000 1001000 Length of product is the sum of operand lengths multiplicand multiplier product
19
(19) A Multiplier Uses multiple adders Cost/performance tradeoff Can be pipelined Several multiplication performed in parallel
20
(20) MIPS Multiplication Two 32-bit registers for product HI: most-significant 32 bits LO: least-significant 32-bits Instructions mult rs, rt / multu rs, rt o64-bit product in HI/LO mfhi rd / mflo rd oMove from HI/LO to rd oCan test HI value to see if product overflows 32 bits mul rd, rs, rt oLeast-significant 32 bits of product – > rd Study Exercise: Check out signed and unsigned multiplication with QtSPIM
21
(21) Division(3.4) Check for 0 divisor Long division approach If divisor ≤ dividend bits o1 bit in quotient, subtract Otherwise o0 bit in quotient, bring down next dividend bit Restoring division Do the subtract, and if remainder goes < 0, add divisor back Signed division Divide using absolute values Adjust sign of quotient and remainder as required 1001 1000 1001010 -1000 10 101 1010 -1000 10 n-bit operands yield n-bit quotient and remainder quotient dividend remainder divisor
22
(22) Faster Division Can’t use parallel hardware as in multiplier Subtraction is conditional on sign of remainder Faster dividers (e.g. SRT division) generate multiple quotient bits per step Still require multiple steps Customized implementations for high performance, e.g., supercomputers
23
(23) MIPS Division Use HI/LO registers for result HI: 32-bit remainder LO: 32-bit quotient Instructions div rs, rt / divu rs, rt No overflow or divide-by-0 checking oSoftware must perform checks if required Use mfhi, mflo to access result Study Exercise: Check out signed and unsigned division with QtSPIM
24
(24) ISA View Additional function units and registers (Hi/Lo) Additional instructions to move data to/from these registers mfhi, mflo What other instructions would you add? Cost? ALU Hi Multiply Divide Lo $0 $1 $31 CPU/Core
25
(25) Floating Point(3.5) Representation for non-integral numbers Including very small and very large numbers Like scientific notation –2.34 × 10 56 +0.002 × 10 –4 +987.02 × 10 9 In binary ±1.xxxxxxx 2 × 2 yyyy Types float and double in C normalized not normalized
26
(26) IEEE 754 Floating-point Representation 292827262524232221201918171615141312111098765432103130 Sexponentsignificand 1bit 8 bits 23 bits 6160595857565554535251504948474645444342414039383736353433326362 Sexponentsignificand 1bit 11 bits 20 bits significand (continued) 32 bits Single Precision (32-bit) Double Precision (64-bit) (–1) sign x (1+fraction) x 2 exponent-127 (–1) sign x (1+fraction) x 2 exponent-1023
27
(27) Floating Point Standard Defined by IEEE Std 754-1985 Developed in response to divergence of representations Portability issues for scientific code Now almost universally adopted Two representations Single precision (32-bit) Double precision (64-bit)
28
(28) FP Adder Hardware Much more complex than integer adder Doing it in one clock cycle would take too long Much longer than integer operations Slower clock would penalize all instructions FP adder usually takes several cycles Can be pipelined Example: FP Addition
29
(29) FP Adder Hardware Step 1 Step 2 Step 3 Step 4
30
(30) FP Arithmetic Hardware FP multiplier is of similar complexity to FP adder But uses a multiplier for significands instead of an adder FP arithmetic hardware usually does Addition, subtraction, multiplication, division, reciprocal, square-root FP integer conversion Operations usually takes several cycles Can be pipelined
31
(31) ISA Impact FP hardware is coprocessor 1 Adjunct processor that extends the ISA Separate FP registers 32 single-precision: $f0, $f1, … $f31 Paired for double-precision: $f0/$f1, $f2/$f3, … oRelease 2 of MIPs ISA supports 32 × 64-bit FP reg’s FP instructions operate only on FP registers Programs generally do not perform integer ops on FP data, or vice versa More registers with minimal code-size impact
32
(32) ISA View: The Co-Processor Floating point operations access a separate set of 32-bit registers Pairs of 32-bit registers are used for double precision ALU Hi Multiply Divide Lo $0 $1 $31 FP ALU $0 $1 $31 BadVaddr Status Causes EPC CPU/Core Co-Processor 1 Co-Processor 0 later
33
(33) ISA View Distinct instructions operate on the floating point registers (pg. A-73) Arithmetic instructions oadd.d fd, fs, ft, and add.s fd, fs, ft Data movement to/from floating point coprocessors mcf1 rt, fs and mtc1 rd, fs Note that the ISA design implementation is extensible via co-processors FP load and store instructions lwc1, ldc1, swc1, sdc1 oe.g., ldc1 $f8, 32($sp) single precisiondouble precision Example: DP Mean
34
(34) Associativity Floating point arithmetic is not commutative Parallel programs may interleave operations in unexpected orders Assumptions of associativity may fail Need to validate parallel programs under varying degrees of parallelism
35
(35) Performance Issues Latency of instructions Integer instructions can take a single cycle Floating point instructions can take multiple cycles Some (FP Divide) can take hundreds of cycles What about energy (we will get to that shortly) What other instructions would you like in hardware? Would some applications change your mind? How do you decide whether to add new instructions?
36
(36) Multimedia (3.6, 3.7, 3.8) Lower dynamic range and precision requirements Do not need 32-bits! Inherent parallelism in the operations
37
(37) Vector Computation Operate on multiple data elements (vectors) at a time Flexible definition/use of registers Registers hold integers, floats (SP), doubles DP) 1x128 bit integer 4 x 32-bit single precision 2x64-bit double precision 8x16 short integers 128-bit Register
38
(38) Processing Vectors Memory vector registers When is this more efficient? When is this not efficient? Think of 3D graphics, linear algebra and media processing
39
(39) Case Study: Intel Streaming SIMD Extensions 8, 128-bit XMM registers X86-64 adds 8 more registers XMM8-XMM15 8, 16, 32, 64 bit integers (SSE2) 32-bit (SP) and 64-bit (DP) floating point Signed/unsigned integer operations IEEE 754 floating point support Reading Assignment: http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions http://neilkemp.us/src/sse_tutorial/sse_tutorial.html#I http://neilkemp.us/src/sse_tutorial/sse_tutorial.html#I
40
(40) Instruction Categories Floating point instructions Arithmetic, movement Comparison, shuffling Type conversion, bit level Integer Other e.g., cache management ISA extensions! Advanced Vector Extensions (AVX) Successor to SSE register memory
41
(41) Arithmetic View Graphics and media processing operates on vectors of 8-bit and 16-bit data Use 64-bit adder, with partitioned carry chain oOperate on 8×8-bit, 4×16-bit, or 2×32-bit vectors SIMD (single-instruction, multiple-data) Saturating operations On overflow, result is largest representable value oc.f. 2s-complement modulo arithmetic E.g., clipping in audio, saturation in video 4x16-bit2x32-bit
42
(42) SSE Example // A 16byte = 128bit vector struct struct Vector4 { float x, y, z, w; }; // Add two constant vectors and return the resulting vector Vector4 SSE_Add ( const Vector4 &Op_A, const Vector4 &Op_B ) { Vector4 Ret_Vector; __asm { MOV EAX Op_A // Load pointers into CPU regs MOV EBX, Op_B MOVUPS XMM0, [EAX] // Move unaligned vectors to SSE regs MOVUPS XMM1, [EBX] ADDPS XMM0, XMM1 // Add vector elements MOVUPS [Ret_Vector], XMM0 // Save the return vector } return Ret_Vector; } From http://neilkemp.us/src/sse_tutorial/sse_tutorial.html#I More complex example (matrix multiply) in Section 3.8 – using AVX
43
(43) Characterizing Parallelism Characterization due to M. Flynn* SISDSIMD MISDMIMD Single instruction multiple data stream computing, e.g., SSE Data Streams Instruction Streams Today serial computing cores (von Neumann model) Today’s Multicore *M. Flynn, (September 1972). "Some Computer Organizations and Their Effectiveness". IEEE Transactions on Computers, C–21 (9): 948–960tt
44
(44) Parallelism Categories From http://en.wikipedia.org/wiki/Flynn%27s_taxonomy
45
(45) Data Parallel vs. Traditional Vector Vector Register A Vector Register B Vector Register C pipelined functional unit registers Vector Architecture Data Parallel Architecture Process each square in parallel – data parallel computation
46
(46) ISA View Separate core data path Can be viewed as a co-processor with a distinct set of instructions ALU Hi Multiply Divide Lo $0 $1 $31 Vector ALU XMM0 XMM1 XMM15 CPU/Core SIMD Registers
47
(47) Domain Impact on the ISA: Example Floats Double precision Massive data Power constrained Integers Lower precision Streaming data Security support Energy constrained Scientific ComputingEmbedded Systems
48
(48) Summary ISAs support operations required of application domains Note the differences between embedded and supercomputers! Signed, unsigned, FP, SIMD, etc. Bounded precision effects Software must be careful how hardware used e.g., associativity Need standards to promote portability Avoid “kitchen sink” designs There is no free lunch Impact on speed and energy we will get to this later
49
(49) Study Guide Perform 2’s complement addition and subtraction (review) Add a few more instructions to the simple ALU Add an XOR instruction Add an instruction that returns the max of its inputs Make sure all control signals are accounted for Convert real numbers to single precision floating point (review) and extract the value from an encoded single precision number (review) Execute the SPIM programs (class website) that use floating point numbers. Study the memory/register contents via single step execution
50
(50) Study Guide (cont.) Write a few simple SPIM programs for Multiplication/division of signed and unsigned numbers oUse numbers that produce >32-bit results oMove to/from HI and LO registers ( find the instructions for doing so) Addition/subtraction of floating point numbers Try to write a simple SPIM program that demonstrates that floating point operations are not associative (this takes some thought and review of the range of floating point numbers) Look up additional SIMD instruction sets and compare AMD NEON, Altivec, AMD 3D Now
51
(51) Glossary Co-processor Data parallelism Data parallel computation vs. vector computation Instruction set extensions Overflow MIMD Precision SIMD Saturating arithmetic Signed arithmetic support Unsigned arithmetic support Vector processing
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.