ALU Architecture and ISA Extensions Lecture notes from MKP, H. H. Lee and S. Yalamanchili.

Slides:



Advertisements
Similar presentations
Morgan Kaufmann Publishers Arithmetic for Computers
Advertisements

Spring 2013 Advising Starts this week! CS2710 Computer Organization1.
Chapter 3 Arithmetic for Computers. Chapter 3 — Arithmetic for Computers — 2 Arithmetic for Computers Operations on integers Addition and subtraction.
C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 3 Arithmetic for Computers.
Computer Organization CS224 Fall 2012 Lesson 19. Floating-Point Example  What number is represented by the single-precision float …00 
Arithmetic in Computers Chapter 4 Arithmetic in Computers2 Outline Data representation integers Unsigned integers Signed integers Floating-points.
CS/ECE 3330 Computer Architecture Chapter 3 Arithmetic.
Chapter 3 Arithmetic for Computers. Exam 1 CSCE
1 CONSTRUCTING AN ARITHMETIC LOGIC UNIT CHAPTER 4: PART II.
Lecture 16: Computer Arithmetic Today’s topic –Floating point numbers –IEEE 754 representations –FP arithmetic Reminder –HW 4 due Monday 1.
Lecture 15: Computer Arithmetic Today’s topic –Division 1.
CMPT 334 Computer Organization Chapter 3 Arithmetic for Computers [Adapted from Computer Organization and Design 5 th Edition, Patterson & Hennessy, ©
Lecture Objectives: 1)Perform binary division of two numbers. 2)Define dividend, divisor, quotient, and remainder. 3)Explain how division is accomplished.
1 Lecture 9: Floating Point Today’s topics:  Division  IEEE 754 representations  FP arithmetic Reminder: assignment 4 will be posted later today.
Chapter 3 Arithmetic for Computers. Multiplication More complicated than addition accomplished via shifting and addition More time and more area Let's.
Lecture 9 Sept 28 Chapter 3 Arithmetic for Computers.
ECE 15B Computer Organization Spring 2010 Dmitri Strukov Lecture 4: Arithmetic / Data Transfer Instructions Partially adapted from Computer Organization.
1  2004 Morgan Kaufmann Publishers Chapter Three.
1 Chapter 4: Arithmetic Where we've been: –Performance (seconds, cycles, instructions) –Abstractions: Instruction Set Architecture Assembly Language and.
1 Lecture 8: Binary Multiplication & Division Today’s topics:  Addition/Subtraction  Multiplication  Division Reminder: get started early on assignment.
ECE 15B Computer Organization Spring 2010 Dmitri Strukov Lecture 6: Logic/Shift Instructions Partially adapted from Computer Organization and Design, 4.
1 ECE369 Chapter 3. 2 ECE369 Multiplication More complicated than addition –Accomplished via shifting and addition More time and more area.
COMPUTER ARCHITECTURE & OPERATIONS I Instructor: Hao Ji.
Lecture Objectives: 1)Explain the relationship between addition and subtraction with twos complement numbering systems 2)Explain the concept of numeric.
Chapter 3 -Arithmetic for Computers Operations on integers – Addition and subtraction – Multiplication and division – Dealing with overflow Floating-point.
1 Bits are just bits (no inherent meaning) — conventions define relationship between bits and numbers Binary numbers (base 2)
CPS3340 COMPUTER ARCHITECTURE Fall Semester, /14/2013 Lecture 16: Floating Point Instructor: Ashraf Yaseen DEPARTMENT OF MATH & COMPUTER SCIENCE.
1 ECE369 Sections 3.5, 3.6 and ECE369 Number Systems Fixed Point: Binary point of a real number in a certain position –Can treat real numbers as.
1 EGRE 426 Fall 08 Chapter Three. 2 Arithmetic What's up ahead: –Implementing the Architecture 32 operation result a b ALU.
1  1998 Morgan Kaufmann Publishers Arithmetic Where we've been: –Performance (seconds, cycles, instructions) –Abstractions: Instruction Set Architecture.
CSF 2009 Arithmetic for Computers Chapter 3. Arithmetic for Computers Operations on integers Addition and subtraction Multiplication and division Dealing.
Lecture 9: Floating Point
Floating Point Representation for non-integral numbers – Including very small and very large numbers Like scientific notation – –2.34 × –
ECE 456 Computer Architecture
Chapter 3 Arithmetic for Computers (Integers). Florida A & M University - Department of Computer and Information Sciences Arithmetic for Computers Operations.
Chapter 3 Arithmetic for Computers. Chapter 3 — Arithmetic for Computers — 2 Arithmetic for Computers Operations on integers Addition and subtraction.
Chapter 3 Arithmetic for Computers. Chapter 3 — Arithmetic for Computers — 2 Arithmetic for Computers Operations on integers Addition and subtraction.
1 ELEN 033 Lecture 4 Chapter 4 of Text (COD2E) Chapters 3 and 4 of Goodman and Miller book.
C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 3 Arithmetic for Computers.
Chapter 3 Arithmetic for Computers. Chapter 3 — Arithmetic for Computers — 2 Arithmetic for Computers Operations on integers Addition and subtraction.
Division Check for 0 divisor Long division approach – If divisor ≤ dividend bits 1 bit in quotient, subtract – Otherwise 0 bit in quotient, bring down.
Morgan Kaufmann Publishers Arithmetic for Computers
Prof. Hsien-Hsin Sean Lee
9/23/2004Comp 120 Fall September Chapter 4 – Arithmetic and its implementation Assignments 5,6 and 7 posted to the class web page.
By Wannarat Computer System Design Lecture 3 Wannarat Suntiamorntut.
1 CPTR 220 Computer Organization Computer Architecture Assembly Programming.
Arithmetic for Computers Chapter 3 1. Arithmetic for Computers  Operations on integers  Addition and subtraction  Multiplication and division  Dealing.
Computer Architecture & Operations I
ALU Architecture and ISA Extensions
Morgan Kaufmann Publishers Arithmetic for Computers
Computer Architecture & Operations I
Morgan Kaufmann Publishers Arithmetic for Computers
Integer Multiplication and Division
Computer Architecture & Operations I
Morgan Kaufmann Publishers Arithmetic for Computers
Morgan Kaufmann Publishers Arithmetic for Computers
Morgan Kaufmann Publishers
/ Computer Architecture and Design
Morgan Kaufmann Publishers
Ch 3: Computer Arithmetic
CDA 3101 Summer 2007 Introduction to Computer Organization
Arithmetic for Computers
CS352H: Computer Systems Architecture
Computer Architecture Computer Science & Engineering
ECE232: Hardware Organization and Design
Computer Arithmetic Multiplication, Floating Point
Morgan Kaufmann Publishers Arithmetic for Computers
MIPS Arithmetic and Logic Instructions
Presentation transcript:

ALU Architecture and ISA Extensions Lecture notes from MKP, H. H. Lee and S. Yalamanchili

(2) Reading Sections (only those elements covered in class) Sections Appendix B.5 Goal: Understand the  ISA view of the core microarchitecture  Organization of functional units and register files into basic data paths

(3) Overview Instruction Set Architectures have a purpose  Applications dictate what we need We only have a fixed number of bits  Impact on accuracy More is not better  We cannot afford everything we want Basic Arithmetic Logic Unit (ALU) Design  Addition/subtraction, multiplication, division

(4) Reminder: ISA byte addressed memory 0xFFFFFFFF Arithmetic Logic Unit (ALU) 0x00 0x01 0x02 0x03 0x1F Processor Internal Buses Memory Interface Register File (Programmer Visible State) stack Data segment (static) Text Segment Dynamic Data Reserved Program Counter Programmer Invisible State Kernel registers Who sees what? Memory Map Instruction register

(5) Arithmetic for Computers Operations on integers  Addition and subtraction  Multiplication and division  Dealing with overflow Operation on floating-point real numbers  Representation and operations Let us first look at integers

(6) Integer Addition(3.2) Example: Overflow if result out of range Adding +ve and –ve operands, no overflow Adding two +ve operands Overflow if result sign is 1 Adding two –ve operands Overflow if result sign is 0

(7) Integer Subtraction Add negation of second operand Example: 7 – 6 = 7 + (–6) +7: … –6: … : … Overflow if result out of range  Subtracting two +ve or two –ve operands, no overflow  Subtracting +ve from –ve operand oOverflow if result sign is 0  Subtracting –ve from +ve operand oOverflow if result sign is 1 2’s complement representation

(8) ISA Impact Some languages (e.g., C) ignore overflow  Use MIPS addu, addui, subu instructions Other languages (e.g., Ada, Fortran) require raising an exception  Use MIPS add, addi, sub instructions  On overflow, invoke exception handler oSave PC in exception program counter (EPC) register oJump to predefined handler address omfc0 (move from coprocessor register) instruction can retrieve EPC value, to return after corrective action (more later) ALU Design leads to many solutions. We look at one simple example

(9) Build a 1 bit ALU, and use 32 of them (bit-slice) b a operation result opabres Integer ALU (arithmetic logic unit)(B.5)

(10) Single Bit ALU 0 1 A B Result Operation Implements only AND and OR operations

(11) We can add additional operators (to a point) How about addition? Review full adders from digital design Adding Functionality c out = ab + ac in + bc in sum = a  b  c in

(12) Building a 32-bit ALU

(13) Two's complement approach: just negate b and add 1. How do we negate? A clever solution: Subtraction (a – b) ? Binvert b31 b0 b1 b2 Result31 a31 Result0 CarryIn a0 Result1 a1 Result2 a2 Operation ALU0 CarryIn CarryOut ALU1 CarryIn CarryOut ALU2 CarryIn CarryOut ALU31 CarryIn sub

(14) Need to support the set-on-less-than instruction( slt )  remember: slt is an arithmetic instruction  produces a 1 if rs < rt and 0 otherwise  use subtraction: (a-b) < 0 implies a < b Need to support test for equality ( beq $t5, $t6, $t7 )  use subtraction: (a-b) = 0 implies a = b Tailoring the ALU to the MIPS

(15) What Result31 is when (a-b)<0? 0 3 Result Operation a 1 CarryIn CarryOut 0 1 Binvert b 2 Less Unsigned vs. signed support

(16) Test for equality Notice control lines: 000 = and 001 = or 010 = add 110 = subtract 111 = slt Note: zero is a 1 when the result is zero! Note test for overflow!

(17) ISA View Register-to-Register data path We want this to be as fast as possible ALU $0 $1 $31 CPU/Core

(18) Multiplication (3.3) Long multiplication 1000 × Length of product is the sum of operand lengths multiplicand multiplier product

(19) A Multiplier Uses multiple adders  Cost/performance tradeoff Can be pipelined Several multiplication performed in parallel

(20) MIPS Multiplication Two 32-bit registers for product  HI: most-significant 32 bits  LO: least-significant 32-bits Instructions  mult rs, rt / multu rs, rt o64-bit product in HI/LO  mfhi rd / mflo rd oMove from HI/LO to rd oCan test HI value to see if product overflows 32 bits  mul rd, rs, rt oLeast-significant 32 bits of product – > rd Study Exercise: Check out signed and unsigned multiplication with QtSPIM

(21) Division(3.4) Check for 0 divisor Long division approach  If divisor ≤ dividend bits o1 bit in quotient, subtract  Otherwise o0 bit in quotient, bring down next dividend bit Restoring division  Do the subtract, and if remainder goes < 0, add divisor back Signed division  Divide using absolute values  Adjust sign of quotient and remainder as required n-bit operands yield n-bit quotient and remainder quotient dividend remainder divisor

(22) Faster Division Can’t use parallel hardware as in multiplier  Subtraction is conditional on sign of remainder Faster dividers (e.g. SRT division) generate multiple quotient bits per step  Still require multiple steps Customized implementations for high performance, e.g., supercomputers

(23) MIPS Division Use HI/LO registers for result  HI: 32-bit remainder  LO: 32-bit quotient Instructions  div rs, rt / divu rs, rt  No overflow or divide-by-0 checking oSoftware must perform checks if required  Use mfhi, mflo to access result Study Exercise: Check out signed and unsigned division with QtSPIM

(24) ISA View Additional function units and registers (Hi/Lo) Additional instructions to move data to/from these registers  mfhi, mflo What other instructions would you add? Cost? ALU Hi Multiply Divide Lo $0 $1 $31 CPU/Core

(25) Floating Point(3.5) Representation for non-integral numbers  Including very small and very large numbers Like scientific notation  –2.34 ×  × 10 –4  × 10 9 In binary  ±1.xxxxxxx 2 × 2 yyyy Types float and double in C normalized not normalized

(26) IEEE 754 Floating-point Representation Sexponentsignificand 1bit 8 bits 23 bits Sexponentsignificand 1bit 11 bits 20 bits significand (continued) 32 bits Single Precision (32-bit) Double Precision (64-bit) (–1) sign x (1+fraction) x 2 exponent-127 (–1) sign x (1+fraction) x 2 exponent-1023

(27) Floating Point Standard Defined by IEEE Std Developed in response to divergence of representations  Portability issues for scientific code Now almost universally adopted Two representations  Single precision (32-bit)  Double precision (64-bit)

(28) FP Adder Hardware Much more complex than integer adder Doing it in one clock cycle would take too long  Much longer than integer operations  Slower clock would penalize all instructions FP adder usually takes several cycles  Can be pipelined Example: FP Addition

(29) FP Adder Hardware Step 1 Step 2 Step 3 Step 4

(30) FP Arithmetic Hardware FP multiplier is of similar complexity to FP adder  But uses a multiplier for significands instead of an adder FP arithmetic hardware usually does  Addition, subtraction, multiplication, division, reciprocal, square-root  FP  integer conversion Operations usually takes several cycles  Can be pipelined

(31) ISA Impact FP hardware is coprocessor 1  Adjunct processor that extends the ISA Separate FP registers  32 single-precision: $f0, $f1, … $f31  Paired for double-precision: $f0/$f1, $f2/$f3, … oRelease 2 of MIPs ISA supports 32 × 64-bit FP reg’s FP instructions operate only on FP registers  Programs generally do not perform integer ops on FP data, or vice versa  More registers with minimal code-size impact

(32) ISA View: The Co-Processor Floating point operations access a separate set of 32-bit registers  Pairs of 32-bit registers are used for double precision ALU Hi Multiply Divide Lo $0 $1 $31 FP ALU $0 $1 $31 BadVaddr Status Causes EPC CPU/Core Co-Processor 1 Co-Processor 0 later

(33) ISA View Distinct instructions operate on the floating point registers (pg. A-73)  Arithmetic instructions oadd.d fd, fs, ft, and add.s fd, fs, ft Data movement to/from floating point coprocessors  mcf1 rt, fs and mtc1 rd, fs Note that the ISA design implementation is extensible via co-processors FP load and store instructions  lwc1, ldc1, swc1, sdc1 oe.g., ldc1 $f8, 32($sp) single precisiondouble precision Example: DP Mean

(34) Associativity Floating point arithmetic is not commutative Parallel programs may interleave operations in unexpected orders  Assumptions of associativity may fail Need to validate parallel programs under varying degrees of parallelism

(35) Performance Issues Latency of instructions  Integer instructions can take a single cycle  Floating point instructions can take multiple cycles  Some (FP Divide) can take hundreds of cycles What about energy (we will get to that shortly) What other instructions would you like in hardware?  Would some applications change your mind? How do you decide whether to add new instructions?

(36) Multimedia (3.6, 3.7, 3.8) Lower dynamic range and precision requirements  Do not need 32-bits! Inherent parallelism in the operations

(37) Vector Computation Operate on multiple data elements (vectors) at a time Flexible definition/use of registers Registers hold integers, floats (SP), doubles DP) 1x128 bit integer 4 x 32-bit single precision 2x64-bit double precision 8x16 short integers 128-bit Register

(38) Processing Vectors Memory vector registers When is this more efficient? When is this not efficient? Think of 3D graphics, linear algebra and media processing

(39) Case Study: Intel Streaming SIMD Extensions 8, 128-bit XMM registers  X86-64 adds 8 more registers XMM8-XMM15 8, 16, 32, 64 bit integers (SSE2) 32-bit (SP) and 64-bit (DP) floating point Signed/unsigned integer operations IEEE 754 floating point support Reading Assignment:  

(40) Instruction Categories Floating point instructions  Arithmetic, movement  Comparison, shuffling  Type conversion, bit level Integer Other  e.g., cache management ISA extensions! Advanced Vector Extensions (AVX)  Successor to SSE register memory

(41) Arithmetic View Graphics and media processing operates on vectors of 8-bit and 16-bit data  Use 64-bit adder, with partitioned carry chain oOperate on 8×8-bit, 4×16-bit, or 2×32-bit vectors  SIMD (single-instruction, multiple-data) Saturating operations  On overflow, result is largest representable value oc.f. 2s-complement modulo arithmetic  E.g., clipping in audio, saturation in video 4x16-bit2x32-bit

(42) SSE Example // A 16byte = 128bit vector struct struct Vector4 { float x, y, z, w; }; // Add two constant vectors and return the resulting vector Vector4 SSE_Add ( const Vector4 &Op_A, const Vector4 &Op_B ) { Vector4 Ret_Vector; __asm { MOV EAX Op_A // Load pointers into CPU regs MOV EBX, Op_B MOVUPS XMM0, [EAX] // Move unaligned vectors to SSE regs MOVUPS XMM1, [EBX] ADDPS XMM0, XMM1 // Add vector elements MOVUPS [Ret_Vector], XMM0 // Save the return vector } return Ret_Vector; } From More complex example (matrix multiply) in Section 3.8 – using AVX

(43) Characterizing Parallelism Characterization due to M. Flynn* SISDSIMD MISDMIMD Single instruction multiple data stream computing, e.g., SSE Data Streams Instruction Streams Today serial computing cores (von Neumann model) Today’s Multicore *M. Flynn, (September 1972). "Some Computer Organizations and Their Effectiveness". IEEE Transactions on Computers, C–21 (9): 948–960tt

(44) Parallelism Categories From

(45) Data Parallel vs. Traditional Vector Vector Register A Vector Register B Vector Register C pipelined functional unit registers Vector Architecture Data Parallel Architecture Process each square in parallel – data parallel computation

(46) ISA View Separate core data path Can be viewed as a co-processor with a distinct set of instructions ALU Hi Multiply Divide Lo $0 $1 $31 Vector ALU XMM0 XMM1 XMM15 CPU/Core SIMD Registers

(47) Domain Impact on the ISA: Example Floats Double precision Massive data Power constrained Integers Lower precision Streaming data Security support Energy constrained Scientific ComputingEmbedded Systems

(48) Summary ISAs support operations required of application domains  Note the differences between embedded and supercomputers!  Signed, unsigned, FP, SIMD, etc. Bounded precision effects  Software must be careful how hardware used e.g., associativity  Need standards to promote portability Avoid “kitchen sink” designs  There is no free lunch  Impact on speed and energy  we will get to this later

(49) Study Guide Perform 2’s complement addition and subtraction (review) Add a few more instructions to the simple ALU  Add an XOR instruction  Add an instruction that returns the max of its inputs  Make sure all control signals are accounted for Convert real numbers to single precision floating point (review) and extract the value from an encoded single precision number (review) Execute the SPIM programs (class website) that use floating point numbers. Study the memory/register contents via single step execution

(50) Study Guide (cont.) Write a few simple SPIM programs for  Multiplication/division of signed and unsigned numbers oUse numbers that produce >32-bit results oMove to/from HI and LO registers ( find the instructions for doing so)  Addition/subtraction of floating point numbers Try to write a simple SPIM program that demonstrates that floating point operations are not associative (this takes some thought and review of the range of floating point numbers) Look up additional SIMD instruction sets and compare  AMD NEON, Altivec, AMD 3D Now

(51) Glossary Co-processor Data parallelism Data parallel computation vs. vector computation Instruction set extensions Overflow MIMD Precision SIMD Saturating arithmetic Signed arithmetic support Unsigned arithmetic support Vector processing