Farhan Mohamed Ali (W2-1) Jigar Vora (W2-2) Sonali Kapoor (W2-3) Avni Jhunjhunwala (W2-4) Shiven Seth (W2-5) Presentation 1 MAD MAC st February, 2006 Architecture Proposal W2 Project Objective: Design a crucial part of a GPU called the Multiply Accumulate Unit (MAC) which will revolutionize graphics.
MAD MAC 525 Status: Project chosen Specifications defined Architecture Design Behavioral Verilog Testbenches To be done Verilog : Gate Level Design Schematic Floor plan Layout Extraction, LVS, post-layout simulation
Multiply Accumulate unit (MAC) Executes function AB+C on 16 bit floating point inputs Multiply and add in parallel to greatly speed up operation Rounding is only performed only once so greater accuracy than individual multiply and add functions. MAD MAC accelerates FP16 blending to enable true HDR graphics Bright things can be really bright Dark things can be really dark And the details can be seen in both Overview - MAD MAC 525
Quick Overview of FP A = x 2 2 B = x 2 5 C = x 2 8 Step 1: A*B –Multiply the Significands: * = –Exponent of result is expA + expB = 7 –A*B = x 2 7 Step 2: Align C –To add two FP’s, their exponents must be the same –Shift by expA + expB – expC = – 8 = -1 –Shift the significand of C left by 1 – >
Quick Overview of FP (contd.) Step 3: Depending on signs of A*B and C, add or subtract the two –Suppose A, B, and C are all positive –A*B + C = = Step 4: Normalize the Result –Currently the significand is and the exponent is expA + expB = 7 –Normalized to x 2 9 Step 5: Round the Result –The significand needs to be fit in 10 bits –Based on bits 11 through 13, the signficand is rounded and fit in 10 bits
Block Diagram RegArray ARegArray BRegArray C Multiplier Exp CalcAlign Adder/Subtractor Control Logic & Sign Dtrmin Normalize Round Reg Y Leading 0 Anticipator Input Output 16
Design Decisions (Week 2): Implementing a 16 bit (fp16) format 1 bit sign, 10 bit significand and 5 bit exponent Compatible with OpenEXR format used in latest games Enable Ultra-Threading Implements high speed register arrays and fast thread switching logic to instantaneously switch to another available thread if the executing thread runs out of data Implementation: High speed register-arrays for each input
Design Decisions (contd.): Multiplier Implementation – 11 x 11 Carry-Save Multiplier – Reasons: Fast because it avoids having ripple carry in every stage Enables Compact Layout
Design Decisions (contd.): 2’s Complement Adder/Subtractor –Variable Length Carry-Select Adder Reason: Reduces delay through Muxes –Use the signs of the inputs to determine addition or subtraction –Output: 35-bits from Align + 1 Carry Out = 36 bits
Design Decisions (contd.): Leading Zero Counter –Carry-Save Adder to count the leading zeroes of C Reason: To pre-compute the amount of shifting the result of A*B+C to normalize it –This will speed up our design because the Leading Zero Counter will not be in the critical path (which is through our multiplier)
Design Decisions (contd.): Align Exponent –Always align the exponent of C to expA + expB –Shift the significand of C by (expA + expB – expC) If negative, shift left because C is bigger than A*B If positive, shift right because C is smaller than A*B –Implementation: n-Pass Shifter Normalize –Format the result of A*B + C to IEEE Format (i.e. change the significand from … to …) –Align the exponent of the result as necessary –n-Pass Shifter to shift the result of the adder by the amount given by the Leading Zero Counter Round –The result needs to be fit into 16 bits –To preserve precision, we round the result based on the last 3 bits –Implementation: Incrementer and Shifter
Behavioral Verilog
Behavioral Verilog (contd.)
Behavioral Verilog (Output)
Updated Estimated Transistor Count Registers (input, output, pipelining) 2500 Threading Logic3000 Carry-Save Multiplier5000 Carry-Select Adder 2000 Alignment Shifter 1500 Leading 0 Anticipator700 Normalize 2000 Rounding 1500 Special Cases and Control Logic 2000 Total20200
Problems and Questions? Difficulty finding a high-level simulator to exhaustively test our behavioral verilog because both Matlab and C use the IEEE 32-bit format. Currently we are thoroughly testing our behavioral verilog and coming up with different test cases by hand. Suggested Solutions: - Make a scalable 32-bit version of our behavioral verilog and test it against C - Finding code written for software simulation by the VAX, PDP microprocessors.
Questions?