Download presentation
Presentation is loading. Please wait.
1
1 Farhan Mohamed Ali (W2-1) Jigar Vora (W2-2) Sonali Kapoor (W2-3) Avni Jhunjhunwala (W2-4) Presentation 5 MAD MAC 525 22 nd February, 2006 Top Level Integration W2 Project Objective: Design a crucial part of a GPU called the Multiply Accumulate Unit (MAC) which will revolutionize graphics. Design Manager: Zack Menegakis
2
2 MAD MAC 525 Status: Project chosen Specifications defined Architecture Design Behavioral Verilog Testbenches Verilog : Gate Level Design Floor plan Schematic To be done Layout (started) Extraction, LVS, post-layout simulation
3
3 Multiply Add (MAD) / Multiply Accumulate Unit (MAC) Executes function AB+C on 16 bit floating point inputs Multiply and add in parallel to greatly speed up operation Rounding is only performed only once so greater accuracy than individual multiply and add functions. One circuit to rule them all! Recap - MAD MAC 525
4
4 Block Diagram RegArray ARegArray BRegArray C Multiplier Exp CalcAlign Adder/Subtractor Control Logic & Sign Dtrmin Normalize Round Reg Y Leading 0 Anticipator 10 5 5 5 14 35 22 5 4 36 14 10 1 5 5 Input Output 16
5
5 Multiplier Align C Reg A Reg B Exp Calc Reg C Pipeline Reg Adder Ld Zero Pipeline Reg Normalize Round Reg Y Floorplan
6
6 Design Decisions Adder – Variable length carry select adder Registers – Pulsed Latches Pass logic in shifters
7
7 Adder Schematic – Carry Select Variable length carry select adder Very regular – good compromise between speed and ease of layout 2.5ns delay through 37bits
8
8 Adder Schematic – 1 bit Carry Select
9
9 Pulsed Latch Advantage – Practically eliminates setup time 120ns Clock to ~Q delay (146 loaded) 16 transistors Simplified version of those used in the Pentium 4 Sizing does not seem to affect speed under load Clock pulse generator
10
10 More Pass Logic Compared different kinds of pass logic for shifters Transmission gates with buffers are the fastest Mux TypePropagation Delay (worst case) N-pass (Align) 78.32ps Transmission gate (Normalize) 50.5ps NAND81.22p
11
11 Transistor Count Area in um 2 Prop. Delay Power in mW (350MHz) Multiplier3500250004.43n8.86 Exponents700 5000942p1.608 Align530 3800480p1.031 Adder3700265003.24n4.58 Leading 0350 25002.05n0.232 Normalize900 6500430p2.291 Round300 20001.81n0.198 Registers1800 9000120p- Total 1178080300--
12
12 Design Goals – On target At least 300MHz – 600 MFLOPS Will be achievable through optimization and pipelining Pipeline stages not fully determined – 6 stages expected Multiplier will be pipelined to cut delay in half All other individual blocks can clock ~500MHz Faster adder is being developed. Not easily pipelined like multiplier – speed of this block will be the limiting factor for entire circuit
13
13 Top Level Schematic
14
14 Simulations: Normalize
15
15 Simulations: Align
16
16 Simulations: Multiplier
17
17 Problems Verilog simulation of circuit generated don’t cares after switching to new improved pass logic. Analog simulations work just fine Pass logic can be evil if done wrong. Multiplier initially ran at only 50MHz due to transmission gate XORs. Buffers solved the problem.
18
18 Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.