Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Farhan Mohamed Ali (W2-1) Jigar Vora (W2-2) Sonali Kapoor (W2-3) Avni Jhunjhunwala (W2-4) Presentation 12 MAD MAC 525 26 th April, 2006 Short Final Presentation.

Similar presentations


Presentation on theme: "1 Farhan Mohamed Ali (W2-1) Jigar Vora (W2-2) Sonali Kapoor (W2-3) Avni Jhunjhunwala (W2-4) Presentation 12 MAD MAC 525 26 th April, 2006 Short Final Presentation."— Presentation transcript:

1 1 Farhan Mohamed Ali (W2-1) Jigar Vora (W2-2) Sonali Kapoor (W2-3) Avni Jhunjhunwala (W2-4) Presentation 12 MAD MAC 525 26 th April, 2006 Short Final Presentation W2 Project Objective: Design a crucial part of a GPU called the Multiply Accumulate Unit (MAC) which will revolutionize graphics. Design Manager: Zack Menegakis

2 2 Agenda Marketing (Jigar) Project Description (Farhan) Algorithmic Description (Farhan) Design Process (Sonali) Floorplan Evolution (Sonali) Layout (Avni) Design Specifications (Avni) Conclusion (Jigar)

3 3 MARKETING Application of product: HDR rendering in gaming graphics Why HDR? Used in games like Far Cry Optimization for speed( chose this because of market) Competition- if enter market, possible barriers to entry

4 4 MAD MAC and HDR What is HDR? Show animation explaining concept

5 5 MAD MAC and HDR MAD MAC accelerates FP16 blending to enable true HDR graphics What is HDR? HDR = High Dynamic Range Dynamic range is defined as the ratio of the largest value of a signal to the lowest measurable value Dynamic range of luminance in real-world scenes can be 100,000 : 1 With HDR rendering, pixel intensity are allowed to extend beyond [0..1] range of traditional graphics Nature isn’t clamped to [0..1] and neither should CG In lay terms: Bright things can be really bright Dark things can be really dark And the details can be seen in both

6 6

7 7 Multiply Accumulate unit (MAC) Executes function AB+C on 16 bit floating point inputs. Inputs will be OpenEXR format. Multiply and add in parallel to greatly speed up operation Rounding is only performed only once so greater accuracy than individual multiply and add functions. Also known as: Fused Multiply Add (FMA) Multiply Add (MAD/MADD) in graphics shader programs Many applications benefit from a fast FMA Graphics – HDR rendering, blending and shader ops DSPs – computing vector dot-products in digital filters Fast division, square root – eliminates extra hardware Available in many newer CPUs and DSPs because it’s so cool One ring (circuit) to rule them all! PROJECT DESCRIPTION

8 8 ALGORITHMIC DESCRIPTION Step through entire process Multiply and align occurs concurrently- always align C to A*B Outputs go to adder, normalize, round, overflow checker and output register

9 9 RegArray ARegArray BRegArray C Multiplier Exp CalcAlign Adder/Subtractor Control Logic & Sign Dtrmin Normalize Round Ovf Checker Leading 0 Anticipator 10 5 5 5 14 35 22 5 4 36 14 10 1 5 5 Input Output 16 Reg Y 15 1 1 1 Block Diagram

10 10 IMPLEMENTATION Implementation of each module- how and why we chose a particular method keeping in mind goal of speed( multiplier, adder)

11 11 Design Decisions (contd.): Multiplier Implementation – 11 x 11 Carry-Save Multiplier – Reasons: Fast because it avoids having ripple carry in every stage Enables Compact Layout

12 12 Design Process Verilog-> Schematic-> Layout –Behavioral -> Structural Verilog –Transistors/gates -> Full Schematic –Gate/Component Layout -> Top Level Transistor Count fluctuated from 20,200 to 12,800 Major design decisions –Decided against implementing denormal arithmetic because it would increase the complexity of the project beyond the scope of the class –Round performed only once at the end. –Picked nPass over Tgate in the normalize shifter –Adder: variable length carry select-> Han-Carlson binary tree adder

13 13 VERIFICATION OF DESIGN Verilog Simulations ( show outputs) – Overview – How/Why it works – Behavioral/Structural Explain why we couldn’t get a high-level simulator and how we tested our verilog design.

14 14 SCHEMATICS Show schematics of major blocks: adder, multiplier, and top-level HOW WE VERIFIED: analog simulation

15 15 Top Level Schematic

16 16 Multiplier Schematic

17 17 Adder Schematic

18 18 FLOORPLAN EVOLUTION Initial floorplan How it evolved (with animation)- why and how we changed it

19 19 Multiplier Align C Reg A Reg B Exp Calc Reg C Pipeline Reg Adder Ld Zero Pipeline Reg Normalize Round Reg Y Main Floorplan

20 20 Floorplan

21 21 Full Chip Layout Exponent Align Zero Adder Multiplier NormalizeNormalize RoundRound OvfOvf

22 22 Pipelining Initially planned 5-6 pipeline stages Reduced to 4 pipeline stages – made possible by implementing fast carry lookahead adders in critical path modules (adder and multiplier)

23 23 Pipeline Reg Pipelining Stages Multiplier Align C Reg A Reg B Exp Calc Reg C Pipeline Reg Adder Ld Zero Pipeline Reg Normalize Round Reg Y Pipeline Reg Overflow checker

24 24 LAYOUT Final Layout Layout of large blocks such as multiplier, adder and normalize

25 25 Layout Decisions 3 standard cell heights Uniform width vdd and ground rails Wider vdd and ground rails in power hungry modules Max of 8 flip flops per clock pulse generator Metal directionality

26 26 Multiplier Layout with pipelining

27 27 Adder Layout

28 28 Normalize Layout

29 29 FINAL LAYOUT

30 30 Design Specifications Worst case delay = 2.25ns Long buses are all buffered (not tested yet) Estimated clocking speed = 400MHz Height by width = 193.86 um * 301.545 um Area = 58,458 um^2 Aspect ratio = 1:1.55 Total Transistor density = 0.22

31 31 Layout densities Active : 14.05% Poly : 9.25% Metal 1 : 33.89% Metal 2 : 18.00% Metal 3 : 14.99% Metal 4 : 6.29%

32 32 Layer Masks - Poly

33 33 Layer Masks – Metal 1

34 34 Layer Masks – Metal 2

35 35 Layer Masks – Metal 3

36 36 Layer Masks – Metal 4

37 37 Schematic Power: mW (350Mhz) Layout Power: mW Schematic Delay Layout Delay Multiplier -w/ pipeline 2.97 ?? N/A ?? 3.38n 1.9n N/A 2.25n Exponents1.6082.211.01n1.2n Align0.0940.113480p637p Adder8.489.731.34n1.7n Leading 00.2320.857506p551p Normalize1.4581.546407p437p Round0.6311.21864p986p OvfCheck0.130.19453p475p Registers?? 179p193p Total?? --

38 38 Area: um 2 Transistor Count Transistor Density Multiplier -w/ pipeline 2038844960.22 Exponents5,1637380.14 Align3,9955000.13 Adder13,20231740.24 Leading 01,2533640.29 Normalize3,1909420.3 Round1,8024940.28 OvfCheck200700.35 Registers, etc N/A1948N/A Total58,45812,7300.22

39 39 Conclusion More marketing Summarize chip functionality Extending applications of chip

40 40 Comments?


Download ppt "1 Farhan Mohamed Ali (W2-1) Jigar Vora (W2-2) Sonali Kapoor (W2-3) Avni Jhunjhunwala (W2-4) Presentation 12 MAD MAC 525 26 th April, 2006 Short Final Presentation."

Similar presentations


Ads by Google