Download presentation
Presentation is loading. Please wait.
1
1
2
2 Farhan Mohamed Ali Jigar Vora Sonali Kapoor Avni Jhunjhunwala 1 st May, 2006 Final Presentation MAD MAC 525 Design Manager: Zack Menegakis Design a crucial part of a GPU called the Multiply Accumulate Unit (MAC) which is revolutionizing graphics
3
3 Agenda Marketing – Jigar Project and Algorithm Description – Farhan Implementation Part I – Farhan Implementation Part II – Sonali Floorplan – Sonali Layout – Avni Verification – Avni Design Specifications – Avni Conclusion – Jigar
4
4 Marketing Jigar
5
5 Purpose MAD MAC 525 accelerates FP16 blending to enable true HDR graphics Huh?? MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
6
6
7
7 Beauty of High Dynamic Range With HDR rendering, pixel intensity can extend beyond the range of traditional graphics Nature doesn’t have a limited pixel intensity and neither should Computer Graphics In other words: Bright things can be really bright Dark things can be really dark And the details can be seen in both MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
8
8 Applications of HDR MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
9
9 Target Market Target Market Segment Graphic chip manufacturers High speed DSP manufacturers CPU co-processors Potential Customers MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
10
10 Design Comparison Top 180nm graphics chip is the NVIDIA NV16. Highest speed only 250MHz 9 bit Integer precision As games are becoming more advanced, they are in need of fast graphics chips Conclusion: Market Needs a FAST MAD MAC MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
11
11 Description and Implementation I Farhan
12
12 Multiply Accumulate unit (MAC) Executes function AB+C on 16 bit floating point inputs. Format – 1 bit sign, 5 bit exponent and 10 bit significand Multiply and add in parallel to greatly speed up operation Rounding performed only once so greater accuracy than individual multiply and add functions. Also known as: Fused Multiply Add (FMA) Multiply Add (MAD/MADD) in graphics shader programs Project Description MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
13
13 Algorithm FP Multiply (A*B) Multiply significands Add exponents Normalize Round FP Add (A+B) Align smaller number to larger number Add significands Normalize Round MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
14
14 Algorithm FP Multiply-Add (AB+C) Align sig C based on exp A+B-C Multiply significands A and B Add sig A*B result to aligned sig C Normalize Round MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
15
15 ABC Multiplier Exp CalcAlign Adder Normalize Round Ovf Checker Leading 0 Anticipator Output Y Block Diagram MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
16
16 Implementation Design target: 300MHz Speed is the design goal Ambitious target? How we planned achieve this Fast Logic – parallelize ops as much as possible Pipelining MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
17
17 Implementation Adder Carry Select vs Carry Lookahead tree MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
18
18 Implementation Adder Han-Carlson based carry lookahead adder 6 lookahead logic stages for 32 bit adder Less logic than a Kogge-Stone adder Less wiring than a Brent-Kung adder MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
19
19 Implementation Multiplier Carry-Save Multiplier Avoids having ripple carry in every stage Enables regular and compact layout Easy to pipeline Final 10 bit add stage using carry lookahead adder MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
20
20 Implementation Leading Zero Anticipator Predicts number of shifts to do in normalize Normalize begins with zero delay Operates in parallel with adder so normalize shifts can be predicted with accuracy of 1 shift to left or right MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
21
21 Implementation Latches Pulse Latches Practically eliminates setup time 16 transistors per pulse generator Simplified version of those used in a certain high speed CPU Clock pulse generator MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
22
22 Implementation II and Floorplan Sonali
23
23 Design Decision: Pass Logic Extensive use of Pass Logic Reduces transistor count Reduces area Transistor count reduced from 20,200 to 12,800 Example Normalize: 3400 -> 942 Align: 1500 -> 530 Ensure all pass logic is buffered MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
24
24 Design Decision: Pipelining Initially planned 6 pipeline stages Reduced to 4 pipeline stages Adder – Fast Carry Lookahead architecture Multiplier – Ripple Carry to Carry Lookahead MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
25
25 Pipeline Stages Multiplier Align C Reg A Exp Calc Reg C Adder Ld Zero Normalize Round Reg B Output MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
26
26 Schematics MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify Multiplier I N P U T S PIPELINEPIPELINE O U T P U T S OUTPUTSOUTPUTS P I P E L I N E
27
27 Schematic Adder INPUTS OUTPUTS Look Ahead Logic MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify Sum Logic
28
28 Multiplier Align C Reg A Reg B Exp Calc Reg C Pipeline Reg Adder Ld Zero Pipeline Reg Normalize Round Initial Floorplan Reg Y Overflow checker Floorplan Evolution MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
29
29 Floorplan Evolution Exponents Align Ld zero Adder Multiplier NormalizeNormalize RoundRound OvfOvf Reg B Output Reg A Reg C Final Floorplan MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
30
30 Layout, Verification & Specification Avni
31
31 Layout Decisions 3 cell heights – 6.03, 5.04 and 3.55 Uniform width vdd and ground rails Wider vdd and ground rails in power hungry modules Max of 8 latches per clock pulse generator Uniform metal directionality within each block MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
32
32 Final Layout MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
33
33 Final Layout MULTIPLIER MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
34
34 Multiplier Height: 191.6 Width: 206.38 Area: 20,388 I N ININ PIPELINEREGPIPELINEREG OUTPUTOUTPUT O U T P U T MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify BITSLICEBITSLICE
35
35 Final Layout MULTIPLIER ADDER MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
36
36 Adder A D D E R INCREMENTER Height:122.9 Width: 110.2 Area:13,202 MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
37
37 Final Layout Exponents Align Ld zero Adder Multiplier N o r m a l i z e R o u n d O v f Input OUTOUT MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
38
38 Layer Masks MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify Active: 14.04%
39
39 Layer Masks Poly : 9.25% MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
40
40 Layer Masks Metal 1 : 34.08% MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
41
41 Layer Masks Metal 2 : 18.00% MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
42
42 Layer Masks Metal 3 : 14.99% MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
43
43 Layer Masks Metal 4 : 6.23% MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
44
44 Verification Of Design Behavioral and Structural Verilog Extensive Testing – Unable to find C or Matlab Code Schematic and Layout testing Analog Simulations – Compare Output with Behavioral Full Chip Verification MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
45
45 Design Specifications Critical path delay = 2.25ns Clock speed = 400MHz Pipeline stages = 4 Height by width = 195.26 um * 303.255 um Area = 59,214 um^2 Aspect ratio = 1:1.55 Transistor density = 0.22 Total Pin Count = 67 MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
46
46 Schematic Power: mW (400 MHz) Layout Power: mW (400 MHz) Schematic Power: mW (100 MHz) Layout Power: mW (100 MHz) Multiplier -w/ pipeline 2.2812.3540.61680.6297 Exponents0.35140.40940.08750.1029 Align0.07820.09260.02780.0324 Adder4.4714.8961.1181.232 Leading 00.13130.17220.0330.0433 Normalize0.58650.62380.14810.1692 Round0.63390.67820.15930.1709 OvfCheck0.16320.16660.04080.04165 Total12.2513.0083.0653.297 MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
47
47 Area: um 2 Transistor Count Transistor Density Schematic Delay (ns) Layout Delay (ns) Multiplier -w/ pipeline2038844960.22 3.38 1.9 N/A 2.25 Exponents5,1637380.141.011.2 Align3,9955000.130.4800.637 Adder13,20231740.241.341.7 Leading 01,2533640.290.5060.551 Normalize3,1909420.30.4070.437 Round1,8024940.280.8640.986 OvfCheck200700.350.4530.475 Registers, etcN/A2038N/A0.1790.193 Total59,21412,8200.22-- MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
48
48 Conclusion Jigar
49
49 Graphics – HDR Rendering, Blending and Shader ops Fastest 180nm GPU: 250 MHz (9-bit Int) MAD MAC 525: 400 MHz (16-bit FP) Everyone Needs a MAD MAC MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
50
50 DSPs – Computing Vector Dot-Products in Digital Filters Everyone Needs a MAD MAC MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
51
51 Enables Fast Division, Square Root Eliminates extra Hardware to handle such computation Available in many new CPUs such as STI’s Cell Everyone Needs a MAD MAC MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
52
52 Future Enhancements 16 to 32 Bits Newer process technology Possible modifications for low power apps MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
53
53 MA D MAC 525 Everyone Wants A
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.