1
2 Farhan Mohamed Ali Jigar Vora Sonali Kapoor Avni Jhunjhunwala 1 st May, 2006 Final Presentation MAD MAC 525 Design Manager: Zack Menegakis Design a crucial part of a GPU called the Multiply Accumulate Unit (MAC) which is revolutionizing graphics
3 Agenda Marketing – Jigar Project and Algorithm Description – Farhan Implementation Part I – Farhan Implementation Part II – Sonali Floorplan – Sonali Layout – Avni Verification – Avni Design Specifications – Avni Conclusion – Jigar
4 Marketing Jigar
5 Purpose MAD MAC 525 accelerates FP16 blending to enable true HDR graphics Huh?? MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
6
7 Beauty of High Dynamic Range With HDR rendering, pixel intensity can extend beyond the range of traditional graphics Nature doesn’t have a limited pixel intensity and neither should Computer Graphics In other words: Bright things can be really bright Dark things can be really dark And the details can be seen in both MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
8 Applications of HDR MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
9 Target Market Target Market Segment Graphic chip manufacturers High speed DSP manufacturers CPU co-processors Potential Customers MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
10 Design Comparison Top 180nm graphics chip is the NVIDIA NV16. Highest speed only 250MHz 9 bit Integer precision As games are becoming more advanced, they are in need of fast graphics chips Conclusion: Market Needs a FAST MAD MAC MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
11 Description and Implementation I Farhan
12 Multiply Accumulate unit (MAC) Executes function AB+C on 16 bit floating point inputs. Format – 1 bit sign, 5 bit exponent and 10 bit significand Multiply and add in parallel to greatly speed up operation Rounding performed only once so greater accuracy than individual multiply and add functions. Also known as: Fused Multiply Add (FMA) Multiply Add (MAD/MADD) in graphics shader programs Project Description MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
13 Algorithm FP Multiply (A*B) Multiply significands Add exponents Normalize Round FP Add (A+B) Align smaller number to larger number Add significands Normalize Round MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
14 Algorithm FP Multiply-Add (AB+C) Align sig C based on exp A+B-C Multiply significands A and B Add sig A*B result to aligned sig C Normalize Round MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
15 ABC Multiplier Exp CalcAlign Adder Normalize Round Ovf Checker Leading 0 Anticipator Output Y Block Diagram MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
16 Implementation Design target: 300MHz Speed is the design goal Ambitious target? How we planned achieve this Fast Logic – parallelize ops as much as possible Pipelining MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
17 Implementation Adder Carry Select vs Carry Lookahead tree MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
18 Implementation Adder Han-Carlson based carry lookahead adder 6 lookahead logic stages for 32 bit adder Less logic than a Kogge-Stone adder Less wiring than a Brent-Kung adder MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
19 Implementation Multiplier Carry-Save Multiplier Avoids having ripple carry in every stage Enables regular and compact layout Easy to pipeline Final 10 bit add stage using carry lookahead adder MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
20 Implementation Leading Zero Anticipator Predicts number of shifts to do in normalize Normalize begins with zero delay Operates in parallel with adder so normalize shifts can be predicted with accuracy of 1 shift to left or right MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
21 Implementation Latches Pulse Latches Practically eliminates setup time 16 transistors per pulse generator Simplified version of those used in a certain high speed CPU Clock pulse generator MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
22 Implementation II and Floorplan Sonali
23 Design Decision: Pass Logic Extensive use of Pass Logic Reduces transistor count Reduces area Transistor count reduced from 20,200 to 12,800 Example Normalize: > 942 Align: > 530 Ensure all pass logic is buffered MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
24 Design Decision: Pipelining Initially planned 6 pipeline stages Reduced to 4 pipeline stages Adder – Fast Carry Lookahead architecture Multiplier – Ripple Carry to Carry Lookahead MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
25 Pipeline Stages Multiplier Align C Reg A Exp Calc Reg C Adder Ld Zero Normalize Round Reg B Output MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
26 Schematics MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify Multiplier I N P U T S PIPELINEPIPELINE O U T P U T S OUTPUTSOUTPUTS P I P E L I N E
27 Schematic Adder INPUTS OUTPUTS Look Ahead Logic MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify Sum Logic
28 Multiplier Align C Reg A Reg B Exp Calc Reg C Pipeline Reg Adder Ld Zero Pipeline Reg Normalize Round Initial Floorplan Reg Y Overflow checker Floorplan Evolution MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
29 Floorplan Evolution Exponents Align Ld zero Adder Multiplier NormalizeNormalize RoundRound OvfOvf Reg B Output Reg A Reg C Final Floorplan MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
30 Layout, Verification & Specification Avni
31 Layout Decisions 3 cell heights – 6.03, 5.04 and 3.55 Uniform width vdd and ground rails Wider vdd and ground rails in power hungry modules Max of 8 latches per clock pulse generator Uniform metal directionality within each block MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
32 Final Layout MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
33 Final Layout MULTIPLIER MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
34 Multiplier Height: Width: Area: 20,388 I N ININ PIPELINEREGPIPELINEREG OUTPUTOUTPUT O U T P U T MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify BITSLICEBITSLICE
35 Final Layout MULTIPLIER ADDER MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
36 Adder A D D E R INCREMENTER Height:122.9 Width: Area:13,202 MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
37 Final Layout Exponents Align Ld zero Adder Multiplier N o r m a l i z e R o u n d O v f Input OUTOUT MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
38 Layer Masks MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify Active: 14.04%
39 Layer Masks Poly : 9.25% MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
40 Layer Masks Metal 1 : 34.08% MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
41 Layer Masks Metal 2 : 18.00% MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
42 Layer Masks Metal 3 : 14.99% MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
43 Layer Masks Metal 4 : 6.23% MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
44 Verification Of Design Behavioral and Structural Verilog Extensive Testing – Unable to find C or Matlab Code Schematic and Layout testing Analog Simulations – Compare Output with Behavioral Full Chip Verification MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
45 Design Specifications Critical path delay = 2.25ns Clock speed = 400MHz Pipeline stages = 4 Height by width = um * um Area = 59,214 um^2 Aspect ratio = 1:1.55 Transistor density = 0.22 Total Pin Count = 67 MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
46 Schematic Power: mW (400 MHz) Layout Power: mW (400 MHz) Schematic Power: mW (100 MHz) Layout Power: mW (100 MHz) Multiplier -w/ pipeline Exponents Align Adder Leading Normalize Round OvfCheck Total MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
47 Area: um 2 Transistor Count Transistor Density Schematic Delay (ns) Layout Delay (ns) Multiplier -w/ pipeline N/A 2.25 Exponents5, Align3, Adder13, Leading 01, Normalize3, Round1, OvfCheck Registers, etcN/A2038N/A Total59,21412, MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
48 Conclusion Jigar
49 Graphics – HDR Rendering, Blending and Shader ops Fastest 180nm GPU: 250 MHz (9-bit Int) MAD MAC 525: 400 MHz (16-bit FP) Everyone Needs a MAD MAC MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
50 DSPs – Computing Vector Dot-Products in Digital Filters Everyone Needs a MAD MAC MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
51 Enables Fast Division, Square Root Eliminates extra Hardware to handle such computation Available in many new CPUs such as STI’s Cell Everyone Needs a MAD MAC MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
52 Future Enhancements 16 to 32 Bits Newer process technology Possible modifications for low power apps MarketingDescriptionImplementingFloorplanLayoutSpecificationsVerify
53 MA D MAC 525 Everyone Wants A