® 1 Exponential Challenges, Exponential Rewards The Future of Moores Law Shekhar Borkar Intel Fellow Circuit Research, Intel Labs Fall, 2004
2 ISSCC 2003 Gordon Moore said… No exponential is forever… But We can delay Forever
3 Outline The exponential challenges The exponential challenges Circuit and Arch solutions Circuit and Arch solutions Major paradigm shifts in design Major paradigm shifts in design Integration & SOC Integration & SOC The exponential reward The exponential reward Summary Summary
4 Goal: 1TIPS by 2010 Pentium® Pro Architecture Pentium® 4 Architecture Pentium® Architecture How do you get there?
5 Technology Scaling GATE SOURCE BODY DRAIN Xj Tox D GATE SOURCE DRAIN Leff BODY Dimensions scale down by 30% Doubles transistor density Oxide thickness scales down Faster transistor, higher performance Vdd & Vt scaling Lower active power Technology has scaled well, will it in the future?
6 Technology Outlook High Volume Manufacturing Technology Node (nm) Integration Capacity (BT) Delay = CV/I scaling 0.7~0.7>0.7 Delay scaling will slow down Energy/Logic Op scaling >0.35>0.5>0.5 Energy scaling will slow down Bulk Planar CMOS High Probability Low Probability Alternate, 3G etc Low Probability High Probability Variability Medium High Very High ILD (K) ~3<3 Reduce slowly towards Reduce slowly towards RC Delay Metal Layers to 1 layer per generation
7 Is Transistor a Good Switch? On I = I = 0 Off I = 0 I 0 I = 1ma/u I 0 Sub-threshold Leakage
8 Exponential Challenge #1
9 Sub-threshold Leakage Sub-threshold leakage increases exponentially Assume: 0.25 m, I off = 1na/ 5X increase each generation at 30ºC
10 SD Leakage Power SD leakage power becomes prohibitive
11 Leakage Power Leakage power limits Vt scaling A. Grove, IEDM 2002
12 Exponential Challenge #2
13 Gate Oxide is Near Limit High-K dielectric is crucial 90nm MOS Transistor50nm Silicon substrate 1.2 nm SiO 2 Gate If Tox scaling slows down, then Vdd scaling will have to slow down
14 Exponential Challenge #3
15 Energy per Logic Operation Energy per logic operation scaling will slow down
16 The Power Crisis Business as usual is not an option
17 Exponential Challenge #4
18 Sources of Variations Heat Flux (W/cm 2 ) Results in Vcc variation Temperature Variation (°C) Hot spots Random Dopant Fluctuations 193nm 248nm 365nm LithographyWavelength 65nm 90nm 130nm Generation Gap 45nm 32nm 13nm EUV 180nm Source: Mark Bohr, Intel Sub-wavelength Lithography
19 Frequency & SD Leakage 0.18 micron ~1000 samples 20X 30% Low Freq Low Isb High Freq Medium Isb High Freq High Isb
20 ~30mV Vt Distribution 0.18 micron ~1000 samples Low Freq Low Isb High Freq Medium Isb High Freq High Isb
21 Exponential Challenge #5
22 Platform Requirements PC towerMini tower tower Slim lineSmall pc System Volume ( cubic inch) Shrinking volume Quieter Yet, High Performance Power (W) Thermal Budget ( o C/W) Heat-Sink Volume (in 3 ) Projected Heat Dissipation Volume Projected Air Flow Rate Pentium ® III Thermal Budget Air Flow Rate (CFM) Pentium ® 4 Thermal budget decreasing Higher heat sink volume Higher air flow rate
23 Exponential Challenge #6
24 Exponential Costs G. Moore ISSCC 03 Litho Cost FAB Cost $ per Transistor $ per MIPS
25 Product Cost Pressure Shrinking ASP, and shrinking $ budget for power
26 Must Fit in Power Envelope Technology, Circuits, and Architecture to constrain the power nm65nm45nm32nm22nm16nm Power (W), Power Density (W/cm 2 ) SiO2 Lkg SD Lkg Active 10 mm Die 1 MB 2 MB 4 MB 8 MB
27 Some Implications Tox scaling will slow downmay stop? Tox scaling will slow downmay stop? Vdd scaling will slow downmay stop? Vdd scaling will slow downmay stop? Vt scaling will slow downmay stop? Vt scaling will slow downmay stop? Approaching constant Vdd scaling Approaching constant Vdd scaling Energy/logic op will not scale Energy/logic op will not scale
28 The Gigascale Dilemma 1B T integration capacity will be available 1B T integration capacity will be available But could be unusable due to power But could be unusable due to power Logic T growth will slow down Logic T growth will slow down Transistor performance will be limited Transistor performance will be limitedSolutions Low power design techniques Low power design techniques Improve design efficiencyMulti everywhere Improve design efficiencyMulti everywhere Valued performance by even higher integration (of potentially slower transistors) Valued performance by even higher integration (of potentially slower transistors)
29 Poweractive and leakage VariationsMicroarchitecture
30 Active Power Reduction SlowFastSlow Low Supply Voltage High Supply Voltage Multiple Supply Voltages Logic Block Freq = 1 Vdd = 1 Throughput = 1 Power = 1 Area = 1 Pwr Den = 1 Vdd Logic Block Freq = 0.5 Vdd = 0.5 Throughput = 1 Power = 0.25 Area = 2 Pwr Den = Vdd/2 Logic Block Replicated Designs
31 Leakage Control Body Bias Vdd Vbp Vbn -Ve +Ve 2-10XReduction Sleep Transistor Logic Block XReduction Stack Effect Equal Loading 5-10XReduction
32 Circuit Design Tradeoffs Low-Vt usage low high Higher probability of target frequency with: 1.Larger transistor sizes 2.Higher Low-Vt usage But with power penalty Transistor size small large power target frequency probability
33 Impact of Critical Paths # of critical paths Mean clock frequency Clock frequency Number of dies 0% 20% 40% 60% # critical paths With increasing # of critical paths With increasing # of critical paths –Both and become smaller –Lower mean frequency
34 Impact of Logic Depth 0% 20% 40% -16%-8%0%8%16% Delay 20% 40% NMOS PMOS Device I ON Number of samples (%) Variation (%) NMOS Ion PMOS Ion / Delay / 5.6%3.0%4.2% Logic depth: Logic depth Ratio of delay- to Ion
Logic depth small large frequency target frequency probability # uArch critical paths lessmore Architecture Tradeoffs Architecture Tradeoffs Higher target frequency with: 1.Shallow logic depth 2.Larger number of critical paths But with lower probability
36 Variation-tolerant Design # uArch critical paths lessmore Balance power & frequency with variation tolerance Logic depth small large frequency target frequency probability Transistor size small large power target frequency probability Low-Vt usage low high
37 Leakage Power FrequencyDeterministicProbabilistic 10X variation ~50% total power Probabilistic Design Delay Path Delay Probability Deterministic design techniques inadequate in the future Due to variations in: V dd, V t, and Temp Delay Target # of Paths Deterministic Delay Target # of Paths Probabilistic
38 Shift in Design Paradigm Multi-variable design optimization for: Multi-variable design optimization for: – Yield and bin splits – Parameter variations – Active and leakage power – Performance Tomorrow: Global Optimization Multi-variateToday: Local Optimization Single Variable
39 Resistor Network 4.5 mm 5.3 mm Multiple subsites PD & Counter Resistor Network CUT Bias Amplifier Delay Die frequency: Min(F 1..F 21 ) Die power: Sum(P 1..P 21 ) Technology 150nm CMOS Number of subsites per die 21 Body bias range 0.5V FBB to 0.5V RBB Bias resolution 32 mV 1.6 X 0.24 mm, 21 sites per die 150nm CMOS Adaptive Body Bias--Experiment
40 Adaptive Body Bias--Results 0% 20% 60% 100% Accepted die noBB 100% yield ABB Higher Frequency 97% highest bin within die ABB For given Freq and Power density 100% yield with ABB 100% yield with ABB 97% highest freq bin with ABB for within die variability 97% highest freq bin with ABB for within die variability
41 Design & Arch Efficiency Employ efficient design & Architectures
42 Memory Latency MemoryCPUCache Small ~few Clocks Large ns Assume: 50ns Memory latency Cache miss hurts performance Worse at higher frequency Cache miss hurts performance Worse at higher frequency
43 Increase on-die Memory Large on die memory provides: 1.Increased Data Bandwidth & Reduced Latency 2.Hence, higher performance for much lower power
44 Multi-threading ST Wait for Mem MT1 Wait for Mem MT2 Wait MT3 Single Thread Multi-Threading Full HW Utilization Multi-threading improves performance without impacting thermals & power delivery Thermals & Power Delivery designed for full HW utilization
45 Chip Multi-Processing C1C2 C3C4 Cache Multi-core, each core Multi-threaded Shared cache and front side bus Each core has different Vdd & Freq Core hopping to spread hot spots Lower junction temperature
46 Special Purpose Hardware 2.23 mm X 3.54 mm, 260K transistors Opportunities: Network processing engines MPEG Encode/Decode engines Speech engines Special purpose HWBest Mips/Watt TCP Offload Engine
47 Valued Performance: SOC (System on a Chip) Special-purpose hardware more MIPS/mm² Special-purpose hardware more MIPS/mm² SIMD integer and FP instructions in several ISAs SIMD integer and FP instructions in several ISAs Die Area PowerPerformance General Purpose 2X2X~1.4X Multimedia Kernels <10%<10% X Si Monolithic Polylithic CPU Special HW Memory CMOS RF Wireline RF Opto- Electronics Dense Memory Heterogeneous Si, SiGe, GaAs
48 The Exponential Reward Multi-everywhere: MT, CMP Speculative, OOO Era of Instruction LevelParallelism Super Scalar Era of Pipelined Architecture Multi Threaded Era of Thread & ProcessorLevelParallelism Special Purpose HW Multi-Threaded, Multi-Core
49 SummaryDelaying Forever Gigascale transistor integration capacity will be availablePower and Energy are the barriers Gigascale transistor integration capacity will be availablePower and Energy are the barriers Variations will be even more prominentshift from Deterministic to Probabilistic design Variations will be even more prominentshift from Deterministic to Probabilistic design Improve design efficiency Improve design efficiency Multieverywhere, & SOC valued performance Multieverywhere, & SOC valued performance Exploit integration capacity to deliver performance in power/cost envelope Exploit integration capacity to deliver performance in power/cost envelope