® 1 VLSI Design Challenges for Gigascale Integration Shekhar Borkar Intel Corp. October 25, 2005
2 Outline Technology scaling challenges Technology scaling challenges Circuit and design solutions Circuit and design solutions Microarchitecture advances Microarchitecture advances Multi-everywhere Multi-everywhere Summary Summary
3 Goal: 10 TIPS by 2015 Pentium® Pro Architecture Pentium® 4 Architecture Pentium® Architecture How do you get there?
4 Technology Scaling GATE SOURCE BODY DRAIN Xj Tox D GATE SOURCE DRAIN Leff BODY Dimensions scale down by 30% Doubles transistor density Oxide thickness scales down Faster transistor, higher performance Vdd & Vt scaling Lower active power Scaling will continue, but with challenges!
5 Technology Outlook High Volume Manufacturing Technology Node (nm) Integration Capacity (BT) Delay = CV/I scaling 0.7~0.7>0.7 Delay scaling will slow down Energy/Logic Op scaling >0.35>0.5>0.5 Energy scaling will slow down Bulk Planar CMOS High Probability Low Probability Alternate, 3G etc Low Probability High Probability Variability Medium High Very High ILD (K) ~3<3 Reduce slowly towards Reduce slowly towards RC Delay Metal Layers to 1 layer per generation
6 The Leakage(s)… 90nm MOS Transistor50nm Si 1.2 nm SiO 2 Gate
7 Must Fit in Power Envelope nm65nm45nm32nm22nm16nm Power (W), Power Density (W/cm 2 ) SiO2 Lkg SD Lkg Active 10 mm Die Technology, Circuits, and Architecture to constrain the power
8 Solutions Move away from Frequency alone to deliver performance Move away from Frequency alone to deliver performance More on-die memory More on-die memory Multi-everywhere Multi-everywhere –Multi-threading –Chip level multi-processing Throughput oriented designs Throughput oriented designs Valued performance by higher level of integration Valued performance by higher level of integration –Monolithic & Polylithic
9 Leakage Solutions Tri-gate Transistor Silicon substrate 1.2 nm SiO 2 Gate Planar Transistor Silicon substrate Gate electrode 3.0nm High-k For a few generations, then what?
10 Active Power Reduction SlowFastSlow Low Supply Voltage High Supply Voltage Multiple Supply Voltages Logic Block Freq = 1 Vdd = 1 Throughput = 1 Power = 1 Area = 1 Pwr Den = 1 Vdd Logic Block Freq = 0.5 Vdd = 0.5 Throughput = 1 Power = 0.25 Area = 2 Pwr Den = Vdd/2 Logic Block Throughput Oriented Designs
11 Leakage Control Body Bias Vdd Vbp Vbn -Ve +Ve 2-10XReduction Sleep Transistor Logic Block XReduction Stack Effect Equal Loading 5-10XReduction
12 Optimum Frequency Maximum performance with Optimum pipeline depth Optimum pipeline depth Optimum frequency Optimum frequency Process Technology Relative Frequency Sub-threshold Leakage increases exponentially Pipeline Depth Relative Pipeline Depth Power Efficiency Optimum Pipeline & Performance Relative Frequency (Pipelining) Performance Diminishing Return
13 Memory Latency MemoryCPUCache Small ~few Clocks Large ns Assume: 50ns Memory latency Cache miss hurts performance Worse at higher frequency Cache miss hurts performance Worse at higher frequency
14 Increase on-die Memory Large on die memory provides: 1.Increased Data Bandwidth & Reduced Latency 2.Hence, higher performance for much lower power
15 Multi-threading ST Wait for Mem MT1 Wait for Mem MT2 Wait MT3 Single Thread Multi-Threading Full HW Utilization Multi-threading improves performance without impacting thermals & power delivery Thermals & Power Delivery designed for full HW utilization
16 Single Core Power/Performance Moore’s Law more transistors for advanced architectures Delivers higher peak performance But… Lower power efficiency
17 Chip Multi-Processing C1C2 C3C4 Cache Multi-core, each core Multi-threaded Shared cache and front side bus Each core has different Vdd & Freq Core hopping to spread hot spots Lower junction temperature
18 Dual Core VoltageFrequencyPowerPerformance1%1%3%0.66% Rule of thumb Core Cache Core Cache Core Voltage = 1 Freq = 1 Area = 1 Power = 1 Perf = 1 Voltage = -15% Freq = -15% Area = 2 Power = 1 Perf = ~1.8 In the same process technology…
19 Multi-Core C1C2 C3C4 Cache Large Core Cache Small Core Power Performance Power = 1/4 Performance = 1/2 Multi-Core: Power efficient Better power and thermal management
20 Special Purpose Hardware 2.23 mm X 3.54 mm, 260K transistors Opportunities: Network processing engines MPEG Encode/Decode engines, Speech engines TCP/IP Offload Engine Special purpose HW provides best Mips/Watt
21 Performance Scaling Amdahl’s Law: Parallel Speedup = 1/(Serial% + (1-Serial%)/N) Serial% = 6.7% N = 16, N 1/2 = 8 16 Cores, Perf = 8 Serial% = 20% N = 6, N 1/2 = 3 6 Cores, Perf = 3 Parallel software key to Multi-core success
22 From Multi to Many… 13mm, 100W, 48MB Cache, 4B Transistors, in 22nm 12 Cores24 Cores 144 Cores
23 GP General Purpose Cores Future Multi-core Platform SP Special Purpose HW CC CC CC CC CC CC CC CC Interconnect fabric Heterogeneous Multi-Core Platform
24 The New Era of Computing Multi-everywhere: MT, CMP Speculative, OOO Era of Instruction LevelParallelism Super Scalar Era of Pipelined Architecture Multi Threaded Era of Thread & ProcessorLevelParallelism Special Purpose HW Multi-Threaded, Multi-Core
25 Summary Business as usual is not an option Business as usual is not an option –Performance at any cost is history Must make a Right Hand Turn (RHT) Must make a Right Hand Turn (RHT) –Move away from frequency alone Future Architectures and designs Future Architectures and designs –More memory (larger caches) –Multi-threading –Multi-processing –Special purpose hardware –Valued performance with higher integration