0
1 Thousand Core Chips A Technology Perspective Shekhar Borkar Intel Corp. June 7, 2007
2 Outline Technology outlook Evolution of Multi—thousands of cores? How do you feed thousands of cores Future challenges: variations and reliability ResiliencySummary
3 Technology Outlook High Volume Manufacturing Technology Node (nm) Integration Capacity (BT) Delay = CV/I scaling 0.7~0.7>0.7 Delay scaling will slow down Energy/Logic Op scaling >0.35>0.5>0.5 Energy scaling will slow down Bulk Planar CMOS High Probability Low Probability Alternate, 3G etc Low Probability High Probability Variability Medium High Very High ILD (K) ~3<3 Reduce slowly towards Reduce slowly towards RC Delay Metal Layers to 1 layer per generation
4 Terascale Integration Capacity Total Transistors, 300mm 2 die ~1.5B Logic Transistors ~100MB Cache 100+B Transistor integration capacity
5 Scaling Projections Freq scaling will slow down V dd scaling will slow down Power will be too high 300mm 2 Die
6 Why Multi-core? –Performance Ever increasing single cores yield diminishing performance in a power envelope Multi-cores provide potential for near-linear performance speedup
7 Why Dual-core? –Power VoltageFrequencyPowerPerformance1%1%3%0.66% Rule of thumb Core Cache Core Cache Core Voltage = 1 Freq = 1 Area = 1 Power = 1 Perf = 1 Voltage = -15% Freq = -15% Area = 2 Power = 1 Perf = ~1.8 In the same process technology…
8 C1C2 C3C4 Cache Large Core Cache Small Core Power Performance Power = 1/4 Performance = 1/2 Multi-Core: Power efficient Better power and thermal management From Dual to Multi—
9 GP General Purpose Cores Future Multi-core Platform SP Special Purpose HW CC CC CC CC CC CC CC CC Interconnect fabric Heterogeneous Multi-Core Platform—SOC
10 Fine Grain Power Management ff f f ff V dd Cores with critical tasks Freq = f, at Vdd TPT = 1, Power = 1 f/2 0.7xV dd Non-critical cores Freq = f/2, at 0.7xVdd TPT = 0.5, Power = Cores shut down TPT = 0, Power = 0
11 Performance Scaling Amdahl’s Law: Parallel Speedup = 1/(Serial% + (1-Serial%)/N) Serial% = 6.7% N = 16, N 1/2 = 8 16 Cores, Perf = 8 Serial% = 20% N = 6, N 1/2 = 3 6 Cores, Perf = 3 Parallel software key to Multi-core success
12 From Multi to Many… 13mm, 100W, 48MB Cache, 4B Transistors, in 22nm 12 Cores48 Cores 144 Cores
13 From Many to Too Many… 13mm, 100W, 96MB Cache, 8B Transistors, in 16nm 24 Cores96 Cores 288 Cores
14 On Die Network Power 300mm 2 Die A careful balance of: 1.Throughput performance 2.Single thread performance (core size) 3.Core and network power
15 Observations Scaling Multi— demands more parallelism every generation Thread level, task level, application level Many (or too many) cores does not always mean… The highest performance The highest MIPS/Watt The lowest power If on-die network power is significant, then power is even worse Now software, too, must follow Moore’s Law
16 Memory BW Gap Busses have become wider to deliver necessary memory BW (10 to 30 GB/sec) Yet, memory BW is not enough Many Core System will demand 100 GB/sec memory BW How do you feed the beast?
17 IO Pins and Power State of the art: 100 GB/sec ~ 1 Tb/sec = 1,000 Gb/sec 25mw/Gb/sec = 25 Watts Bus-width = 1,000/5 = 200, about 400 pins (differential) Too many signal pins, too much power
18 Solution Chip > 5mm Bus High speed busses Busses are transmission lines L-R-C effects Need signal termination Signal processing consumes power Solutions: Reduce distance to << 5mm R-C bus Reduce signaling speed (~1Gb/sec) Increase pins to deliver BW 1-2 mw/Gbps Chip <2mm 100 GB/sec ~ 1 Tb/sec = 1,000 Gb/sec 2mw/Gb/sec = 2 Watts Bus-width = 1,000/1 = 1,000 pins
19 Package Anatomy of a Silicon Chip Si Chip Heat-sink HeatPower Signals
20 Package System in a Package Si Chip Limited pins: 10mm / 50 micron = 200 pins Limited pins Signal distance is large ~10 mm – higher power Complex package
21 Package DRAM on Top CPU Temp = 85°C Junction Temp = 100+°C High temp, hot spots Not good for DRAM DRAM Heat-sink
22 Package DRAM at the Bottom DRAMCPU Heat-sink Power and IO signals go through DRAM to CPU Thin DRAM die Through DRAM vias The most promising solution to feed the beast
23 Reliability Soft Error FIT/Chip (Logic & Mem) Time dependent device degradation Burn-in may phase out…? Extreme device variations Wider
24 Implications to Reliability Extreme variations (Static & Dynamic) will result in unreliable components Impossible to design reliable system as we know today Transient errors (Soft Errors) Gradual errors (Variations) Time dependent (Degradation) Reliable systems with unreliable components —Resilient Architectures
25 Implications to Test One-time-factory testing will be out Burn-in to catch chip infant-mortality will not be practical Test HW will be part of the design Dynamically self-test, detect errors, reconfigure, & adapt
26 In a Nut-shell… 100 Billion Transistors 100 BT integration capacity Billions unusable (variations) Some will fail over time Yet, deliver high performance in the power & cost envelope Intermittent failures
27 Resiliency with Many-Core Dynamic on-chip testing Performance profiling Cores in reserve (spares) Binning strategy Dynamic, fine grain, performance and power management Coarse-grain redundancy checking Dynamic error detection & reconfiguration Decommission aging cores, swap with spares Dynamically… 1.Self test & detect 2.Isolate errors 3.Confine 4.Reconfigure, and 5.Adapt CC CC CC CC CC CC CC CC
28 Summary Moore’s Law with Terascale integration capacity will allow integration of thousands of cores Power continues to be the challenge On-die network power could be significant Optimize for power with the size of the core and the number of cores 3D Memory technology needed to feed the beast Many-cores will deliver the highest performance in the power envelope with resiliency