Beyond Multi-core: The Dawning of the Era of Tera Intel™ 80-core Tera-scale Research Processor Jim Held Intel Fellow & Director, Tera-scale Computing Research Intel Corporation
Agenda Tera-scale Computing Teraflops Research Processor Motivation Platform Vision Teraflops Research Processor Key Ingredients Power Management Performance Programming Key Learnings Work in progress Summary
Emerging Applications will demand Tera-scale performance Computer Vision Ray-Tracing 2 Webcam Video Surveillance Beetle Scene 1M pixel CFD, 150x100x100, 30 fps 0.1 1 100 1000 10000 Computational (GFLOPs) Memory BW (GB/s) 10 4 Camera Body Tracking Bar Scene, 1M pixel CFD, 75x50x50, 10 fps ALM, 6 assets, 7 branches, 1 min ALM, 6 assets, 10 branches, 1 sec Physical Simulation Financial Analytics Courtesy of Prof. Ron Fetkiw Stanford University
A Tera-scale Platform Vision Special Purpose Engines Integrated IO devices Cache Last Level Scalable On-die Interconnect Fabric Integrated Memory Controllers High Bandwidth Off Die interconnect IO Socket Inter- Connect
Tera-scale Computing Research Applications – Identify, characterize & optimize Programming – Empower the mainstream System Software – Scalable services Memory Hierarchy – Feed the compute engine On-Die Interconnect – High bandwidth, low latency Cores – power efficient general & special function
Teraflops Research Processor 12.64mm I/O Area Goals: Deliver Tera-scale performance Single precision TFLOP at desktop power Frequency target 5GHz Bi-section B/W order of Terabits/s Link bandwidth in hundreds of GB/s Prototype two key technologies On-die interconnect fabric 3D stacked memory Develop a scalable design methodology Tiled design approach Mesochronous clocking Power-aware capability single tile 1.5mm 2.0mm 21.72mm PLL PLL TAP TAP I/O Area I/O Area
Key Ingredients Tile Special Purpose Cores 2D Mesh Interconnect Mesochronous Interface MSINT Special Purpose Cores High performance Dual FPMACs 2D Mesh Interconnect High bandwidth low latency router Phase-tolerant tile to tile communication Mesochronous Clocking Modular & scalable Lower power Workload-aware Power Management Sleep instructions and Packets Chip voltage & freq. control MSINT 39 Crossbar Router MSINT 40 GB/s 39 MSINT 2KB Data memory (DMEM) 64 96 RIB 64 64 32 32 6-read, 4-write 32 entry RF 32 32 x x 96 3KB Inst. memory (IMEM) + + 32 32 Normalize Normalize FPMAC0 FPMAC1 Tile Processing Engine (PE)
Fine Grain Power Management 21 sleep regions per tile (not all shown) FP Engine 1 FP Engine 2 Router Data Memory Instruction Memory FP Engine 1 Sleeping: 90% less power FP Engine 2 Router 10% less power (stays on to pass traffic) Data Memory Sleeping: 57% less power Instruction Memory Sleeping: 56% less power Dynamic sleep STANDBY: Memory retains data 50% less power/tile FULL SLEEP: Memories fully off 80% less power/tile Scalable power to match workload demands
2X-5X leakage power reduction Leakage Savings Est breakdown @ 1.2V 110C Dynamic sleep Total measured idle power 13W ~7W sleeping Regulated sleep for memory arrays State retention MSINT Crossbar Router 82mW 70mW 63mW21mW 15mW7.5mW 42mW8mW 2KB Data memory (DMEM) RIB 6R, 4W 32 entry Register File 100mW 20mW 100mW 20mW x x Memory Clamping 3KB Inst. Memory (IMEM) + + Normalize Normalize FPMAC0 FPMAC1 Processing Engine (PE) 2X-5X leakage power reduction
Router Power Management Activity based power management Individual port enables Queues on sleep and clock gated when port idle 924mW 7X power reduction for idle routers
Power Performance Results Peak Performance Average Power Efficiency 1 2 3 4 5 6 0.6 0.7 0.8 0.9 1.1 1.2 1.3 1.4 V cc (V) Frequency (GHz) 80°C N=80 (0.32 TFLOP) 1GHz (1 TFLOP) 3.16GHz (1.81 TFLOP) 5.67GHz (1.63 TFLOP) 5.1GHz 5 10 15 20 200 400 600 800 1000 1200 1400 GFLOPS GFLOPS/W 80°C N=80 5.8 19.4 394 GFLOPS 10.5 Measured Power Leakage 25 50 75 100 125 150 175 200 225 250 0.70 0.80 0.90 1.00 1.10 1.20 1.30 Vcc (V) Power (W) Active Power Leakage Power 78 15.6 152 26 1.33TFLOP @ 230W 80°C, N=80 1TFLOP @ 97W 0% 4% 8% 12% 16% 0.70 0.80 0.90 1.00 1.10 1.20 1.30 Vcc (V) % Total Power Sleep disabled Sleep enabled 80°C N=80 2X Stencil: 1TFLOP @ 97W, 1.07V; All tiles awake/asleep
Programming Results Not designed as a general Software Development Vehicle Small memory ISA limitations Limited data ports Four kernels hand-coded to explore delivered performance: Stencil 2D heat diffusion equation SGEMM for 100x100 matrices Spreadsheet doing weighted sums 64 point 2D FFT (w 64 tiles) Demonstrated utility and high scalability of message passing programming models on many core
Key Learnings Teraflop performance is possible within a mainstream power envelope Peak of 1.01 Teraflops at 62 watts Measured peak power efficiency of 19.4 GFLOPS/Watt Tile-based methodology fulfilled its promise Design possible with ½ the team in ½ the time Pre & Post-Si debug reduced – fully functional on A0 Fine-grained power management pays off Hierarchical clock gating and sleep transistor techniques Up to 3X measured reduction in standby leakage power Scalable low-power mesochronous clocking Excellent SW performance possible in this message-based architecture Further improvements possible with additional instructions, larger memory, wider data ports
Work in Progress: Stacked Memory Prototype 256 KB SRAM per core 4X C4 bump density 3200 thru-silicon vias 80-tile processor with Cu bumps “Polaris” Denser than C4 pitch Memory “Freya” C4 pitch Thru-Silicon Via Package Package Memory access to match the compute power
Teraflops on IA Pat Gelsinger – Intel Developer Forum 2007
Summary Emerging applications will demand teraflop performance Teraflop performance is possible within a mainstream power envelope Intel is developing technologies to enable Tera-scale computing
Questions
Acknowledgments Sriram Vangal, Jason Howard, Gregory Ruhl, Saurabh Dighe, Howard Wilson, James Tschanz, David Finan, Priya Iyer, Arvind Singh, Tiju Jacob, Shailendra Jain, Sriram Venkataraman, Yatin Hoskote, Nitin Borkar, Rob van der Wijngaart, Michael Frumkin and Tim Mattson