Galaxy: High-Performance Energy-Efficient Multi-Chip Architectures Using Photonic Interconnects Nikos Hardavellas – Parallel Architecture Group Northwestern University Team: Y. Demir, P. Yan, S. Song, J. Kim, G. Memik
Chip Power Scaling © Hardavellas 2 Chip power does not scale [Azizi 2010]
Voltage Scaling Has Slowed © Hardavellas 3 In last decade: 13x transistors but 30% lower voltage Cannot run all transistors fast enough
Pin Bandwidth Scaling © Hardavellas 4 [TU Berlin] Cannot feed cores with data fast enough to keep them busy
Data Scaling SPEC, TPC datasets growth: faster than Moore Same trends in scientific, personal computing Large Hadron Collider March11: 1.6PB data (Tier-1) Large Synoptic Survey Telescope 30 TB/night 2x Sloan Digital Sky Surveys/day Sloan: more data than entire history of astronomy before it © Hardavellas 5 More data more computing power to process them
Galaxy: Optically-Connected Disintegrated Processors Physical constraints limit single-chip designs Area, Yield, Power, Bandwidth Multi-chip designs break free of these limitations Processor disintegration Macro-chip integration © Hardavellas 6 [Pan, WINDS 2010]
Outline Introduction Background Galaxy Architecture Experimental Methodology Results Sensitivity Studies Single-Chip Comparisons (Processor Disintegration) Multi-Chip Comparisons (Macrochip Integration) Thermal Modeling Conclude Overview of Other Research © Hardavellas 7
Nanophotonic Components © Hardavellas 8 off-chip laser source coupler resonant modulators resonant detectors Ge-doped waveguide Selective: couple optical energy of a specific wavelength
Modulation and Detection © Hardavellas wavelengths DWDM μm waveguide pitch 10Gbps per link 8 Tbps/mm bandwidth density or more !!!
Outline Introduction Background Galaxy Architecture Experimental Methodology Results Sensitivity Studies Single-Chip Comparisons (Processor Disintegration) Multi-Chip Comparisons (Macrochip Integration) Thermal Modeling Conclude Overview of Other Research © Hardavellas 10
Galaxy Architecture © Hardavellas 11
Routing Example © Hardavellas 12
Galaxy Architecture © Hardavellas 13
Galaxy MWSR Optical Crossbar © Hardavellas 14 More energy-efficient than SWMR at that scale MWSR avoids broadcast bus, but requires arbitration
Token-Based Arbitration © Hardavellas 15 8 cycles on average for token arbitration (5 chiplets)
Dense Off-Chip Coupling © Hardavellas 16 Dense optical fiber array [Lee, OSA/OFC/NFOEC 2010] ~3.8dB loss, 8 Tbps/mm demonstrated Misalignment loss <1 dB Loss comparable to optical proximity couplers
Nanophotonic Parameters © Hardavellas 17
Outline Introduction Background Galaxy Architecture Experimental Methodology Results Sensitivity Studies Single-Chip Comparisons (Processor Disintegration) Multi-Chip Comparisons (Macrochip Integration) Thermal Modeling Conclude Overview of Other Research © Hardavellas 18
Architectural Parameters © Hardavellas 19
Modeling Infrastructure © Hardavellas 20 3D-stack model SimFlex sampling 95% confidence photonic-layer ring heating
Outline Introduction Background Galaxy Architecture Experimental Methodology Results Sensitivity Studies Single-Chip Comparisons (Processor Disintegration) Multi-Chip Comparisons (Macrochip Integration) Thermal Modeling Conclude Overview of Other Research © Hardavellas 21
Load-Latency Curves © Hardavellas tokens provide optimal buffer depth
Laser Power Sensitivity to Optical Parameters © Hardavellas 23 Coupler Loss Off-Ring Loss Waveguide & Filter Drop Loss Modulator Insertion Loss Highly sensitive to coupler loss, insensitive to other losses
Sensitivity to Fiber Density 116mm 2 chiplets 43mm along the chip edge Enough room for μm pitch © Hardavellas fibers: within 3% of max performance
Outline Introduction Background Galaxy Architecture Experimental Methodology Results Sensitivity Studies Single-Chip Comparisons (Processor Disintegration) Multi-Chip Comparisons (Macrochip Integration) Thermal Modeling Conclude Overview of Other Research © Hardavellas 25
Performance Against Unlimited Designs © Hardavellas 26 Unlimited power (max speed of design, irrespective of temp.) Mesh_20MC & Corona_20MC Also unlimited bandwidth (20 MCs per chip, 5x more pins) Galaxy matches the performance of unlimited designs
Performance Against Realistic Designs Realistic: within power and bandwidth envelopes Galaxy chiplets within 66.2 o C chiplets run at max speed © Hardavellas 27 Galaxy: 2.2x speedup on average (3.4 max)
Energy-Delay Product Cool chiplets minimize leakage © Hardavellas 28 Galaxy: 2.4x-2.8x smaller EDP on average (6.8x max)
Outline Introduction Background Galaxy Architecture Experimental Methodology Results Sensitivity Studies Single-Chip Comparisons (Processor Disintegration) Multi-Chip Comparisons (Macrochip Integration) Thermal Modeling Conclude Overview of Other Research © Hardavellas 29
Comparison Against Multi-Chip Alternatives © Hardavellas 30
Comparison Against Multi-Chip Alternatives © Hardavellas 31 Fiber Galaxy: 2.5x over Oracle Macrochip (6.8x max)
Tapered vs. Optical Proximity Couplers © Hardavellas 32 6x less laser power than Oracle Macrochip with demonstrated couplers
Outline Introduction Background Galaxy Architecture Experimental Methodology Results Sensitivity Studies Single-Chip Comparisons (Processor Disintegration) Multi-Chip Comparisons (Macrochip Integration) Thermal Modeling Conclude Overview of Other Research © Hardavellas 33
80-core 5-chiplet Galaxy Thermal CFD Modeling © Hardavellas 34 8cm spacing allows cooling with cheap passive heatsinks C
9-chiplet Dense Array (Oracle Macrochip) © Hardavellas 35 Tight arrangement points to liquid cooling requirement C
9-chiplet Galaxy 2D © Hardavellas 36 Cooling 9 chiplets with passive heatsinks C
9-chiplet Galaxy 3D © Hardavellas 37 Flexible fibers allow virtual chip to break free of 2D planar designs C
Galaxy Summary Virtual chips with the performance of unlimited designs Breaks free of typical physical constraints Large aggregate area Improved yield (break-even point : 60% yield for photonics) Tb/s/mm bandwidth density Pushes back power wall Processor disintegration 2.2x avg. speedup (3.4 max) 2.4x-2.8x avg. smaller EDP (6.8x max) Macrochip integration 2.5x speedup over Oracle Macrochip (6.8x max) 6x more power efficient links © Hardavellas 38
Outline Introduction Background Galaxy Architecture Experimental Methodology Results Sensitivity Studies Single-Chip Comparisons (Processor Disintegration) Multi-Chip Comparisons (Macrochip Integration) Thermal Modeling Conclude Overview of Other Research © Hardavellas 39
Energy is Shaping the IT Industry #1 of Grand Challenges for Humanity in the Next 50 Years [Smalley Institute for Nanoscale Research and Technology, Rice U.] Computing worldwide: ~408 TWh in 2010 [Gartner] Datacenter energy consumption in US ~150 TWh in 2011 [EPA] 3.8% of domestic power generation, $15B CO 2 -equiv. emissions Airline Industry (2%) Carbon footprint of worlds data centers Czech Republic 20MW: 200x lower energy/instr. (2nJ 10pJ) 3% of the output of an average nuclear plant! 10% annual growth on installed computers worldwide [Gartner] © Hardavellas 40 Exponential increase in energy consumption
Integer add: 0.5pJ; FP-FMA: 50pJ. Where does energy go? Data movement: 1200pJ across 400mm 2 chip, 16000pJ memory Elastic caches: minimize data transfers through adapting caches to workload demands [ISCA09, IEEEMicro10, DATE12] Processing: ~1500pJ to schedule the operation SeaFire: specialized computing on dark silicon to eliminate general- purpose computings overheads [IEEEMicro11, USENIX-Login11] Circuits: wide voltage guardbands Low voltages, process variation timing errors computing errors Elastic fidelity: allow errors at select code/data segments to save energy while maintaining fidelity contract with user [CoRR abs/ ] Chips fundamentally limited by physical constraints. Need to break free. Galaxy: processor disintegration/macrochip integration using photonic interconnects [WINDS10] Overall Focus: Energy-Efficient Computing
Thank You! © Hardavellas 42
Overcoming Data Movement and Processing Overheads Elastic caches: adapt cache to workloads demands Significant energy on data movements and coherence requests Co-locate data, metadata, and computation Decouple address from placement location Capitalize on existing OS events simplify hardware Cut on-chip interconnect traffic by half Seafire: specialized computing on dark silicon Repurpose dark silicon to implement specialized cores Application cherry-picks a few cores, rest of chip is powered off Vast unused area many specialized cores likely to find good matches 12x lower energy (conservative) 43 © Hardavellas
Elastic fidelity: selectively trade accuracy for energy We dont always need 100% accuracy, but HW always provides it Language constructs specify required fidelity for code/data segments Steer computation to exec/storage units with appropriate fidelity and lower voltage 35% lower energy Overcoming Voltage Guardbands 44 © Hardavellas No errors 10% errors