Download presentation
Presentation is loading. Please wait.
Published byNickolas Emans Modified over 10 years ago
1
Galaxy: High-Performance Energy-Efficient Multi-Chip Architectures Using Photonic Interconnects Nikos Hardavellas PARAG@N – Parallel Architecture Group Northwestern University Team: Y. Demir, P. Yan, S. Song, J. Kim, G. Memik
2
Chip Power Scaling © Hardavellas 2 Chip power does not scale [Azizi 2010]
3
Voltage Scaling Has Slowed © Hardavellas 3 In last decade: 13x transistors but 30% lower voltage Cannot run all transistors fast enough
4
Pin Bandwidth Scaling © Hardavellas 4 [TU Berlin] Cannot feed cores with data fast enough to keep them busy
5
Data Scaling SPEC, TPC datasets growth: faster than Moore Same trends in scientific, personal computing Large Hadron Collider March11: 1.6PB data (Tier-1) Large Synoptic Survey Telescope 30 TB/night 2x Sloan Digital Sky Surveys/day Sloan: more data than entire history of astronomy before it © Hardavellas 5 More data more computing power to process them
6
Galaxy: Optically-Connected Disintegrated Processors Physical constraints limit single-chip designs Area, Yield, Power, Bandwidth Multi-chip designs break free of these limitations Processor disintegration Macro-chip integration © Hardavellas 6 [Pan, WINDS 2010]
7
Outline Introduction Background Galaxy Architecture Experimental Methodology Results Sensitivity Studies Single-Chip Comparisons (Processor Disintegration) Multi-Chip Comparisons (Macrochip Integration) Thermal Modeling Conclude Overview of Other Research © Hardavellas 7
8
Nanophotonic Components © Hardavellas 8 off-chip laser source coupler resonant modulators resonant detectors Ge-doped waveguide Selective: couple optical energy of a specific wavelength
9
Modulation and Detection © Hardavellas 9 11010101 10001011 16 - 64 wavelengths DWDM 5 - 20μm waveguide pitch 10Gbps per link 8 Tbps/mm bandwidth density or more !!!
10
Outline Introduction Background Galaxy Architecture Experimental Methodology Results Sensitivity Studies Single-Chip Comparisons (Processor Disintegration) Multi-Chip Comparisons (Macrochip Integration) Thermal Modeling Conclude Overview of Other Research © Hardavellas 10
11
Galaxy Architecture © Hardavellas 11
12
Routing Example © Hardavellas 12
13
Galaxy Architecture © Hardavellas 13
14
Galaxy MWSR Optical Crossbar © Hardavellas 14 More energy-efficient than SWMR at that scale MWSR avoids broadcast bus, but requires arbitration
15
Token-Based Arbitration © Hardavellas 15 8 cycles on average for token arbitration (5 chiplets)
16
Dense Off-Chip Coupling © Hardavellas 16 Dense optical fiber array [Lee, OSA/OFC/NFOEC 2010] ~3.8dB loss, 8 Tbps/mm demonstrated Misalignment loss <1 dB Loss comparable to optical proximity couplers
17
Nanophotonic Parameters © Hardavellas 17
18
Outline Introduction Background Galaxy Architecture Experimental Methodology Results Sensitivity Studies Single-Chip Comparisons (Processor Disintegration) Multi-Chip Comparisons (Macrochip Integration) Thermal Modeling Conclude Overview of Other Research © Hardavellas 18
19
Architectural Parameters © Hardavellas 19
20
Modeling Infrastructure © Hardavellas 20 3D-stack model SimFlex sampling 95% confidence photonic-layer ring heating
21
Outline Introduction Background Galaxy Architecture Experimental Methodology Results Sensitivity Studies Single-Chip Comparisons (Processor Disintegration) Multi-Chip Comparisons (Macrochip Integration) Thermal Modeling Conclude Overview of Other Research © Hardavellas 21
22
Load-Latency Curves © Hardavellas 22 16 tokens provide optimal buffer depth
23
Laser Power Sensitivity to Optical Parameters © Hardavellas 23 Coupler Loss Off-Ring Loss Waveguide & Filter Drop Loss Modulator Insertion Loss Highly sensitive to coupler loss, insensitive to other losses
24
Sensitivity to Fiber Density 116mm 2 chiplets 43mm along the chip edge Enough room for 172 fibers @ 250μm pitch © Hardavellas 24 128 fibers: within 3% of max performance
25
Outline Introduction Background Galaxy Architecture Experimental Methodology Results Sensitivity Studies Single-Chip Comparisons (Processor Disintegration) Multi-Chip Comparisons (Macrochip Integration) Thermal Modeling Conclude Overview of Other Research © Hardavellas 25
26
Performance Against Unlimited Designs © Hardavellas 26 Unlimited power (max speed of design, irrespective of temp.) Mesh_20MC & Corona_20MC Also unlimited bandwidth (20 MCs per chip, 5x more pins) Galaxy matches the performance of unlimited designs
27
Performance Against Realistic Designs Realistic: within power and bandwidth envelopes Galaxy chiplets within 66.2 o C chiplets run at max speed © Hardavellas 27 Galaxy: 2.2x speedup on average (3.4 max)
28
Energy-Delay Product Cool chiplets minimize leakage © Hardavellas 28 Galaxy: 2.4x-2.8x smaller EDP on average (6.8x max)
29
Outline Introduction Background Galaxy Architecture Experimental Methodology Results Sensitivity Studies Single-Chip Comparisons (Processor Disintegration) Multi-Chip Comparisons (Macrochip Integration) Thermal Modeling Conclude Overview of Other Research © Hardavellas 29
30
Comparison Against Multi-Chip Alternatives © Hardavellas 30
31
Comparison Against Multi-Chip Alternatives © Hardavellas 31 Fiber Galaxy: 2.5x over Oracle Macrochip (6.8x max)
32
Tapered vs. Optical Proximity Couplers © Hardavellas 32 6x less laser power than Oracle Macrochip with demonstrated couplers
33
Outline Introduction Background Galaxy Architecture Experimental Methodology Results Sensitivity Studies Single-Chip Comparisons (Processor Disintegration) Multi-Chip Comparisons (Macrochip Integration) Thermal Modeling Conclude Overview of Other Research © Hardavellas 33
34
80-core 5-chiplet Galaxy Thermal CFD Modeling © Hardavellas 34 8cm spacing allows cooling with cheap passive heatsinks 88.2 0 C
35
9-chiplet Dense Array (Oracle Macrochip) © Hardavellas 35 Tight arrangement points to liquid cooling requirement 249 0 C
36
9-chiplet Galaxy 2D © Hardavellas 36 Cooling 9 chiplets with passive heatsinks 110 0 C
37
9-chiplet Galaxy 3D © Hardavellas 37 Flexible fibers allow virtual chip to break free of 2D planar designs 83.6 0 C
38
Galaxy Summary Virtual chips with the performance of unlimited designs Breaks free of typical physical constraints Large aggregate area Improved yield (break-even point : 60% yield for photonics) Tb/s/mm bandwidth density Pushes back power wall Processor disintegration 2.2x avg. speedup (3.4 max) 2.4x-2.8x avg. smaller EDP (6.8x max) Macrochip integration 2.5x speedup over Oracle Macrochip (6.8x max) 6x more power efficient links © Hardavellas 38
39
Outline Introduction Background Galaxy Architecture Experimental Methodology Results Sensitivity Studies Single-Chip Comparisons (Processor Disintegration) Multi-Chip Comparisons (Macrochip Integration) Thermal Modeling Conclude Overview of Other Research © Hardavellas 39
40
Energy is Shaping the IT Industry #1 of Grand Challenges for Humanity in the Next 50 Years [Smalley Institute for Nanoscale Research and Technology, Rice U.] Computing worldwide: ~408 TWh in 2010 [Gartner] Datacenter energy consumption in US ~150 TWh in 2011 [EPA] 3.8% of domestic power generation, $15B CO 2 -equiv. emissions Airline Industry (2%) Carbon footprint of worlds data centers Czech Republic Exascale @ 20MW: 200x lower energy/instr. (2nJ 10pJ) 3% of the output of an average nuclear plant! 10% annual growth on installed computers worldwide [Gartner] © Hardavellas 40 Exponential increase in energy consumption
41
Integer add: 0.5pJ; FP-FMA: 50pJ. Where does energy go? Data movement: 1200pJ across 400mm 2 chip, 16000pJ memory Elastic caches: minimize data transfers through adapting caches to workload demands [ISCA09, IEEEMicro10, DATE12] Processing: ~1500pJ to schedule the operation SeaFire: specialized computing on dark silicon to eliminate general- purpose computings overheads [IEEEMicro11, USENIX-Login11] Circuits: wide voltage guardbands Low voltages, process variation timing errors computing errors Elastic fidelity: allow errors at select code/data segments to save energy while maintaining fidelity contract with user [CoRR abs/1111.4279] Chips fundamentally limited by physical constraints. Need to break free. Galaxy: processor disintegration/macrochip integration using photonic interconnects [WINDS10] Overall Focus: Energy-Efficient Computing
42
Thank You! © Hardavellas 42
43
Overcoming Data Movement and Processing Overheads Elastic caches: adapt cache to workloads demands Significant energy on data movements and coherence requests Co-locate data, metadata, and computation Decouple address from placement location Capitalize on existing OS events simplify hardware Cut on-chip interconnect traffic by half Seafire: specialized computing on dark silicon Repurpose dark silicon to implement specialized cores Application cherry-picks a few cores, rest of chip is powered off Vast unused area many specialized cores likely to find good matches 12x lower energy (conservative) 43 © Hardavellas
44
Elastic fidelity: selectively trade accuracy for energy We dont always need 100% accuracy, but HW always provides it Language constructs specify required fidelity for code/data segments Steer computation to exec/storage units with appropriate fidelity and lower voltage 35% lower energy Overcoming Voltage Guardbands 44 © Hardavellas No errors 10% errors
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.