Galaxy: High-Performance Energy-Efficient Multi-Chip Architectures Using Photonic Interconnects Nikos Hardavellas – Parallel Architecture Group.

Galaxy: High-Performance Energy-Efficient Multi-Chip Architectures Using Photonic Interconnects Nikos Hardavellas PARAG@N – Parallel Architecture Group Northwestern University Team: Y. Demir, P. Yan, S. Song, J. Kim, G. Memik

Chip Power Scaling © Hardavellas 2 Chip power does not scale [Azizi 2010]

Voltage Scaling Has Slowed © Hardavellas 3 In last decade: 13x transistors but 30% lower voltage Cannot run all transistors fast enough

Pin Bandwidth Scaling © Hardavellas 4 [TU Berlin] Cannot feed cores with data fast enough to keep them busy

Data Scaling SPEC, TPC datasets growth: faster than Moore Same trends in scientific, personal computing Large Hadron Collider  March’11: 1.6PB data (Tier-1) Large Synoptic Survey Telescope  30 TB/night  2x Sloan Digital Sky Surveys/day  Sloan: more data than entire history of astronomy before it © Hardavellas 5 More data  more computing power to process them

Galaxy: Optically-Connected Disintegrated Processors Physical constraints limit single-chip designs  Area, Yield, Power, Bandwidth Multi-chip designs break free of these limitations  Processor disintegration  Macro-chip integration © Hardavellas 6 [Pan, WINDS 2010]

Outline Introduction ➔ Background Galaxy Architecture Experimental Methodology Results  Sensitivity Studies  Single-Chip Comparisons (Processor Disintegration)  Multi-Chip Comparisons (Macrochip Integration)  Thermal Modeling Conclude Overview of Other Research © Hardavellas 7

Nanophotonic Components © Hardavellas 8 off-chip laser source coupler resonant modulators resonant detectors Ge-doped waveguide Selective: couple optical energy of a specific wavelength

Modulation and Detection © Hardavellas 9 11010101 10001011 16 - 64 wavelengths DWDM 5 - 20μm waveguide pitch 10Gbps per link 8 Tbps/mm bandwidth density or more !!!

Outline Introduction Background ➔ Galaxy Architecture Experimental Methodology Results  Sensitivity Studies  Single-Chip Comparisons (Processor Disintegration)  Multi-Chip Comparisons (Macrochip Integration)  Thermal Modeling Conclude Overview of Other Research © Hardavellas 10

Optical Crossbar © Hardavellas 11

Routing Example © Hardavellas 12

Single Chiplet Connectivity © Hardavellas 13

Galaxy Architecture © Hardavellas 14

Galaxy MWSR Optical Crossbar © Hardavellas 15 More energy-efficient than SWMR at that scale MWSR avoids broadcast bus, but requires arbitration

© Hardavellas 16 SWMR vs. MWSR Crossbar Single-Writer Multiple-Reader Broadcast bus All receivers always read On-rings  optical loss High laser power Multiple-Writer Single-Reader Only one receiver reads Only one ring is on  low loss Low laser power Needs arbitration

Dense Off-Chip Coupling © Hardavellas 17 Dense optical fiber array [Lee, OSA/OFC/NFOEC 2010] ~3.8dB loss, 8 Tbps/mm demonstrated Misalignment within  loss <1 dB Loss comparable to optical proximity couplers

Nanophotonic Parameters © Hardavellas 18

Outline Introduction Background Galaxy Architecture ➔ Experimental Methodology Results  Sensitivity Studies  Single-Chip Comparisons (Processor Disintegration)  Multi-Chip Comparisons (Macrochip Integration)  Thermal Modeling Conclude Overview of Other Research © Hardavellas 19

Architectural Parameters © Hardavellas 20

Modeling Infrastructure © Hardavellas 21 3D-stack model SimFlex sampling 95% confidence photonic-layer ring heating

Outline Introduction Background Galaxy Architecture Experimental Methodology Results ➔ Sensitivity Studies  Single-Chip Comparisons (Processor Disintegration)  Multi-Chip Comparisons (Macrochip Integration)  Thermal Modeling Conclude Overview of Other Research © Hardavellas 22

Laser Power Sensitivity to Optical Parameters © Hardavellas 23 Coupler Loss Off-Ring Loss Waveguide & Filter Drop Loss Modulator Insertion Loss Highly sensitive to coupler loss, insensitive to other losses

Sensitivity to Fiber Density 116mm 2 chiplets  43mm along the chip edge Enough room for 172 fibers @ 250μm pitch © Hardavellas 24 128 fibers: within 3% of max performance

Outline Introduction Background Galaxy Architecture Experimental Methodology Results  Sensitivity Studies ➔ Single-Chip Comparisons (Processor Disintegration)  Multi-Chip Comparisons (Macrochip Integration)  Thermal Modeling Conclude Overview of Other Research © Hardavellas 25

xxx_4MC: Unlimited power (max speed of design, irrespective of temp.) xxx_20MC: Also unlimited bandwidth (20 MCs per chip, 5x more pins) Performance Against “Unlimited” Designs © Hardavellas 26 Galaxy matches the performance of “unlimited” designs

Realistic: within power and bandwidth envelopes Galaxy chiplets within 66.2 o C  chiplets run at max speed Performance Against Realistic Designs © Hardavellas 27 Galaxy: 2.4x - 3.2x speedup on average (3.4 max)

Cool chiplets minimize leakage Energy-Delay Product © Hardavellas 28 Galaxy: 2.4x-2.8x smaller EDP on average (7.1x max)

Outline Introduction Background Galaxy Architecture Experimental Methodology Results  Sensitivity Studies  Single-Chip Comparisons (Processor Disintegration) ➔ Multi-Chip Comparisons (Macrochip Integration)  Thermal Modeling Conclude Overview of Other Research © Hardavellas 29

Comparison Against Multi-Chip Alternatives © Hardavellas 30 Fiber Galaxy: 2.5x speedup over Oracle Macrochip (6.8x max) 6x less laser power with demonstrated couplers

Outline Introduction Background Galaxy Architecture Experimental Methodology Results  Sensitivity Studies  Single-Chip Comparisons (Processor Disintegration)  Multi-Chip Comparisons (Macrochip Integration) ➔ Thermal Modeling Conclude Overview of Other Research © Hardavellas 31

80-core 5-chiplet Galaxy Thermal CFD Modeling © Hardavellas 32 8cm spacing allows cooling with cheap passive heatsinks 88.2 0 C

Galaxy Summary “Virtual chips” with the performance of unlimited designs Breaks free of typical physical constraints  Large aggregate area  Improved yield (break-even point : 60% yield for photonics)  Tb/s/mm bandwidth density  Pushes back power wall Processor disintegration  2.4x – 3.2x avg. speedup (3.4 max)  2.4x – 2.8x avg. smaller EDP (7.1x max) Macrochip integration  2.5x speedup over Oracle Macrochip (6.8x max)  6x more power efficient links © Hardavellas 36

Outline Introduction Background Galaxy Architecture Experimental Methodology Results  Sensitivity Studies  Single-Chip Comparisons (Processor Disintegration)  Multi-Chip Comparisons (Macrochip Integration)  Thermal Modeling Conclude ➔ Overview of Other Research © Hardavellas 37

Integer add: 0.5pJ; FP-FMA: 50pJ. Where does energy go?  Data movement: 1200pJ across 400mm 2 chip, 16000pJ memory Elastic caches: minimize data transfers through adapting caches to workload demands [ISCA’09, IEEEMicro’10, DATE’12]  Processing: ~1500pJ to schedule the operation SeaFire: specialized computing on dark silicon to eliminate general- purpose computing’s overheads [IEEEMicro’11, USENIX-Login’11]  Circuits: wide voltage guardbands  Low voltages, process variation  timing errors  computing errors Elastic fidelity: allow errors at select code/data segments to save energy while maintaining fidelity contract with user [CoRR abs/1111.4279] Chips fundamentally limited by physical constraints. Need to break free. Galaxy: processor disintegration/macrochip integration using photonic interconnects [WINDS’10] Overall Focus: Energy-Efficient Computing

Tapered vs. Optical Proximity Couplers © Hardavellas 44 6x less laser power than Oracle Macrochip with demonstrated couplers

Energy is Shaping the IT Industry #1 of Grand Challenges for Humanity in the Next 50 Years [Smalley Institute for Nanoscale Research and Technology, Rice U.] Computing worldwide: ~408 TWh in 2010 [Gartner] Datacenter energy consumption in US ~150 TWh in 2011 [EPA]  3.8% of domestic power generation, $15B  CO 2 -equiv. emissions ≈ Airline Industry (2%) Carbon footprint of world’s data centers ≈ Czech Republic Exascale @ 20MW: 200x lower energy/instr. (2nJ  10pJ)  3% of the output of an average nuclear plant! 10% annual growth on installed computers worldwide [Gartner] © Hardavellas 45 Exponential increase in energy consumption

Overcoming Data Movement and Processing Overheads Elastic caches: adapt cache to workload’s demands  Significant energy on data movements and coherence requests  Co-locate data, metadata, and computation  Decouple address from placement location  Capitalize on existing OS events  simplify hardware  Cut on-chip interconnect traffic by half Seafire: specialized computing on dark silicon  Repurpose dark silicon to implement specialized cores  Application cherry-picks a few cores, rest of chip is powered off  Vast unused area  many specialized cores  likely to find good matches  12x lower energy (conservative) 46 © Hardavellas

Elastic fidelity: selectively trade accuracy for energy  We don’t always need 100% accuracy, but HW always provides it  Language constructs specify required fidelity for code/data segments  Steer computation to exec/storage units with appropriate fidelity and lower voltage  35% lower energy Overcoming Voltage Guardbands 47 © Hardavellas No errors 10% errors

Galaxy: High-Performance Energy-Efficient Multi-Chip Architectures Using Photonic Interconnects Nikos Hardavellas – Parallel Architecture Group.

Similar presentations

Presentation on theme: "Galaxy: High-Performance Energy-Efficient Multi-Chip Architectures Using Photonic Interconnects Nikos Hardavellas – Parallel Architecture Group."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Galaxy: High-Performance Energy-Efficient Multi-Chip Architectures Using Photonic Interconnects Nikos Hardavellas – Parallel Architecture Group.

Similar presentations

Presentation on theme: "Galaxy: High-Performance Energy-Efficient Multi-Chip Architectures Using Photonic Interconnects Nikos Hardavellas – Parallel Architecture Group."— Presentation transcript:

Similar presentations

About project

Feedback