Download presentation
Presentation is loading. Please wait.
Published byFrederica Chapman Modified over 8 years ago
1
Building manycore processor-to-DRAM networks using monolithic silicon photonics Ajay Joshi †, Christopher Batten †, Vladimir Stojanović †, Krste Asanović ‡ † MIT, 77 Massachusetts Ave, Cambridge MA 02139 ‡ UC Berkeley, 430 Soda Hall, MC #1776, Berkeley, CA 94720 {joshi, cbatten, vlada}@mit.edu, krste@eecs.berkeley.edu High Performance Embedded Computing (HPEC) Workshop 23-25 September 2008
2
MIT/UCB Manycore systems design space
3
MIT/UCB Manycore system bandwidth requirements
4
MIT/UCB Manycore systems – bandwidth, pin count and power scaling 4 1 Byte/Flop, 8 Flops/core @ 5GHz Server & HPC Mobile Client
5
MIT/UCB Interconnect bottlenecks CPU Cache DRAM DIMM Manycore system cores Cache DRAM DIMM Cache DRAM DIMM CPU Interconnect Network Interconnect Network Bottlenecks due to energy and bandwidth density limitations
6
MIT/UCB Interconnect bottlenecks CPU Cache DRAM DIMM Manycore system cores Cache DRAM DIMM Cache DRAM DIMM CPU Interconnect Network Interconnect Network Bottlenecks due to energy and bandwidth density limitations Need to jointly optimize on-chip and off-chip interconnect network
7
MIT/UCB Outline Motivation Monolithic silicon photonic technology Processor-memory network architecture exploration Manycore system using silicon photonics Conclusion
8
MIT/UCB Unified on-chip/off-chip photonic link Supports dense wavelength-division multiplexing that improves bandwidth density Uses monolithic integration that reduces energy consumption Utilizes the standard bulk CMOS flow
9
MIT/UCB Optical link components 65 nm bulk CMOS chip designed to test various optical devices
10
MIT/UCB Silicon photonics area and energy advantage Metric Energy (pJ/b) Bandwidth density (Gb/s/μ) Global on-chip photonic link0.25160-320 Global on-chip optimally repeated electrical link15 Off-chip photonic link (50 μ coupler pitch)0.2513-26 Off-chip electrical SERDES (100 μ pitch)50.1 On-chip/off-chip seamless photonic link0.25
11
MIT/UCB Outline Motivation Monolithic silicon photonic technology Processor-memory network architecture exploration Baseline electrical mesh topology Electrical mesh with optical global crossbar topology Manycore system using silicon photonics Conclusion
12
MIT/UCB Baseline electrical system architecture Access point per DM distributed across the chip Two on-chip electrical mesh networks Request path – core access point DRAM module Response path – DRAM module access point core Mesh physical viewMesh logical view C = core, DM = DRAM module
13
MIT/UCB Interconnect network design methodology Ideal throughput and zero load latency used as design metrics Energy constrained approach is adopted Energy components in a network Mesh energy (E m ) (router-to-router links (RRL), routers) IO energy (E io ) (logic-to-memory links (LML)) Flit width Calculate on-chip RRL energy Calculate on-chip router energy Calculate mesh throughput Calculate total mesh energy Calculate energy budget for LML Total energy budget Calculate LML width Calculate I/O throughput Calculate zero load latency
14
MIT/UCB Network throughput and zero load latency System throughput limited by on-chip mesh or I/O links On-chip mesh could be over-provisioned to overcome mesh bottleneck Zero load latency limited by data serialization (22nm tech, 256 cores @ 2.5 GHz, 8 nJ/cyc energy budget)
15
MIT/UCB Network throughput and zero load latency System throughput limited by on-chip mesh or I/O links On-chip mesh could be over-provisioned to overcome mesh bottleneck Zero load latency limited by data serialization (22nm tech, 256 cores @ 2.5 GHz, 8 nJ/cyc energy budget)
16
MIT/UCB Network throughput and zero load latency System throughput limited by on-chip mesh or I/O links On-chip mesh could be over-provisioned to overcome mesh bottleneck Zero load latency limited by data serialization (22nm tech, 256 cores @ 2.5 GHz, 8 nJ/cyc energy budget) OPF:1 OPF:2 OPF:4
17
MIT/UCB Network throughput and zero load latency System throughput limited by on-chip mesh or I/O links On-chip mesh could be over-provisioned to overcome mesh bottleneck Zero load latency limited by data serialization On-chip serialization Off-chip serialization (22nm tech, 256 cores @ 2.5 GHz, 8 nJ/cyc energy budget) OPF:1 OPF:2 OPF:4
18
MIT/UCB Outline Motivation Monolithic silicon photonic technology Processor-memory network architecture exploration Baseline electrical mesh topology Electrical mesh with optical global crossbar topology Manycore system using silicon photonics Conclusion
19
MIT/UCB Optical system architecture Off-chip electrical links replaced with optical links Electrical to optical conversion at access point Wavelengths in each optical link distributed across various core-DRAM module pairs Mesh physical viewMesh logical view C = core, DM = DRAM module
20
MIT/UCB Network throughput and zero load latency Reduced I/O cost improves system bandwidth Reduction in latency due to lower serialization latency On-chip network is the new bottleneck
21
MIT/UCB Network throughput and zero load latency Reduced I/O cost improves system bandwidth Reduction in latency due to lower serialization latency On-chip network is the new bottleneck
22
MIT/UCB Optical multi-group system architecture Break the single on-chip electrical mesh into several groups Each group has its own smaller mesh Each group still has one AP for each DM More APs each AP is narrower (uses less λs) Use optical network as a very efficient global crossbar Need a crossbar switch at the memory for arbitration Ci = core in group i, DM = DRAM module, S = global crossbar switch
23
MIT/UCB Network throughput vs zero load latency Grouping moves traffic from energy-inefficient mesh channels to energy-efficient photonic channels Grouping and silicon photonics provides 10x- 15x throughput improvement Grouping reduces ZLL in photonic range, but increases ZLL in electrical range A B 10x-15x
24
MIT/UCB Simulation results Grouping 2x improvement in bandwidth at comparable latency Overprovisioning 2x-3x improvement in bandwidth for small group count at comparable latency Minimal improvement for large group count 256 cores,16 DM Uniform random traffic 256 cores,16 DM Uniform random traffic
25
MIT/UCB Simulation results Replacing off-chip electrical with photonics (Eg1x4 Og1x4) 2x improvement in bandwidth at comparable latency Using opto-electrical global crossbar (Eg4x2 Og16x1) 8x-10x improvement in bandwidth at comparable latency 256 cores,16 DM Uniform random traffic 256 cores 16 DM Uniform random traffic
26
MIT/UCB Outline Motivation Monolithic silicon photonic technology Processor-memory network architecture exploration Manycore system using silicon photonics Conclusion
27
MIT/UCB Simplified 16-core system design
28
MIT/UCB Simplified 16-core system design
29
MIT/UCB Simplified 16-core system design
30
MIT/UCB Simplified 16-core system design
31
MIT/UCB Simplified 16-core system design
32
MIT/UCB Full 256-core system design
33
MIT/UCB Outline Motivation Monolithic silicon photonic technology Processor-memory network architecture exploration Manycore system using silicon photonics Conclusion
34
MIT/UCB Conclusion On-chip network design and memory bandwidth will limit manycore system performance Unified on-chip/off-chip photonic link is proposed to solve this problem Grouping with optical global crossbar improves system throughput For an energy-constrained approach, photonics provide 8-10x improvement in throughput at comparable latency
35
MIT/UCB Backup
36
MIT/UCB MIT Eos1 65 nm test chip Texas Instruments standard 65 nm bulk CMOS process First ever photonic chip in sub-100nm CMOS Automated photonic device layout Monolithic integration with electrical modulator drivers
37
MIT/UCB Ring modulator Paperclips Waveguide crossings M-Z test structures Digital driver 4 ring filter banks Photo detector Two-ring filter One-ring filter Vertical coupler grating
38
MIT/UCB Optical waveguide Waveguide made of polysilicon Silicon substrate under waveguide etched away to provide optical cladding 64 wavelengths per waveguide in opposite directions SEM image of a poly silicon waveguide Cross-sectional view of a photonic chip
39
MIT/UCB Modulators and filters 2 nd order ring filters used Rings tuned using sizing and heating Resonant racetrack modulator Double-ring resonant filter Modulator is tuned using charge injection Sub-100 fJ/bit energy cost for the modulator driver
40
MIT/UCB Photodetectors Embedded SiGe used to create photodetectors Monolithic integration enable good optical coupling Sub-100 fJ/bit energy cost required for the receiver
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.