Building manycore processor-to-DRAM networks using monolithic silicon photonics Ajay Joshi †, Christopher Batten †, Vladimir Stojanović †, Krste Asanović.

Slides:

Advertisements

Similar presentations

Galaxy: High-Performance Energy-Efficient Multi-Chip Architectures Using Photonic Interconnects Nikos Hardavellas – Parallel Architecture Group.

Advertisements

Electrical and Computer Engineering UAH System Level Optical Interconnect Optical Fiber Computer Interconnect: The Simultaneous Multiprocessor Exchange.

A Search Memory Substrate for High Throughput and Low Power Packet Processing Sangyeun Cho, Michel Hanna and Rami Melhem Dept. of Computer Science University.

QuT: A Low-Power Optical Network-on-chip

A Novel 3D Layer-Multiplexed On-Chip Network

A Scalable and Reconfigurable Search Memory Substrate for High Throughput Packet Processing Sangyeun Cho and Rami Melhem Dept. of Computer Science University.

Flattened Butterfly Topology for On-Chip Networks John Kim, James Balfour, and William J. Dally Presented by Jun Pang.

Benjamin C. Johnstone, Dr. Sonia Lopez Alarcon 1.

Memory Network: Enabling Technology for Scalable Near-Data Computing Gwangsun Kim, John Kim Korea Advanced Institute of Science and Technology Jung Ho.

Optical Interconnects Speeding Up Computing Matt Webb PICTURE HERE.

L2 to Off-Chip Memory Interconnects for CMPs Presented by Allen Lee CS258 Spring 2008 May 14, 2008.

Firefly: Illuminating Future Network-on-Chip with Nanophotonics Yan Pan, Prabhat Kumar, John Kim †, Gokhan Memik, Yu Zhang, Alok Choudhary EECS Department.

1 Lecture 15: DRAM Design Today: DRAM basics, DRAM innovations (Section 5.3)

IP I/O Memory Hard Disk Single Core IP I/O Memory Hard Disk IP Bus Multi-Core IP R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R Networks.

Lei Wang, Yuho Jin, Hyungjun Kim and Eun Jung Kim

Integrated Silicon Photonics – An Overview Aniruddha N. Udipi.

Orion: A Power-Performance Simulator for Interconnection Networks Presented by: Ilya Tabakh RC Reading Group4/19/2006.

Combining Memory and a Controller with Photonics through 3D-Stacking to Enable Scalable and Energy-Efficient Systems Aniruddha N. Udipi Naveen Muralimanohar*

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in partitioned architectures Rajeev Balasubramonian Naveen.

Optical Interconnects Speeding Up Computing Matt Webb PICTURE HERE.

COLUMBIA UNIVERSITY Interconnects Jim Tomkins: “Exascale System Interconnect Requirements” Jeff Vetter: “IAA Interconnect Workshop Recap and HPC Application.

McRouter: Multicast within a Router for High Performance NoCs

ROBERT HENDRY, GILBERT HENDRY, KEREN BERGMAN LIGHTWAVE RESEARCH LAB COLUMBIA UNIVERSITY HPEC 2011 TDM Photonic Network using Deposited Materials.

Tightly-Coupled Multi-Layer Topologies for 3D NoCs Hiroki Matsutani (Keio Univ, JAPAN) Michihiro Koibuchi (NII, JAPAN) Hideharu Amano (Keio Univ, JAPAN)

TLC: Transmission Line Caches Brad Beckmann David Wood Multifacet Project University of Wisconsin-Madison 12/3/03.

On-Chip Networks and Testing

R OUTE P ACKETS, N OT W IRES : O N -C HIP I NTERCONNECTION N ETWORKS Veronica Eyo Sharvari Joshi.

Optics in Internet Routers Mark Horowitz, Nick McKeown, Olav Solgaard, David Miller Stanford University

EE16.468/16.568Lecture 3Waveguide photonic devices 1. Mach-Zehnder EO modulator Electro-optic effect, n2 changes with E -field a1 a2 a3 b1 b2 50% AA out.

Waveguide High-Speed Circuits and Systems Laboratory B.M.Yu High-Speed Circuits and Systems Laboratory 1.

SMART: A Single- Cycle Reconfigurable NoC for SoC Applications -Jyoti Wadhwani Chia-Hsin Owen Chen, Sunghyun Park, Tushar Krishna, Suvinay Subramaniam,

Timing Channel Protection for a Shared Memory Controller Yao Wang, Andrew Ferraiuolo, G. Edward Suh Feb 17 th 2014.

Si-based On-chip Optical Interconnects

Interconnect Focus Center e¯e¯ e¯e¯ e¯e¯ e¯e¯ IWSM 2001Sam, Chandrakasan, and Boning – MIT Variation Issues in On-Chip Optical Clock Distribution S. L.

Nikos Hardavellas – Parallel Architecture Group

A Lightweight Fault-Tolerant Mechanism for Network-on-Chip

Design and Evaluation of Hierarchical Rings with Deflection Routing Rachata Ausavarungnirun, Chris Fallin, Xiangyao Yu, Kevin Chang, Greg Nazario, Reetuparna.

Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012.

Express Cube Topologies for On-chip Interconnects Boris Grot J. Hestness, S. W. Keckler, O. Mutlu † The University of Texas at Austin † Carnegie Mellon.

COMPARISON B/W ELECTRICAL AND OPTICAL COMMUNICATION INSIDE CHIP Irfan Ullah Department of Information and Communication Engineering Myongji university,

CS 8501 Networks-on-Chip (NoCs) Lukasz Szafaryn 15 FEB 10.

Interconnect Technologies and Drivers primary technologies: integration + optics driven primarily by servers/cloud computing thin wires → slow wires; limits.

10/03/2005: 1 Physical Synthesis of Latency Aware Low Power NoC Through Topology Exploration and Wire Style Optimization CK Cheng CSE Department UC San.

Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.

University of Michigan, Ann Arbor

Performance, Cost, and Energy Evaluation of Fat H-Tree: A Cost-Efficient Tree-Based On-Chip Network Hiroki Matsutani (Keio Univ, JAPAN) Michihiro Koibuchi.

Yu Cai Ken Mai Onur Mutlu

Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.

Assaf Shacham, Keren Bergman, Luca P. Carloni Presented for HPCAN Session by: Millad Ghane NOCS’07.

Combining Memory and a Controller with Photonics through 3D-Stacking to Enable Scalable and Energy-Efficient Systems Aniruddha N. Udipi Naveen Muralimanohar*

Advanced Processor Group The School of Computer Science A Dynamic Link Allocation Router Wei Song, Doug Edwards Advanced Processor Group The University.

Hybrid Optoelectric On-chip Interconnect Networks Yong-jin Kwon 1.

PERFORMANCE EVALUATION OF LARGE RECONFIGURABLE INTERCONNECTS FOR MULTIPROCESSOR SYSTEMS Wim Heirman, Iñigo Artundo, Joni Dambre, Christof Debaes, Pham.

Simultaneous Multi-Layer Access Improving 3D-Stacked Memory Bandwidth at Low Cost Donghyuk Lee, Saugata Ghose, Gennady Pekhimenko, Samira Khan, Onur Mutlu.

Runtime Reconfigurable Network-on- chips for FPGA-based systems Mugdha Puranik Department of Electrical and Computer Engineering

Network-on-Chip Paradigm Erman Doğan. OUTLINE SoC Communication Basics  Bus Architecture  Pros, Cons and Alternatives NoC  Why NoC?  Components 

Lynn Choi School of Electrical Engineering

Architecture and Algorithms for an IEEE 802

Seth Pugsley, Jeffrey Jestes,

ISPASS th April Santa Rosa, California

3Boston University ECE Dept.;

Pablo Abad, Pablo Prieto, Valentin Puente, Jose-Angel Gregorio

Exploring Concentration and Channel Slicing in On-chip Network Router

Gilbert Hendry Johnnie Chan, Daniel Brunina,

Analysis of a Chip Multiprocessor Using Scientific Applications

Israel Cidon, Ran Ginosar and Avinoam Kolodny

Natalie Enright Jerger, Li Shiuan Peh, and Mikko Lipasti

Leveraging Optical Technology in Future Bus-based Chip Multiprocessors

Network-on-Chip Programmable Platform in Versal™ ACAP Architecture

Exploring Chip to Chip Photonic Networks

Presentation transcript:

Building manycore processor-to-DRAM networks using monolithic silicon photonics Ajay Joshi †, Christopher Batten †, Vladimir Stojanović †, Krste Asanović ‡ † MIT, 77 Massachusetts Ave, Cambridge MA ‡ UC Berkeley, 430 Soda Hall, MC #1776, Berkeley, CA {joshi, cbatten, High Performance Embedded Computing (HPEC) Workshop September 2008

MIT/UCB Manycore systems design space

MIT/UCB Manycore system bandwidth requirements

MIT/UCB Manycore systems – bandwidth, pin count and power scaling 4 1 Byte/Flop, 8 5GHz Server & HPC Mobile Client

MIT/UCB Interconnect bottlenecks CPU Cache DRAM DIMM Manycore system cores Cache DRAM DIMM Cache DRAM DIMM CPU Interconnect Network Interconnect Network Bottlenecks due to energy and bandwidth density limitations

MIT/UCB Interconnect bottlenecks CPU Cache DRAM DIMM Manycore system cores Cache DRAM DIMM Cache DRAM DIMM CPU Interconnect Network Interconnect Network Bottlenecks due to energy and bandwidth density limitations Need to jointly optimize on-chip and off-chip interconnect network

MIT/UCB Outline  Motivation  Monolithic silicon photonic technology  Processor-memory network architecture exploration  Manycore system using silicon photonics  Conclusion

MIT/UCB Unified on-chip/off-chip photonic link  Supports dense wavelength-division multiplexing that improves bandwidth density  Uses monolithic integration that reduces energy consumption  Utilizes the standard bulk CMOS flow

MIT/UCB Optical link components 65 nm bulk CMOS chip designed to test various optical devices

MIT/UCB Silicon photonics area and energy advantage Metric Energy (pJ/b) Bandwidth density (Gb/s/μ) Global on-chip photonic link Global on-chip optimally repeated electrical link15 Off-chip photonic link (50 μ coupler pitch) Off-chip electrical SERDES (100 μ pitch)50.1 On-chip/off-chip seamless photonic link0.25

MIT/UCB Outline  Motivation  Monolithic silicon photonic technology  Processor-memory network architecture exploration Baseline electrical mesh topology Electrical mesh with optical global crossbar topology  Manycore system using silicon photonics  Conclusion

MIT/UCB Baseline electrical system architecture  Access point per DM distributed across the chip  Two on-chip electrical mesh networks Request path – core  access point  DRAM module Response path – DRAM module  access point  core Mesh physical viewMesh logical view C = core, DM = DRAM module

MIT/UCB Interconnect network design methodology  Ideal throughput and zero load latency used as design metrics  Energy constrained approach is adopted  Energy components in a network Mesh energy (E m ) (router-to-router links (RRL), routers) IO energy (E io ) (logic-to-memory links (LML)) Flit width Calculate on-chip RRL energy Calculate on-chip router energy Calculate mesh throughput Calculate total mesh energy Calculate energy budget for LML Total energy budget Calculate LML width Calculate I/O throughput Calculate zero load latency

MIT/UCB Network throughput and zero load latency  System throughput limited by on-chip mesh or I/O links  On-chip mesh could be over-provisioned to overcome mesh bottleneck  Zero load latency limited by data serialization (22nm tech, GHz, 8 nJ/cyc energy budget)

MIT/UCB Network throughput and zero load latency  System throughput limited by on-chip mesh or I/O links  On-chip mesh could be over-provisioned to overcome mesh bottleneck  Zero load latency limited by data serialization (22nm tech, GHz, 8 nJ/cyc energy budget)

MIT/UCB Network throughput and zero load latency  System throughput limited by on-chip mesh or I/O links  On-chip mesh could be over-provisioned to overcome mesh bottleneck  Zero load latency limited by data serialization (22nm tech, GHz, 8 nJ/cyc energy budget) OPF:1 OPF:2 OPF:4

MIT/UCB Network throughput and zero load latency  System throughput limited by on-chip mesh or I/O links  On-chip mesh could be over-provisioned to overcome mesh bottleneck  Zero load latency limited by data serialization On-chip serialization Off-chip serialization (22nm tech, GHz, 8 nJ/cyc energy budget) OPF:1 OPF:2 OPF:4

MIT/UCB Outline  Motivation  Monolithic silicon photonic technology  Processor-memory network architecture exploration Baseline electrical mesh topology Electrical mesh with optical global crossbar topology  Manycore system using silicon photonics  Conclusion

MIT/UCB Optical system architecture  Off-chip electrical links replaced with optical links  Electrical to optical conversion at access point  Wavelengths in each optical link distributed across various core-DRAM module pairs Mesh physical viewMesh logical view C = core, DM = DRAM module

MIT/UCB Network throughput and zero load latency  Reduced I/O cost improves system bandwidth  Reduction in latency due to lower serialization latency  On-chip network is the new bottleneck

MIT/UCB Network throughput and zero load latency  Reduced I/O cost improves system bandwidth  Reduction in latency due to lower serialization latency  On-chip network is the new bottleneck

MIT/UCB Optical multi-group system architecture  Break the single on-chip electrical mesh into several groups Each group has its own smaller mesh Each group still has one AP for each DM More APs  each AP is narrower (uses less λs)  Use optical network as a very efficient global crossbar  Need a crossbar switch at the memory for arbitration Ci = core in group i, DM = DRAM module, S = global crossbar switch

MIT/UCB Network throughput vs zero load latency  Grouping moves traffic from energy-inefficient mesh channels to energy-efficient photonic channels  Grouping and silicon photonics provides 10x- 15x throughput improvement  Grouping reduces ZLL in photonic range, but increases ZLL in electrical range A B 10x-15x

MIT/UCB Simulation results  Grouping 2x improvement in bandwidth at comparable latency  Overprovisioning 2x-3x improvement in bandwidth for small group count at comparable latency Minimal improvement for large group count 256 cores,16 DM Uniform random traffic 256 cores,16 DM Uniform random traffic

MIT/UCB Simulation results  Replacing off-chip electrical with photonics (Eg1x4  Og1x4) 2x improvement in bandwidth at comparable latency  Using opto-electrical global crossbar (Eg4x2  Og16x1) 8x-10x improvement in bandwidth at comparable latency 256 cores,16 DM Uniform random traffic 256 cores 16 DM Uniform random traffic

MIT/UCB Outline  Motivation  Monolithic silicon photonic technology  Processor-memory network architecture exploration  Manycore system using silicon photonics  Conclusion

MIT/UCB Simplified 16-core system design

MIT/UCB Simplified 16-core system design

MIT/UCB Simplified 16-core system design

MIT/UCB Simplified 16-core system design

MIT/UCB Simplified 16-core system design

MIT/UCB Full 256-core system design

MIT/UCB Outline  Motivation  Monolithic silicon photonic technology  Processor-memory network architecture exploration  Manycore system using silicon photonics  Conclusion

MIT/UCB Conclusion  On-chip network design and memory bandwidth will limit manycore system performance  Unified on-chip/off-chip photonic link is proposed to solve this problem  Grouping with optical global crossbar improves system throughput  For an energy-constrained approach, photonics provide 8-10x improvement in throughput at comparable latency

MIT/UCB Backup

MIT/UCB MIT Eos1 65 nm test chip  Texas Instruments standard 65 nm bulk CMOS process  First ever photonic chip in sub-100nm CMOS  Automated photonic device layout  Monolithic integration with electrical modulator drivers

MIT/UCB Ring modulator Paperclips Waveguide crossings M-Z test structures Digital driver 4 ring filter banks Photo detector Two-ring filter One-ring filter Vertical coupler grating

MIT/UCB Optical waveguide  Waveguide made of polysilicon  Silicon substrate under waveguide etched away to provide optical cladding  64 wavelengths per waveguide in opposite directions SEM image of a poly silicon waveguide Cross-sectional view of a photonic chip

MIT/UCB Modulators and filters  2 nd order ring filters used  Rings tuned using sizing and heating Resonant racetrack modulator Double-ring resonant filter  Modulator is tuned using charge injection  Sub-100 fJ/bit energy cost for the modulator driver

MIT/UCB Photodetectors  Embedded SiGe used to create photodetectors  Monolithic integration enable good optical coupling  Sub-100 fJ/bit energy cost required for the receiver