Nikos Hardavellas – Parallel Architecture Group

Slides:

Advertisements

Similar presentations

Galaxy: High-Performance Energy-Efficient Multi-Chip Architectures Using Photonic Interconnects Nikos Hardavellas – Parallel Architecture Group.

Advertisements

Electrical and Computer Engineering UAH System Level Optical Interconnect Optical Fiber Computer Interconnect: The Simultaneous Multiprocessor Exchange.

QuT: A Low-Power Optical Network-on-chip

A Novel 3D Layer-Multiplexed On-Chip Network

Nikos Hardavellas, Northwestern University

1 An Efficient, Hardware-based Multi-Hash Scheme for High Speed IP Lookup Hot Interconnects 2008 Socrates Demetriades, Michel Hanna, Sangyeun Cho and Rami.

THE RAW MICROPROCESSOR: A COMPUTATIONAL FABRIC FOR SOFTWARE CIRCUITS AND GENERAL- PURPOSE PROGRAMS Taylor, M.B.; Kim, J.; Miller, J.; Wentzlaff, D.; Ghodrat,

System Design Tricks for Low-Power Video Processing Jonah Probell, Director of Multimedia Solutions, ARC International.

On-Chip Interconnects Alexander Grubb Jennifer Tam Jiri Simsa Harsha Simhadri Martha Mercaldi Kim, John D. Davis, Mark Oskin, and Todd Austin. “Polymorphic.

Wavelength-Routing Switch Fabric Patrick Chiang, Hossein Kakvand, Milind Kopikare, Uma Krishnamoorthy, Paulina Kuo, Pablo Molinero-Fernández Stanford University.

Galaxy: High-Performance Energy-Efficient Multi-Chip Architectures Using Photonic Interconnects Nikos Hardavellas – Parallel Architecture Group.

Toward Energy-Efficient Computing Nikos Hardavellas – Parallel Architecture Group Northwestern University.

Router Architecture : Building high-performance routers Ian Pratt

June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.

Lecture 12: DRAM Basics Today: DRAM terminology and basics, energy innovations.

Firefly: Illuminating Future Network-on-Chip with Nanophotonics Yan Pan, Prabhat Kumar, John Kim †, Gokhan Memik, Yu Zhang, Alok Choudhary EECS Department.

1 Lecture 15: DRAM Design Today: DRAM basics, DRAM innovations (Section 5.3)

IP I/O Memory Hard Disk Single Core IP I/O Memory Hard Disk IP Bus Multi-Core IP R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R Networks.

Scaling the Bandwidth Wall: Challenges in and Avenues for CMP Scalability 36th International Symposium on Computer Architecture Brian Rogers †‡, Anil Krishna.

© intec 2000 Reasons for parallel optical interconnects Roel Baets Ghent University - IMEC Department of Information Technology (INTEC)

Integrated Silicon Photonics – An Overview Aniruddha N. Udipi.

1 Indirect Adaptive Routing on Large Scale Interconnection Networks Nan Jiang, William J. Dally Computer System Laboratory Stanford University John Kim.

Combining Memory and a Controller with Photonics through 3D-Stacking to Enable Scalable and Energy-Efficient Systems Aniruddha N. Udipi Naveen Muralimanohar*

Exploiting Dark Silicon for Energy Efficiency Nikos Hardavellas Northwestern University, EECS.

1 Lecture 1: Introduction and Memory Systems CS 7810 Course organization:  5 lectures on memory systems  5 lectures on cache coherence and consistency.

Dragonfly Topology and Routing

On-Chip Networks and Testing

Introduction to Interconnection Networks. Introduction to Interconnection network Digital systems(DS) are pervasive in modern society. Digital computers.

Computing Hardware Starter.

1J. Kim Web Science & Technology Forum Enabling Hardware Technology for Web Science John Kim Department of Computer Science KAIST.

Optics in Internet Routers Mark Horowitz, Nick McKeown, Olav Solgaard, David Miller Stanford University

1 Roland Kersting Department of Physics, Applied Physics, and Astronomy The Science of Information Technology Computing with Light the processing.

Energy-Proportional Photonic Interconnects Nikos Hardavellas – Parallel Architecture Group Northwestern University Team: Y. Demir, P. Yan, S. Song,

Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.

Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.

Optical Filter 武倩倩 Outline Introduction to silicon photonics Athermal tunable silicon optical filter Working principle Fabricated device Experiments.

1 CS/EE 6810: Computer Architecture Class format:  Most lectures on YouTube *BEFORE* class  Use class time for discussions, clarifications, problem-solving,

Design and Evaluation of Hierarchical Rings with Deflection Routing Rachata Ausavarungnirun, Chris Fallin, Xiangyao Yu, Kevin Chang, Greg Nazario, Reetuparna.

COMPARISON B/W ELECTRICAL AND OPTICAL COMMUNICATION INSIDE CHIP Irfan Ullah Department of Information and Communication Engineering Myongji university,

Silicon Nanophotonic Network-On-Chip Using TDM Arbitration

Rev PA1 1 Performance energy trade-offs with Silicon Photonics Sébastien Rumley, Robert Hendry, Dessislava Nikolova, Keren Bergman.

Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.

Traffic Steering Between a Low-Latency Unsiwtched TL Ring and a High-Throughput Switched On-chip Interconnect Jungju Oh, Alenka Zajic, Milos Prvulovic.

Basics of Energy & Power Dissipation

University of Michigan, Ann Arbor

© GCSE Computing Computing Hardware Starter. Creating a spreadsheet to demonstrate the size of memory. 1 byte = 1 character or about 1 pixel of information.

Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.

Assaf Shacham, Keren Bergman, Luca P. Carloni Presented for HPCAN Session by: Millad Ghane NOCS’07.

Computer Science and Engineering Power-Performance Considerations of Parallel Computing on Chip Multiprocessors Jian Li and Jose F. Martinez ACM Transactions.

The Rise of Dark Silicon

Main memory Processor Bus Cache memory Figure 1.5.The processor cache.

Hybrid Optoelectric On-chip Interconnect Networks Yong-jin Kwon 1.

1 Lecture 2: Metrics to Evaluate Systems Topics: Metrics: power, reliability, cost, benchmark suites, performance equation, summarizing performance with.

Simultaneous Multi-Layer Access Improving 3D-Stacked Memory Bandwidth at Low Cost Donghyuk Lee, Saugata Ghose, Gennady Pekhimenko, Samira Khan, Onur Mutlu.

1 Lecture 3: Memory Energy and Buffers Topics: Refresh, floorplan, buffers (SMB, FB-DIMM, BOOM), memory blades, HMC.

Building manycore processor-to-DRAM networks using monolithic silicon photonics Ajay Joshi †, Christopher Batten †, Vladimir Stojanović †, Krste Asanović.

Design Space Exploration for NoC Topologies ECE757 6 th May 2009 By Amit Kumar, Kanchan Damle, Muhammad Shoaib Bin Altaf, Janaki K.M Jillella Course Instructor:

CS203 – Advanced Computer Architecture

HyPPI The End or The Rebirth of MOORE’S LAW Shuai sun Volker sorger.

Process Variation Aware Crosstalk Mitigation for DWDM based Photonic NoC Architectures. By Ishan Thakkar

CS203 – Advanced Computer Architecture

Pablo Abad, Pablo Prieto, Valentin Puente, Jose-Angel Gregorio

Exploring Concentration and Channel Slicing in On-chip Network Router

Making Networks Light March 29, 2018 Charleston, South Carolina.

The Role of Light in High Speed Digital Design

Analysis of a Chip Multiprocessor Using Scientific Applications

Leveraging Optical Technology in Future Bus-based Chip Multiprocessors

2.C Memory GCSE Computing Langley Park School for Boys.

CS 6290 Many-core & Interconnect

Presentation transcript:

Galaxy: A High-Performance Energy-Efficient Multi-Chip Architecture Using Photonic Interconnects Nikos Hardavellas PARAG@N – Parallel Architecture Group Northwestern University Team: Y. Demir, P. Yan, S. Song, J. Kim, G. Memik

Technology Scaling Runs Out of Steam Transistor counts increase exponentially, but… Can no longer power the entire chip (voltage, cooling do not scale) Can no longer feed all cores with data fast enough (package pins do not scale) Bandwidth Wall Power Wall 580 mm2 die: 25600 pins to package at 150μm 5x5 cm: 3844 substrate-to-board at 0.8m. 1.2V 100nm -> 0.85V 28nm Can no longer keep costs at bay (process variation, defects) Monolithic (single-chip) processor designs running out of steam too Low Yield © Hardavellas

Demand for High-Performance Computing Grows SPEC, TPC datasets growth: faster than Moore Same trends in scientific, personal computing Large Hadron Collider March’11: 1.6PB data (Tier-1) Large Synoptic Survey Telescope 30 TB/night 2x Sloan Digital Sky Surveys/day Sloan: more data than entire history of astronomy before it More data  more computing power to process them © Hardavellas

Galaxy: Optically-Connected Disintegrated Processors [WINDS 2010, ICS 2014] Physical constraints limit single-chip designs Area, Yield, Power, Bandwidth Multi-chip designs break free of these limitations Processor disintegration Macro-chip integration © Hardavellas

Electrical vs. Photonic Links [Nitta et al., 2013] © Hardavellas

Outline Introduction ➔ Background Galaxy Architecture Experimental Methodology Results Sensitivity Studies Single-Chip Comparisons (Processor Disintegration) Multi-Chip Comparisons (Macrochip Integration) Thermal Modeling Conclude © Hardavellas

Nanophotonic Components resonant detectors Ge-doped coupler waveguide off-chip laser source resonant modulators Selective: couple optical energy of a specific wavelength © Hardavellas

Modulation and Detection 11010101 10001011 16 - 64 wavelengths DWDM 5 - 20μm waveguide pitch 10Gbps per link 10001011 11010101 © Hardavellas

Outline Introduction Background ➔ Galaxy Architecture Experimental Methodology Results Sensitivity Studies Single-Chip Comparisons (Processor Disintegration) Multi-Chip Comparisons (Macrochip Integration) Thermal Modeling Conclude © Hardavellas

Optical Crossbar © Hardavellas

Routing Example © Hardavellas

Single Chiplet Connectivity © Hardavellas

Galaxy Architecture (5-chiplet example) 200mm2 die, 128 cores/chiplet, 9 chiplets, 16cm fiber: > 1K cores 256 cores/chiplet, 17 chiplets: > 4K cores 10 radix-8 MWSR crossbars, 64bit flits,16-way DWDM, data: 320 fibers 40960 rings, arb: 20 fibers 3840 rings, fwd clock: 10 fibers 80 rings © Hardavellas

Galaxy MWSR Optical Crossbar MWSR avoids broadcast data bus, but requires arbitration © Hardavellas

Why Fibers and not SOI Waveguides? Almost twice as fast: 0.286c vs 0.676c Negligible optical loss: 0.3db/cm vs. 0.2db/Km Fibers are flexible  do not restrict the design to a 2D plane Minimize thermal transfer  cheap cooling Overlooked due to density concerns Fibers at 250um pitch Waveguides at 20um pitch © Hardavellas

Dense Off-Chip Coupling 116 mm2 chiplets  43mm in length along the chip edge  172 fibers at 250 um Dense optical fiber array [Lee, OSA/OFC/NFOEC 2010] ~3.8dB loss, 8 Tbps/mm demonstrated Misalignment within <0.7μm, 0.4μm, 0.7μm>  loss <1 dB Loss comparable to optical proximity couplers © Hardavellas

Outline Introduction Background Galaxy Architecture ➔ Experimental Methodology Results Sensitivity Studies Single-Chip Comparisons (Processor Disintegration) Multi-Chip Comparisons (Macrochip Integration) Thermal Modeling Conclude © Hardavellas

Nanophotonic Parameters mod/demodulation energy 150 fJ/bit @ 10 GHz power generation & delivery typically excluded. Additional coupling loss  2.9W. 25% efficiency  wall-socket power 12W © Hardavellas

Architectural Parameters Corona: 256-wide data channels, 80 crossbars, 16cm WG 10ns OCM, 2ns 3D-mem © Hardavellas

Modeling Infrastructure SimFlex sampling 95% confidence photonic-layer ring heating target: 16nm node Workloads: SPLASH and scientific 3D-stack model © Hardavellas

Outline Introduction Background Galaxy Architecture Experimental Methodology Results ➔ Sensitivity Studies Single-Chip Comparisons (Processor Disintegration) Multi-Chip Comparisons (Macrochip Integration) Thermal Modeling Conclude © Hardavellas

Laser Power Sensitivity to Optical Parameters Coupler Loss Waveguide & Filter Drop Loss Off-Ring Loss Modulator Insertion Loss Highly sensitive to coupler loss, insensitive to other losses © Hardavellas

Sensitivity to Fiber Density 116mm2 chiplets  43mm along the chip edge Enough room for 172 fibers @ 250μm pitch 128 fibers: within 3% of max performance © Hardavellas

Outline Introduction Background Galaxy Architecture Experimental Methodology Results Sensitivity Studies ➔ Single-Chip Comparisons (Processor Disintegration) Multi-Chip Comparisons (Macrochip Integration) Thermal Modeling Conclude © Hardavellas

Performance Against “Unlimited” Designs Speedup of (power+bandwidth)-constrained design Speedup of bandwidth-constrained design Speedup of power-constrained design Speedup of unconstrained design Galaxy matches the performance of “unlimited” designs © Hardavellas

Performance Against “Unlimited” Designs Speedup of (power+bandwidth)-constrained design Speedup of power-constrained design Speedup of bandwidth-constrained design Speedup of unconstrained design Galaxy matches the performance of “unlimited” designs © Hardavellas

Performance Against “Realistic” Designs Realistic: within power and bandwidth envelopes Galaxy chiplets within 66.2oC  chiplets run at max speed Galaxy: 2.4x - 3.2x speedup on average (3.4 max) © Hardavellas

Galaxy: 2.4x-2.8x smaller EDP on average (7.1x max) Energy-Delay Product Galaxy: 2.4x-2.8x smaller EDP on average (7.1x max) © Hardavellas

Outline Introduction Background Galaxy Architecture Experimental Methodology Results Sensitivity Studies Single-Chip Comparisons (Processor Disintegration) ➔ Multi-Chip Comparisons (Macrochip Integration) Thermal Modeling Conclude © Hardavellas

Comparison Against Multi-Chip Alternatives SerDes on FR-4 incurs significant energy consumption or long delays (20 pJ/bit typically, and at best 2.5 pJ/bit and 2.5 ns latency over 4 inches of electrical strip) © Hardavellas

Comparison Against Multi-Chip Alternatives Fiber WG: 1.25x Galaxy: 2.5x speedup over Oracle Macrochip (6.8x max) 6x less laser power with demonstrated couplers © Hardavellas

Outline Introduction Background Galaxy Architecture Experimental Methodology Results Sensitivity Studies Single-Chip Comparisons (Processor Disintegration) Multi-Chip Comparisons (Macrochip Integration) ➔ Thermal Modeling Conclude © Hardavellas

80-core 5-chiplet Galaxy Thermal CFD Modeling 88.2C, 8cm, 45C ambient 8cm spacing allows cooling with cheap passive heatsinks © Hardavellas

9-chiplet Dense Array (Oracle Macrochip) Tight arrangement points to liquid cooling requirement © Hardavellas

Cooling 9 chiplets with passive heatsinks 9-chiplet Galaxy 2D 1100C 110C Cooling 9 chiplets with passive heatsinks © Hardavellas

9-chiplet Galaxy 3D 83.60C 83.6C Flexible fibers allow “virtual chip” to break free of 2D planar designs © Hardavellas

Galaxy Summary “Virtual chips” with the performance of unlimited designs Breaks free of typical physical constraints Large aggregate area Improved yield (break-even point : 60% yield for photonics) Tb/s/mm bandwidth density Pushes back power wall Processor disintegration 2.4x – 3.2x avg. speedup (3.4 max) 2.4x – 2.8x avg. smaller EDP (7.1x max) Macrochip integration 2.5x speedup over Oracle Macrochip (6.8x max) 6x more power efficient links © Hardavellas

High Laser Wall-Plug Power Laser power consumption is generally high High optical loss components Galaxy restricts sharers of an optical path to at most 8 High-radix crossbars are impractical Radix-16 MWSR: 20.1W Radix-64 MWSR: 78.1W Coupling the off-chip laser on chip: 2.4x power loss (3.8 dB) WDM-compatible lasers: 5-10% efficiency What if we can power-gate the laser? Off-chip lasers: long latencies (10-16ns) On-chip Ge-doped lasers: 1ns on/off delay © Hardavellas

EcoLaser MWSR Crossbar and Router Architecture © Hardavellas

EcoLaser Energy/Flit for Radix-16 MWSR © Hardavellas

EcoLaser + AdaptiveWidth for Radix-16 SWMR EcoLaser power savings  higher power budget for cores  2x speedup © Hardavellas

PARAG@N: Energy-Efficient Computing Thank You! PARAG@N: Energy-Efficient Computing Galaxy: nanophotonics to overcome physical single-chip limitations [WINDS’10, ICS’14] Processor disintegration, macrochip integration Arch/nanophotonics intersection SeaFire: Design for Dark Silicon [IEEE Micro’11, USENIX-Login’11] We cannot power up an entire chip Heterogeneous/specialized designs Elastic Fidelity [CoRR abs/1111.4279] Some errors are ok Allow a few errors to make computers power efficient Elastic Caches [ISCA’09, IEEEMicro’10, DATE’12, IEEE Computer’13] Dynamically adapt on-chip storage to workload requirements disciplined

Thank You! © Hardavellas

BACKUP SLIDES © Hardavellas

Chip power does not scale Chip Power Scaling [Azizi 2010] Chip power does not scale © Hardavellas

Voltage Scaling Has Slowed 1.2V -> 0.85V 100nm -> 28nm In last decade: 13x transistors but 30% lower voltage Cannot run all transistors fast enough © Hardavellas

Cannot feed cores with data fast enough to keep them busy Pin Bandwidth Scaling 580 mm2 die: 25600 pins to package at 150μm 5x5 cm: 3844 substrate-to-board at 0.8mm [TU Berlin] Cannot feed cores with data fast enough to keep them busy © Hardavellas

Electrical (SerDes) vs. SOI Waveguides vs. Fibers © Hardavellas

SWMR vs. MWSR Crossbar Single-Writer Multiple-Reader Broadcast bus All receivers always read On-rings  optical loss High laser power Multiple-Writer Single-Reader Only one receiver reads Only one ring is on  low loss Low laser power Needs arbitration © Hardavellas

Token-Based Arbitration 8 cycles on average for token arbitration (5 chiplets) © Hardavellas

Load Latency (uniform random traffic) © Hardavellas

16 tokens provide optimal buffer depth Load-Latency Curves Buffer depth  16 tokens Congested traffic: 72% utilization, 0.18 per-router injection rate 16 tokens provide optimal buffer depth © Hardavellas

Tapered vs. Optical Proximity Couplers 6x less laser power than Oracle Macrochip with demonstrated couplers © Hardavellas

Energy per Instruction Galaxy: 12-20% lower energy/instruction on average (up to 2.3x less) © Hardavellas

EcoLaser Backup © Hardavellas

EcoLaser SWMR Crossbar and Router Architecture © Hardavellas

EcoLaser 3-bit Token and Laser Controller FSM © Hardavellas

EcoLaser Writer Node FSM © Hardavellas

EcoLaser Nanophotonic Parameters © Hardavellas

EcoLaser Energy/Flit for Radix-16 SWMR © Hardavellas

EcoLaser Latency Impact on Radix-16 MWSR © Hardavellas

EcoLaser Latency Impact on Radix-16 SWMR © Hardavellas

EcoLaser Speedup for Radix-64 SWMR EcoLaser Power Savings  ~2x Speedup © Hardavellas

EcoLaser Speedup for Radix-64 MWSR EcoLaser Power Savings  ~2x Speedup © Hardavellas