Galaxy: High-Performance Energy-Efficient Multi-Chip Architectures Using Photonic Interconnects Nikos Hardavellas – Parallel Architecture Group.

Slides:

Advertisements

Similar presentations

Slide 1 Insert your own content. Slide 2 Insert your own content.

Advertisements

Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology.

ITRS Roadmap Design + System Drivers Makuhari, December 2007 Worldwide Design ITWG Good morning. Here we present the work that the ITRS Design TWG has.

Predicting Performance Impact of DVFS for Realistic Memory Systems Rustam Miftakhutdinov Eiman Ebrahimi Yale N. Patt.

Best of Both Worlds: A Bus-Enhanced Network on-Chip (BENoC) Ran Manevich, Isask har (Zigi) Walter, Israel Cidon, and Avinoam Kolodny Technion – Israel.

1 Networks for Multi-core Chip A Controversial View Shekhar Borkar Intel Corp.

Electrical and Computer Engineering UAH System Level Optical Interconnect Optical Fiber Computer Interconnect: The Simultaneous Multiprocessor Exchange.

THERMAL-AWARE BUS-DRIVEN FLOORPLANNING PO-HSUN WU & TSUNG-YI HO Department of Computer Science and Information Engineering, National Cheng Kung University.

A Stabilization Technique for Phase-Locked Frequency Synthesizers Tai-Cheng Lee and Behzad Razavi IEEE Journal of Solid-State Circuits, Vol. 38, June 2003.

Improving DRAM Performance by Parallelizing Refreshes with Accesses

Dawei Huang, IEEE Journal of Selected Topics in Quantum Electronics, March/April 2003 Optical Interconnects: Out of the Box Forever? Jeong-Min Lee

Mobius Microsystems Microsystems Mbius Slide 1 of 21 A 9.2mW 528/66/50MHz Monolithic Clock Synthesizer for Mobile µP Platforms Custom Integrated Circuits.

Application Server Based on SoftSwitch

ASYNC07 High Rate Wave-pipelined Asynchronous On-chip Bit-serial Data Link R. Dobkin, T. Liran, Y. Perelman, A. Kolodny, R. Ginosar Technion – Israel Institute.

Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Embedded NoCs Power Analysis

© S Haughton more than 3?

25 July, 2014 Martijn v/d Horst, TU/e Computer Science, System Architecture and Networking 1 Martijn v/d Horst

25 July, 2014 Martijn v/d Horst, TU/e Computer Science, System Architecture and Networking 1 Martijn v/d Horst

Linking Verb? Action Verb or. Question 1 Define the term: action verb.

Past Tense Probe. Past Tense Probe Past Tense Probe – Practice 1.

10/10/ * Introduction * Network Evolution * Why Gi-Fi is used * Bluetooth & Wi-Fi * Architecture of Gi-Fi * Features / Advantages * Applications.

Addition 1’s to 20.

Test B, 100 Subtraction Facts

Asaf SOMEKH, Oct 15 th, 2013 Evolving Peering with a New Router Architecture Jean-David LEHMANN-CHARLEY Compass-EOS RIPE 67, Athens

Bottoms Up Factoring. Start with the X-box 3-9 Product Sum

// RF Transceiver Design Condensed course for 3TU students Peter Baltus Eindhoven University of Technology Department of Electrical Engineering

QuT: A Low-Power Optical Network-on-chip

Nikos Hardavellas, Northwestern University

Galaxy: High-Performance Energy-Efficient Multi-Chip Architectures Using Photonic Interconnects Nikos Hardavellas – Parallel Architecture Group.

Toward Energy-Efficient Computing Nikos Hardavellas – Parallel Architecture Group Northwestern University.

Microsoft Technical Computing Modeling the world with greater fidelity Wolfgang Dreyer, TC - Microsoft Germany.

OEIC LAB National Cheng Kung University 1 Ching-Ting Lee Institute of Microelectronics, Department of Electrical Engineering, National Cheng Kung University.

Firefly: Illuminating Future Network-on-Chip with Nanophotonics Yan Pan, Prabhat Kumar, John Kim †, Gokhan Memik, Yu Zhang, Alok Choudhary EECS Department.

IP I/O Memory Hard Disk Single Core IP I/O Memory Hard Disk IP Bus Multi-Core IP R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R Networks.

© intec 2000 Reasons for parallel optical interconnects Roel Baets Ghent University - IMEC Department of Information Technology (INTEC)

Integrated Silicon Photonics – An Overview Aniruddha N. Udipi.

A 10 Gb/s Photonic Modulator and WDM MUX/DEMUX Integrated with Electronics in 0.13um SOI CMOS High Speed Circuits & Systems Laboratory Joungwook Moon 2011.

System Performance Stephen Schultz Fiber Optics Fall 2005.

SARAN THAMPY D SARAN THAMPY D S7 CSE S7 CSE ROLL NO 17 ROLL NO 17 Optical computing.

WELCOME. HIGH SPEED SMART PIXEL ARRAYS Guided by, Presenting by, Mr. RANJITH.CSHAHID.C &ROLL NO: 63 Ms. KAVYA.K.MREG. NO: CTAHEEC130.

The Cosmic Simulator Daniel Kasen (UCB & LBNL) Peter Nugent, Rollin Thomas, Julian Borrill & Christina Siegerist.

Multi Core Processor Submitted by: Lizolen Pradhan

National Center for Supercomputing Applications Observational Astronomy NCSA projects radio astronomy: CARMA & SKA optical astronomy: DES & LSST access:

EE16.468/16.568Lecture 3Waveguide photonic devices 1. Mach-Zehnder EO modulator Electro-optic effect, n2 changes with E -field a1 a2 a3 b1 b2 50% AA out.

1 Roland Kersting Department of Physics, Applied Physics, and Astronomy The Science of Information Technology Computing with Light the processing.

Energy-Proportional Photonic Interconnects Nikos Hardavellas – Parallel Architecture Group Northwestern University Team: Y. Demir, P. Yan, S. Song,

Feb. 19, 2008 Multicore Processor Technology and Managing Contention for Shared Resource Cong Zhao Yixing Li.

Nikos Hardavellas – Parallel Architecture Group

COMPARISON B/W ELECTRICAL AND OPTICAL COMMUNICATION INSIDE CHIP Irfan Ullah Department of Information and Communication Engineering Myongji university,

Outline  Over view  Design  Performance  Advantages and disadvantages  Examples  Conclusion  Bibliography.

The Rise of Dark Silicon

Hybrid Optoelectric On-chip Interconnect Networks Yong-jin Kwon 1.

Building manycore processor-to-DRAM networks using monolithic silicon photonics Ajay Joshi †, Christopher Batten †, Vladimir Stojanović †, Krste Asanović.

By Chad Andrus. TILE-Gx100  100 Identical Processor Cores Each core has its own L2 & L3 cache Each can run its own OS or group together for multiprocessing.

UNIVERSITY OF WATERLOO Nortel Networks Institute University of Waterloo.

CS203 – Advanced Computer Architecture

System on a Programmable Chip (System on a Reprogrammable Chip)

Process Variation Aware Crosstalk Mitigation for DWDM based Photonic NoC Architectures. By Ishan Thakkar

CS203 – Advanced Computer Architecture

Tbit/s Optical Data Transmission

Making Networks Light March 29, 2018 Charleston, South Carolina.

The Role of Light in High Speed Digital Design

Analysis of a Chip Multiprocessor Using Scientific Applications

Parallel Processing Sharing the load.

Leveraging Optical Technology in Future Bus-based Chip Multiprocessors

Integrated Optical Wavelength Converters and Routers for Robust Wavelength-Agile Analog/ Digital Optical Networks Daniel J. Blumenthal (PI), John E. Bowers,

Presentation transcript:

Galaxy: High-Performance Energy-Efficient Multi-Chip Architectures Using Photonic Interconnects Nikos Hardavellas – Parallel Architecture Group Northwestern University Team: Y. Demir, P. Yan, S. Song, J. Kim, G. Memik

Chip Power Scaling © Hardavellas 2 Chip power does not scale [Azizi 2010]

Voltage Scaling Has Slowed © Hardavellas 3 In last decade: 13x transistors but 30% lower voltage Cannot run all transistors fast enough

Pin Bandwidth Scaling © Hardavellas 4 [TU Berlin] Cannot feed cores with data fast enough to keep them busy

Data Scaling SPEC, TPC datasets growth: faster than Moore Same trends in scientific, personal computing Large Hadron Collider March11: 1.6PB data (Tier-1) Large Synoptic Survey Telescope 30 TB/night 2x Sloan Digital Sky Surveys/day Sloan: more data than entire history of astronomy before it © Hardavellas 5 More data more computing power to process them

Galaxy: Optically-Connected Disintegrated Processors Physical constraints limit single-chip designs Area, Yield, Power, Bandwidth Multi-chip designs break free of these limitations Processor disintegration Macro-chip integration © Hardavellas 6 [Pan, WINDS 2010]

Outline Introduction Background Galaxy Architecture Experimental Methodology Results Sensitivity Studies Single-Chip Comparisons (Processor Disintegration) Multi-Chip Comparisons (Macrochip Integration) Thermal Modeling Conclude Overview of Other Research © Hardavellas 7

Nanophotonic Components © Hardavellas 8 off-chip laser source coupler resonant modulators resonant detectors Ge-doped waveguide Selective: couple optical energy of a specific wavelength

Modulation and Detection © Hardavellas wavelengths DWDM μm waveguide pitch 10Gbps per link 8 Tbps/mm bandwidth density or more !!!

Outline Introduction Background Galaxy Architecture Experimental Methodology Results Sensitivity Studies Single-Chip Comparisons (Processor Disintegration) Multi-Chip Comparisons (Macrochip Integration) Thermal Modeling Conclude Overview of Other Research © Hardavellas 10

Galaxy Architecture © Hardavellas 11

Routing Example © Hardavellas 12

Galaxy Architecture © Hardavellas 13

Galaxy MWSR Optical Crossbar © Hardavellas 14 More energy-efficient than SWMR at that scale MWSR avoids broadcast bus, but requires arbitration

Token-Based Arbitration © Hardavellas 15 8 cycles on average for token arbitration (5 chiplets)

Dense Off-Chip Coupling © Hardavellas 16 Dense optical fiber array [Lee, OSA/OFC/NFOEC 2010] ~3.8dB loss, 8 Tbps/mm demonstrated Misalignment loss <1 dB Loss comparable to optical proximity couplers

Nanophotonic Parameters © Hardavellas 17

Outline Introduction Background Galaxy Architecture Experimental Methodology Results Sensitivity Studies Single-Chip Comparisons (Processor Disintegration) Multi-Chip Comparisons (Macrochip Integration) Thermal Modeling Conclude Overview of Other Research © Hardavellas 18

Architectural Parameters © Hardavellas 19

Modeling Infrastructure © Hardavellas 20 3D-stack model SimFlex sampling 95% confidence photonic-layer ring heating

Outline Introduction Background Galaxy Architecture Experimental Methodology Results Sensitivity Studies Single-Chip Comparisons (Processor Disintegration) Multi-Chip Comparisons (Macrochip Integration) Thermal Modeling Conclude Overview of Other Research © Hardavellas 21

Load-Latency Curves © Hardavellas tokens provide optimal buffer depth

Laser Power Sensitivity to Optical Parameters © Hardavellas 23 Coupler Loss Off-Ring Loss Waveguide & Filter Drop Loss Modulator Insertion Loss Highly sensitive to coupler loss, insensitive to other losses

Sensitivity to Fiber Density 116mm 2 chiplets 43mm along the chip edge Enough room for μm pitch © Hardavellas fibers: within 3% of max performance

Outline Introduction Background Galaxy Architecture Experimental Methodology Results Sensitivity Studies Single-Chip Comparisons (Processor Disintegration) Multi-Chip Comparisons (Macrochip Integration) Thermal Modeling Conclude Overview of Other Research © Hardavellas 25

Performance Against Unlimited Designs © Hardavellas 26 Unlimited power (max speed of design, irrespective of temp.) Mesh_20MC & Corona_20MC Also unlimited bandwidth (20 MCs per chip, 5x more pins) Galaxy matches the performance of unlimited designs

Performance Against Realistic Designs Realistic: within power and bandwidth envelopes Galaxy chiplets within 66.2 o C chiplets run at max speed © Hardavellas 27 Galaxy: 2.2x speedup on average (3.4 max)

Energy-Delay Product Cool chiplets minimize leakage © Hardavellas 28 Galaxy: 2.4x-2.8x smaller EDP on average (6.8x max)

Outline Introduction Background Galaxy Architecture Experimental Methodology Results Sensitivity Studies Single-Chip Comparisons (Processor Disintegration) Multi-Chip Comparisons (Macrochip Integration) Thermal Modeling Conclude Overview of Other Research © Hardavellas 29

Comparison Against Multi-Chip Alternatives © Hardavellas 30

Comparison Against Multi-Chip Alternatives © Hardavellas 31 Fiber Galaxy: 2.5x over Oracle Macrochip (6.8x max)

Tapered vs. Optical Proximity Couplers © Hardavellas 32 6x less laser power than Oracle Macrochip with demonstrated couplers

Outline Introduction Background Galaxy Architecture Experimental Methodology Results Sensitivity Studies Single-Chip Comparisons (Processor Disintegration) Multi-Chip Comparisons (Macrochip Integration) Thermal Modeling Conclude Overview of Other Research © Hardavellas 33

80-core 5-chiplet Galaxy Thermal CFD Modeling © Hardavellas 34 8cm spacing allows cooling with cheap passive heatsinks C

9-chiplet Dense Array (Oracle Macrochip) © Hardavellas 35 Tight arrangement points to liquid cooling requirement C

9-chiplet Galaxy 2D © Hardavellas 36 Cooling 9 chiplets with passive heatsinks C

9-chiplet Galaxy 3D © Hardavellas 37 Flexible fibers allow virtual chip to break free of 2D planar designs C

Galaxy Summary Virtual chips with the performance of unlimited designs Breaks free of typical physical constraints Large aggregate area Improved yield (break-even point : 60% yield for photonics) Tb/s/mm bandwidth density Pushes back power wall Processor disintegration 2.2x avg. speedup (3.4 max) 2.4x-2.8x avg. smaller EDP (6.8x max) Macrochip integration 2.5x speedup over Oracle Macrochip (6.8x max) 6x more power efficient links © Hardavellas 38

Outline Introduction Background Galaxy Architecture Experimental Methodology Results Sensitivity Studies Single-Chip Comparisons (Processor Disintegration) Multi-Chip Comparisons (Macrochip Integration) Thermal Modeling Conclude Overview of Other Research © Hardavellas 39

Energy is Shaping the IT Industry #1 of Grand Challenges for Humanity in the Next 50 Years [Smalley Institute for Nanoscale Research and Technology, Rice U.] Computing worldwide: ~408 TWh in 2010 [Gartner] Datacenter energy consumption in US ~150 TWh in 2011 [EPA] 3.8% of domestic power generation, $15B CO 2 -equiv. emissions Airline Industry (2%) Carbon footprint of worlds data centers Czech Republic 20MW: 200x lower energy/instr. (2nJ 10pJ) 3% of the output of an average nuclear plant! 10% annual growth on installed computers worldwide [Gartner] © Hardavellas 40 Exponential increase in energy consumption

Integer add: 0.5pJ; FP-FMA: 50pJ. Where does energy go? Data movement: 1200pJ across 400mm 2 chip, 16000pJ memory Elastic caches: minimize data transfers through adapting caches to workload demands [ISCA09, IEEEMicro10, DATE12] Processing: ~1500pJ to schedule the operation SeaFire: specialized computing on dark silicon to eliminate general- purpose computings overheads [IEEEMicro11, USENIX-Login11] Circuits: wide voltage guardbands Low voltages, process variation timing errors computing errors Elastic fidelity: allow errors at select code/data segments to save energy while maintaining fidelity contract with user [CoRR abs/ ] Chips fundamentally limited by physical constraints. Need to break free. Galaxy: processor disintegration/macrochip integration using photonic interconnects [WINDS10] Overall Focus: Energy-Efficient Computing

Thank You! © Hardavellas 42

Overcoming Data Movement and Processing Overheads Elastic caches: adapt cache to workloads demands Significant energy on data movements and coherence requests Co-locate data, metadata, and computation Decouple address from placement location Capitalize on existing OS events simplify hardware Cut on-chip interconnect traffic by half Seafire: specialized computing on dark silicon Repurpose dark silicon to implement specialized cores Application cherry-picks a few cores, rest of chip is powered off Vast unused area many specialized cores likely to find good matches 12x lower energy (conservative) 43 © Hardavellas

Elastic fidelity: selectively trade accuracy for energy We dont always need 100% accuracy, but HW always provides it Language constructs specify required fidelity for code/data segments Steer computation to exec/storage units with appropriate fidelity and lower voltage 35% lower energy Overcoming Voltage Guardbands 44 © Hardavellas No errors 10% errors