Galaxy: High-Performance Energy-Efficient Multi-Chip Architectures Using Photonic Interconnects Nikos Hardavellas – Parallel Architecture Group.

Slides:



Advertisements
Similar presentations
Slide 1 Insert your own content. Slide 2 Insert your own content.
Advertisements

Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology.
ITRS Roadmap Design + System Drivers Makuhari, December 2007 Worldwide Design ITWG Good morning. Here we present the work that the ITRS Design TWG has.
0 - 0.
Addition Facts
Predicting Performance Impact of DVFS for Realistic Memory Systems Rustam Miftakhutdinov Eiman Ebrahimi Yale N. Patt.
Best of Both Worlds: A Bus-Enhanced Network on-Chip (BENoC) Ran Manevich, Isask har (Zigi) Walter, Israel Cidon, and Avinoam Kolodny Technion – Israel.
1 Networks for Multi-core Chip A Controversial View Shekhar Borkar Intel Corp.
Electrical and Computer Engineering UAH System Level Optical Interconnect Optical Fiber Computer Interconnect: The Simultaneous Multiprocessor Exchange.
THERMAL-AWARE BUS-DRIVEN FLOORPLANNING PO-HSUN WU & TSUNG-YI HO Department of Computer Science and Information Engineering, National Cheng Kung University.
A Stabilization Technique for Phase-Locked Frequency Synthesizers Tai-Cheng Lee and Behzad Razavi IEEE Journal of Solid-State Circuits, Vol. 38, June 2003.
Improving DRAM Performance by Parallelizing Refreshes with Accesses
Dawei Huang, IEEE Journal of Selected Topics in Quantum Electronics, March/April 2003 Optical Interconnects: Out of the Box Forever? Jeong-Min Lee
Mobius Microsystems Microsystems Mbius Slide 1 of 21 A 9.2mW 528/66/50MHz Monolithic Clock Synthesizer for Mobile µP Platforms Custom Integrated Circuits.
Application Server Based on SoftSwitch
ASYNC07 High Rate Wave-pipelined Asynchronous On-chip Bit-serial Data Link R. Dobkin, T. Liran, Y. Perelman, A. Kolodny, R. Ginosar Technion – Israel Institute.
Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Embedded NoCs Power Analysis
© S Haughton more than 3?
25 July, 2014 Martijn v/d Horst, TU/e Computer Science, System Architecture and Networking 1 Martijn v/d Horst
25 July, 2014 Martijn v/d Horst, TU/e Computer Science, System Architecture and Networking 1 Martijn v/d Horst
Linking Verb? Action Verb or. Question 1 Define the term: action verb.
Past Tense Probe. Past Tense Probe Past Tense Probe – Practice 1.
10/10/ * Introduction * Network Evolution * Why Gi-Fi is used * Bluetooth & Wi-Fi * Architecture of Gi-Fi * Features / Advantages * Applications.
Addition 1’s to 20.
Test B, 100 Subtraction Facts
Week 1.
Asaf SOMEKH, Oct 15 th, 2013 Evolving Peering with a New Router Architecture Jean-David LEHMANN-CHARLEY Compass-EOS RIPE 67, Athens
Bottoms Up Factoring. Start with the X-box 3-9 Product Sum
// RF Transceiver Design Condensed course for 3TU students Peter Baltus Eindhoven University of Technology Department of Electrical Engineering
QuT: A Low-Power Optical Network-on-chip
Nikos Hardavellas, Northwestern University
Galaxy: High-Performance Energy-Efficient Multi-Chip Architectures Using Photonic Interconnects Nikos Hardavellas – Parallel Architecture Group.
Toward Energy-Efficient Computing Nikos Hardavellas – Parallel Architecture Group Northwestern University.
Microsoft Technical Computing Modeling the world with greater fidelity Wolfgang Dreyer, TC - Microsoft Germany.
OEIC LAB National Cheng Kung University 1 Ching-Ting Lee Institute of Microelectronics, Department of Electrical Engineering, National Cheng Kung University.
Firefly: Illuminating Future Network-on-Chip with Nanophotonics Yan Pan, Prabhat Kumar, John Kim †, Gokhan Memik, Yu Zhang, Alok Choudhary EECS Department.
IP I/O Memory Hard Disk Single Core IP I/O Memory Hard Disk IP Bus Multi-Core IP R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R Networks.
© intec 2000 Reasons for parallel optical interconnects Roel Baets Ghent University - IMEC Department of Information Technology (INTEC)
Integrated Silicon Photonics – An Overview Aniruddha N. Udipi.
A 10 Gb/s Photonic Modulator and WDM MUX/DEMUX Integrated with Electronics in 0.13um SOI CMOS High Speed Circuits & Systems Laboratory Joungwook Moon 2011.
System Performance Stephen Schultz Fiber Optics Fall 2005.
SARAN THAMPY D SARAN THAMPY D S7 CSE S7 CSE ROLL NO 17 ROLL NO 17 Optical computing.
WELCOME. HIGH SPEED SMART PIXEL ARRAYS Guided by, Presenting by, Mr. RANJITH.CSHAHID.C &ROLL NO: 63 Ms. KAVYA.K.MREG. NO: CTAHEEC130.
The Cosmic Simulator Daniel Kasen (UCB & LBNL) Peter Nugent, Rollin Thomas, Julian Borrill & Christina Siegerist.
Multi Core Processor Submitted by: Lizolen Pradhan
National Center for Supercomputing Applications Observational Astronomy NCSA projects radio astronomy: CARMA & SKA optical astronomy: DES & LSST access:
EE16.468/16.568Lecture 3Waveguide photonic devices 1. Mach-Zehnder EO modulator Electro-optic effect, n2 changes with E -field a1 a2 a3 b1 b2 50% AA out.
1 Roland Kersting Department of Physics, Applied Physics, and Astronomy The Science of Information Technology Computing with Light the processing.
Energy-Proportional Photonic Interconnects Nikos Hardavellas – Parallel Architecture Group Northwestern University Team: Y. Demir, P. Yan, S. Song,
Feb. 19, 2008 Multicore Processor Technology and Managing Contention for Shared Resource Cong Zhao Yixing Li.
Nikos Hardavellas – Parallel Architecture Group
COMPARISON B/W ELECTRICAL AND OPTICAL COMMUNICATION INSIDE CHIP Irfan Ullah Department of Information and Communication Engineering Myongji university,
Outline  Over view  Design  Performance  Advantages and disadvantages  Examples  Conclusion  Bibliography.
The Rise of Dark Silicon
Hybrid Optoelectric On-chip Interconnect Networks Yong-jin Kwon 1.
Building manycore processor-to-DRAM networks using monolithic silicon photonics Ajay Joshi †, Christopher Batten †, Vladimir Stojanović †, Krste Asanović.
By Chad Andrus. TILE-Gx100  100 Identical Processor Cores Each core has its own L2 & L3 cache Each can run its own OS or group together for multiprocessing.
UNIVERSITY OF WATERLOO Nortel Networks Institute University of Waterloo.
CS203 – Advanced Computer Architecture
System on a Programmable Chip (System on a Reprogrammable Chip)
Process Variation Aware Crosstalk Mitigation for DWDM based Photonic NoC Architectures. By Ishan Thakkar
CS203 – Advanced Computer Architecture
Tbit/s Optical Data Transmission
Making Networks Light March 29, 2018 Charleston, South Carolina.
The Role of Light in High Speed Digital Design
Analysis of a Chip Multiprocessor Using Scientific Applications
Parallel Processing Sharing the load.
Leveraging Optical Technology in Future Bus-based Chip Multiprocessors
Integrated Optical Wavelength Converters and Routers for Robust Wavelength-Agile Analog/ Digital Optical Networks Daniel J. Blumenthal (PI), John E. Bowers,
Presentation transcript:

Galaxy: High-Performance Energy-Efficient Multi-Chip Architectures Using Photonic Interconnects Nikos Hardavellas – Parallel Architecture Group Northwestern University Team: Y. Demir, P. Yan, S. Song, J. Kim, G. Memik

Chip Power Scaling © Hardavellas 2 Chip power does not scale [Azizi 2010]

Voltage Scaling Has Slowed © Hardavellas 3 In last decade: 13x transistors but 30% lower voltage Cannot run all transistors fast enough

Pin Bandwidth Scaling © Hardavellas 4 [TU Berlin] Cannot feed cores with data fast enough to keep them busy

Data Scaling SPEC, TPC datasets growth: faster than Moore Same trends in scientific, personal computing Large Hadron Collider March11: 1.6PB data (Tier-1) Large Synoptic Survey Telescope 30 TB/night 2x Sloan Digital Sky Surveys/day Sloan: more data than entire history of astronomy before it © Hardavellas 5 More data more computing power to process them

Galaxy: Optically-Connected Disintegrated Processors Physical constraints limit single-chip designs Area, Yield, Power, Bandwidth Multi-chip designs break free of these limitations Processor disintegration Macro-chip integration © Hardavellas 6 [Pan, WINDS 2010]

Outline Introduction Background Galaxy Architecture Experimental Methodology Results Sensitivity Studies Single-Chip Comparisons (Processor Disintegration) Multi-Chip Comparisons (Macrochip Integration) Thermal Modeling Conclude Overview of Other Research © Hardavellas 7

Nanophotonic Components © Hardavellas 8 off-chip laser source coupler resonant modulators resonant detectors Ge-doped waveguide Selective: couple optical energy of a specific wavelength

Modulation and Detection © Hardavellas wavelengths DWDM μm waveguide pitch 10Gbps per link 8 Tbps/mm bandwidth density or more !!!

Outline Introduction Background Galaxy Architecture Experimental Methodology Results Sensitivity Studies Single-Chip Comparisons (Processor Disintegration) Multi-Chip Comparisons (Macrochip Integration) Thermal Modeling Conclude Overview of Other Research © Hardavellas 10

Galaxy Architecture © Hardavellas 11

Routing Example © Hardavellas 12

Galaxy Architecture © Hardavellas 13

Galaxy MWSR Optical Crossbar © Hardavellas 14 More energy-efficient than SWMR at that scale MWSR avoids broadcast bus, but requires arbitration

Token-Based Arbitration © Hardavellas 15 8 cycles on average for token arbitration (5 chiplets)

Dense Off-Chip Coupling © Hardavellas 16 Dense optical fiber array [Lee, OSA/OFC/NFOEC 2010] ~3.8dB loss, 8 Tbps/mm demonstrated Misalignment loss <1 dB Loss comparable to optical proximity couplers

Nanophotonic Parameters © Hardavellas 17

Outline Introduction Background Galaxy Architecture Experimental Methodology Results Sensitivity Studies Single-Chip Comparisons (Processor Disintegration) Multi-Chip Comparisons (Macrochip Integration) Thermal Modeling Conclude Overview of Other Research © Hardavellas 18

Architectural Parameters © Hardavellas 19

Modeling Infrastructure © Hardavellas 20 3D-stack model SimFlex sampling 95% confidence photonic-layer ring heating

Outline Introduction Background Galaxy Architecture Experimental Methodology Results Sensitivity Studies Single-Chip Comparisons (Processor Disintegration) Multi-Chip Comparisons (Macrochip Integration) Thermal Modeling Conclude Overview of Other Research © Hardavellas 21

Load-Latency Curves © Hardavellas tokens provide optimal buffer depth

Laser Power Sensitivity to Optical Parameters © Hardavellas 23 Coupler Loss Off-Ring Loss Waveguide & Filter Drop Loss Modulator Insertion Loss Highly sensitive to coupler loss, insensitive to other losses

Sensitivity to Fiber Density 116mm 2 chiplets 43mm along the chip edge Enough room for μm pitch © Hardavellas fibers: within 3% of max performance

Outline Introduction Background Galaxy Architecture Experimental Methodology Results Sensitivity Studies Single-Chip Comparisons (Processor Disintegration) Multi-Chip Comparisons (Macrochip Integration) Thermal Modeling Conclude Overview of Other Research © Hardavellas 25

Performance Against Unlimited Designs © Hardavellas 26 Unlimited power (max speed of design, irrespective of temp.) Mesh_20MC & Corona_20MC Also unlimited bandwidth (20 MCs per chip, 5x more pins) Galaxy matches the performance of unlimited designs

Performance Against Realistic Designs Realistic: within power and bandwidth envelopes Galaxy chiplets within 66.2 o C chiplets run at max speed © Hardavellas 27 Galaxy: 2.2x speedup on average (3.4 max)

Energy-Delay Product Cool chiplets minimize leakage © Hardavellas 28 Galaxy: 2.4x-2.8x smaller EDP on average (6.8x max)

Outline Introduction Background Galaxy Architecture Experimental Methodology Results Sensitivity Studies Single-Chip Comparisons (Processor Disintegration) Multi-Chip Comparisons (Macrochip Integration) Thermal Modeling Conclude Overview of Other Research © Hardavellas 29

Comparison Against Multi-Chip Alternatives © Hardavellas 30

Comparison Against Multi-Chip Alternatives © Hardavellas 31 Fiber Galaxy: 2.5x over Oracle Macrochip (6.8x max)

Tapered vs. Optical Proximity Couplers © Hardavellas 32 6x less laser power than Oracle Macrochip with demonstrated couplers

Outline Introduction Background Galaxy Architecture Experimental Methodology Results Sensitivity Studies Single-Chip Comparisons (Processor Disintegration) Multi-Chip Comparisons (Macrochip Integration) Thermal Modeling Conclude Overview of Other Research © Hardavellas 33

80-core 5-chiplet Galaxy Thermal CFD Modeling © Hardavellas 34 8cm spacing allows cooling with cheap passive heatsinks C

9-chiplet Dense Array (Oracle Macrochip) © Hardavellas 35 Tight arrangement points to liquid cooling requirement C

9-chiplet Galaxy 2D © Hardavellas 36 Cooling 9 chiplets with passive heatsinks C

9-chiplet Galaxy 3D © Hardavellas 37 Flexible fibers allow virtual chip to break free of 2D planar designs C

Galaxy Summary Virtual chips with the performance of unlimited designs Breaks free of typical physical constraints Large aggregate area Improved yield (break-even point : 60% yield for photonics) Tb/s/mm bandwidth density Pushes back power wall Processor disintegration 2.2x avg. speedup (3.4 max) 2.4x-2.8x avg. smaller EDP (6.8x max) Macrochip integration 2.5x speedup over Oracle Macrochip (6.8x max) 6x more power efficient links © Hardavellas 38

Outline Introduction Background Galaxy Architecture Experimental Methodology Results Sensitivity Studies Single-Chip Comparisons (Processor Disintegration) Multi-Chip Comparisons (Macrochip Integration) Thermal Modeling Conclude Overview of Other Research © Hardavellas 39

Energy is Shaping the IT Industry #1 of Grand Challenges for Humanity in the Next 50 Years [Smalley Institute for Nanoscale Research and Technology, Rice U.] Computing worldwide: ~408 TWh in 2010 [Gartner] Datacenter energy consumption in US ~150 TWh in 2011 [EPA] 3.8% of domestic power generation, $15B CO 2 -equiv. emissions Airline Industry (2%) Carbon footprint of worlds data centers Czech Republic 20MW: 200x lower energy/instr. (2nJ 10pJ) 3% of the output of an average nuclear plant! 10% annual growth on installed computers worldwide [Gartner] © Hardavellas 40 Exponential increase in energy consumption

Integer add: 0.5pJ; FP-FMA: 50pJ. Where does energy go? Data movement: 1200pJ across 400mm 2 chip, 16000pJ memory Elastic caches: minimize data transfers through adapting caches to workload demands [ISCA09, IEEEMicro10, DATE12] Processing: ~1500pJ to schedule the operation SeaFire: specialized computing on dark silicon to eliminate general- purpose computings overheads [IEEEMicro11, USENIX-Login11] Circuits: wide voltage guardbands Low voltages, process variation timing errors computing errors Elastic fidelity: allow errors at select code/data segments to save energy while maintaining fidelity contract with user [CoRR abs/ ] Chips fundamentally limited by physical constraints. Need to break free. Galaxy: processor disintegration/macrochip integration using photonic interconnects [WINDS10] Overall Focus: Energy-Efficient Computing

Thank You! © Hardavellas 42

Overcoming Data Movement and Processing Overheads Elastic caches: adapt cache to workloads demands Significant energy on data movements and coherence requests Co-locate data, metadata, and computation Decouple address from placement location Capitalize on existing OS events simplify hardware Cut on-chip interconnect traffic by half Seafire: specialized computing on dark silicon Repurpose dark silicon to implement specialized cores Application cherry-picks a few cores, rest of chip is powered off Vast unused area many specialized cores likely to find good matches 12x lower energy (conservative) 43 © Hardavellas

Elastic fidelity: selectively trade accuracy for energy We dont always need 100% accuracy, but HW always provides it Language constructs specify required fidelity for code/data segments Steer computation to exec/storage units with appropriate fidelity and lower voltage 35% lower energy Overcoming Voltage Guardbands 44 © Hardavellas No errors 10% errors