A Multi-Vdd Dynamic Variable-Pipeline On-Chip Router for CMPs

Slides:



Advertisements
Similar presentations
A Case for Wireless 3D NoCs for CMPs Hiroki Matsutani (1), Paul Bogdan (2), Radu Marculescu (2), Yasuhiro Take (1), Daisuke Sasaki (1), Hao Zhang (1),
Advertisements

A Novel 3D Layer-Multiplexed On-Chip Network
4. Shared Memory Parallel Architectures 4.4. Multicore Architectures
International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems.
Ultra Fine-Grained Run-Time Power Gating of On-Chip Routers for CMPs
Lizhong Chen and Timothy M. Pinkston SMART Interconnects Group
CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros Volos Naser Khosro Pour Babak Falsafi Giovanni De Micheli.
The Design and Implementation of a Low-Latency On-Chip Network Robert Mullins 11 th Asia and South Pacific Design Automation Conference (ASP-DAC), Jan.
Allocator Implementations for Network-on-Chip Routers Daniel U. Becker and William J. Dally Concurrent VLSI Architecture Group Stanford University.
1 Lecture 17: On-Chip Networks Today: background wrap-up and innovations.
L2 to Off-Chip Memory Interconnects for CMPs Presented by Allen Lee CS258 Spring 2008 May 14, 2008.
S. Reda EN160 SP’08 Design and Implementation of VLSI Systems (EN1600) Lecture 14: Power Dissipation Prof. Sherief Reda Division of Engineering, Brown.
IP I/O Memory Hard Disk Single Core IP I/O Memory Hard Disk IP Bus Multi-Core IP R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R Networks.
MINIMISING DYNAMIC POWER CONSUMPTION IN ON-CHIP NETWORKS Robert Mullins Computer Architecture Group Computer Laboratory University of Cambridge, UK.
Lei Wang, Yuho Jin, Hyungjun Kim and Eun Jung Kim
Design of a High-Throughput Distributed Shared-Buffer NoC Router
Network-on-Chip Examples System-on-Chip Group, CSE-IMM, DTU.
On-Line Adjustable Buffering for Runtime Power Reduction Andrew B. Kahng Ψ Sherief Reda † Puneet Sharma Ψ Ψ University of California, San Diego † Brown.
Special Course on Computer Architecture
CSE477 L26 System Power.1Irwin&Vijay, PSU, 2002 Low Power Design in Microarchitectures and Memories [Adapted from Mary Jane Irwin (
S. Reda EN160 SP’07 Design and Implementation of VLSI Systems (EN0160) Lecture 13: Power Dissipation Prof. Sherief Reda Division of Engineering, Brown.
Network-on-Chip: Communication Synthesis Department of Computer Science Texas A&M University.
Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.
Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.
McRouter: Multicast within a Router for High Performance NoCs
Approaching Ideal NoC Latency with Pre-Configured Routes George Michelogiannakis, Dionisios Pnevmatikatos and Manolis Katevenis Institute of Computer Science.
Tightly-Coupled Multi-Layer Topologies for 3D NoCs Hiroki Matsutani (Keio Univ, JAPAN) Michihiro Koibuchi (NII, JAPAN) Hideharu Amano (Keio Univ, JAPAN)
CSE477 L26 System Power.1Irwin&Vijay, PSU, 2002 TKT-1527 Digital System Design Issues Low Power Techniques in Microarchitectures and Memories Mary Jane.
A Vertical Bubble Flow Network using Inductive-Coupling for 3D CMPs
José Vicente Escamilla José Flich Pedro Javier García 1.
Design and Management of 3D CMP’s using Network-in-Memory Feihui Li et.al. Penn State University (ISCA – 2006)
Adding Slow-Silent Virtual Channels for Low-Power On-Chip Networks Hiroki Matsutani (Keio Univ, Japan) Michihiro Koibuchi (NII, Japan) Daihan Wang (Keio.
Three-Dimensional Layout of On-Chip Tree-Based Networks Hiroki Matsutani (Keio Univ, Japan) Michihiro Koibuchi (NII, Japan) D. Frank Hsu (Fordham Univ,
Elastic-Buffer Flow-Control for On-Chip Networks
Networks-on-Chips (NoCs) Basics
International Symposium on Low Power Electronics and Design NoC Frequency Scaling with Flexible- Pipeline Routers Pingqiang Zhou, Jieming Yin, Antonia.
Déjà Vu Switching for Multiplane NoCs NOCS’12 University of Pittsburgh Ahmed Abousamra Rami MelhemAlex Jones.
SMART: A Single- Cycle Reconfigurable NoC for SoC Applications -Jyoti Wadhwani Chia-Hsin Owen Chen, Sunghyun Park, Tushar Krishna, Suvinay Subramaniam,
High-Level Interconnect Architectures for FPGAs An investigation into network-based interconnect systems for existing and future FPGA architectures Nick.
LIBRA: Multi-mode On-Chip Network Arbitration for Locality-Oblivious Task Placement Gwangsun Kim Computer Science Department Korea Advanced Institute of.
1 EE 587 SoC Design & Test Partha Pande School of EECS Washington State University
High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.
1 Michihiro Koibuchi, Takafumi Watanabe, Atsushi Minamihata, Masahiro Nakao, Tomoyuki Hiroyasu, Hiroki Matsutani, and Hideharu Amano
George Michelogiannakis William J. Dally Stanford University Router Designs for Elastic- Buffer On-Chip Networks.
A Lightweight Fault-Tolerant Mechanism for Network-on-Chip
CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR
Michihiro Koibuchi(NII, Japan ) Tomohiro Otsuka(Keio U, Japan ) Hiroki Matsutani ( U of Tokyo, Japan ) Hideharu Amano ( Keio U/ NII, Japan ) An On/Off.
Non-Minimal Routing Strategy for Application-Specific Networks-on-Chips Hiroki Matsutani Michihiro Koibuchi Yutaka Yamada Jouraku Akiya Hideharu Amano.
ACCESS IC LAB Graduate Institute of Electronics Engineering, NTU Under-Graduate Project Improving Timing, Area, and Power Speaker: 黃乃珊 Adviser: Prof.
Jose Miguel Montanana (NII, Japan) Michihiro Koibuchi (NII, Japan ) Hiroki Matsutani ( U of Tokyo, Japan ) Hideharu Amano ( Keio U/ NII, Japan ) Stabilizing.
Runtime Power Gating of On-Chip Routers Using Look-Ahead Routing
Enabling System-Level Modeling of Variation-Induced Faults in Networks-on-Chips Konstantinos Aisopos (Princeton, MIT) Chia-Hsin Owen Chen (MIT) Li-Shiuan.
University of Michigan, Ann Arbor
Performance, Cost, and Energy Evaluation of Fat H-Tree: A Cost-Efficient Tree-Based On-Chip Network Hiroki Matsutani (Keio Univ, JAPAN) Michihiro Koibuchi.
Yu Cai Ken Mai Onur Mutlu
Copyright © 2010 Houman Homayoun Houman Homayoun National Science Foundation Computing Innovation Fellow Department of Computer Science University of California.
Lecture 16: Router Design
Topology-aware QOS Support in Highly Integrated CMPs Boris Grot (UT-Austin) Stephen W. Keckler (NVIDIA/UT-Austin) Onur Mutlu (CMU) WIOSCA '10.
1 Lecture 22: Router Design Papers: Power-Driven Design of Router Microarchitectures in On-Chip Networks, MICRO’03, Princeton A Gracefully Degrading and.
Mohamed ABDELFATTAH Andrew BITAR Vaughn BETZ. 2 Module 1 Module 2 Module 3 Module 4 FPGAs are big! Design big systems High on-chip communication.
1 Power-Aware System on a Chip A. Laffely, J. Liang, R. Tessier, C. A. Moritz, W. Burleson University of Massachusetts Amherst Boston Area Architecture.
Network On Chip Cache Coherency Final presentation – Part A Students: Zemer Tzach Kalifon Ethan Kalifon Ethan Instructor: Walter Isaschar Instructor: Walter.
A Low-Area Interconnect Architecture for Chip Multiprocessors Zhiyi Yu and Bevan Baas VLSI Computation Lab ECE Department, UC Davis.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
Temperature and Power Management
Lynn Choi School of Electrical Engineering
OpenSMART: Single-cycle Multi-hop NoC Generator in BSV and Chisel
Rahul Boyapati. , Jiayi Huang
CS 6290 Many-core & Interconnect
Presentation transcript:

A Multi-Vdd Dynamic Variable-Pipeline On-Chip Router for CMPs Hiroki Matsutani Yuto Hirata Michihiro Koibuchi Kimiyoshi Usami Hiroshi Nakamura Hideharu Amano (Keio Univ, Japan) (NII, Japan) (Shibaura IT, Japan) (Univ Tokyo, Japan)

Multi-core & many-core 4 8 16 32 64 128 256 2011 2004 2006 2008 2010 Number of PEs (caches are not included) 2 MIT RAW STI Cell BE Sun T1 Sun T2 TILERA TILE64 Intel Core, IBM Power7 AMD Opteron Intel 80-core ClearSpeed CSX600 ClearSpeed CSX700 picoChip PC102 picoChip PC205 UT TRIPS (OPN) Fujitsu SPARC64 Intel SCC Target Chip multi- processor (CMP)

Our target: NoC for future CMPs 16-tile CMP example Each tile consists of processor, L1, and L2 cache bank L2 banks form a single shared L2 cache  SNUCA UltraSPARC L1 cache (I & D) L2 cache bank

Our target: NoC for future CMPs 16-tile CMP example Each tile consists of processor, L1, and L2 cache bank L2 banks form a single shared L2 cache  SNUCA Tiles are interconnected via Network-on-Chip (NoC) UltraSPARC L1 cache (I & D) L2 cache bank On-chip router A DVFS-like technique of on-chip routers for low-power CMPs

Problem: DVFS for on-chip routers If DVFS is applied to on-chip routers… Each router works at different frequency How to communicate between different frequency domains? 333MHz 400MHz

Problem: DVFS for on-chip routers If DVFS is applied to on-chip routers… Each router works at different frequency How to communicate between different frequency domains? Possible solutions 333MHz 400MHz 1. Frequency of a router is restricted to integer multiple of adjacent one E.g., 300MHz vs. 150MHz 2. Asynchronous communication protocol Versatile but too costly Any other solution? – Multi-Vdd variable-pipeline router !

Multi-Vdd variable-pipeline router Existing DVFS router Voltage and frequency are adjusted Multi-Vdd variable-pipeline router Voltage and “router pipeline depth” are adjusted All routers work at the same frequency Communication between different frequency domains introduces some difficulties… Vdd and pipeline-mode switching policies Low workload: 2-cycle mode @ Vdd-low High workload: 1-cycle mode @ Vdd-high Low temperature: 1-cycle mode @ Vdd-high High temperature: 2-cycle mode @ Vdd-high

Outline: Multi-Vdd variable-pipeline router Problem of DVFS router Each router works at different frequency Communication between different frequency domains Multi-Vdd variable-pipeline router Voltage and “router pipeline depth” are adjusted All routers work at the same frequency Voltage switch policies: Low-power policy Delay variation tolerance using a thermal sensor Evaluations Circuit-level evaluations System-level evaluations

Original router: Overview & pipeline Routing comp (RC) Arbitration (VSA) Switch traversal (ST) Link traversal (LT) Arbiter X+ X+ FIFO X- X- FIFO Y+ Y+ FIFO Y- Y- FIFO 5x5 Crossbar SW Core Core FIFO

Multi-Vdd variable-pipeline router Vdd select Vdd-high Vdd-low Voltage switch Pipeline control Arbiter X+ X+ FIFO X- X- FIFO Y+ Y+ FIFO Y- Y- FIFO 5x5 Crossbar SW Core Core FIFO

Multi-Vdd variable-pipeline router Vdd select Vdd-high Vdd-low Voltage switch Pipeline control Level shifter Arbiter X+ X+ FIFO X- X- FIFO Y+ Y+ FIFO Y- Y- FIFO 5x5 Crossbar SW Core Core FIFO

Multi-Vdd variable-pipeline router Vdd-high Vdd-low Vdd select Output buffer bypassing Voltage switch Pipeline control Level shifter Arbiter X+ FIFO X+ X- FIFO X- Y+ FIFO Y+ Y- FIFO Y- 5x5 Crossbar SW Core FIFO Core Enabling/disabling output buf changes the router pipeline mode

Variable-pipeline: Three modes All routers work at the same frequency 2-cycle mode Output buf is enabled Low performance (2-cycle per hop) Short critical path Vdd-low (low-power) 1-cycle mode Output buf is bypassed High performance (1-cycle per hop) Long critical path Vdd-high (high-power) @Router 1 @Router 2 @Router 1 @Router 2 RC/ VSA RC/ VSA RC/ VSA RC/ VSA ST LT ST LT ST,LT ST,LT

Variable-pipeline: Three modes All routers work at the same frequency 3-cycle mode Adaptive routing is available More communication latency (3-cycle per hop) Hotspots and faulty routers can be dynamically avoided @Router 1 @Router 2 RC VSA ST LT RC VSA ST LT 4x4 NoC (one hotspot) 3-cycle mode is useful for hotspot traffic or fault tolerance

Variable-pipeline router: Design Standard cells Fujitsu 65nm library Custom cells (level shifter, voltage switch) Synthesis Synopsys Design Compiler Place and route Synopsys Astro RC extraction Cadence QRC SPICE simulation Synopsys HSIM VDD Vdd-low Vdd-high Clock Clock is stopped to clearly show energy Select Voltage switch Current Fujitsu 65nm (0.8V1.2V, 25C)

Variable-pipeline: Freq & Vdd levels All routers work at 392.2MHz 2-cycle mode Output buf is enabled Low performance (2-cycle per hop) Delay: 1831psec @ 1.2V Vdd-low: 0.83V 1-cycle mode Output buf is bypassed High performance (1-cycle per hop) Delay: 2550psec @ 1.2V Vdd-high: 1.20V @Router 1 @Router 2 @Router 1 @Router 2 RC/ VSA RC/ VSA RC/ VSA RC/ VSA ST LT ST LT ST,LT ST,LT All routers work at 392.2MHz; 1-cycle@1.2V; 2-cycle@0.83V

Outline: Multi-Vdd variable-pipeline router Problem of DVFS router Each router works at different frequency Communication between different frequency domains Multi-Vdd variable-pipeline router Voltage and “router pipeline depth” are adjusted All routers work at the same frequency Voltage switch policies: Low-power policy Delay variation tolerance using a thermal sensor Evaluations Circuit-level evaluations System-level evaluations

Policy 1: Standby-power reduction Each router Notifies the packet arrival to 2-hop away Can detect next packet arrival 2-hop earlier Packet arrival detected: 2-cycle to 1-cycle (boost) Raise the voltage Reduce pipeline stage After packet transfer: 1-cycle to 2-cycle (relax) Increase pipeline stage Reduce the voltage Vdd transition: 2 cycle Only 1st hop cannot detect packet arrival in advance. 2-cycle transfer at 1st hop Vdd up 2cycle

Policy 1: Standby-power reduction Each router Notifies the packet arrival to 2-hop away Can detect next packet arrival 2-hop earlier Packet arrival detected: 2-cycle to 1-cycle (boost) Raise the voltage Reduce pipeline stage After packet transfer: 1-cycle to 2-cycle (relax) Increase pipeline stage Reduce the voltage Vdd transition: 2 cycle Only 1st hop cannot detect packet arrival in advance. 2-cycle transfer at 1st hop Vdd up 2cycle 1cycle

Policy 1: Standby-power reduction Each router Notifies the packet arrival to 2-hop away Can detect next packet arrival 2-hop earlier Packet arrival detected: 2-cycle to 1-cycle (boost) Raise the voltage Reduce pipeline stage After packet transfer: 1-cycle to 2-cycle (relax) Increase pipeline stage Reduce the voltage Vdd transition: 2 cycle Only 1st hop cannot detect packet arrival in advance. 2-cycle transfer at 1st hop Vdd up 2cycle 1cycle 1cycle

Policy 1: Standby-power reduction Each router Notifies the packet arrival to 2-hop away Can detect next packet arrival 2-hop earlier Packet arrival detected: 2-cycle to 1-cycle (boost) Raise the voltage Reduce pipeline stage After packet transfer: 1-cycle to 2-cycle (relax) Increase pipeline stage Reduce the voltage Vdd transition: 2 cycle Only 1st hop cannot detect packet arrival in advance. 2-cycle transfer at 1st hop 2cycle 1cycle 1cycle 1cycle Vdd-low when no packet; 1-cycle@Vdd-high when packets come

Policy 2: Delay variation tolerance Each router has A thermal sensor to detect the delay change Routers temp. will change depending on adjacent CPUs Normal temperature (1-cycle@Vdd-high) High temperature (2-cycle@Vdd-high)

Policy 2: Delay variation tolerance Each router has A thermal sensor to detect the delay change Delay margin decreases: More margin is needed 1-cycle @ Vdd-high  2-cycle @ Vdd-high Delay margin increases: Margin can be relaxed 2-cycle @ Vdd-high  1-cycle @ Vdd-high Routers temp. will change depending on adjacent CPUs Normal temperature (1-cycle@Vdd-high) High temperature (2-cycle@Vdd-high)

Outline: Multi-Vdd variable-pipeline router Problem of DVFS router Each router works at different frequency Communication between different frequency domains Multi-Vdd variable-pipeline router Voltage and “router pipeline depth” are adjusted All routers work at the same frequency Voltage switch policies: Low-power policy Delay variation tolerance using a thermal sensor Evaluations Circuit-level (area, switching latency & energy, BET) System-level (application performance, standby power)

Evaluation: Area overhead Arbiter X+ X- FIFO Vdd select Vdd-high Vdd-low Voltage switch Level shifter Output buffer bypassing Pipeline control

Evaluation: Area overhead Hardware amount of original and variable-pipeline routers [kilo gates] Router Level shifter Voltage switch Total Original 59.41 Proposed 63.13 (*) 4.11 0.54 67.78 (+14.1%) (*) Thermal sensor is not included Arbiter X+ X- FIFO Vdd select Vdd-high Vdd-low Voltage switch Level shifter Output buffer bypassing Pipeline control

Evaluation: Transition time & energy Vdd-high is 1.2V; Vdd-low ranges 0.6V-1.1V Low to high transition E.g., 0.8V to 1.2V Transition time: 3.1nsec 231 pJ is consumed High to low transition E.g., 1.2V to 0.8V Transition time: 5.3nsec 148 pJ is charged Energy consumed Energy charged Transition energy (left) Transition delay (right)

Evaluation: Transition time & energy Vdd-high is 1.2V; Vdd-low ranges 0.6V-1.1V Low to high transition E.g., 0.8V to 1.2V Transition time: 3.1nsec 231 pJ is consumed High to low transition E.g., 1.2V to 0.8V Transition time: 5.3nsec 148 pJ is charged A pair of voltage transitions consume 83pJ (=231-148) . Thus, frequent voltage transitions adversely increase the total power consumption. Break-even time (BET) is the minimum duration time to compensate for the overhead energy (i.e., 83pJ).  BET is 58nsec (23-cycle @392.2MHz) Energy consumed Energy charged Transition energy (left) Transition delay (right)

CMP simulator: GEMS/Simics Full-system CMP simulation 16-tile CMP, 4x4 mesh NoC Sun Solaris 9, Sun Studio 12 NAS Parallel Bench (16 threads) OpenMP version IS, DC, MG, EP, LU, SP, BT, FT UltraSPARC L1 cache (I & D) L2 cache bank On-chip router

CMP simulator: GEMS/Simics Shared L2 CMP All L2 cache banks are shared by all tiles More communication between tiles Private L2 CMP Every tile has its own private L1 and L2 caches Less communication between tiles Full-system CMP simulation 16-tile CMP, 4x4 mesh NoC Sun Solaris 9, Sun Studio 12 NAS Parallel Bench (16 threads) OpenMP version IS, DC, MG, EP, LU, SP, BT, FT UltraSPARC L1 cache (I & D) L2 cache bank On-chip router

Low-power policy: Performance overhead 2-cycle @ Vdd-low when no packets come (standby mode) 1-cycle @ Vdd-high when packets come (except 1st hop) Proposed All 1-cycle transfer All 2-cycle transfer @ Vdd-high @ Vdd-low

Low-power policy: Performance overhead 2-cycle @ Vdd-low when no packets come (standby mode) 1-cycle @ Vdd-high when packets come (except 1st hop) Proposed All 1-cycle transfer All 2-cycle transfer @ Vdd-high @ Vdd-low Shared L2 CMP More communication 2.1% decrease compared to All 1-cycle@Vdd-high Private L2 CMP Less communication 1.0% decrease compared to All 1-cycle@Vdd-high

Low-power policy: Standby power reduction 2-cycle @ Vdd-low when no packets come (standby mode) 1-cycle @ Vdd-high when packets come (except 1st hop) Proposed All 1-cycle transfer All 2-cycle transfer @ Vdd-high @ Vdd-low

Low-power policy: Standby power reduction 2-cycle @ Vdd-low when no packets come (standby mode) 1-cycle @ Vdd-high when packets come (except 1st hop) Proposed All 1-cycle transfer All 2-cycle transfer @ Vdd-high @ Vdd-low Shared L2 CMP More communication 10.4% reduction compared to All 1-cycle@Vdd-high Private L2 CMP Less communication 44.4% reduction compared to All 1-cycle@Vdd-high Frequent short-term transitions less than BET

Summary: Multi-Vdd variable-pipeline router Existing DVFS router Voltage and frequency are adjusted Multi-Vdd variable-pipeline router Voltage and “router pipeline depth” are adjusted All routers work at the same frequency Voltage switch policies: Low-power policy, and delay variation tolerance policy Evaluation results: Area overhead is 14.1%; Break-even time is 23 cycles Performance degradation 1.0% to 2.1% Standby power reduction 10.4% to 44.4% Communication between different frequency domains introduces some difficulties…

On-going work Thermal sensor design More sophisticated policies Oscillation-Ring based Thermal Sensor (ORTS) More sophisticated policies Take into account Break-even time (BET) to avoid inefficient voltage transitions E.g., Tied to 1-cycle@Vdd-high for high-volume traffic Finer-grained multi-Vdd control E.g., input channel or VC level to exploit traffic locality Enable [Yang,ASPDAC’09] [Wolpert,NOCS’10] Pulse counter Thermal level