Runtime Power Gating of On-Chip Routers Using Look-Ahead Routing

Slides:



Advertisements
Similar presentations
Prof. Natalie Enright Jerger
Advertisements

A Novel 3D Layer-Multiplexed On-Chip Network
1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.
Evaluating Bufferless Flow Control for On-Chip Networks George Michelogiannakis, Daniel Sanchez, William J. Dally, Christos Kozyrakis Stanford University.
Ultra Fine-Grained Run-Time Power Gating of On-Chip Routers for CMPs
A Multi-Vdd Dynamic Variable-Pipeline On-Chip Router for CMPs
Lizhong Chen and Timothy M. Pinkston SMART Interconnects Group
What is Flow Control ? Flow Control determines how a network resources, such as channel bandwidth, buffer capacity and control state are allocated to packet.
Allocator Implementations for Network-on-Chip Routers Daniel U. Becker and William J. Dally Concurrent VLSI Architecture Group Stanford University.
1 Lecture 17: On-Chip Networks Today: background wrap-up and innovations.
Module R R RRR R RRRRR RR R R R R Efficient Link Capacity and QoS Design for Wormhole Network-on-Chip Zvika Guz, Isask ’ har Walter, Evgeny Bolotin, Israel.
1 Lecture 23: Interconnection Networks Paper: Express Virtual Channels: Towards the Ideal Interconnection Fabric, ISCA’07, Princeton.
MINIMISING DYNAMIC POWER CONSUMPTION IN ON-CHIP NETWORKS Robert Mullins Computer Architecture Group Computer Laboratory University of Cambridge, UK.
Lei Wang, Yuho Jin, Hyungjun Kim and Eun Jung Kim
Design of a High-Throughput Distributed Shared-Buffer NoC Router
1 Lecture 21: Router Design Papers: Power-Driven Design of Router Microarchitectures in On-Chip Networks, MICRO’03, Princeton A Gracefully Degrading and.
1 Lecture 13: Interconnection Networks Topics: flow control, router pipelines, case studies.
1 Lecture 25: Interconnection Networks Topics: flow control, router microarchitecture Final exam:  Dec 4 th 9am – 10:40am  ~15-20% on pre-midterm  post-midterm:
1 Lecture 24: Interconnection Networks Topics: topologies, routing, deadlocks, flow control Final exam reminders:  Plan well – attempt every question.
Issues in System-Level Direct Networks Jason D. Bakos.
Trace-Driven Optimization of Networks-on-Chip Configurations Andrew B. Kahng †‡ Bill Lin ‡ Kambiz Samadi ‡ Rohit Sunkam Ramanujam ‡ University of California,
1 Lecture 26: Interconnection Networks Topics: flow control, router microarchitecture.
Orion: A Power-Performance Simulator for Interconnection Networks Presented by: Ilya Tabakh RC Reading Group4/19/2006.
Lecture 7: Power.
Internetworking Fundamentals (Lecture #2) Andres Rengifo Copyright 2008.
Network-on-Chip: Communication Synthesis Department of Computer Science Texas A&M University.
Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.
McRouter: Multicast within a Router for High Performance NoCs
Approaching Ideal NoC Latency with Pre-Configured Routes George Michelogiannakis, Dionisios Pnevmatikatos and Manolis Katevenis Institute of Computer Science.
Tightly-Coupled Multi-Layer Topologies for 3D NoCs Hiroki Matsutani (Keio Univ, JAPAN) Michihiro Koibuchi (NII, JAPAN) Hideharu Amano (Keio Univ, JAPAN)
1 The Turn Model for Adaptive Routing. 2 Summary Introduction to Direct Networks. Deadlocks in Wormhole Routing. System Model. Partially Adaptive Routing.
On-Chip Networks and Testing
Adding Slow-Silent Virtual Channels for Low-Power On-Chip Networks Hiroki Matsutani (Keio Univ, Japan) Michihiro Koibuchi (NII, Japan) Daihan Wang (Keio.
Three-Dimensional Layout of On-Chip Tree-Based Networks Hiroki Matsutani (Keio Univ, Japan) Michihiro Koibuchi (NII, Japan) D. Frank Hsu (Fordham Univ,
Elastic-Buffer Flow-Control for On-Chip Networks
Networks-on-Chips (NoCs) Basics
International Symposium on Low Power Electronics and Design NoC Frequency Scaling with Flexible- Pipeline Routers Pingqiang Zhou, Jieming Yin, Antonia.
Déjà Vu Switching for Multiplane NoCs NOCS’12 University of Pittsburgh Ahmed Abousamra Rami MelhemAlex Jones.
SMART: A Single- Cycle Reconfigurable NoC for SoC Applications -Jyoti Wadhwani Chia-Hsin Owen Chen, Sunghyun Park, Tushar Krishna, Suvinay Subramaniam,
High-Level Interconnect Architectures for FPGAs An investigation into network-based interconnect systems for existing and future FPGA architectures Nick.
1 EE 587 SoC Design & Test Partha Pande School of EECS Washington State University
Author : Jing Lin, Xiaola Lin, Liang Tang Publish Journal of parallel and Distributed Computing MAKING-A-STOP: A NEW BUFFERLESS ROUTING ALGORITHM FOR ON-CHIP.
High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.
1 Michihiro Koibuchi, Takafumi Watanabe, Atsushi Minamihata, Masahiro Nakao, Tomoyuki Hiroyasu, Hiroki Matsutani, and Hideharu Amano
George Michelogiannakis William J. Dally Stanford University Router Designs for Elastic- Buffer On-Chip Networks.
A Lightweight Fault-Tolerant Mechanism for Network-on-Chip
O1TURN : Near-Optimal Worst-Case Throughput Routing for 2D-Mesh Networks DaeHo Seo, Akif Ali, WonTaek Lim Nauman Rafique, Mithuna Thottethodi School of.
CS 8501 Networks-on-Chip (NoCs) Lukasz Szafaryn 15 FEB 10.
1 Lecture 15: Interconnection Routing Topics: deadlock, flow control.
Michihiro Koibuchi(NII, Japan ) Tomohiro Otsuka(Keio U, Japan ) Hiroki Matsutani ( U of Tokyo, Japan ) Hideharu Amano ( Keio U/ NII, Japan ) An On/Off.
Non-Minimal Routing Strategy for Application-Specific Networks-on-Chips Hiroki Matsutani Michihiro Koibuchi Yutaka Yamada Jouraku Akiya Hideharu Amano.
Jose Miguel Montanana (NII, Japan) Michihiro Koibuchi (NII, Japan ) Hiroki Matsutani ( U of Tokyo, Japan ) Hideharu Amano ( Keio U/ NII, Japan ) Stabilizing.
Performance, Cost, and Energy Evaluation of Fat H-Tree: A Cost-Efficient Tree-Based On-Chip Network Hiroki Matsutani (Keio Univ, JAPAN) Michihiro Koibuchi.
Lecture 16: Router Design
Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.
1 Lecture 22: Router Design Papers: Power-Driven Design of Router Microarchitectures in On-Chip Networks, MICRO’03, Princeton A Gracefully Degrading and.
Virtual-Channel Flow Control William J. Dally
Predictive High-Performance Architecture Research Mavens (PHARM), Department of ECE The NoX Router Mitchell Hayenga Mikko Lipasti.
Network On Chip Cache Coherency Final presentation – Part A Students: Zemer Tzach Kalifon Ethan Kalifon Ethan Instructor: Walter Isaschar Instructor: Walter.
1 Lecture 29: Interconnection Networks Papers: Express Virtual Channels: Towards the Ideal Interconnection Fabric, ISCA’07, Princeton Interconnect Design.
1 Lecture 22: Interconnection Networks Topics: Routing, deadlock, flow control, virtual channels.
FlexiBuffer: Reducing Leakage Power in On-Chip Network Routers
Lecture 23: Interconnection Networks
Physical constraints (1/2)
Pablo Abad, Pablo Prieto, Valentin Puente, Jose-Angel Gregorio
Lecture 23: Router Design
OpenSMART: Single-cycle Multi-hop NoC Generator in BSV and Chisel
Rahul Boyapati. , Jiayi Huang
Lecture: Interconnection Networks
Lecture 25: Interconnection Networks
Presentation transcript:

Runtime Power Gating of On-Chip Routers Using Look-Ahead Routing Hiroki Matsutani (Keio Univ, Japan) Michihiro Koibuchi (NII, Japan) Daihan Wang (Keio Univ, Japan) Hideharu Amano (Keio Univ, Japan)

Background: Leakage & Power gating Major component of Standby power Power gating (PG) Leakage power reduction Turning on/off the power supply to the circuit block Examples of PG Processor core Execution unit ALU, FPU, MAC, … Dynamic Leakage (60.9%) e.g., Standby power of on-chip router (90nm CMOS; 200MHz) Vdd Virtual Vdd GND Power switch Circuit block We focus on power gating to reduce standby power of NoCs

Outline Network-on-Chip (NoC) On-Chip Router Architecture Power consumption Runtime power gating of routers Overheads Look-Ahead sleep control Evaluations Performance penalty Compensated sleep cycles Leakage reduction

Network-on-Chip (NoC) Processor core On-chip router Processor core Router An example tile architecture (ASPLA 90nm CMOS)

Network-on-Chip (NoC) Processor core Largest component Various low-power techniques are used On-chip router Area is not so large Infrastructure that affects on-chip communication D Stop!! e.g., Standby current 11uA [Ishikawa,IEICE’05] S Stopping routers makes a topology “irregular” An example tile architecture (ASPLA 90nm CMOS) The next slides show “Router architecture” and “Its power”

On-Chip Router: Architecture 5-input 5-output router (data width is 64-bit) Two virtual channels (64-bit x 4 x 2) ARBITER X+ X+ FIFO X- X- FIFO Y+ Y+ FIFO Y- Y- FIFO 5x5 XBAR CORE CORE FIFO HW amount is 34 kilo gates and 64% of area is used for FIFO

On-Chip Router: Pipeline A header flit goes through a router in 3 cycles RC (Routing Computation) SA (Switch Allocation) ST (Switch Traversal) E.g., Packet transfer from router A to C Packet size is 4-flit including 1-flit header @ROUTER A @ROUTER B @ROUTER C HEAD RC SA ST RC SA ST RC SA ST DATA 1 ST ST ST DATA 2 ST ST ST DATA 3 ST ST ST 1 2 3 4 5 6 7 8 9 10 11 12 ELAPSED TIME [CYCLE]

On-Chip Router: Power consumption Place-and-routed with 90nm CMOS Post layout simulation at 200MHz Power consumption of a router when n ports are used [mW] A router consumes more power as the router processes more packets

On-Chip Router: Power consumption Power consumption when no port is used  standby power Standby power of the on-chip router Leakage (60.1%) Dynamic (39.9%) Channels (54.0%) Leakage of channel bufs is the largest; it should be reduced

Outline Network-on-Chip (NoC) On-Chip Router Architecture Power consumption Runtime power gating of routers Overheads Look-Ahead sleep control Evaluations Performance penalty Compensated sleep cycles Leakage reduction

On-Chip Router: Leakage reduction Runtime power gating of router channels No packets in a channel  Sleep Packet arrives at the channel  Wakeup ARBITER X+ X+ FIFO X- X- FIFO FIFO Y+ Y+ FIFO Y- Y- FIFO 5x5 XBAR CORE CORE FIFO

On-Chip Router: Leakage reduction Runtime power gating of router channels No packets in a channel  Sleep Packet arrives at the channel  Wakeup ARBITER X+ X+ FIFO X- X- FIFO FIFO FIFO Y+ Y+ FIFO Y- Y- FIFO Link shutdown has been studied for on- & off-chip networks, but prior work uses SRAM buffers [Chen,ISLPED’03] [Soteriou,TPDS’07] We use small registered FIFOs for light-weight NoC routers 5x5 XBAR CORE CORE FIFO

Power Gating: Various overheads Pipeline stall of a router occurs Area overhead Power switches Performance overhead Wakeup delay Pipeline stall is caused Power overhead Driving power switches Short sleeps adversely increases dynamic power Sleep FIFO Active FIFO Waiting for channel wakeup Early detection of packet arrivals Detect & avoid short-term sleeps

Power Gating: Various overheads Pipeline stall of a router occurs Area overhead Power switches Performance overhead Wakeup delay Pipeline stall is caused Power overhead Driving power switches Short sleeps adversely increases dynamic power Sleep FIFO Active FIFO Waiting for channel wakeup sleep Vdd Virtual Vdd GND Power switch Circuit block Early detection of packet arrivals Detect & avoid short-term sleeps Sleep control that detects arrival of packets early is needed

Look-Ahead Sleep Control To mitigate the wakeup delay and short-term sleeps Normal routing: Router i calculates the output port of Router i Look-ahead routing: Router i calculates the output port of Router i+1 Five-cycle margin until packet arrival R0 R1 R2 RC SA ST Router 4 Router 5 Router 2 Look-Ahead: Packet will arrive after two hops R2 detects a packet arrival when the packet arrives at R4 R3 R4 R5 R6 R7 R8 Eg., A packet goes through R3, R4, R5, and R2 Look-ahead can eliminate a wakeup delay of less than 5-cycle

Outline Network-on-Chip (NoC) On-Chip Router Architecture Power consumption Runtime power gating of routers Overheads Look-Ahead sleep control Evaluations Performance penalty Compensated sleep cycles Leakage reduction

Evaluations: Sleep control methods Evaluation items Network throughput Leakage reduction Parameters Ideal method Ideal case No wakeup delay Look-ahead method Detects packet arrival 5-cycles ahead Naïve method Original router No look-ahead Topology 2-D Mesh (4x4) Routing DOR (XY routing) Packet size 5-flit (1-flit header) Buffer size 4-flit (WH switching) # of VCs 2 VCs Latency 3-cycle per 1-hop Traffic pattern: Uniform and NPB programs (BT,SP,CG,MG, and IS)

Evaluations: Performance of “naïve” Throughput on various wakeup delays (e.g., 0,1,2,3 cycles) Naïve: Performance is reduced as Twakeup increases Uniform traffic (16-core) MG.W traffic (16-core)

Evaluations: Performance of “lookahead” Throughput on various wakeup delays (e.g., 0,1,2,3 cycles) Naïve:      Ideal: Look-ahead: Performance is degraded as Twakeup increases Same as regardless of Twakeup Same as if Twakeup is less than 5 Uniform traffic (16-core) MG.W traffic (16-core) Look-ahead can conceal a wakeup delay of less than 5 cycles

Evaluations: Breakeven point of PG Power gating model Eoverhead: Power consumed for turning PS on/off Esaved: Leakage power saving for an N-cycle sleep [Hu,ISLPED’04] How many cycles are required to sleep for compensating Eoverhead ? We calculate the breakeven point of PG based on the following parameters Supply voltage 1.0 V Switching factor 0.10 Leakage power 95 uW Dynamic power (200MHz) 105 uW Dynamic power (500MHz) 261 uW Power switch size ratio 0.1 Power switch cap ratio 0.5 Based on the post layout simulation of on-chip router (90nm CMOS)

Evaluations: Breakeven point of PG Power gating model Eoverhead: Power consumed for turning PS on/off Esaved: Leakage power saving for N-cycle sleep [Hu,ISLPED’04] How many cycles are required to sleep for compensating Eoverhead ? Breakeven point is 6 cycle (200MHz) Power consumption is reduced as sleep duration becomes long Breakeven point is 14 cycles (500MHz) No power gating (PG) PG router (200MHz) PG router (500MHz)

Evaluations: Compensated sleep ratio States of router channels Nactive: Active operation Power is consumed as usual Ncsc: Compensated sleep Sleep longer than Tbreakeven Nusc: Uncompensated sleep Sleep less than Tbreakeven Estimate the ratio of compensated sleep cycles We performed the network simulation again Comparison between three sleep control methods sleep sleep Nactive Nusc Ncsc wakeup Ideal, Look-ahead, Naïve

Evaluations: Compensated sleep ratio States of router channels Nactive: Active operation Power is consumed as usual Ncsc: Compensated sleep Sleep longer than Tbreakeven Nusc: Uncompensated sleep Sleep less than Tbreakeven Ncsc rate 80% (low workload) Ncsc rate 25% (high workload) Uniform traffic (16-core) MG.W traffic (16-core) Ncsc decreases as traffic increases; Ideal >Look-ahead >Naïve

Evaluations: Leakage power reduction Leakage power at each channel Tbreakeven = 6 No power gating consumes 95 [uW] Leakage reduction of PG with 3 sleep control methods This includes the overhead energy to turn on/off power switches Leakage reduction Uniform traffic (16-core) MG.W traffic (16-core) Leak increases as traffic increases; Ideal <Look-ahead < Naïve

Summary: Look-ahead sleep control Runtime power gating of router channels Wakeup delay introduces pipeline stalls of routers Short-term sleeps overwhelm the leakage reduction Look-ahead sleep control An extension of “look-ahead routing” Detects the arrival of packets five cycles ahead Evaluation results Look-ahead conceals the wakeup delay of less than 5 Look-ahead reduces more leakage compared with naive

Thank you for your attention

Backup sides

Look-ahead method: HW resources Routing computation of next router Just changing the routing function Area overhead is very small Wakeup signals are needed Sender asserts “wakeup” signal to receiver Wakeup signals becomes long Negative impact of multi-cycle or repeater buffers NRC stage: Next Routing Computation HEAD NRC SA ST NRC SA ST NRC SA ST DATA 1 ST ST ST DATA 2 ST ST ST 1 2 3 4 5 6 7 8 Wakeup signals to router 1

Wakeup delay: Performance impact Wakeup delays in literatures ALU: 2 cycle AES core: approx 4 cycle FPMAC in Intel’s 80-tile chip: 6 cycle It depends on circuit block size, clock freq, noise, … Performance of look-ahead method (@ uniform tr) Twakeup=0 Twakeup=5 Twakeup=1 Twakeup=6 Twakeup=2 Twakeup=7 Twakeup=3 Twakeup=8 Twakeup=4 Twakeup=5 Wakeup delay = 0,1,2,3,4,5 [cycle] Wakeup delay = 5,6,7,8 [cycle]

Breakeven point: leakage reduction Breakeven point in literatures Execution unit in processor: 10 cycles It depends on circuit block size, clock freq, … Leakage power reduction (@ uniform traffic) The longer Tbreakeven reduces the opportunity of compensated sleep Tbreakeven = 6 [cycle] Tbreakeven = 14 [cycle]

Finer grain PG of NoC routers Virtual channel (VC) level power gating Packet routing scheme for VC-level PG All packets use VC#0 when they are injected to NoC VC number is increased when the packet conflicts VC#0 VC#0 VC#0 VC#1 VC#1 VC#1 Only VC#0 is used if workload is low VC#2 VC#2 VC#2 Router (a) Router (b) Router (c)

Finer grain PG of NoC routers Virtual channel (VC) level power gating Packet routing scheme for VC-level PG All packets use VC#0 when they are injected to NoC VC number is increased when the packet conflicts All VCs are activated if workload is high VC#0 VC#0 VC#0 VC#1 VC#1 VC#1 VC#2 VC#2 VC#2 Router (a) Router (b) Router (c) High peak performance of VCs with the least leakage power

Buffer design: Registers or SRAMs It depends on buffer depth, not width Depth > 32-flit  Buffers are design with SRAMs Otherwise  Buffers are design with registers ARBITER X+ X+ FIFO In our design: Buffer depth is 4-flit X- X- FIFO Y+ Y+ FIFO FIFO buffers are design with registers Y- Y- FIFO 5x5 XBAR CORE CORE FIFO

Leakage power calculation Power estimation flow: Perform the network simulation Obtain the length of every sleep during the simulation Ave. leakage of each sleep is estimated according to its length, based on “sleep duration vs. leakage” graph Leakage reduction (Tbreakeven = 6) Sleep duration vs. leakage power

Look-ahead method: the 1st hop? Look-ahead for Router 3, Router 4, Router 5, … Look-ahead for Router 1 and Router 2 Network interface (NI) performs look-ahead Packet construction takes several clock cycles NI of source node can perform “look-ahead” Look-ahead!! Look-ahead!! Src Router (1) Router (2) Router (3) Router (4) Dst Look-ahead!! Src Router (1) Router (2) Router (3) Router (4) Dst

Look-ahead method:Adaptive routing Routing algorithms Deterministic routing  routing path is predictable Adaptive routing  path is dynamically changed Adaptive routing It is difficult to predict the routing path Look-ahead wakeup sometimes fails Eg., Asserting wakeup signals to wrong input channels An extension for adaptive At low workload, Using the output selection function (OSF) that tries to use the same output channel  wakeup rarely fails We used “deterministic routing”, because it is popular in simple NoCs