Adding Slow-Silent Virtual Channels for Low-Power On-Chip Networks Hiroki Matsutani (Keio Univ, Japan) Michihiro Koibuchi (NII, Japan) Daihan Wang (Keio.

Slides:



Advertisements
Similar presentations
A Novel 3D Layer-Multiplexed On-Chip Network
Advertisements

1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.
Keeping Hot Chips Cool Ruchir Puri, Leon Stok, Subhrajit Bhattacharya IBM T.J. Watson Research Center Yorktown Heights, NY Circuits R-US.
Evaluating Bufferless Flow Control for On-Chip Networks George Michelogiannakis, Daniel Sanchez, William J. Dally, Christos Kozyrakis Stanford University.
Ultra Fine-Grained Run-Time Power Gating of On-Chip Routers for CMPs
A Multi-Vdd Dynamic Variable-Pipeline On-Chip Router for CMPs
Lizhong Chen and Timothy M. Pinkston SMART Interconnects Group
Allocator Implementations for Network-on-Chip Routers Daniel U. Becker and William J. Dally Concurrent VLSI Architecture Group Stanford University.
1 Lecture 17: On-Chip Networks Today: background wrap-up and innovations.
1 Lecture 23: Interconnection Networks Paper: Express Virtual Channels: Towards the Ideal Interconnection Fabric, ISCA’07, Princeton.
L2 to Off-Chip Memory Interconnects for CMPs Presented by Allen Lee CS258 Spring 2008 May 14, 2008.
MINIMISING DYNAMIC POWER CONSUMPTION IN ON-CHIP NETWORKS Robert Mullins Computer Architecture Group Computer Laboratory University of Cambridge, UK.
Lei Wang, Yuho Jin, Hyungjun Kim and Eun Jung Kim
1 Lecture 21: Router Design Papers: Power-Driven Design of Router Microarchitectures in On-Chip Networks, MICRO’03, Princeton A Gracefully Degrading and.
1 Lecture 13: Interconnection Networks Topics: flow control, router pipelines, case studies.
1 Lecture 25: Interconnection Networks Topics: flow control, router microarchitecture Final exam:  Dec 4 th 9am – 10:40am  ~15-20% on pre-midterm  post-midterm:
1 Lecture 24: Interconnection Networks Topics: topologies, routing, deadlocks, flow control Final exam reminders:  Plan well – attempt every question.
On-Line Adjustable Buffering for Runtime Power Reduction Andrew B. Kahng Ψ Sherief Reda † Puneet Sharma Ψ Ψ University of California, San Diego † Brown.
S. Reda EN160 SP’07 Design and Implementation of VLSI Systems (EN0160) Lecture 13: Power Dissipation Prof. Sherief Reda Division of Engineering, Brown.
1 Lecture 26: Interconnection Networks Topics: flow control, router microarchitecture.
Orion: A Power-Performance Simulator for Interconnection Networks Presented by: Ilya Tabakh RC Reading Group4/19/2006.
Lecture 7: Power.
Network-on-Chip: Communication Synthesis Department of Computer Science Texas A&M University.
Low-Latency Virtual-Channel Routers for On-Chip Networks Robert Mullins, Andrew West, Simon Moore Presented by Sailesh Kumar.
Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.
McRouter: Multicast within a Router for High Performance NoCs
Tightly-Coupled Multi-Layer Topologies for 3D NoCs Hiroki Matsutani (Keio Univ, JAPAN) Michihiro Koibuchi (NII, JAPAN) Hideharu Amano (Keio Univ, JAPAN)
1 Lecture 23: Interconnection Networks Topics: Router microarchitecture, topologies Final exam next Tuesday: same rules as the first midterm Next semester:
A Vertical Bubble Flow Network using Inductive-Coupling for 3D CMPs
Introduction to Interconnection Networks. Introduction to Interconnection network Digital systems(DS) are pervasive in modern society. Digital computers.
José Vicente Escamilla José Flich Pedro Javier García 1.
High-Performance Networks for Dataflow Architectures Pravin Bhat Andrew Putnam.
Three-Dimensional Layout of On-Chip Tree-Based Networks Hiroki Matsutani (Keio Univ, Japan) Michihiro Koibuchi (NII, Japan) D. Frank Hsu (Fordham Univ,
Elastic-Buffer Flow-Control for On-Chip Networks
Networks-on-Chips (NoCs) Basics
International Symposium on Low Power Electronics and Design NoC Frequency Scaling with Flexible- Pipeline Routers Pingqiang Zhou, Jieming Yin, Antonia.
Déjà Vu Switching for Multiplane NoCs NOCS’12 University of Pittsburgh Ahmed Abousamra Rami MelhemAlex Jones.
SMART: A Single- Cycle Reconfigurable NoC for SoC Applications -Jyoti Wadhwani Chia-Hsin Owen Chen, Sunghyun Park, Tushar Krishna, Suvinay Subramaniam,
1 EE 587 SoC Design & Test Partha Pande School of EECS Washington State University
Paper review: High Speed Dynamic Asynchronous Pipeline: Self Precharging Style Name : Chi-Chuan Chuang Date : 2013/03/20.
1 Michihiro Koibuchi, Takafumi Watanabe, Atsushi Minamihata, Masahiro Nakao, Tomoyuki Hiroyasu, Hiroki Matsutani, and Hideharu Amano
George Michelogiannakis William J. Dally Stanford University Router Designs for Elastic- Buffer On-Chip Networks.
A Lightweight Fault-Tolerant Mechanism for Network-on-Chip
CS 8501 Networks-on-Chip (NoCs) Lukasz Szafaryn 15 FEB 10.
1 Lecture 15: Interconnection Routing Topics: deadlock, flow control.
Michihiro Koibuchi(NII, Japan ) Tomohiro Otsuka(Keio U, Japan ) Hiroki Matsutani ( U of Tokyo, Japan ) Hideharu Amano ( Keio U/ NII, Japan ) An On/Off.
Non-Minimal Routing Strategy for Application-Specific Networks-on-Chips Hiroki Matsutani Michihiro Koibuchi Yutaka Yamada Jouraku Akiya Hideharu Amano.
Anshul Kumar, CSE IITD ECE729 : Advanced Computer Architecture Lecture 27, 28: Interconnection Mechanisms In Multiprocessors 29 th, 31 st March, 2010.
Jose Miguel Montanana (NII, Japan) Michihiro Koibuchi (NII, Japan ) Hiroki Matsutani ( U of Tokyo, Japan ) Hideharu Amano ( Keio U/ NII, Japan ) Stabilizing.
Runtime Power Gating of On-Chip Routers Using Look-Ahead Routing
Performance, Cost, and Energy Evaluation of Fat H-Tree: A Cost-Efficient Tree-Based On-Chip Network Hiroki Matsutani (Keio Univ, JAPAN) Michihiro Koibuchi.
Yu Cai Ken Mai Onur Mutlu
Lecture 16: Router Design
FPGA-Based System Design: Chapter 6 Copyright  2004 Prentice Hall PTR Topics n Low power design. n Pipelining.
Z. Feng MTU EE4800 CMOS Digital IC Design & Analysis 6.1 EE4800 CMOS Digital IC Design & Analysis Lecture 6 Power Zhuo Feng.
1 Lecture 15: NoC Innovations Today: power and performance innovations for NoCs.
1 Lecture 22: Router Design Papers: Power-Driven Design of Router Microarchitectures in On-Chip Networks, MICRO’03, Princeton A Gracefully Degrading and.
Virtual-Channel Flow Control William J. Dally
Predictive High-Performance Architecture Research Mavens (PHARM), Department of ECE The NoX Router Mitchell Hayenga Mikko Lipasti.
Network On Chip Cache Coherency Final presentation – Part A Students: Zemer Tzach Kalifon Ethan Kalifon Ethan Instructor: Walter Isaschar Instructor: Walter.
1 Lecture 29: Interconnection Networks Papers: Express Virtual Channels: Towards the Ideal Interconnection Fabric, ISCA’07, Princeton Interconnect Design.
1 Lecture 22: Interconnection Networks Topics: Routing, deadlock, flow control, virtual channels.
Lecture 23: Interconnection Networks
Physical constraints (1/2)
Lecture 23: Router Design
OpenSMART: Single-cycle Multi-hop NoC Generator in BSV and Chisel
Rahul Boyapati. , Jiayi Huang
CS 6290 Many-core & Interconnect
Lecture 25: Interconnection Networks
Presentation transcript:

Adding Slow-Silent Virtual Channels for Low-Power On-Chip Networks Hiroki Matsutani (Keio Univ, Japan) Michihiro Koibuchi (NII, Japan) Daihan Wang (Keio Univ, Japan) Hideharu Amano (Keio Univ, Japan)

I am very sorry… My flight was canceled on April 6. I was waiting for rebooking at airport for seven hours, but I couldn’t get a ticket. I got a fever. I arrived at Newcastle on April 7. I couldn’t find my baggage; I wore only a shirt. My hotel reservation was canceled w/o asking; I didn’t have a place to sleep… I went to another hotel to book a room in my shirt sleeves in the rain. The fever was gone up. Ms. Jerder kindly did her presentation on Apr 8. I would like thank her and ASYNC/NOCS program committee.

Adding Slow-Silent Virtual Channels for Low-Power On-Chip Networks Hiroki Matsutani (Keio Univ, Japan) Michihiro Koibuchi (NII, Japan) Daihan Wang (Keio Univ, Japan) Hideharu Amano (Keio Univ, Japan) Power gating Voltage and frequency scaling

Introduction: Area and power Due to the finger process technology, –Area constraint is relaxed –But power density becomes more serious Adding extra hardware resources (e.g., VCs) –We can get a performance margin; so –We can reduce voltage and frequency to reduce power VC#0 Router (a) VC#0 Router (b) VC#0 Router (c) VC#1 VC#2 Issues to be tackled in this presentation Adding extra hardware increases the leakage power How much resource is required to minimize total power

Outline: Slow-silent virtual channels Network-on-Chip (NoC) On-Chip Router –Architecture and its power consumption Slow-silent virtual channels –Voltage and frequency scaling –Run-time power gating of virtual channels –Adaptive VC activation Evaluations (1VC, 2VC, 3VC, and 4VC) –Throughput –Power consumption (with PG & voltage freq scaling) –How many VCs are required to minimize power

Network-on-Chip (NoC) An example tile architecture (ASPLA 90nm CMOS) Processor core Router The next slides show “Router architecture” and “Its power” Processor core –Largest component –Various low-power techniques are used On-chip router –Area is not so large –Always preparing (active) for packet injection [Ishikawa,IEICE’05] e.g., Standby current 11uA

On-Chip Router: Architecture 5-input 5-output router (data width is 64-bit) 5x5 XBAR ARBITER FIFO X+ X- Y+ Y- CORE X+ X- Y+ Y- CORE Each physical channel has 2 VCs HW amount is 34 kilo gates and 64% of area is used for FIFO Each VC has a FIFO buffer (4 x 64 bits)

On-Chip Router: Pipeline A header flit goes through a router in 3 cycles –RC (Routing computation) –VSA (Virtual channel / Switch allocation) –ST (Switch traversal) E.g., Packet transfer from router A to C RCVSAST RCVSAST RCVSAST ELAPSED TIME [CYCLE] C HEAD DATA 1 DATA 2 DATA 3 A packet consists of a header and 3 data flits

On-Chip Router: Power consumption Place-and-routed with 90nm CMOS Post layout simulation at 200MHz Power consumption of a router when n ports are used [mW] A router consumes more power as the router processes more packets Packet switching power is large  Voltage freq scaling

On-Chip Router: Power consumption Standby power of the on-chip router Leakage (55.0%) Dynamic (45.0%) Channels (49.4%) Leakage of channel buf is the largest  Runtime power gating Power consumption when no port is used  standby power Packet switching power is large  Voltage freq scaling

Outline: Slow-silent virtual channels Network-on-Chip (NoC) On-Chip Router –Architecture and its power consumption Slow-silent virtual channels –Voltage and frequency scaling –Run-time power gating of virtual channels –Adaptive VC activation Evaluations (1VC, 2VC, 3VC, and 4VC) –Throughput –Power consumption (with PG & voltage freq scaling) –How many VCs are required to minimize power

Slow-Silent Virtual Channels Latency vs. accepted traffic Adding extra VCs –Performance improves –We can reduce voltage and frequency Voltage & frequency scaling (VFS) –Set the reduced voltage and frequency –In response to the performance margin Problem –Adding extra VCs increases leakage power –It may overwhelm VFS 1-VC 2-VC3-VC 4-VC Performance margin We focus on run-time power gating of VCs to reduce leakage

Power Gating of virtual channels Run-time power gating of virtual channels –No packets in a VC  Sleep (turn off the power supply) –Packet arrives at the VC  Wakeup (turn on the power) 5x5 XBAR ARBITER X+ X- Y+ Y- CORE X+ X- Y+ Y- CORE sleep

Power Gating of virtual channels 5x5 XBAR ARBITER X+ X- Y+ Y- CORE X+ X- Y+ Y- CORE sleep Link shutdown has been studied for on- & off-chip networks, but prior work uses SRAM buffers [Chen,ISLPED’03] [Soteriou,TPDS’07]  We use small registered FIFOs for light-weight NoC routers Run-time power gating of virtual channels –No packets in a VC  Sleep (turn off the power supply) –Packet arrives at the VC  Wakeup (turn on the power)

Power Gating: Various overheads Area overhead –Power switches Performance overhead –Wakeup delay –Pipeline stall is caused Power overhead –Driving power switches –Short sleeps adversely increases dynamic power FIFO Sleep Waiting for channel wakeup FIFO Active Pipeline stall of a router occurs  Frequent on/off should be avoided

Power Gating: Various overheads Area overhead –Power switches Performance overhead –Wakeup delay –Pipeline stall is caused Power overhead –Driving power switches –Short sleeps adversely increases dynamic power sleep Vdd Virtual Vdd GND Power switch Circuit block Control that gradually activates VCs in response to workload FIFO Sleep Waiting for channel wakeup Pipeline stall of a router occurs  Frequent on/off should be avoided FIFO Active

Power Gating: VC activation policy Virtual channel (VC) level power gating Virtual-channel selection: –All packets use VC#0 when they are injected to NoC –VC number is increased when the packet conflicts VC#0 Router (a) VC#1 VC#2 VC#0 Router (b) VC#1 VC#2 VC#0 Router (c) VC#1 VC#2 Only VC#0 is used if workload is low

Power Gating: VC activation policy Virtual channel (VC) level power gating Virtual-channel selection: –All packets use VC#0 when they are injected to NoC –VC number is increased when the packet conflicts Router (a)Router (b)Router (c) VC#0 VC#1 VC#2 VC#0 VC#1 VC#2 VC#0 VC#1 VC#2 High peak performance of VCs with the least leakage power All VCs are activated if workload is high

Power Gating: Routing design A virtual-channel layer –A virtual network consisting of VCs with the same VC# Deadlock-freedom –Moving upper to lower layers –Only bottom layer must guarantee deadlock-freedom VC#0 VC#1 VC#2 VC#3 VC#0 VC#1 VC#2 VC#3 VC#0 VC#1 VC#2 VC#3 Router (a)Router (b)Router (c) VC Layer #0 VC Layer #1 VC Layer #2 VC Layer #3 VC#0  VC#1  VC#2  VC#3 [Duato,TPDS’93][Koibuchi,ICPP’03] All VC layers except for the bottom can employ any routing, as far as the bottom guarantees deadlock-free by itself

Outline: Slow-silent virtual channels Network-on-Chip (NoC) On-Chip Router –Architecture and its power consumption Slow-silent virtual channels –Voltage and frequency scaling –Run-time power gating of virtual channels –Adaptive VC activation Evaluations (1VC, 2VC, 3VC, and 4VC) –Throughput –Power consumption (with PG & voltage freq scaling) –How many VCs are required to minimize power

Evaluations of slow-silent VCs Preliminary –Leakage modeling of PG –Breakeven point of PG Evaluation items –Original throughput –Power consumption w/o PG and VFS –Power consumption w/ PG and VFS Which is the best? –1VC, 2VC, 3VC, and 4VC Process technology –ASPLA 90nm CMOS –1.00V (baseline) Simulation parameters Traffic patterns –Unifrom + NPB traces (BT, SP, CG, MG, IS) Topology2-D Mesh (8x8) RoutingDOR (XY routing) Buffer size4-flit (WH switching) # of VCs1VC, 2VC, 3VC, 4VC Latency3-cycle per 1-hop

Preliminary: Leakage power modeling Supply voltage1.0 V Switching factor0.12 Leakage power52 uW Dynamic power (200MHz)78 uW Dynamic power (500MHz)194 uW Power switch size ratio0.1 Power switch cap ratio0.5 Based on the post layout simulation of on-chip router (90nm CMOS) Power gating model –E overhead : Power consumed for turning PS on/off –E saved : Leakage power saving for an N-cycle sleep [Hu,ISLPED’04] How many cycles are required to sleep for compensating E overhead ? We calculate the breakeven point of PG based on the following parameters

Preliminary: Leakage power modeling Power gating model –E overhead : Power consumed for turning PS on/off –E saved : Leakage power saving for N-cycle sleep Breakeven point is 7 cycle (200MHz) Breakeven point is 16 cycles (500MHz) No power gating (PG) PG router (200MHz) PG router (500MHz) How many cycles are required to sleep for compensating E overhead ? Power consumption is reduced as sleep duration becomes long [Hu,ISLPED’04]

Preliminary: Leakage power modeling Power gating model –E overhead : Power consumed for turning PS on/off –E saved : Leakage power saving for N-cycle sleep Breakeven point is… PG(200MHz): 7 cycles PG(300MHz): 10 cycles PG(400MHz): 13 cycles PG(500MHz): 16 cycles No power gating (PG) PG router (200MHz) PG router (500MHz) How many cycles are required to sleep for compensating E overhead ? Power consumption is reduced as sleep duration becomes long [Hu,ISLPED’04] PG router (300MHz) PG router (400MHz)

Evaluations of slow-silent VCs Preliminary –Leakage modeling of PG –Breakeven point of PG Evaluation items –Original throughput –Power consumption w/o PG and VFS –Power consumption w/ PG and VFS Which is the best? –1VC, 2VC, 3VC, and 4VC Process technology –ASPLA 90nm CMOS –1.00V (baseline) Simulation parameters Traffic patterns –Unifrom + NPB traces (BT, SP, CG, MG, IS) Topology2-D Mesh (8x8) RoutingDOR (XY routing) Buffer size4-flit (WH switching) # of VCs1VC, 2VC, 3VC, 4VC Latency3-cycle per 1-hop

Evaluations: Uniform (64-core) 1/4 1-VC 2-VC 3-VC 4-VC Original throughput

Evaluations: Uniform (64-core) 2/4 1-VC 2-VC 3-VC 4-VC Power (without PG & VFS) leakage total

Evaluations: Uniform (64-core) 3/4 1-VC 2-VC 3-VC 4-VC Power (without PG & VFS) Freq [MHz]Voltage [V] 1VC VC VC VC Static voltage and frequency scaling leakage total 1)We re-characterized low- voltage libraries ( V) by Cadence SignalStrom 2)We confirm our design works at these reduced voltages

Evaluations: Uniform (64-core) 4/4 1-VC 2-VC 3-VC 4-VC Power (without PG & VFS)Power (with PG & VFS) Freq [MHz]Voltage [V] 1VC VC VC VC Static voltage and frequency scaling leakage total leakage total 4-VC is the lowest The same results can be seen in all-to-all traffics (e.g., IS)

Evaluations: BT traffic (64-core) 1/4 1-VC 2-VC 3-VC 4-VC Original throughput Performance improvements of 3-VC and 4-VC are small

Evaluations: BT traffic (64-core) 2/4 Power (without PG & VFS) 1-VC 2-VC 3-VC 4-VC leakage total Performance improvements of 3-VC and 4-VC are small

Evaluations: BT traffic (64-core) 3/4 1-VC 2-VC 3-VC 4-VC Power (without PG & VFS) Freq [MHz]Voltage [V] 1VC VC VC VC Static voltage and frequency scaling leakage total 1)We re-characterized the low- voltage library (0.82V) by Cadence SignalStrom 2)We confirm our design works at this reduced voltage Almost the same

Evaluations: BT traffic (64-core) 4/4 1-VC 2-VC 3-VC 4-VC Power (without PG & VFS)Power (with PG & VFS) Freq [MHz]Voltage [V] 1VC VC VC VC Static voltage and frequency scaling leakage total leakage total 2-VC is the lowest The same result can be seen in neighboring traffics (e.g., SP)

How many VCs are best for LP? All-to-all traffic –Uniform, IS traffic –3 or 4VCs are better Neighboring traffic –BT, SP traffic –2VCs are enough Uniform (with SVFS & PG)BT traffic (with SVFS & PG)  It depends on the traffic pattern of application leakage total leakage total 1-VC 2-VC3-VC 4-VC 4-VC is the lowest 2-VC is the lowest

Slow-silent virtual channels –Adding extra VCs  Performance margin is available –We can reduce the freq and voltage –But adding extra VCs increases leakage power … Run-time power gating of VCs –Adaptive VC activation How many VCs are required for minimizing power? –It depends on the traffic pattern of application –All-to-all traffic: 3 or 4 VCs are better –Neighboring traffic: 2 VCs are enough Summary: Slow-silent virtual channels

Very “FAT” trees –Adding more trees & voltage frequency scaling –Run-time power gating There are a lot of types of Fat trees How many trees are required to minimize power? Future work: Slow-silent fat trees fatter

Thank you for your attention

Backup sides

Wakeup delay: Performance impact Twakeup=0 Twakeup=1 Twakeup=2 Twakeup=3 Wakeup delays in literatures –ALU: 2 cycle –FPMAC in Intel’s 80-tile chip: 6 cycle Performance impact of wakeup delay (naïve mode) [Tschanz,JSSC’03] [Vangal,ISSCC’07]

Eg., A packet goes through R3, R4, R5, and R2 Look-Ahead Sleep Control Look-ahead sleep control –To mitigate the wakeup delay and short-term sleeps Normal routing: –Router i calculates the output port of Router i Look-ahead routing: –Router i calculates the output port of Router i+1 R0R1R2 R3R4R5 R6R7R8 Five-cycle margin until packet arrival R2 detects a packet arrival when the packet arrives at R4 Look-Ahead: RCSAST RCSAST Router 4Router 5Router 2 RC Packet will arrive after two hops Look-ahead can eliminate a wakeup delay of less than 5-cycle [Matsutani,ASP-DAC’08]

Look-ahead method: HW resources Routing computation of next router –Just changing the routing function –Area overhead is very small Wakeup signals are needed –Sender asserts “wakeup” signal to receiver –Wakeup signals becomes long –Negative impact of multi-cycle or repeater buffers NRCSAST NRCSAST NRCSAST HEAD DATA 1 DATA 2 NRC stage: Next Routing Computation Wakeup signals to router 1 [Matsutani,ASP-DAC’08]

VC activation: three grouping methods 4VC x 1 (# of lane is 1) –Starting from VC#0, –A packet moves VC#0  VC#1  VC#2  VC#3 2VC x 2 (# of lanes is 2) –If (dst%2)=0: a packet moves VC#0  VC#1 –If (dst%2)=1: a packet moves VC#2  VC#3 1VC x 4 (# of lanes is 4) –If (dst%4)=0: a packet uses VC#0 –… –If (dst%4)=3: a packet uses VC#3 VC#0VC#1VC#2VC#3 VC#0VC#1VC#2VC#3 dst=0,4dst=1,5dst=2,6dst=3,7 VC#0VC#1VC#2VC#3 dst=0,2,4dst=1,3,5 All packets The first one (used in this paper) achieves the highest performance with the least leakage power

X+ X- Y+ Y- CORE Buffer design: Registers or SRAMs It depends on buffer depth, not width –Depth > 32-flit  Buffers are designed with SRAMs –Otherwise  Buffers are designed with registers 5x5 XBAR ARBITER FIFO X+ X- Y+ Y- CORE In our design: Buffer depth is 4-flit FIFO buffers are designed with registers