MINIMISING DYNAMIC POWER CONSUMPTION IN ON-CHIP NETWORKS Robert Mullins Computer Architecture Group Computer Laboratory University of Cambridge, UK.

Slides:

Advertisements

Similar presentations

A Programmable Adaptive Router for a GALS Parallel System Jian Wu APT Group University of Manchester May 2009.

Advertisements

A Novel 3D Layer-Multiplexed On-Chip Network

International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems.

On-Chip Interconnects Alexander Grubb Jennifer Tam Jiri Simsa Harsha Simhadri Martha Mercaldi Kim, John D. Davis, Mark Oskin, and Todd Austin. “Polymorphic.

Evaluating Bufferless Flow Control for On-Chip Networks George Michelogiannakis, Daniel Sanchez, William J. Dally, Christos Kozyrakis Stanford University.

PRESENTED BY: PRIYANK GUPTA 04/02/2012 Generic Low Latency NoC Router Architecture for FPGA Computing Systems & A Complete Network on Chip Emulation Framework.

The Design and Implementation of a Low-Latency On-Chip Network Robert Mullins 11 th Asia and South Pacific Design Automation Conference (ASP-DAC), Jan.

Allocator Implementations for Network-on-Chip Routers Daniel U. Becker and William J. Dally Concurrent VLSI Architecture Group Stanford University.

1 Lecture 17: On-Chip Networks Today: background wrap-up and innovations.

L2 to Off-Chip Memory Interconnects for CMPs Presented by Allen Lee CS258 Spring 2008 May 14, 2008.

OCIN Workshop Wrapup Bill Dally. Thanks To Funding –NSF - Timothy Pinkston, Federica Darema, Mike Foster –UC Discovery Program Organization –Jane Klickman,

Core-based SoCs Testing Julien Pouget Embedded Systems Laboratory (ESLAB) Linköping University Julien Pouget Embedded Systems Laboratory (ESLAB) Linköping.

Demystifying Data-Driven and Pausible Clocking Schemes Robert Mullins Tutorial presented at 18 th UK Asynchronous Forum Newcastle, September 2006.

IP I/O Memory Hard Disk Single Core IP I/O Memory Hard Disk IP Bus Multi-Core IP R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R Networks.

Demystifying Data-Driven and Pausible Clocking Schemes Robert Mullins Computer Architecture Group Computer Laboratory, University of Cambridge ASYNC 2007,

Lab for Reliable Computing Generalized Latency-Insensitive Systems for Single-Clock and Multi-Clock Architectures Singh, M.; Theobald, M.; Design, Automation.

On-Line Adjustable Buffering for Runtime Power Reduction Andrew B. Kahng Ψ Sherief Reda † Puneet Sharma Ψ Ψ University of California, San Diego † Brown.

Communication-Centric Design Robert Mullins Computer Architecture Group Computer Laboratory, University of Cambridge (University of Twente, December 11.

Modern trends in computer architecture and semiconductor scaling are leading towards the design of chips with more and more processor cores. Highly concurrent.

Communication-Centric Design Robert Mullins Computer Architecture Group Computer Laboratory, University of Cambridge Workshop on On- and Off-Chip Interconnection.

1 Evgeny Bolotin – ICECS 2004 Automatic Hardware-Efficient SoC Integration by QoS Network on Chip Electrical Engineering Department, Technion, Haifa, Israel.

Issues in System-Level Direct Networks Jason D. Bakos.

Orion: A Power-Performance Simulator for Interconnection Networks Presented by: Ilya Tabakh RC Reading Group4/19/2006.

Network-on-Chip: Communication Synthesis Department of Computer Science Texas A&M University.

Low-Latency Virtual-Channel Routers for On-Chip Networks Robert Mullins, Andrew West, Simon Moore Presented by Sailesh Kumar.

Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.

TLC: Transmission Line Caches Brad Beckmann David Wood Multifacet Project University of Wisconsin-Madison 12/3/03.

Blue Gene / C Cellular architecture 64-bit Cyclops64 chip: –500 Mhz –80 processors ( each has 2 thread units and a FP unit) Software –Cyclops64 exposes.

1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian.

On-Chip Networks and Testing

ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.

Power Issues in On-chip Interconnection Networks Mojtaba Amiri Nov. 5, 2009.

High-Performance Networks for Dataflow Architectures Pravin Bhat Andrew Putnam.

R OUTE P ACKETS, N OT W IRES : O N -C HIP I NTERCONNECTION N ETWORKS Veronica Eyo Sharvari Joshi.

Elastic-Buffer Flow-Control for On-Chip Networks

International Symposium on Low Power Electronics and Design NoC Frequency Scaling with Flexible- Pipeline Routers Pingqiang Zhou, Jieming Yin, Antonia.

SMART: A Single- Cycle Reconfigurable NoC for SoC Applications -Jyoti Wadhwani Chia-Hsin Owen Chen, Sunghyun Park, Tushar Krishna, Suvinay Subramaniam,

High-Level Interconnect Architectures for FPGAs An investigation into network-based interconnect systems for existing and future FPGA architectures Nick.

1 EE 587 SoC Design & Test Partha Pande School of EECS Washington State University

Improving Capacity and Flexibility of Wireless Mesh Networks by Interface Switching Yunxia Feng, Minglu Li and Min-You Wu Presented by: Yunxia Feng Dept.

High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.

George Michelogiannakis William J. Dally Stanford University Router Designs for Elastic- Buffer On-Chip Networks.

A Lightweight Fault-Tolerant Mechanism for Network-on-Chip

George Michelogiannakis, Prof. William J. Dally Concurrent architecture & VLSI group Stanford University Elastic Buffer Flow Control for On-chip Networks.

Page 1 Reconfigurable Communications Processor Principal Investigator: Chris Papachristou Task Number: NAG Electrical Engineering & Computer Science.

Network-on-Chip Energy-Efficient Design Techniques for Interconnects Suhail Basit.

CS 8501 Networks-on-Chip (NoCs) Lukasz Szafaryn 15 FEB 10.

Field Programmable Port Extender (FPX) 1 Modular Design Techniques for the FPX.

System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.

University of Michigan, Ann Arbor

Yu Cai Ken Mai Onur Mutlu

By Nasir Mahmood.  The NoC solution brings a networking method to on-chip communication.

Lecture 16: Router Design

FPGA-Based System Design: Chapter 6 Copyright  2004 Prentice Hall PTR Topics n Low power design. n Pipelining.

Virtual-Channel Flow Control William J. Dally

VADA Lab.SungKyunKwan Univ. 1 L5:Lower Power Architecture Design 성균관대학교 조 준 동 교수

Predictive High-Performance Architecture Research Mavens (PHARM), Department of ECE The NoX Router Mitchell Hayenga Mikko Lipasti.

Implementing Tile-based Chip Multiprocessors with GALS Clocking Styles Zhiyi Yu, Bevan Baas VLSI Computation Lab, ECE Department University of California,

Network On Chip Cache Coherency Final presentation – Part A Students: Zemer Tzach Kalifon Ethan Kalifon Ethan Instructor: Walter Isaschar Instructor: Walter.

A Low-Area Interconnect Architecture for Chip Multiprocessors Zhiyi Yu and Bevan Baas VLSI Computation Lab ECE Department, UC Davis.

Runtime Reconfigurable Network-on- chips for FPGA-based systems Mugdha Puranik Department of Electrical and Computer Engineering

Network-on-Chip Paradigm Erman Doğan. OUTLINE SoC Communication Basics  Bus Architecture  Pros, Cons and Alternatives NoC  Why NoC?  Components 

Mohamed Abdelfattah Vaughn Betz

Gopakumar.G Hardware Design Group

ESE532: System-on-a-Chip Architecture

Architecture & Organization 1

OpenSMART: Single-cycle Multi-hop NoC Generator in BSV and Chisel

Architecture & Organization 1

Low-Latency Virtual-Channel Routers for On-Chip Networks Robert Mullins, Andrew West, Simon Moore Presented by Sailesh Kumar.

Presentation transcript:

MINIMISING DYNAMIC POWER CONSUMPTION IN ON-CHIP NETWORKS Robert Mullins Computer Architecture Group Computer Laboratory University of Cambridge, UK

2/19 Future performance gains will primarily come from increasing the number of IP cores in a system not their complexity or operating frequency Many reasons: –Diminishing returns from simply scaling what we have –Energy efficiency –Complexity –Fault tolerance –Economics Communication-Centric Architectures

3/19 On-Chip Networks An efficient general purpose chip-wide communication infrastructure is becoming essential One flexible networking option is to use packet- switched networks with support for virtual- channels

4/19 The Lochside Router Router Architecture –Highly parameterised implementation –Packet-switched network with virtual-channel flow- control –Best case latency is one cycle per network hop. Results presented here are from post P&R simulations targeting a 90nm technology TILE Traffic Generator, Debug & Test R Lochside Chip (2004/05) 180nm Technology

5/19 Exploiting Speculation to Reduce Communication Latency Peh/Dally (2001)

6/19 Exploiting Speculation to Reduce Communication Latency

7/19 Apply existing power saving techniques to an on-chip network design –e.g. clock and signal gating, gate-level optimisations etc. –Importance of applying such techniques before making comparisons Measure power consumption and provide an accurate breakdown of where the remaining power is dissipated Where is best place to look for future power savings? Aims of this work

8/19 Measuring and Optimizing Dynamic Power Our Test Case –8mm x 8mm die –4x4 mesh network –Low-latency routers, best case latency is one cycle per hop (incl. interconnect) –1.2V, 90nm technology –4 input-buffers/ VC –4 VC/ input port –48 x 80-bit network links WC PVT ~32 FO4 clock period –Results reported at 250MHz

9/19 Interconnect Delay/Energy Trade-offs Power dissipated in network links depends on how links are spaced and buffered At least a factor of 3 difference in energy consumption over range of potential interconnect options Could move to low-swing differential schemes for even greater energy savings For results we assume min. spaced wires, opt. energy x delay product

10/19 Clock gating optimisations applied at two levels: –Local Clock Gating Automated clock gating within router Some tuning of RTL involved to maximise opportunities for synthesis tool –Router Level Clock Gating Exploit opportunities to gate clock as it enters the router Isolates router’s clock completely, only static power consumption remains Clock Gating

11/19 Clock gating exposes clock tree insertion delay Need to know early if router will be required Generate ‘early valid’ signals in neighbouring routers –Early-valid signals are slightly pessimistic –Based on what is requested not granted Router-Level Clock Gating

12/19 Automated signal gating and gate-level power optimisations had minimal impact Inserting signal gating logic manually did reduce input FIFO power requirements significantly The reported results could be further improved (by 12%) by enabling logic optimisation across module boundaries –This was restricted to accurately determine where power is dissipated Gate-Level Optimizations and Signal Gating

13/19 Simple power optimisations can quarter power requirements + many more opportunities to save power Network is ~5% of core area Perhaps 10% of system power at present Don’t make comparisons without optimizing power! Power consumption of a single router and its links Analysis of Power Consumption

14/19 22% Static power, 11% Inter-Router Links ~1% Global Clock tree 65% Dynamic Power –Power Breakdown ~50% of dynamic power is consumed in local clock tree and input FIFOs ~30% on router datapath ~20% on scheduling and arbitration –Scheduling is probably more complex than typical implementations due to speculation Analysis of Power Consumption

15/19 Low-Power On-Chip Networks Interconnect and static power set to increase –Many low-power link technologies Low-swing differential techniques –Power gating and other leakage reduction techniques Potential power savings begin to require lots of different techniques – no one silver bullet?

16/19 Low-Power On-Chip Networks Topology –Don’t want to sacrifice general or at least multi- purpose nature of our networked SoC –Results suggest higher radix routers and longer interconnects could reduce power Probably not a long term solution Reduces path diversity, bad for fault-tolerance Architecture –Scope for minimising memory required to store precomputed router schedule (particular to our router) –Simpler routers –Single cycle routers reduce power? Speculation for low-power?

17/19 Supporting Best-Effort (BE) and Guaranteed Services (GS) Efficiently Current timing of the datapath and link suggests additional GS data could be routed in the same clock cycle –Allocate datapath/link to GS traffic for first ½ of clock cycle Double capacity of network –Exploit simpler GS circuit-switched routing when possible –Reduce power Very little additional overhead

18/19 Network system timing issues are interesting –naturally event-driven not synchronous Work is investigating placing local data-driven clock generators in each network router –Clock is stretched when no data to be routed –Clock matches rate of incoming data streams –Robust synchronisation solution (true GALS) –Also investigating incorporating power gating support See also Distributed Clock Generator – DCG (Fairbanks/Moore) Clocking On-Chip Networks

19/19 Challenges and Future Work These are early results in a much more rigorous study on the power requirements of networked on-chip comummunication –Much more soon! Exploiting a general-purpose on-chip network –Exploiting execution diversity to improve energy-efficiency –Multi-use platforms and Virtual-IP –Fault tolerance –Networks of processing elements or networks that process? Scope for removing unnecessary interfaces and boundaries Impact of networking on IP and processor core design

Thank You