Lei Wang, Yuho Jin, Hyungjun Kim and Eun Jung Kim

Slides:

Advertisements

Similar presentations

A Novel 3D Layer-Multiplexed On-Chip Network

Advertisements

Circuit-Switched Coherence Natalie Enright Jerger*, Li-Shiuan Peh +, Mikko Lipasti* *University of Wisconsin - Madison + Princeton University 2 nd IEEE.

REAL-TIME COMMUNICATION ANALYSIS FOR NOCS WITH WORMHOLE SWITCHING Presented by Sina Gholamian, 1 09/11/2011.

Aérgia: Exploiting Packet Latency Slack in On-Chip Networks

Destination-Based Adaptive Routing for 2D Mesh Networks ANCS 2010 Rohit Sunkam Ramanujam Bill Lin Electrical and Computer Engineering University of California,

CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros Volos Naser Khosro Pour Babak Falsafi Giovanni De Micheli.

Allocator Implementations for Network-on-Chip Routers Daniel U. Becker and William J. Dally Concurrent VLSI Architecture Group Stanford University.

Miguel Gorgues, Dong Xiang, Jose Flich, Zhigang Yu and Jose Duato Uni. Politecnica de Valencia, Spain School of Software, Tsinghua University, China, Achieving.

1 Lecture 17: On-Chip Networks Today: background wrap-up and innovations.

High Performance Router Architectures for Network- based Computing By Dr. Timothy Mark Pinkston University of South California Computer Engineering Division.

1 Lecture 23: Interconnection Networks Paper: Express Virtual Channels: Towards the Ideal Interconnection Fabric, ISCA’07, Princeton.

L2 to Off-Chip Memory Interconnects for CMPs Presented by Allen Lee CS258 Spring 2008 May 14, 2008.

1 Lecture 16: On-Chip Networks Today: on-chip networks background.

Design of a High-Throughput Distributed Shared-Buffer NoC Router

1 Lecture 21: Router Design Papers: Power-Driven Design of Router Microarchitectures in On-Chip Networks, MICRO’03, Princeton A Gracefully Degrading and.

1 Lecture 13: Interconnection Networks Topics: flow control, router pipelines, case studies.

1 Lecture 25: Interconnection Networks Topics: flow control, router microarchitecture Final exam:  Dec 4 th 9am – 10:40am  ~15-20% on pre-midterm  post-midterm:

Network-on-Chip Examples System-on-Chip Group, CSE-IMM, DTU.

Trace-Driven Optimization of Networks-on-Chip Configurations Andrew B. Kahng †‡ Bill Lin ‡ Kambiz Samadi ‡ Rohit Sunkam Ramanujam ‡ University of California,

1 Lecture 26: Interconnection Networks Topics: flow control, router microarchitecture.

Orion: A Power-Performance Simulator for Interconnection Networks Presented by: Ilya Tabakh RC Reading Group4/19/2006.

Network-on-Chip: Communication Synthesis Department of Computer Science Texas A&M University.

1 Near-Optimal Oblivious Routing for 3D-Mesh Networks ICCD 2008 Rohit Sunkam Ramanujam Bill Lin Electrical and Computer Engineering Department University.

Low-Latency Virtual-Channel Routers for On-Chip Networks Robert Mullins, Andrew West, Simon Moore Presented by Sailesh Kumar.

Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.

McRouter: Multicast within a Router for High Performance NoCs

High Performance Embedded Computing © 2007 Elsevier Lecture 16: Interconnection Networks Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.

Tightly-Coupled Multi-Layer Topologies for 3D NoCs Hiroki Matsutani (Keio Univ, JAPAN) Michihiro Koibuchi (NII, JAPAN) Hideharu Amano (Keio Univ, JAPAN)

1 The Turn Model for Adaptive Routing. 2 Summary Introduction to Direct Networks. Deadlocks in Wormhole Routing. System Model. Partially Adaptive Routing.

Elastic-Buffer Flow-Control for On-Chip Networks

Networks-on-Chips (NoCs) Basics

International Symposium on Low Power Electronics and Design NoC Frequency Scaling with Flexible- Pipeline Routers Pingqiang Zhou, Jieming Yin, Antonia.

Déjà Vu Switching for Multiplane NoCs NOCS’12 University of Pittsburgh Ahmed Abousamra Rami MelhemAlex Jones.

1 Application Aware Prioritization Mechanisms for On-Chip Networks Reetuparna Das Onur Mutlu † Thomas Moscibroda ‡ Chita Das § Reetuparna Das § Onur Mutlu.

SMART: A Single- Cycle Reconfigurable NoC for SoC Applications -Jyoti Wadhwani Chia-Hsin Owen Chen, Sunghyun Park, Tushar Krishna, Suvinay Subramaniam,

Improving Capacity and Flexibility of Wireless Mesh Networks by Interface Switching Yunxia Feng, Minglu Li and Min-You Wu Presented by: Yunxia Feng Dept.

Author : Jing Lin, Xiaola Lin, Liang Tang Publish Journal of parallel and Distributed Computing MAKING-A-STOP: A NEW BUFFERLESS ROUTING ALGORITHM FOR ON-CHIP.

George Michelogiannakis William J. Dally Stanford University Router Designs for Elastic- Buffer On-Chip Networks.

A Lightweight Fault-Tolerant Mechanism for Network-on-Chip

Presenter: Min-Yu Lo 2015/10/19 Asit K. Mishra, N. Vijaykrishnan, Chita R. Das Computer Architecture (ISCA), th Annual International Symposium on.

O1TURN : Near-Optimal Worst-Case Throughput Routing for 2D-Mesh Networks DaeHo Seo, Akif Ali, WonTaek Lim Nauman Rafique, Mithuna Thottethodi School of.

CS 8501 Networks-on-Chip (NoCs) Lukasz Szafaryn 15 FEB 10.

Runtime Power Gating of On-Chip Routers Using Look-Ahead Routing

University of Michigan, Ann Arbor

Run-time Adaptive on-chip Communication Scheme 林孟諭 Dept. of Electrical Engineering National Cheng Kung University Tainan, Taiwan, R.O.C.

Performance, Cost, and Energy Evaluation of Fat H-Tree: A Cost-Efficient Tree-Based On-Chip Network Hiroki Matsutani (Keio Univ, JAPAN) Michihiro Koibuchi.

Yu Cai Ken Mai Onur Mutlu

OASIS NoC Revisited Adam Esch (m ). Outline Pre-Research OASIS Overview Research Contributions Remarks OASIS Suggestions Future Work.

Lecture 16: Router Design

Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.

Advanced Processor Group The School of Computer Science A Dynamic Link Allocation Router Wei Song, Doug Edwards Advanced Processor Group The University.

1 Lecture 22: Router Design Papers: Power-Driven Design of Router Microarchitectures in On-Chip Networks, MICRO’03, Princeton A Gracefully Degrading and.

Virtual-Channel Flow Control William J. Dally

Predictive High-Performance Architecture Research Mavens (PHARM), Department of ECE The NoX Router Mitchell Hayenga Mikko Lipasti.

Network On Chip Cache Coherency Final presentation – Part A Students: Zemer Tzach Kalifon Ethan Kalifon Ethan Instructor: Walter Isaschar Instructor: Walter.

A Low-Area Interconnect Architecture for Chip Multiprocessors Zhiyi Yu and Bevan Baas VLSI Computation Lab ECE Department, UC Davis.

Network On Chip Cache Coherency Midterm presentation Students: Zemer Tzach Kalifon Ethan Kalifon Ethan Instructor: Walter Isaschar Instructor: Walter Isaschar.

1 Lecture 29: Interconnection Networks Papers: Express Virtual Channels: Towards the Ideal Interconnection Fabric, ISCA’07, Princeton Interconnect Design.

FlexiBuffer: Reducing Leakage Power in On-Chip Network Routers

Lecture 23: Interconnection Networks

Pablo Abad, Pablo Prieto, Valentin Puente, Jose-Angel Gregorio

Exploring Concentration and Channel Slicing in On-chip Network Router

Lecture 23: Router Design

Lecture 16: On-Chip Networks

Rahul Boyapati. , Jiayi Huang

Deadlock Free Hardware Router with Dynamic Arbiter

Using Packet Information for Efficient Communication in NoCs

Natalie Enright Jerger, Li Shiuan Peh, and Mikko Lipasti

Lecture 25: Interconnection Networks

Multiprocessors and Multi-computers

Presentation transcript:

Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip Lei Wang, Yuho Jin, Hyungjun Kim and Eun Jung Kim Department of Computer Science and Engineering Texas A&M University

Multi-Core Wave & Networks-On-Chip Uniprocessors hit the power wall. Multi-processors provide high performance at lower power budget. Shared-bus architecture has scalability limitation. Networks-On-Chip (NOCs) orchestrate chip-wide communications towards future many-core processors. MIT Raw (0.18um, 300MHz) 16-core chip Four 4x4 mesh networks Intel Polaris (65nm, 4GHz) 80-core chip 8x10 mesh network First, let’s look at two changes in our processor design. Lei Wang - NOCS 2009

Challenges in On-Chip Communication High performance Low communication latency is critical for high system performance. Bandwidth-efficient Well-designed routing algorithms provide high network throughput. Power and Area Constraints Simple topologies and slim routers reduce communication power consumption and save chip area. Efficient Multicast supporting Cache coherence protocols heavily rely on multicast or broadcast communication characteristics. We propose a bandwidth-efficient routing for multicast communication in NOCs with low latency and power consumption. Lei Wang - NOCS 2009

Prior Work in Multicast Communication Routing Evaluation Criteria for Multicast Communication [Ni93] Multicast in multicomputer system Tree-based Multicast Routing for DSM Multiprocessor [Torrellas96] Short message multicast in DSM system Virtual Circuit Tree Multicasting for NOCs[Lipasti08] Demonstrate necessity of multicasting on-chip Propose table-based multicast routing Region-based Multicast for CMPs [Duato08] Multicast routing for irregular topology in CMPs Lei Wang - NOCS 2009

Outline Motivation Multicast Router Design State-of-art Unicast Router Architecture Replication Schemes Destination List Management Recursive Partitioning Multicast (RPM) Network Partitioning Routing Rules Example Deadlock Avoidance Evaluation Conclusion Lei Wang - NOCS 2009

Different Bandwidth Usage Example Source Destination 1 2 3 1 2 3 4 5 6 7 4 5 6 7 8 9 10 11 8 9 10 11 12 13 14 15 12 13 14 15 Left Path requires 11 link traversals, 12 buffer writes, 15 buffer reads, and 15 crossbar traversals Right Path requires 5 link traversals, 6 buffer writes, 10 buffer reads, and 10 cross-bar traversals Lei Wang - NOCS 2009

State-of-Art Wormhole Unicast Router RC VA SA ST LT Router Link RC VA SA ST LT Router Link RC: Route Computation VA: VC Allocation; SA: Switch Allocation ST: Switch Traversal; LT: Link Traversal Lei Wang - NOCS 2009

What we need in a Multicast Router? Packet Replication Synchronous Replication Asynchronous Replication Destination List Management All-destination Encoding Bit String Encoding Multiple-region Broadcast Encoding Lei Wang - NOCS 2009

Synchronous Replication Head flit Time (Cycle) M Middle flit 1 2 3 Tail flit T Input 0 Output 0 T M M M H H Input 1 Output 1 Input 2 Output 2 Input 3 Output 3 Packet replication happens at Switch Traversal Stage. Lei Wang - NOCS 2009

Asynchronous Replication Head flit Time (Cycle) M Middle flit 1 2 3 Tail flit T Input 0 Output 0 T M M M M H H Input 1 Output 1 Input 2 Output 2 Input 3 Output 3 Lei Wang - NOCS 2009

Network Partitioning Source node N W E S 1 2 3 7 4 8 5 Eight Parts Source node 2 N 3 7 W E 4 8 5 Eight Parts Three Parts (5, 6, 7) S Three Parts (0, 1, 7) Three Parts (3, 4, 5) Three Parts (1, 2, 3) Lei Wang - NOCS 2009

Basic Routing Rules North: top right corner. West: top left corner. South: bottom left corner. East: bottom right corner. N W E S Source N N E E W W S S Destination Lei Wang - NOCS 2009

Optimized Routing Rules Source Destination Deadlock!!! Lei Wang - NOCS 2009

RPM Example-step 1 Multicast Packet Source Destination Partitioning Lei Wang - NOCS 2009

RPM Example-step 2 Multicast Packet Source Destination Partitioning Ejection Lei Wang - NOCS 2009

RPM Example-step 3 Multicast Packet Source Destination Partitioning Lei Wang - NOCS 2009

RPM Example-step 4 Multicast Packet Source Destination Partitioning Ejection Ejection M M M M Ejection Lei Wang - NOCS 2009

RPM Example-step 5 Multicast Packet Source Destination Partitioning Ejection M M Lei Wang - NOCS 2009

Deadlock Avoidance RPM has no turn restrictions, potentially introducing deadlock. We use Virtual Network (VN) to avoid deadlock. Two VNs lie in the same physical network. Virtual Channels of each port are equally divided into each virtual network. Virtual network Id (0 or 1) for each packet is decided at the source. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Virtual Network 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Virtual Network 1 Lei Wang - NOCS 2009

Evaluation Methodology Performance Model: Cycle-accurate Network Simulator Models all router pipeline stages in detail Highly parameterized Power Model: Orion with both dynamic and leakage power models Network configuration Topology 8×8 Mesh (6×6 Mesh, 10×10 Mesh, 16×16 Mesh) Routing RPM VC/Port 4 VC Depth Packet Length (flits) Unicast Traffic Pattern Uniform Random (Bit Complement, Transpose) Multicast Packet Portion 10% (5%, 20%, 40%, 80%) Multicast Destination Number 0 -16 (uniformly distributed) Lei Wang - NOCS 2009

Uniform Random Traffic 50% 40% 40% Latency is improved around 50% before network saturation. Network throughput is extended 40%. Lei Wang - NOCS 2009

Link Utilization 33% 45% In low workload, RPM saves 33% link utilization. In high workload, RPM saves 45% link utlization. Lei Wang - NOCS 2009

Dynamic Power Consumption 50% 40% Lei Wang - NOCS 2009

Scalability Study-Network Size Over 50% Lei Wang - NOCS 2009

Scalability Study-Multicast Traffic Portion Lei Wang - NOCS 2009

Scalability Study-Destination Number Lei Wang - NOCS 2009

Conclusion Propose a new multicast routing algorithm, Recursive Partitioning Multicast (RPM) Bandwidth-efficient and Scalable Performance Improvement Up to 50% latency reduction 33% link utilization reduction Power Savings Up to 40% total dynamic power savings 25% crossbar and link power savings Lei Wang - NOCS 2009

Thank you! Lei Wang - NOCS 2009

Backup Lei Wang - NOCS 2009

Hardware Implementation of Routing logic Lei Wang - NOCS 2009

Bit Complement Traffic Lei Wang - NOCS 2009

Transpose Traffic Lei Wang - NOCS 2009