McRouter: Multicast within a Router for High Performance NoCs

Slides:



Advertisements
Similar presentations
Prof. Natalie Enright Jerger
Advertisements

QuT: A Low-Power Optical Network-on-chip
A Novel 3D Layer-Multiplexed On-Chip Network
International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems.
Flattened Butterfly Topology for On-Chip Networks John Kim, James Balfour, and William J. Dally Presented by Jun Pang.
Circuit-Switched Coherence Natalie Enright Jerger*, Li-Shiuan Peh +, Mikko Lipasti* *University of Wisconsin - Madison + Princeton University 2 nd IEEE.
REAL-TIME COMMUNICATION ANALYSIS FOR NOCS WITH WORMHOLE SWITCHING Presented by Sina Gholamian, 1 09/11/2011.
Aérgia: Exploiting Packet Latency Slack in On-Chip Networks
Evaluating Bufferless Flow Control for On-Chip Networks George Michelogiannakis, Daniel Sanchez, William J. Dally, Christos Kozyrakis Stanford University.
$ A Case for Bufferless Routing in On-Chip Networks A Case for Bufferless Routing in On-Chip Networks Onur Mutlu CMU TexPoint fonts used in EMF. Read the.
Lizhong Chen and Timothy M. Pinkston SMART Interconnects Group
Destination-Based Adaptive Routing for 2D Mesh Networks ANCS 2010 Rohit Sunkam Ramanujam Bill Lin Electrical and Computer Engineering University of California,
1 DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems Magnus Jahre †, Marius Grannaes † ‡ and Lasse Natvig † † Norwegian.
CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros Volos Naser Khosro Pour Babak Falsafi Giovanni De Micheli.
Allocator Implementations for Network-on-Chip Routers Daniel U. Becker and William J. Dally Concurrent VLSI Architecture Group Stanford University.
1 Lecture 17: On-Chip Networks Today: background wrap-up and innovations.
High Performance Router Architectures for Network- based Computing By Dr. Timothy Mark Pinkston University of South California Computer Engineering Division.
1 Lecture 23: Interconnection Networks Paper: Express Virtual Channels: Towards the Ideal Interconnection Fabric, ISCA’07, Princeton.
L2 to Off-Chip Memory Interconnects for CMPs Presented by Allen Lee CS258 Spring 2008 May 14, 2008.
Lei Wang, Yuho Jin, Hyungjun Kim and Eun Jung Kim
Design of a High-Throughput Distributed Shared-Buffer NoC Router
1 Lecture 25: Interconnection Networks Topics: flow control, router microarchitecture Final exam:  Dec 4 th 9am – 10:40am  ~15-20% on pre-midterm  post-midterm:
Trace-Driven Optimization of Networks-on-Chip Configurations Andrew B. Kahng †‡ Bill Lin ‡ Kambiz Samadi ‡ Rohit Sunkam Ramanujam ‡ University of California,
1 Lecture 26: Interconnection Networks Topics: flow control, router microarchitecture.
1 Indirect Adaptive Routing on Large Scale Interconnection Networks Nan Jiang, William J. Dally Computer System Laboratory Stanford University John Kim.
1 Near-Optimal Oblivious Routing for 3D-Mesh Networks ICCD 2008 Rohit Sunkam Ramanujam Bill Lin Electrical and Computer Engineering Department University.
Dragonfly Topology and Routing
Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.
1 Lecture 23: Interconnection Networks Topics: Router microarchitecture, topologies Final exam next Tuesday: same rules as the first midterm Next semester:
Elastic-Buffer Flow-Control for On-Chip Networks
Javier Lira (Intel-UPC, Spain)Timothy M. Jones (U. of Cambridge, UK) Carlos Molina (URV, Spain)Antonio.
International Symposium on Low Power Electronics and Design NoC Frequency Scaling with Flexible- Pipeline Routers Pingqiang Zhou, Jieming Yin, Antonia.
Déjà Vu Switching for Multiplane NoCs NOCS’12 University of Pittsburgh Ahmed Abousamra Rami MelhemAlex Jones.
1 Application Aware Prioritization Mechanisms for On-Chip Networks Reetuparna Das Onur Mutlu † Thomas Moscibroda ‡ Chita Das § Reetuparna Das § Onur Mutlu.
LIBRA: Multi-mode On-Chip Network Arbitration for Locality-Oblivious Task Placement Gwangsun Kim Computer Science Department Korea Advanced Institute of.
Timing Channel Protection for a Shared Memory Controller Yao Wang, Andrew Ferraiuolo, G. Edward Suh Feb 17 th 2014.
Author : Jing Lin, Xiaola Lin, Liang Tang Publish Journal of parallel and Distributed Computing MAKING-A-STOP: A NEW BUFFERLESS ROUTING ALGORITHM FOR ON-CHIP.
George Michelogiannakis William J. Dally Stanford University Router Designs for Elastic- Buffer On-Chip Networks.
Presenter: Min-Yu Lo 2015/10/19 Asit K. Mishra, N. Vijaykrishnan, Chita R. Das Computer Architecture (ISCA), th Annual International Symposium on.
Design and Evaluation of Hierarchical Rings with Deflection Routing Rachata Ausavarungnirun, Chris Fallin, Xiangyao Yu, ​ Kevin Chang, Greg Nazario, Reetuparna.
O1TURN : Near-Optimal Worst-Case Throughput Routing for 2D-Mesh Networks DaeHo Seo, Akif Ali, WonTaek Lim Nauman Rafique, Mithuna Thottethodi School of.
Department of Computer Science and Engineering The Pennsylvania State University Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das Addressing.
CS 8501 Networks-on-Chip (NoCs) Lukasz Szafaryn 15 FEB 10.
Runtime Power Gating of On-Chip Routers Using Look-Ahead Routing
University of Michigan, Ann Arbor
MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.
Performance, Cost, and Energy Evaluation of Fat H-Tree: A Cost-Efficient Tree-Based On-Chip Network Hiroki Matsutani (Keio Univ, JAPAN) Michihiro Koibuchi.
Yu Cai Ken Mai Onur Mutlu
Lecture 16: Router Design
1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.
1 Lecture 15: NoC Innovations Today: power and performance innovations for NoCs.
Virtual-Channel Flow Control William J. Dally
Predictive High-Performance Architecture Research Mavens (PHARM), Department of ECE The NoX Router Mitchell Hayenga Mikko Lipasti.
On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.
Boris Grot, Joel Hestness, Stephen W. Keckler 1 The University of Texas at Austin 1 NVIDIA Research Onur Mutlu Carnegie Mellon University.
Design Space Exploration for NoC Topologies ECE757 6 th May 2009 By Amit Kumar, Kanchan Damle, Muhammad Shoaib Bin Altaf, Janaki K.M Jillella Course Instructor:
Quantifying and Controlling Impact of Interference at Shared Caches and Main Memory Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, Onur.
HAT: Heterogeneous Adaptive Throttling for On-Chip Networks Kevin Kai-Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu.
1 Lecture 29: Interconnection Networks Papers: Express Virtual Channels: Towards the Ideal Interconnection Fabric, ISCA’07, Princeton Interconnect Design.
FlexiBuffer: Reducing Leakage Power in On-Chip Network Routers
Lecture 23: Interconnection Networks
ISPASS th April Santa Rosa, California
Pablo Abad, Pablo Prieto, Valentin Puente, Jose-Angel Gregorio
Exploring Concentration and Channel Slicing in On-chip Network Router
Rahul Boyapati. , Jiayi Huang
Israel Cidon, Ran Ginosar and Avinoam Kolodny
Complexity effective memory access scheduling for many-core accelerator architectures Zhang Liang.
Using Packet Information for Efficient Communication in NoCs
Natalie Enright Jerger, Li Shiuan Peh, and Mikko Lipasti
CS 6290 Many-core & Interconnect
Presentation transcript:

McRouter: Multicast within a Router for High Performance NoCs Yuan He, Hiroshi Sasaki*, Shinobu Miwa, Hiroshi Nakamura The University of Tokyo and *Kyushu University

Executive Summary Like other networks, NoCs are latency critical. But through evaluations, we also observed that they can be quite bandwidth plentiful (within the routers) We propose to have packets multicast within a router (routed to all possible outputs), so that route computation is completely hidden and is only required to acknowledge the ONE correctly routed packet in a multicasting Results show that McRouter incurs more productive use of its internal bandwidth It outperforms the Prediction Router (the best router so far) with nearly all application traffic we evaluated

Outline Scope of the Work Motivation Proposal: Multicast within a Router Evaluations and Results Conclusion

Scope On-chip routers Standalone router designs Mesh topology So not based on look-ahead routing Conventional Router Prediction Router (HPCA 2009, Matsutani et al) Mesh topology But the idea should be able to other topologies as well

Motivation Modern On-chip Networks Latency Critical NoCs affects cache/memory access latency Let us look at two router designs Conventional Router (4-cycle) Prediction Router (1-cycle when prediction succeeds)

Conventional Router (CR) 2 3 1 4 P P P P Conventional Virtual Channel Router BW/RC -> VA -> SA -> ST Problem -> 4 cycles BW: Buffer Write RC: Route Computation VA: Virtual Channel Allocation SA: Switch Allocation ST: Switch Traversal

Prediction Router (PR, Hit) 1 P P P P Prediction Router (HPCA 2009, Matsutani et al) If prediction hits (and VA/SA succeeds with this predicted RC), only ST is needed (1-cycle)

Prediction Router (PR, Miss) 1 P P P P Prediction Router If prediction misses, miss-routed packets get killed and the conventional data path is then used Problem -> prediction accuracy is around 65% in our evaluation

Motivation (cont…) Modern On-chip Networks Bandwidth Plentiful Observations

Observation 1: Avearge Link Utilization Average Link Utilization (flits/link/cycle)

Observation 1: Avearge Link Utilization 0.031 flits/link/cycle for the worst case - FT 0.2 flits / crossbar / cycle assuming a radix-6 router Little contention internally

Observation 2: Concurrent Flits to a Router Fraction of Numbers of Concurrent Flits

Observation 2: Concurrent Flits to a Router P P Taking the worst case workload – FT 83% of the time -> no incoming flits 15% of the time -> 1 flit only 2 % of the time -> 2+ flits Very few chances of encountering concurrent flits

Proposal: Multicast within a Router Or McRouter for short Single-cycle router when having enough bandwidth Is based on multicast operation inside a router A multicast is like a always-correct prediction No predictors Conventional Router Prediction Router McRouter

McRouter: Conditions to Invoke A Multicasting P Only 1 flit arrives at the router (which means no concurrent flits) Within this router, no flit is waiting to undertake ST (switch traversal)

Multicasting Operation

A Summary on McRouter Pros A single cycle router when internal bandwidth allows No predictors Cons More complex control over the crossbar switch Killing of more miss-routed flits

Evaluation Methodology CPU Model: Simics 3.0.31 16 cores, in-order Memory Model: GEMS 2.1.1 32KB L1 I/D Caches 256KB L2 Cache X 16 Banks 4 Memory Controllers, 4GB main memory NoC Model: GARNET 4 X 4 Mesh with virtual channel routers NoC Power Model: Orion 2 32nm process and 1V Vdd Synthetic Traffic: Uniform Radom Benchmarks: 13 workloads From SPLASH-2 and NPB-3 Counterparts: CR and PR Router Link Core/L1$s Link L2$ Memory Controller Router

Evaluations with Synthetic Traffic 0.34 flits/link/cycle 0.07 flits/link/cycle

Evaluations with Application Traffic: Normalized System Speed-up

Sensitivity Study with Network Parameter Downscaling Workload: raytrace Workload: FT Parameters downscaled Link width halved # of VCs minimized McRouter still works with thinned bandwidth Its advantages over CR/PR is not from over-designing

Conclusion A new low-latency router It successfully hides route computation and arbitration delays while still being a standalone design It outperforms PR (best router so far) in practice We uncover an insight that with more aggressive utilization of remaining internal bandwidth, a router can have its latency dramatically shortened with simple architectural changes

Thank you so much for attention!