Design and Evaluation of Hierarchical Rings with Deflection Routing Rachata Ausavarungnirun, Chris Fallin, Xiangyao Yu, Kevin Chang, Greg Nazario, Reetuparna.

Slides:

Advertisements

Similar presentations

George Nychis✝, Chris Fallin✝, Thomas Moscibroda★, Onur Mutlu✝

Advertisements

1 On-Chip Networks from a Networking Perspective: Congestion and Scalability in Many-Core Interconnects George Nychis ✝, Chris Fallin ✝, Thomas Moscibroda.

Application-to-Core Mapping Policies to Reduce Memory System Interference Reetuparna Das * Rachata Ausavarungnirun $ Onur Mutlu $ Akhilesh Kumar § Mani.

QuT: A Low-Power Optical Network-on-chip

A Novel 3D Layer-Multiplexed On-Chip Network

Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.

Presentation of Designing Efficient Irregular Networks for Heterogeneous Systems-on-Chip by Christian Neeb and Norbert Wehn and Workload Driven Synthesis.

1 An Efficient, Hardware-based Multi-Hash Scheme for High Speed IP Lookup Hot Interconnects 2008 Socrates Demetriades, Michel Hanna, Sangyeun Cho and Rami.

Flattened Butterfly Topology for On-Chip Networks John Kim, James Balfour, and William J. Dally Presented by Jun Pang.

REAL-TIME COMMUNICATION ANALYSIS FOR NOCS WITH WORMHOLE SWITCHING Presented by Sina Gholamian, 1 09/11/2011.

Aérgia: Exploiting Packet Latency Slack in On-Chip Networks

Evaluating Bufferless Flow Control for On-Chip Networks George Michelogiannakis, Daniel Sanchez, William J. Dally, Christos Kozyrakis Stanford University.

MinBD: Minimally-Buffered Deflection Routing for Energy-Efficient Interconnect Chris Fallin, Greg Nazario, Xiangyao Yu*, Kevin Chang, Rachata Ausavarungnirun,

CHIPPER: A Low-complexity Bufferless Deflection Router Chris Fallin Chris Craik Onur Mutlu.

Destination-Based Adaptive Routing for 2D Mesh Networks ANCS 2010 Rohit Sunkam Ramanujam Bill Lin Electrical and Computer Engineering University of California,

High Performance Router Architectures for Network- based Computing By Dr. Timothy Mark Pinkston University of South California Computer Engineering Division.

Packet-Switched vs. Time-Multiplexed FPGA Overlay Networks Kapre et. al RC Reading Group – 3/29/2006 Presenter: Ilya Tabakh.

Network based System on Chip Part A Performed by: Medvedev Alexey Supervisor: Walter Isaschar (Zigmond) Winter-Spring 2006.

Lei Wang, Yuho Jin, Hyungjun Kim and Eun Jung Kim

1 Indirect Adaptive Routing on Large Scale Interconnection Networks Nan Jiang, William J. Dally Computer System Laboratory Stanford University John Kim.

1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.

A Scalable, Commodity Data Center Network Architecture Mohammad Al-Fares, Alexander Loukissas, Amin Vahdat Presented by Gregory Peaker and Tyler Maclean.

1 On-Chip Networks from a Networking Perspective: Congestion and Scalability in Many-Core Interconnects George Nychis ✝, Chris Fallin ✝, Thomas Moscibroda.

Dragonfly Topology and Routing

Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.

McRouter: Multicast within a Router for High Performance NoCs

High Performance Embedded Computing © 2007 Elsevier Lecture 16: Interconnection Networks Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.

Déjà Vu Switching for Multiplane NoCs NOCS’12 University of Pittsburgh Ahmed Abousamra Rami MelhemAlex Jones.

1 Application Aware Prioritization Mechanisms for On-Chip Networks Reetuparna Das Onur Mutlu † Thomas Moscibroda ‡ Chita Das § Reetuparna Das § Onur Mutlu.

High-Level Interconnect Architectures for FPGAs An investigation into network-based interconnect systems for existing and future FPGA architectures Nick.

Author : Jing Lin, Xiaola Lin, Liang Tang Publish Journal of parallel and Distributed Computing MAKING-A-STOP: A NEW BUFFERLESS ROUTING ALGORITHM FOR ON-CHIP.

High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.

Data and Computer Communications Chapter 10 – Circuit Switching and Packet Switching (Wide Area Networks)

George Michelogiannakis, Prof. William J. Dally Concurrent architecture & VLSI group Stanford University Elastic Buffer Flow Control for On-chip Networks.

O1TURN : Near-Optimal Worst-Case Throughput Routing for 2D-Mesh Networks DaeHo Seo, Akif Ali, WonTaek Lim Nauman Rafique, Mithuna Thottethodi School of.

Express Cube Topologies for On-chip Interconnects Boris Grot J. Hestness, S. W. Keckler, O. Mutlu † The University of Texas at Austin † Carnegie Mellon.

Congestion Control in CSMA-Based Networks with Inconsistent Channel State V. Gambiroza and E. Knightly Rice Networks Group

CS 8501 Networks-on-Chip (NoCs) Lukasz Szafaryn 15 FEB 10.

50 th Annual Allerton Conference, 2012 On the Capacity of Bufferless Networks-on-Chip Alex Shpiner, Erez Kantor, Pu Li, Israel Cidon and Isaac Keslassy.

Non-Minimal Routing Strategy for Application-Specific Networks-on-Chips Hiroki Matsutani Michihiro Koibuchi Yutaka Yamada Jouraku Akiya Hideharu Amano.

COP 5611 Operating Systems Spring 2010 Dan C. Marinescu Office: HEC 439 B Office hours: M-Wd 2:00-3:00 PM.

University of Michigan, Ann Arbor

Performance, Cost, and Energy Evaluation of Fat H-Tree: A Cost-Efficient Tree-Based On-Chip Network Hiroki Matsutani (Keio Univ, JAPAN) Michihiro Koibuchi.

Yu Cai Ken Mai Onur Mutlu

Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research.

A Heterogeneous Multiple Network-On-Chip Design: An Application-Aware Approach Asit K. MishraChita R. DasOnur Mutlu.

Dynamic Traffic Distribution among Hierarchy Levels in Hierarchical Networks-on-Chip Ran Manevich, Israel Cidon, and Avinoam Kolodny Group Research QNoC.

Intel Slide 1 A Comparative Study of Arbitration Algorithms for the Alpha Pipelined Router Shubu Mukherjee*, Federico Silla !, Peter Bannon $, Joel.

Topology-aware QOS Support in Highly Integrated CMPs Boris Grot (UT-Austin) Stephen W. Keckler (NVIDIA/UT-Austin) Onur Mutlu (CMU) WIOSCA '10.

Achieving High Performance and Fairness at Low Cost Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri, Harsha Rastogi, Onur Mutlu 1 The Blacklisting Memory.

Design Space Exploration for NoC Topologies ECE757 6 th May 2009 By Amit Kumar, Kanchan Damle, Muhammad Shoaib Bin Altaf, Janaki K.M Jillella Course Instructor:

HAT: Heterogeneous Adaptive Throttling for On-Chip Networks Kevin Kai-Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu.

Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Lavanya Subramanian 1.

1 Lecture 29: Interconnection Networks Papers: Express Virtual Channels: Towards the Ideal Interconnection Fabric, ISCA’07, Princeton Interconnect Design.

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

ESE532: System-on-a-Chip Architecture

Effective mechanism for bufferless networks at intensive workloads

Pablo Abad, Pablo Prieto, Valentin Puente, Jose-Angel Gregorio

Rachata Ausavarungnirun, Kevin Chang

Exploring Concentration and Channel Slicing in On-chip Network Router

OpenSMART: Single-cycle Multi-hop NoC Generator in BSV and Chisel

Rahul Boyapati. , Jiayi Huang

Israel Cidon, Ran Ginosar and Avinoam Kolodny

Achieving High Performance and Fairness at Low Cost

Impact of Interconnection Network resources on CMP performance

Natalie Enright Jerger, Li Shiuan Peh, and Mikko Lipasti

Data Center Architectures

CS 6290 Many-core & Interconnect

HopliteBuf: FPGA NoCs with Provably Stall-Free FIFOs

A Case for Bufferless Routing in On-Chip Networks

Presentation transcript:

Design and Evaluation of Hierarchical Rings with Deflection Routing Rachata Ausavarungnirun, Chris Fallin, Xiangyao Yu, Kevin Chang, Greg Nazario, Reetuparna Das, Gabriel H. Loh, Onur Mutlu

Executive Summary Rings do not scale well as core count increases Traditional hierarchical ring designs are complex and energy inefficient –Complicated buffering and flow control Solution: Hierarchical Rings with Deflection (HiRD) –Guarantees livelock freedom and delivery –Eliminates all buffers at local routers and most buffers at bridge routers HiRD provides higher performance and energy-efficiency than hierarchical rings HiRD is simpler than hierarchical rings 2

Outline Background and Motivation Key Idea: Deflection Routing End-to-end Delivery Guarantees Our Solution: HiRD Results Conclusion 3

Scaling Problems in a Ring NoC As the number of cores grows: –Lower performance –More power 4

Alternative: Hierarchical Designs Packets can reach far destination in fewer hops 5 Local Ring (Level 0) Global Ring (Level 1)

Single Ring vs. Hierarchical Rings 6 A hierarchical design provides better performance as the network scales

Complexity in Hierarchical Designs 7 Complex buffering and flow control

Single Ring vs. Hierarchical Rings 8 Design complexity increases power consumption

Our Goal Design a hierarchical ring that has lower complexity without sacrificing performance 9

Outline Background and Motivation Key Idea: Deflection Routing End-to-end Delivery Guarantees Our Solution: HiRD Results Conclusion 10

Key Idea Eliminate buffers Use deflection routing –Simpler flow control 11

Local Router Key functionality: –Accept new flits –Pass flits around the ring 12 Core Local East Local West

Eliminating Buffers in Local Routers 13 Core Local East Local West

Eliminating Buffers in Local Routers Flits can enter the ring if the output is available 14 Core Local East Local West Ejector No Buffer Simpler Crossbar

Deflection Routing 15 Core Local East Local West Ejector Deflected

Bridge Router 16 Local Ring Global Ring Crossbar WestEast WestEast

Eliminating Buffers in Bridge Routers 17 Crossbar Local Ring Global Ring WestEast West East

Eliminating Buffers in Bridge Routers 18 Local Ring Global Ring WestEast West East Simpler Buffering Fewer Buffers Simpler Crossbar

Outline Background and Motivation Key Idea: Deflection Routing End-to-end Delivery Guarantees Our Solution: HiRD Results Conclusion 19

Injection starvation 20 Src Unable to inject Starved Flit Ring Livelock in Deflection Routing

Ring HiRD: Injection Guarantee Throttling provides injection guarantee 21 After 150 cycles: All nodes stop injecting flits Src Unable to inject Starved Flit Throttled Router

Livelock in Deflection Routing Transfer starvation 22 Transfer FIFO Unable to Transfer Starved Flit Ring

HiRD: Transfer Guarantee Reservation provides transfer guarantee 23 After 10 looparounds Transfer FIFO Starved Flit Ring Reserved Slot

Ejection Guarantee Provided by a prior work Re-transmit once [Fallin et al., HPCA’11] –Drop a flit if there is no available slot –Reserve a buffer slot at the destination if a flit was dropped 24

End-to-end Delivery Guarantees 25 Local Ring Local Ring Global Ring src dest Injection Guarantee Transfer Guarantee Injection Guarantee Transfer Guarantee Injection Guarantee Ejection Guarantee

Outline Background and Motivation Key Idea: Deflection Routing End-to-end Delivery Guarantees Our Solution: HiRD Results Conclusion 26

An Overview of HiRD Deflection routing –Simpler flow control –Simpler crossbars and control logics No buffers in the local rings –Simpler and faster local routers Simpler bridge routers –Lower power, less area and simpler to design Provides end-to-end delivery guarantees –Injection guarantee by throttling –Transfer guarantee by reservation 27

Putting It All Together Deflection routing –Simpler flow control –Simpler crossbars and control logic No buffers in the local rings –Simpler and faster local routers Simpler bridge routers –Lower power, less area and simpler to design Provides end-to-end delivery guarantees –Injection guarantee by throttling –Transfer guarantee by reservation 28

Outline Background and Motivation Key Idea: Deflection Routing End-to-end Delivery Guarantees Our Solution: HiRD Results Conclusion 29

Methodology Cores –16 and 64 OoO CPU cores –64 KB 4-way private L1 –Distributed L2 Network –1 flit local-to-global buffer –4 flits global-to-local buffers –2-cycle per hop latency for local routers –3-cycle per hop latency for global routers 60 workloads consisting of SPEC2006 apps 30

Comparison to Previous Designs Single ring design –Kim and Kim, NoCArc’09 –64-bit links –128-bit links –256-bit links Buffered hierarchical ring design –Ravindran and Stumm, HPCA’97 –Identical topology –Identical bisection bandwidth –4-flit buffers in both local and global routers 31

Results: System Performance % 1.9% 1)Hierarchical designs provide better performance than a single ring on a larger network 2)HiRD performs better compared to buffered hierarchical rings due to lower latency in local routers and throttling

Results: Network Power 33 15%46.6% 1)Hierarchical designs consume much less power than the highest-performance single ring 2) Routers and flow control in HiRD are simpler than routers in buffered hierarchical rings

Router Area and Critical Path 16-node network with 8 bridge routers Verilog RTL design using 45nm Technology HiRD reduces NoC area by 50.3% compared to a buffered hierarchical ring design HiRD reduces local router critical path by 29.9% compared to a buffered hierarchical ring design 34

Additional Results Detailed power breakdown Synthetic evaluations Energy efficiency results Worst case analysis Techical Report: –Multithreaded evaluation –Average, 90 th percentile and max latency –Comparison against other topologies –Sensitivity analysis on different link bandwidths and number of buffers 35

Outline Background and Motivation Key Idea: Deflection Routing End-to-end Delivery Guarantees Our Solution: HiRD Results Conclusion 36

Conclusion Rings do not scale well as core count increases Traditional hierarchical ring designs are complex and energy inefficient –Complicated buffering and flow control Solution: Hierarchical Rings with Deflection (HiRD) –Guarantees livelock freedom and delivery –Eliminates all buffers at local routers and most buffers at bridge routers HiRD provides higher performance and energy-efficiency than hierarchical rings HiRD is simpler than hierarchical rings 37

Design and Evaluation of Hierarchical Rings with Deflection Routing Rachata Ausavarungnirun, Chris Fallin, Xiangyao Yu, Kevin Chang, Greg Nazario, Reetuparna Das, Gabriel H. Loh, Onur Mutlu

Backup Slides 39

Network Intensive Workloads 15 network intensive workloads 40

System Performance %2.5% Deflections balance out the network load Thorttling reduces congestion

Network Power %37% More deflections happen when the network is congested

Detailed Results 43

Multithreaded Applications 44

Network Latency 45

Synthetic Traffic Evaluations 46

Topology Comparison 47

Sweep over Different Bandwidth 48

Packet Reassembly Borrowed from CHIPPER [Fallin et al. HPCA’10] –Retransmit-Once  Destination node reserves a buffer slot for a dropped packet –Provides ejection guarantee 49

Other Optimizations Map cores that communicate with each other a lot on the same local ring –Takes advantage of the faster local ring routers 50

Related Concurrent Works Clumsy Flow Control [Kim et al., IEEE CAL’13] –Requires coordination between cores and memory controllers Transportation inspired NoCs [Kim et al., HPCA’14] –tNoCs require an additional credit network –tNoCs have more complex flow control –HiRD is more lightweight 51

Some Related Previous Works Hierarchical Bus [Udipi et al., HPCA’10] –HiRD provides more scalability Concentrated Meshe [Das et al., HPCA’09] –Several nodes share one router –Used on meshed network –Less power efficient than HiRD Low-cost Mesh Router [J. Kim, MICRO’09] –Specifically designed for meshes –Does not solve issues in deflection-based flow control (HiRD does) 52