1 On-Chip Networks from a Networking Perspective: Congestion and Scalability in Many-Core Interconnects George Nychis ✝, Chris Fallin ✝, Thomas Moscibroda.

Slides:



Advertisements
Similar presentations
George Nychis✝, Chris Fallin✝, Thomas Moscibroda★, Onur Mutlu✝
Advertisements

1 On-Chip Networks from a Networking Perspective: Congestion and Scalability in Many-Core Interconnects George Nychis ✝, Chris Fallin ✝, Thomas Moscibroda.
A Novel 3D Layer-Multiplexed On-Chip Network
B 黃冠智.
Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.
FLEXclusion: Balancing Cache Capacity and On-chip Bandwidth via Flexible Exclusion Jaewoong Sim Jaekyu Lee Moinuddin K. Qureshi Hyesoon Kim.
International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems.
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
Flattened Butterfly Topology for On-Chip Networks John Kim, James Balfour, and William J. Dally Presented by Jun Pang.
Aérgia: Exploiting Packet Latency Slack in On-Chip Networks
Evaluating Bufferless Flow Control for On-Chip Networks George Michelogiannakis, Daniel Sanchez, William J. Dally, Christos Kozyrakis Stanford University.
$ A Case for Bufferless Routing in On-Chip Networks A Case for Bufferless Routing in On-Chip Networks Onur Mutlu CMU TexPoint fonts used in EMF. Read the.
CHIPPER: A Low-complexity Bufferless Deflection Router Chris Fallin Chris Craik Onur Mutlu.
1 Multi-Core Systems CORE 0CORE 1CORE 2CORE 3 L2 CACHE L2 CACHE L2 CACHE L2 CACHE DRAM MEMORY CONTROLLER DRAM Bank 0 DRAM Bank 1 DRAM Bank 2 DRAM Bank.
Scaling the Bandwidth Wall: Challenges in and Avenues for CMP Scalability 36th International Symposium on Computer Architecture Brian Rogers †‡, Anil Krishna.
Lei Wang, Yuho Jin, Hyungjun Kim and Eun Jung Kim
Rotary Router : An Efficient Architecture for CMP Interconnection Networks Pablo Abad, Valentín Puente, Pablo Prieto, and Jose Angel Gregorio University.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.
Dragonfly Topology and Routing
Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.
Not All Microseconds are Equal: Fine-Grained Per-Flow Measurements with Reference Latency Interpolation Myungjin Lee †, Nick Duffield‡, Ramana Rao Kompella†
Diamonds are a Memory Controller’s Best Friend* *Also known as: Achieving Predictable Performance through Better Memory Controller Placement in Many-Core.
McRouter: Multicast within a Router for High Performance NoCs
Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.
Copyright © 2012 Houman Homayoun 1 Dynamically Heterogeneous Cores Through 3D Resource Pooling Houman Homayoun Vasileios Kontorinis Amirali Shayan Ta-Wei.
International Symposium on Low Power Electronics and Design NoC Frequency Scaling with Flexible- Pipeline Routers Pingqiang Zhou, Jieming Yin, Antonia.
Déjà Vu Switching for Multiplane NoCs NOCS’12 University of Pittsburgh Ahmed Abousamra Rami MelhemAlex Jones.
1 Application Aware Prioritization Mechanisms for On-Chip Networks Reetuparna Das Onur Mutlu † Thomas Moscibroda ‡ Chita Das § Reetuparna Das § Onur Mutlu.
LIBRA: Multi-mode On-Chip Network Arbitration for Locality-Oblivious Task Placement Gwangsun Kim Computer Science Department Korea Advanced Institute of.
Building Expressive, Area-Efficient Coherence Directories Michael C. Huang Guofan Jiang Zhejiang University University of Rochester IBM 1 Lei Fang, Peng.
Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.
Presenter: Min-Yu Lo 2015/10/19 Asit K. Mishra, N. Vijaykrishnan, Chita R. Das Computer Architecture (ISCA), th Annual International Symposium on.
Design and Evaluation of Hierarchical Rings with Deflection Routing Rachata Ausavarungnirun, Chris Fallin, Xiangyao Yu, ​ Kevin Chang, Greg Nazario, Reetuparna.
TEMPLATE DESIGN © Hardware Design, Synthesis, and Verification of a Multicore Communication API Ben Meakin, Ganesh Gopalakrishnan.
Srihari Makineni & Ravi Iyer Communications Technology Lab
Department of Computer Science and Engineering The Pennsylvania State University Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das Addressing.
Congestion Control in CSMA-Based Networks with Inconsistent Channel State V. Gambiroza and E. Knightly Rice Networks Group
CS 8501 Networks-on-Chip (NoCs) Lukasz Szafaryn 15 FEB 10.
Mitigating Congestion in Wireless Sensor Networks Bret Hull, Kyle Jamieson, Hari Balakrishnan Networks and Mobile Systems Group MIT Computer Science and.
Interconnect simulation. Different levels for Evaluating an architecture Numerical models – Mathematic formulations to obtain performance characteristics.
Architectural Impact of Stateful Networking Applications Javier Verdú, Jorge García Mario Nemirovsky, Mateo Valero The 1st Symposium on Architectures for.
Microprocessors and Microsystems Volume 35, Issue 2, March 2011, Pages 230–245 Special issue on Network-on-Chip Architectures and Design Methodologies.
Yu Cai Ken Mai Onur Mutlu
1 Dynamically Heterogeneous Cores Through 3D Resource Pooling Houman Homayoun Vasileios Kontorinis Amirali Shayan Ta-Wei Lin Dean M. Tullsen Speaker: Houman.
Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research.
A Heterogeneous Multiple Network-On-Chip Design: An Application-Aware Approach Asit K. MishraChita R. DasOnur Mutlu.
1 11 Distributed Channel Assignment in Multi-Radio Mesh Networks Bong-Jun Ko, Vishal Misra, Jitendra Padhye and Dan Rubenstein Columbia University.
Queue Scheduling Disciplines
Topology-aware QOS Support in Highly Integrated CMPs Boris Grot (UT-Austin) Stephen W. Keckler (NVIDIA/UT-Austin) Onur Mutlu (CMU) WIOSCA '10.
Achieving High Performance and Fairness at Low Cost Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri, Harsha Rastogi, Onur Mutlu 1 The Blacklisting Memory.
HP Labs 1 IEEE Infocom 2003 End-to-End Congestion Control for InfiniBand Jose Renato Santos, Yoshio Turner, John Janakiraman HP Labs.
Network On Chip Cache Coherency Final presentation – Part A Students: Zemer Tzach Kalifon Ethan Kalifon Ethan Instructor: Walter Isaschar Instructor: Walter.
Design Space Exploration for NoC Topologies ECE757 6 th May 2009 By Amit Kumar, Kanchan Damle, Muhammad Shoaib Bin Altaf, Janaki K.M Jillella Course Instructor:
Quantifying and Controlling Impact of Interference at Shared Caches and Main Memory Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, Onur.
HAT: Heterogeneous Adaptive Throttling for On-Chip Networks Kevin Kai-Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu.
Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Lavanya Subramanian 1.
A Case for Toggle-Aware Compression for GPU Systems
Architecture and Algorithms for an IEEE 802
Chris Cai, Shayan Saeed, Indranil Gupta, Roy Campbell, Franck Le
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
Effective mechanism for bufferless networks at intensive workloads
Pablo Abad, Pablo Prieto, Valentin Puente, Jose-Angel Gregorio
Rachata Ausavarungnirun, Kevin Chang
Exploring Concentration and Channel Slicing in On-chip Network Router
Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor
Achieving High Performance and Fairness at Low Cost
A Case for Bufferless Routing in On-Chip Networks
EdgeWise: A Better Stream Processing Engine for the Edge
Presented by Florian Ettinger
Presentation transcript:

1 On-Chip Networks from a Networking Perspective: Congestion and Scalability in Many-Core Interconnects George Nychis ✝, Chris Fallin ✝, Thomas Moscibroda ★, Onur Mutlu ✝, Srinivasan Seshan ✝ Carnegie Mellon University ✝ Microsoft Research Asia ★ Presenter : Zhi Liu

What is the On-Chip Network? 2 Multi-core Processor (9-core) Cores Memory Controllers GPUs Cache Banks

What is the On-Chip Network? 3 Multi-core Processor (9-core) Router Network Links S D

Networking Challenges 4 On-Chip Network Familiar discussion in the architecture community, e.g.:  How to reduce congestion  How to scale the network  Choosing an effective topology  Routing and buffer size

Routing min. complexity: X-Y-Routing, low latency On-Chip Network (3x3) Links links cannot be over-provisioned Coordination global is often practical for the known topology 5 Characteristics of Bufferless NoCs 1. Different constraints: unique network design 2. Different workloads: unique style of traffic and flow S D X Zoomed In Bufferless Router Area: -60% Power: -40%

6 Zoomed In Architecture Network Layer Router Insn. i5i6i7i8i9 (Instruct. Win) Closed-Loop Instruction Window Limits In-Flight Traffic Per-Core 1. Different constraints: unique network design 2. Different workloads: unique style of traffic and flow Routing min. complexity: X-Y-Routing, low latency Links links cannot be over-provisioned Bufferless Area: -60% Power: -40% R5R7R8 Bufferless NoCs Characteristics Coordination global is often practical for the known topology

7 Traffic and Congestion On-Chip Network S1S1 S2S2 D age is initialized 0 1 Arbitration: oldest pkt first (dead/live-lock free) contending for top port, oldest first, newest deflected Injection only when output link is free Manifestation of Congestion 1. Deflection: arbitration causing non-optimal hop

8 Can’t inject packet without a free output port 2. Starvation: when a core cannot inject (no loss) Definition: Starvation rate is a fraction of starved cycles Traffic and Congestion Arbitration: oldest pkt first (dead/live-lock free) Injection only when output link is free Manifestation of Congestion 1. Deflection: arbitration causing non-optimal hop On-Chip Network

9 Outline Bufferless On-Chip Networks: Congestion & Scalability  Study of congestion at network and application layers  Impact of congestion on scalability Novel application-aware congestion control mechanism Evaluation of congestion control mechanism  Able to effectively scale the network  Improve system throughput up to 27%

10 Congestion and Scalability Study Prior work: moderate intensity workloads, small on-chip net  Energy and area benefits of going bufferless  throughput comparable to buffered Study: high intensity workloads & large network (4096 cores)  Still comparable throughput with the benefits of bufferless? Methodology : Use real application workloads (e.g., matlab, gcc, bzip2, perl)  Simulate the bufferless network and system components  Simulator used to publish in ISCA, MICRO, HPCA, NoCs…

11 Congestion at the Network Level Evaluate 700 different loads(intensity of high medium and low) in 16- core system Finding: net latency remains stable with congestion/deflects Unlike traditional networks What about starvation rate? Starvation increases significantly with congestion Finding: starvation likely to impact performance; indicator of congestion Each point represents a single workload 700% Increase Increase in network latency under congestion is only ~5-6 cycles 25% Increase

12 Congestion at the Application Level Define system throughput as sum of instructions-per-cycle (IPC) of all applications in system: Unthrottle apps in single high intensity wkld in 4x4 system Sub-optimal with congestion Finding 1: Throughput decreases under congestion Finding 2: Self-throttling of cores prevents collapse Finding 3: Static throttling can provide some gain (e.g., 14%), but we will show up to 27% gain with app-aware throttling Throughput Does Not Collapse Throttled Unthrottled

Prior work: cores Our work: up to 4096 cores As we increase system’s size (high load, locality model, avg_hop=1):  Starvation rate increases  A core can be starved up to 37% of all cycles!  Per-node throughput decreases with system’s size  Up to 38% reduction Impact of Congestion on Scalability 13

Summary of Congestion Study Network congestion limits scalability and performance  Due to starvation rate, not increased network latency  Starvation rate is the indicator of congestion in on-chip net Self-throttling nature of cores prevent congestion collapse Throttling: reduced congestion, improved performance  Motivation for congestion control 14 Congestion control must be application-aware

15 Outline Bufferless On-Chip Networks: Congestion & Scalability  Study of congestion at network and application layers  Impact of congestion on scalability Novel application-aware congestion control mechanism Evaluation of congestion control mechanism  Able to effectively scale the network  Improve system throughput up to 27%

Traditional congestion controllers designed to:  Improve network efficiency  Maintain fairness of network access  Provide stability (and avoid collapse)  Operate in a distributed manner A controller for on-chip networks must:  Have minimal complexity  Be area-efficient  We show: be application-aware Developing a Congestion Controller 16 When Considering: On-Chip Network …in paper: global and simple controller

Finding 3: different applications respond differently to an increase in network throughput (mcf is memory-intensive and benefit less,but Gromacs benefit a lot) 17 Need For Application Awareness Throttling reduces congestion, improves system throughput  Under congestion, which core should be throttled? Use 16-core system, 8 instances each app, alternate 90% throttle rate(delay memory req by 1 cycl) to applications Finding 1: the app that is throttled impacts system performance Unlike traditional congestion controllers (e.g., TCP): cannot be application agnostic Finding 2: application throughput does not dictate who to throttle -9% +21% +33% <1%

Key Insight: Not all packets are created equal  more L1 misses  more traffic to progress 18 Instructions-Per-Packet ( IPP ) Phase behavior on the order of millions of cycles  Throttling during a “high” IPP phase will hurt performance Since L1 miss rate varies over execution, IPP is dynamic IPP only depends on the L1 miss rate  independent of the level of congestion & execution rate  low value: inject more packets into network  provides a stable measure of app’slcurrent network intensity  MCF’s IPF: 0.583, Gromacs IPF: Instructions-Per-Packet ( IPP ) = I/P

Since L1 miss rate varies over execution, IPP is dynamic 19 Instructions-Per-Packet ( IPP )  Throttling during a “high” IPP phase will hurt performance IPP provides application-layer insight in who to throttle  Dynamic IPP  Throttling must be dynamic When congested: throttle applications with low IPP Fairness: scale throttling rate by application’s IPP  Details in paper show that fairness in throttling

20 Summary of Congestion Controlling Global controller and Interval-based congestion control algorithm every 100,000 cycles Detects congestion: based on starvation rates Whom to throttle: determines IPF of applications throttle applications with low IPP Determining Throttling Rate: scale throttling rate by application’s IPP  Higher Throttling Rate for applications with lower IPP

21 Outline Bufferless On-Chip Network: Congestion & Scalability  Study of congestion at network and application layers  Impact of congestion on scalability Novel application-aware congestion control mechanism Evaluation of congestion control mechanism  Able to effectively scale the network  Improve system throughput up to 27%

Evaluate with 875 real workloads ( core, core)  generate balanced set of CMP workloads (cloud computing) 22 Evaluation of Improved Efficiency Network Utilization With No Congestion Control The improvement in system throughput for workloads Improvement up to 27% under congested workloads Does not degrade non-congested workloads Only 4/875 workloads have perform. reduced > 0.5% Do not unfairly throttle applications down (in paper)

Comparison points… Baseline bufferless: doesn’t scale Buffered: area/power expensive Contribution: keep area and power benefits of bufferless, while achieving comparable performance Application-aware throttling Overall reduction in congestion Power consumption reduced through increase in net efficiency Evaluation of Improved Scalability 23

24 Summary of Study, Results, and Conclusions Highlighted a traditional networking problem in a new context  Unique design requires novel solution We showed congestion limited efficiency and scalability, and that self-throttling nature of cores prevents collapse Study showed congestion control would require app- awareness Our application-aware congestion controller provided:  A more efficient network-layer (reduced latency)  Improvements in system throughput (up to 27%)  Effectively scale the CMP (shown for up to 4096 cores)

Low-Complexity & Area-Efficient Controller Global interval-based controller, simple & low hardware req  Only every 100k cycles: who to throttle, and by what rate Information needed from cores:  Starvation Rate: detect congestion  IPF: who to throttle, and rate Communications with controller is in-band, control packets include needed info Hardware: only 149-bits per core  minimal compared to 128KB L1 25 CMP Zoomed In (to a single core) IPF: 2 Counters + Divider Insns Retired Flits Injected Starvation Rate: Shift Reg + Comparator / Router … … Throttler: Counter + Comparator Injection Permitted?

Evaluation of Fairness 26 Controller does not unfairly throttle low IPF applications

Break the results(overall system throughput) down by the application “mix” H=High, M=Medium, L=Low … and mixes of such workloads As expected, majority of benefits come from heavier workloads Do not unfairly throttle down lower intensity workloads Improvements greater in larger sized network (8x8) Breakdown by Workload 27 Click to edit Master text styles  Second level  Third level Fourth level Fifth level