George Nychis✝, Chris Fallin✝, Thomas Moscibroda★, Onur Mutlu✝

Slides:



Advertisements
Similar presentations
Pricing for Utility-driven Resource Management and Allocation in Clusters Chee Shin Yeo and Rajkumar Buyya Grid Computing and Distributed Systems (GRIDS)
Advertisements

Greening Backbone Networks Shutting Off Cables in Bundled Links Will Fisher, Martin Suchara, and Jennifer Rexford Princeton University.
Distributed Systems Architectures
Dynamic Power Redistribution in Failure-Prone CMPs Paula Petrica, Jonathan A. Winter * and David H. Albonesi Cornell University *Google, Inc.
Cognitive Radio Communications and Networks: Principles and Practice By A. M. Wyglinski, M. Nekovee, Y. T. Hou (Elsevier, December 2009) 1 Chapter 12 Cross-Layer.
February 20, Spatio-Temporal Bandwidth Reuse: A Centralized Scheduling Mechanism for Wireless Mesh Networks Mahbub Alam Prof. Choong Seon Hong.
Reconsidering Reliable Transport Protocol in Heterogeneous Wireless Networks Wang Yang Tsinghua University 1.
Congestion Control and Fairness Models Nick Feamster CS 4251 Computer Networking II Spring 2008.
Towards Automating the Configuration of a Distributed Storage System Lauro B. Costa Matei Ripeanu {lauroc, NetSysLab University of British.
Wireless Networks Should Spread Spectrum On Demand Ramki Gummadi (MIT) Joint work with Hari Balakrishnan.
Edge-based Network Modeling and Inference
Prefetch-Aware Shared-Resource Management for Multi-Core Systems Eiman Ebrahimi * Chang Joo Lee * + Onur Mutlu Yale N. Patt * * HPS Research Group The.
Gennady Pekhimenko Advisers: Todd C. Mowry & Onur Mutlu
Predicting Performance Impact of DVFS for Realistic Memory Systems Rustam Miftakhutdinov Eiman Ebrahimi Yale N. Patt.
A Switch-Based Approach to Starvation in Data Centers Alex Shpiner and Isaac Keslassy Department of Electrical Engineering, Technion. Gabi Bracha, Eyal.
1 Maintaining Packet Order in Two-Stage Switches Isaac Keslassy, Nick McKeown Stanford University.
Best of Both Worlds: A Bus-Enhanced Network on-Chip (BENoC) Ran Manevich, Isask har (Zigi) Walter, Israel Cidon, and Avinoam Kolodny Technion – Israel.
Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.
Presenter : Cheng-Ta Wu Kenichiro Anjo, Member, IEEE, Atsushi Okamura, and Masato Motomura IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 39,NO. 5, MAY 2004.
Adaptive Backpressure: Efficient Buffer Management for On-Chip Networks Daniel U. Becker, Nan Jiang, George Michelogiannakis, William J. Dally Stanford.
Improving DRAM Performance by Parallelizing Refreshes with Accesses
Managing Web server performance with AutoTune agents by Y. Diao, J. L. Hellerstein, S. Parekh, J. P. Bigu Jangwon Han Seongwon Park
1 Quality of Service Issues Network design and security Lecture 12.
1 PhD Defense Presentation Managing Shared Resources in Chip Multiprocessor Memory Systems 12. October 2010 Magnus Jahre.
Capacity Planning For Products and Services
Copyright © Chang Gung University. Permission required for reproduction or display. On Femto Deployment Architecture and Macrocell Offloading Benefits.
Application-aware Memory System for Fair and Efficient Execution of Concurrent GPGPU Applications Adwait Jog 1, Evgeny Bolotin 2, Zvika Guz 2,a, Mike Parker.
Fairness via Source Throttling: A configurable and high-performance fairness substrate for multi-core memory systems Eiman Ebrahimi * Chang Joo Lee * Onur.
Management and Control of Domestic Smart Grid Technology IEEE Transactions on Smart Grid, Sep Albert Molderink, Vincent Bakker Yong Zhou
Routing and Congestion Problems in General Networks Presented by Jun Zou CAS 744.
1 On-Chip Networks from a Networking Perspective: Congestion and Scalability in Many-Core Interconnects George Nychis ✝, Chris Fallin ✝, Thomas Moscibroda.
Choosing an Order for Joins
Application-to-Core Mapping Policies to Reduce Memory System Interference Reetuparna Das * Rachata Ausavarungnirun $ Onur Mutlu $ Akhilesh Kumar § Mani.
Delay Analysis and Optimality of Scheduling Policies for Multihop Wireless Networks Gagan Raj Gupta Post-Doctoral Research Associate with the Parallel.
New Opportunities for Load Balancing in Network-Wide Intrusion Detection Systems Victor Heorhiadi, Michael K. Reiter, Vyas Sekar UNC Chapel Hill UNC Chapel.
A Novel 3D Layer-Multiplexed On-Chip Network
Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.
International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems.
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
Flattened Butterfly Topology for On-Chip Networks John Kim, James Balfour, and William J. Dally Presented by Jun Pang.
Aérgia: Exploiting Packet Latency Slack in On-Chip Networks
Evaluating Bufferless Flow Control for On-Chip Networks George Michelogiannakis, Daniel Sanchez, William J. Dally, Christos Kozyrakis Stanford University.
CSC457 Seminar YongKang Zhu December 6 th, 2001 About Network Processor.
IP I/O Memory Hard Disk Single Core IP I/O Memory Hard Disk IP Bus Multi-Core IP R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R Networks.
Network Bandwidth Allocation (and Stability) In Three Acts.
1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.
1 On-Chip Networks from a Networking Perspective: Congestion and Scalability in Many-Core Interconnects George Nychis ✝, Chris Fallin ✝, Thomas Moscibroda.
Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.
Diamonds are a Memory Controller’s Best Friend* *Also known as: Achieving Predictable Performance through Better Memory Controller Placement in Many-Core.
Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.
Déjà Vu Switching for Multiplane NoCs NOCS’12 University of Pittsburgh Ahmed Abousamra Rami MelhemAlex Jones.
1 Application Aware Prioritization Mechanisms for On-Chip Networks Reetuparna Das Onur Mutlu † Thomas Moscibroda ‡ Chita Das § Reetuparna Das § Onur Mutlu.
DUKE UNIVERSITY Self-Tuned Congestion Control for Multiprocessor Networks Shubhendu S. Mukherjee VSSAD, Alpha Development Group.
Ramazan Bitirgen, Engin Ipek and Jose F.Martinez MICRO’08 Presented by PAK,EUNJI Coordinated Management of Multiple Interacting Resources in Chip Multiprocessors.
Design and Evaluation of Hierarchical Rings with Deflection Routing Rachata Ausavarungnirun, Chris Fallin, Xiangyao Yu, ​ Kevin Chang, Greg Nazario, Reetuparna.
CS 8501 Networks-on-Chip (NoCs) Lukasz Szafaryn 15 FEB 10.
Analyzing Performance Vulnerability due to Resource Denial-Of-Service Attack on Chip Multiprocessors Dong Hyuk WooGeorgia Tech Hsien-Hsin “Sean” LeeGeorgia.
University of Michigan, Ann Arbor
Yu Cai Ken Mai Onur Mutlu
A Heterogeneous Multiple Network-On-Chip Design: An Application-Aware Approach Asit K. MishraChita R. DasOnur Mutlu.
Topology-aware QOS Support in Highly Integrated CMPs Boris Grot (UT-Austin) Stephen W. Keckler (NVIDIA/UT-Austin) Onur Mutlu (CMU) WIOSCA '10.
Mitigating Congestion in Wireless Sensor Networks Bret Hull, Kyle Jamieson, Hari Balakrishnan MIT Computer Science and Artificial Intelligence Laborartory.
HAT: Heterogeneous Adaptive Throttling for On-Chip Networks Kevin Kai-Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu.
Architecture and Algorithms for an IEEE 802
ISPASS th April Santa Rosa, California
Effective mechanism for bufferless networks at intensive workloads
Rachata Ausavarungnirun, Kevin Chang
Exploring Concentration and Channel Slicing in On-chip Network Router
Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor
A Case for Bufferless Routing in On-Chip Networks
Presentation transcript:

Next Generation On-Chip Networks: What Kind of Congestion Control Do We Need? George Nychis✝, Chris Fallin✝, Thomas Moscibroda★, Onur Mutlu✝ Carnegie Mellon University ✝ Microsoft Research ★

Chip Multiprocessor (CMP) Background Trend: towards ever larger chip multiprocessors (CMPs) the CMP overcomes diminishing returns of increasingly complex single-core processors Communication: critical to the CMP’s performance between cores, cache banks, DRAM controllers ... delays in information can stall the pipeline Common Bus: does not scale beyond 8 cores: electrical loading on the bus significantly reduces its speed the shared bus cannot support the bandwidth demand

The On-Chip Network Network Links Core + Router Build a network, routing information between endpoints Increased bandwidth and scales with the number of cores CMP (3x3) Network Links Core + Router

On-Chip Networks Are Walking a Familiar Line Scale of the networking is increasing Intel’s “Single-chip Cloud Computer” ... 48 cores Tilera Corperation TILE-Gx ... 100 cores What should the topology be? How should efficient routing be done? What should the buffer size be? (hot in arch. community) Can QoS guarantees be made in the network? How do you handle congestion in the network? All historic topics in the networking field...

Can We Apply Traditional Solutions? On-chip networks have a very different set of constraints Three first-class considerations in processor design: Chip area & space, power consumption, impl. complexity This impacts: integration (e.g., fitting more cores), cost, performance, thermal dissipation, design & verification ... The on-chip network has a unique design likely to require novel solutions to traditional problems chance for the networking community to weigh in

Outline Unique characteristics of the Network-on-Chip (NoC) likely requiring novel solutions to traditional problems Initial case study: congestion in a next generation NoC background on next generation bufferless design a study of congestion at network and application layers Novel application-aware congestion control mechanism

NoC Characteristics - What’s Different? Topology known, fixed, and regular Links expensive, can’t over-provision CMP (3x3) topology: known and fixed beyond the design stage, and regular for efficiency reasons routing: routing is very minimal in complexity and low latency for maintaing efficient communication links: are thin for area constraints, and for the same reason cannot be overprovisioned to handle temporal bursts of traffic net flow: with cache banks spread across all of the cores, a single core’s traffic pattern is one-to-many latency: network latency is very low, only 1-2 cycles through the router, and 1-2 cycles over a link coordination: while global coordination is not feasible in the Internet, global coordination is often less expensive than distributed techniques on the chip Latency 2-4 cycles for router & link No Net Flow one-to-many cache access Src R R Coordination global is often less expensive Routing min. complexity, low latency

Next Generation: Bufferless NoCs Architecture community is now heavily evaluating buffers: 30-40% of static and dynamic energy (e.g., Intel Tera- Scale) 75% of NoC area in a prototype (TRIPS) Push for bufferless (BLESS) NoC design: energy is reduced by ~40%, and area by ~60% comparable throughput for low to moderate workloads BLESS design has its own set of unique properties: no loss, retransmissions, or (N)ACKs As previously mentioned, the question of buffer size is currently hot in debate by the architecture community. this is also driven by the first-class design constraints of chip area and power. For example, 8

Outline Unique characteristics of the Network-on-Chip (NoC) likely requiring novel solutions to traditional problems Initial case study: congestion in a next generation NoC background on next generation bufferless design a study of congestion at network and application layers Novel application-aware congestion control mechanism

How Bufferless NoCs Work Packet Creation: L1 miss, L1 service, write- back.. Injection: only when an output port is available CMP D Routing: commonly X,Y- routing (first X-dir, then Y) S1 2 1 1 Arbitration: oldest flit- first (dead/live-lock free) Deflection: arbitration causing non-optimal hop S2 contending for top port, oldest first, newest deflected age is initialized

Starvation in Bufferless NoCs Remember, injection only if an output port is free... CMP Starvation cycle occurs when a core cannot inject Starvation rate (σ) is the fraction of starved cycles Keep starvation in mind ... Flit created but can’t inject without a free output ports

Outline Unique characteristics of the Network-on-Chip (NoC) likely requiring novel solutions to traditional problems Initial case study: congestion in a next generation NoC background on next generation bufferless design a study of congestion at network and application layers Novel application-aware congestion control mechanism

Congestion at the Network Level Evaluate 700 real application workloads in bufferless 4x4 Finding: net latency remains stable with congestion/deflects Net latency is not sufficient for detecting congestion What about starvation rate? Starvation increases significantly in congestion +4x Separation Separation of non-congested and congested net latency is only ~3-4 cycles Each point represents a single workload Finding: starvation rate is representative of congestion

Congestion at the Application Level Define system throughput as sum of instructions-per-cycle (IPC) of all applications on CMP: Sample 4x4, unthrottle apps: Finding 1: Throughput decreases under congestion Sub-optimal with congestion Finding 2: Self-throttling cores prevent collapse Finding 3: Static throttling can provide some gain (e.g., 14%), but we will show up to 27% gain with app-aware throttling

Need for Application Awareness System throughput can be improved, throttling with congestion Under congestion, what application should be throttled? Construct 4x4 NoC, alternate 90% throttle rate to applications Finding 1: the app that is throttled impacts system performance Finding 2: instruction throughput does not dictate who to throttle Finding 3: different applications respond differently to an increase in network throughput (unlike gromacs, mcf barely gains) Overall system throughput increases or decreases based on throttling decision MCF has lower application-level throughput, but should be throttled under congestion

Instructions-Per-Flit (IPF): Who To Throttle Key Insight: Not all flits (packet fragments) are created equal apps need different amounts of traffic to retire instructions if congested, throttle apps that gain least from traffic IPF is a fixed value that only depends on the L1 miss rate independent of the level of congestion & execution rate low value: many flits needed for an instruction We compute IPF for our 26 application workloads MCF’s IPF: 0.583, Gromacs IPF: 12.41 IPF explains MCF and Gromacs throttling experiment

App-Aware Congestion Control Mechanism From our study of congestion in a bufferless NoC: When To Throttle: monitor starvation rate Whom to Throttle: based on the IPF of applications in NoC Throttling Rate: proportional to application intensity (IPF) Controller: centrally coordinated control evaluation finds it less complex than a distributed controller 149 bits per-core (minimal compared to 128KB L1 cache) Controller is interval based, running only every 100k cycles

Evaluation of Congestion Controller Evaluate with 875 real workloads (700 16-core, 175 64- core) generate balanced set of CMP workloads (cloud computing) Parameters: 2d mesh, 2GHz, 128-entry ins. win, 128KB L1 Improvement up to 27% under congested workloads Does not degrade non-congested workloads Only 4/875 workloads have perform. reduced > 0.5% The improvement in system throughput for workloads Do not unfairly throttle applications down, but do reduce starvation (in paper) Network Utilization With No Congestion Control

Conclusions We have presented NoC, and bufferless NoC design highlighted unique characteristics which warrant novel solutions to traditional networking problems We showed a need for congestion control in a bufferless NoC throttling can only be done properly with app-awareness achieve app-awareness through novel IPF metric improve system performance up to 27% under congestion Opportunity for networking community to weigh in on novel solutions to traditional networking problems in a new context

Discussion / Questions? We focused on one traditional problem, others problems? load balancing, fairness, latency guarantees (QoS) ... Does the on-chip networking need a layered architecture? Multithreaded application workloads? What are the right metrics to focus on? instructions-per-cycle (IPC) is not all-telling what is the metric of fairness? (CPU bound & net bound)