Aérgia: Exploiting Packet Latency Slack in On-Chip Networks

Slides:



Advertisements
Similar presentations
George Nychis✝, Chris Fallin✝, Thomas Moscibroda★, Onur Mutlu✝
Advertisements

Prefetch-Aware Shared-Resource Management for Multi-Core Systems Eiman Ebrahimi * Chang Joo Lee * + Onur Mutlu Yale N. Patt * * HPS Research Group The.
Application-to-Core Mapping Policies to Reduce Memory System Interference Reetuparna Das * Rachata Ausavarungnirun $ Onur Mutlu $ Akhilesh Kumar § Mani.
A Novel 3D Layer-Multiplexed On-Chip Network
Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.
Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.
Thomas Moscibroda Distributed Systems Research, Redmond Onur Mutlu
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
REAL-TIME COMMUNICATION ANALYSIS FOR NOCS WITH WORMHOLE SWITCHING Presented by Sina Gholamian, 1 09/11/2011.
$ A Case for Bufferless Routing in On-Chip Networks A Case for Bufferless Routing in On-Chip Networks Onur Mutlu CMU TexPoint fonts used in EMF. Read the.
Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.
OWL: Cooperative Thread Array (CTA) Aware Scheduling Techniques for Improving GPGPU Performance Adwait Jog, Onur Kayiran, Nachiappan CN, Asit Mishra, Mahmut.
1 Multi-Core Systems CORE 0CORE 1CORE 2CORE 3 L2 CACHE L2 CACHE L2 CACHE L2 CACHE DRAM MEMORY CONTROLLER DRAM Bank 0 DRAM Bank 1 DRAM Bank 2 DRAM Bank.
CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros Volos Naser Khosro Pour Babak Falsafi Giovanni De Micheli.
Allocator Implementations for Network-on-Chip Routers Daniel U. Becker and William J. Dally Concurrent VLSI Architecture Group Stanford University.
1 Lecture 13: DRAM Innovations Today: energy efficiency, row buffer management, scheduling.
Lei Wang, Yuho Jin, Hyungjun Kim and Eun Jung Kim
1 Lecture 21: Router Design Papers: Power-Driven Design of Router Microarchitectures in On-Chip Networks, MICRO’03, Princeton A Gracefully Degrading and.
1 Lecture 13: Interconnection Networks Topics: flow control, router pipelines, case studies.
1 Lecture 25: Interconnection Networks Topics: flow control, router microarchitecture Final exam:  Dec 4 th 9am – 10:40am  ~15-20% on pre-midterm  post-midterm:
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Parallel Application Memory Scheduling Eiman Ebrahimi * Rustam Miftakhutdinov *, Chris Fallin ‡ Chang Joo Lee * +, Jose Joao * Onur Mutlu ‡, Yale N. Patt.
1 Lecture 26: Interconnection Networks Topics: flow control, router microarchitecture.
1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.
Dragonfly Topology and Routing
Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.
McRouter: Multicast within a Router for High Performance NoCs
Utility-Based Acceleration of Multithreaded Applications on Asymmetric CMPs José A. Joao * M. Aater Suleman * Onur Mutlu ‡ Yale N. Patt * * HPS Research.
TitleEfficient Timing Channel Protection for On-Chip Networks Yao Wang and G. Edward Suh Cornell University.
High Performance Embedded Computing © 2007 Elsevier Lecture 16: Interconnection Networks Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.
International Symposium on Low Power Electronics and Design NoC Frequency Scaling with Flexible- Pipeline Routers Pingqiang Zhou, Jieming Yin, Antonia.
Déjà Vu Switching for Multiplane NoCs NOCS’12 University of Pittsburgh Ahmed Abousamra Rami MelhemAlex Jones.
1 Application Aware Prioritization Mechanisms for On-Chip Networks Reetuparna Das Onur Mutlu † Thomas Moscibroda ‡ Chita Das § Reetuparna Das § Onur Mutlu.
Stall-Time Fair Memory Access Scheduling Onur Mutlu and Thomas Moscibroda Computer Architecture Group Microsoft Research.
LIBRA: Multi-mode On-Chip Network Arbitration for Locality-Oblivious Task Placement Gwangsun Kim Computer Science Department Korea Advanced Institute of.
Author : Jing Lin, Xiaola Lin, Liang Tang Publish Journal of parallel and Distributed Computing MAKING-A-STOP: A NEW BUFFERLESS ROUTING ALGORITHM FOR ON-CHIP.
Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.
Design and Evaluation of Hierarchical Rings with Deflection Routing Rachata Ausavarungnirun, Chris Fallin, Xiangyao Yu, ​ Kevin Chang, Greg Nazario, Reetuparna.
Srihari Makineni & Ravi Iyer Communications Technology Lab
Department of Computer Science and Engineering The Pennsylvania State University Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das Addressing.
Yu Cai Ken Mai Onur Mutlu
Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research.
A Heterogeneous Multiple Network-On-Chip Design: An Application-Aware Approach Asit K. MishraChita R. DasOnur Mutlu.
Lecture 16: Router Design
Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.
The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.
Intel Slide 1 A Comparative Study of Arbitration Algorithms for the Alpha Pipelined Router Shubu Mukherjee*, Federico Silla !, Peter Bannon $, Joel.
Topology-aware QOS Support in Highly Integrated CMPs Boris Grot (UT-Austin) Stephen W. Keckler (NVIDIA/UT-Austin) Onur Mutlu (CMU) WIOSCA '10.
Achieving High Performance and Fairness at Low Cost Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri, Harsha Rastogi, Onur Mutlu 1 The Blacklisting Memory.
Sunpyo Hong, Hyesoon Kim
Predictive High-Performance Architecture Research Mavens (PHARM), Department of ECE The NoX Router Mitchell Hayenga Mikko Lipasti.
On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.
Quantifying and Controlling Impact of Interference at Shared Caches and Main Memory Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, Onur.
HAT: Heterogeneous Adaptive Throttling for On-Chip Networks Kevin Kai-Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu.
$ Advances in On-Chip Networks Advances in On-Chip Networks TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A A Thomas.
Computer Architecture: Interconnects (Part II) Michael Papamichael Carnegie Mellon University Material partially based on Onur Mutlu’s lecture slides.
Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Lavanya Subramanian 1.
1 Lecture 29: Interconnection Networks Papers: Express Virtual Channels: Towards the Ideal Interconnection Fabric, ISCA’07, Princeton Interconnect Design.
Fall 2012 Parallel Computer Architecture Lecture 21: Interconnects IV Prof. Onur Mutlu Carnegie Mellon University 10/29/2012.
UH-MEM: Utility-Based Hybrid Memory Management
Effective mechanism for bufferless networks at intensive workloads
Rachata Ausavarungnirun, Kevin Chang
Exploring Concentration and Channel Slicing in On-chip Network Router
Accelerating Dependent Cache Misses with an Enhanced Memory Controller
Achieving High Performance and Fairness at Low Cost
Address-Value Delta (AVD) Prediction
Natalie Enright Jerger, Li Shiuan Peh, and Mikko Lipasti
Lecture 25: Interconnection Networks
Prof. Onur Mutlu Carnegie Mellon University 10/24/2012
Presentation transcript:

Aérgia: Exploiting Packet Latency Slack in On-Chip Networks Reetuparna Das€§ Onur Mutlu† Thomas Moscibroda‡ Chita Das§ € Intel Labs §PennState †CMU ‡Microsoft Research

Network-on-Chip is a critical resource shared by multiple applications App N App N-1 Network-on-Chip L2$ Bank mem cont Memory Controller P Accelerator Network-on-Chip A multicore processor will have many cores, cache banks, memory controllers, possibly accelerators all connected via the network-on-chip The execution environment will have multiple diverse applications running on different cores, communicating to caches/memories/accelerators or within themselves via the NoC. Thus the NoC will thus be a critical resource shared by multiple apps Network-on-Chip is a critical resource shared by multiple applications

Network-on-Chip R PE R PE R PE R PE Input Port with Buffers R PE R PE R PE R PE From East From West From North From South From PE VC 0 VC Identifier VC 1 VC 2 Control Logic Routing Unit ( RC ) VC Allocator VA Switch Allocator (SA) Crossbar ( 5 x ) To East To PE To West To North To South R PE R PE R PE R PE R PE R PE R PE R PE If we look at the physical implementation, all these processing elements(cores/caches/controllers etc) will be connected via routers in a grid 2. Zooming out one router, it will have four ports (N/S/E/W). Each port has some buffers organized as virtual channels. 3. The buffers are connected to a switch/crossbar 4. Finally some control logic routes the packets and does arbitration/allocation of switch and buffers. “lot of future tense” R Routers PE Processing Element (Cores, L2 Banks, Memory Controllers etc) Crossbar

Packet Scheduling in NoC VC Routing Unit ( RC ) VC Allocator VA Switch Allocator (SA) 1 2 From East From West From North From South From PE In this work we are interested in the VC and Switch allocators. These units allocate buffer/switch to one packet among several competing packets

Packet Scheduling in NoC Conceptual View VC 0 VC Routing Unit ( RC ) VC Allocator VA Switch 1 2 From East From West From North From South From PE From East VC 1 VC 2 Allocator (SA) From West From North Conceptually we can think of routers as multiple buffers (separated out into ports) The router store packets from many applications (each color is a separate application) VC/SA allocators act as packet scheduler choosing one packet every cycle. From South From PE App1 App2 App3 App4 App5 App6 App7 App8

Packet Scheduling in NoC Conceptual View VC Routing Unit ( RC ) VC Allocator VA Switch Allocator (SA) 1 2 From East From West From North From South From PE VC 0 From East VC 1 VC 2 From West Scheduler From North 1. The problem we address in this paper is which packet to choose? From South Which packet to choose? From PE App1 App2 App3 App4 App5 App6 App7 App8

Packet Scheduling in NoC Existing scheduling policies Round robin Age Problem Treat all packets equally Application-oblivious Packets have different criticality Packet is critical if latency of a packet affects application’s performance Different criticality due to memory level parallelism (MLP) All packets are not the same…!!! There are two existing scheduling policies RR {explain} Age {explain} Lead to contradictory decision making between routers: packets from one application may be prioritized at one router, to be delayed at next. Stress bullets of application-oblivious Transition: Lets look at the motivation behind our proposed application-aware policies

MLP Principle Stall ( ) = 0 Stall Packet Latency != Network Stall Time Compute Stall Latency ( ) Latency ( ) Latency ( ) with MLP cores issue several memory requests in parallel in the hope of overlapping future load misses with current load misses. First I show an application with high MLP, it does some computation and sends a bunch of packets(in this case three, green, blue, red). The core stalls for the packets to return. Note that blue packet’s latency is overlapped and hence it has much fewer stall cycles. Stall cycles of red packet is 0 because its latency is completely overlapped. Thus our first observation packet latency != NST and different packets have different stalll / impact on execution time Packet Latency != Network Stall Time Different Packets have different criticality due to MLP Criticality( ) > Criticality( )

Outline Introduction Aérgia Evaluation Conclusion Packet Scheduling Memory Level Parallelism Aérgia Concept of Slack Estimating Slack Evaluation Conclusion

What is Aérgia? Aérgia is the spirit of laziness in Greek mythology Some packets can afford to slack!

Outline Introduction Aérgia Evaluation Conclusion Packet Scheduling Memory Level Parallelism Aérgia Concept of Slack Estimating Slack Evaluation Conclusion

Slack of Packets What is slack of a packet? Slack of a packet is number of cycles it can be delayed in a router without reducing application’s performance Local network slack Source of slack: Memory-Level Parallelism (MLP) Latency of an application’s packet hidden from application due to overlap with latency of pending cache miss requests Prioritize packets with lower slack

Concept of Slack Instruction Window Execution Time Network-on-Chip Latency ( ) Latency ( ) Load Miss Causes Slack Stall Compute Causes Load Miss returns earlier than necessary We have two cores, core A colored green and core B colored red Slack ( ) = Latency ( ) – Latency ( ) = 26 – 6 = 20 hops Packet( ) can be delayed for available slack cycles without reducing performance!

Prioritizing using Slack Core A Packet Latency Slack 13 hops 0 hops 3 hops 10 hops 0 hops 4 hops 6 hops Causes Load Miss Slack( ) > Slack ( ) Load Miss Causes Interference at 3 hops Core B Now the packets B-1 and A-0 interfere in the network for 3 hops, and we can choose one over the other A-0 has slack of 0 hops and B-1 has a slack of 6 hops So we can Prioritize A-0 over B-1 to reduce stall cycles of Core-A without increasing stall cycles of Core-B Prioritize

Slack in Applications Non-critical critical 50% of packets have 350+ slack cycles critical 10% of packets have <50 slack cycles

Slack in Applications 68% of packets have zero slack cycles

Diversity in Slack

Diversity in Slack Slack varies between packets of different applications Slack varies between packets of a single application

Outline Introduction Aérgia Evaluation Conclusion Packet Scheduling Memory Level Parallelism Aérgia Concept of Slack Estimating Slack Evaluation Conclusion

Estimating Slack Priority Slack (P) = Max (Latencies of P’s Predecessors) – Latency of P Predecessors(P) are the packets of outstanding cache miss requests when P is issued Packet latencies not known when issued Predicting latency of any packet Q Higher latency if Q corresponds to an L2 miss Higher latency if Q has to travel farther number of hops Difficult to predict packet latency Predicting Max (Predecessors’ Latencies) Number of predecessors Higher predecessor latency if predecessor is L2 miss Predicting latency of a packet P Higher latency if P corresponds to an L2 miss Higher latency if P has to travel farther number of hops

Estimating Slack Priority Slack of P = Maximum Predecessor Latency – Latency of P Slack(P) = PredL2: Set if any predecessor packet is servicing L2 miss MyL2: Set if P is NOT servicing an L2 miss HopEstimate: Max (# of hops of Predecessors) – hops of P PredL2 (2 bits) MyL2 (1 bit) HopEstimate

Estimating Slack Priority How to predict L2 hit or miss at core? Global Branch Predictor based L2 Miss Predictor Use Pattern History Table and 2-bit saturating counters Threshold based L2 Miss Predictor If #L2 misses in “M” misses >= “T” threshold then next load is a L2 miss. Number of miss predecessors? List of outstanding L2 Misses Hops estimate? Hops => ∆X + ∆ Y distance Use predecessor list to calculate slack hop estimate

Starvation Avoidance Problem: Starvation Prioritizing packets can lead to starvation of lower priority packets Solution: Time-Based Packet Batching New batches are formed at every T cycles Packets of older batches are prioritized over younger batches

Putting it all together Tag header of the packet with priority bits before injection Priority(P)? P’s batch (highest priority) P’s Slack Local Round-Robin (final tie breaker) Priority (P) = Batch (3 bits) PredL2 (2 bits) MyL2 (1 bit) HopEstimate (2 bits) Say “older batches are prioritized over younger batches”

Outline Introduction Aérgia Evaluation Conclusion Packet Scheduling Memory Level Parallelism Aérgia Concept of Slack Estimating Slack Evaluation Conclusion

Evaluation Methodology 64-core system x86 processor model based on Intel Pentium M 2 GHz processor, 128-entry instruction window 32KB private L1 and 1MB per core shared L2 caches, 32 miss buffers 4GB DRAM, 320 cycle access latency, 4 on-chip DRAM controllers Detailed Network-on-Chip model 2-stage routers (with speculation and look ahead routing) Wormhole switching (8 flit data packets) Virtual channel flow control (6 VCs, 5 flit buffer depth) 8x8 Mesh (128 bit bi-directional channels) Benchmarks Multiprogrammed scientific, server, desktop workloads (35 applications) 96 workload combinations We do our evaluations on a detailed cycle accurate CMP simulator with state-of-art NoC model. Our evaluations are across 35 diverse applications. Our results are for 96 workload combinations.

Qualitative Comparison Round Robin & Age Local and application oblivious Age is biased towards heavy applications Globally Synchronized Frames (GSF) [Lee et al., ISCA 2008] Provides bandwidth fairness at the expense of system performance Penalizes heavy and bursty applications Application-Aware Prioritization Policies (SJF) [Das et al., MICRO 2009] Shortest-Job-First Principle Packet scheduling policies which prioritize network sensitive applications which inject lower load Age heavy applications flood the network higher likelihood of an older packet being from heavy application GSF Each application gets equal and fixed quota of flits (credits) in each batch. Heavy application quickly run out of credits after injecting into all active batches & stall till oldest batch completes and frees up fresh credits. Underutilization of network resources

System Performance SJF provides 8.9% improvement in weighted speedup Aérgia improves system throughput by 10.3% Aérgia+SJF improves system throughput by 16.1%

Network Unfairness SJF does not imbalance network fairness Aergia improves network unfairness by 1.5X SJF+Aergia improves network unfairness by 1.3X

Conclusions & Future Directions Packets have different criticality, yet existing packet scheduling policies treat all packets equally We propose a new approach to packet scheduling in NoCs We define Slack as a key measure that characterizes the relative importance of a packet. We propose Aérgia a novel architecture to accelerate low slack critical packets Result Improves system performance: 16.1% Improves network fairness: 30.8%

Future Directions Can we determine slack more accurately…? Models…? Take into account instruction-level dependencies…? Slack-based arbitration in bufferless on-chip networks…? (see [Moscibroda, Mutlu, ISCA 2009]) Can we combine benefits from slack-based arbitration with providing fairness guarantees…? Etc…

Backup

Heuristic 1 Number of Predecessors which are L2 Misses Recall NST indicates criticality of a packet High NST/Packet => Low Slack 0 Predecessors have highest NST/packet and least Slack

Heuristic 2 L2 Hit or Miss Recall NST indicates criticality of a packet High NST/Packet => Low Slack L2 Misses have much higher NST/packet ( lower slack) than hits

Heuristic 3 Slack of P = Maximum Predecessor Hops – Hops of P Lower hops => low Slack => high criticality Slack computed from hops is a good approximation