Application-to-Core Mapping Policies to Reduce Memory System Interference Reetuparna Das * Rachata Ausavarungnirun $ Onur Mutlu $ Akhilesh Kumar § Mani.

Slides:

Advertisements

Similar presentations

George Nychis✝, Chris Fallin✝, Thomas Moscibroda★, Onur Mutlu✝

Advertisements

You have been given a mission and a code. Use the code to complete the mission and you will save the world from obliteration…

1 Utility-Based Partitioning of Shared Caches Moinuddin K. Qureshi Yale N. Patt International Symposium on Microarchitecture (MICRO) 2006.

Pricing for Utility-driven Resource Management and Allocation in Clusters Chee Shin Yeo and Rajkumar Buyya Grid Computing and Distributed Systems (GRIDS)

1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 3 CPUs.

Towards Automating the Configuration of a Distributed Storage System Lauro B. Costa Matei Ripeanu {lauroc, NetSysLab University of British.

1 Multi-Channel Wireless Networks: Capacity and Protocols Nitin H. Vaidya University of Illinois at Urbana-Champaign Joint work with Pradeep Kyasanur Chandrakanth.

An Alliance based Peering Scheme for P2P Live Media Streaming Darshan Purandare Ratan Guha University of Central Florida August 31, P2P-TV, Kyoto.

Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13

Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13

Title Subtitle.

Prefetch-Aware Shared-Resource Management for Multi-Core Systems Eiman Ebrahimi * Chang Joo Lee * + Onur Mutlu Yale N. Patt * * HPS Research Group The.

Gennady Pekhimenko Advisers: Todd C. Mowry & Onur Mutlu

Predicting Performance Impact of DVFS for Realistic Memory Systems Rustam Miftakhutdinov Eiman Ebrahimi Yale N. Patt.

Re-examining Instruction Reuse in Pre-execution Approaches By Sonya R. Wolff Prof. Ronald D. Barnes June 5, 2011.

Feedback Directed Prefetching Santhosh Srinath Onur Mutlu Hyesoon Kim Yale N. Patt §¥ ¥ §

SE-292 High Performance Computing

Best of Both Worlds: A Bus-Enhanced Network on-Chip (BENoC) Ran Manevich, Isask har (Zigi) Walter, Israel Cidon, and Avinoam Kolodny Technion – Israel.

CS 105 Tour of the Black Holes of Computing

Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.

Light NUCA: a proposal for bridging the inter-cache latency gap Darío Suárez 1, Teresa Monreal 1, Fernando Vallejo 2, Ramón Beivide 2, and Victor Viñals.

Directory-Based Cache Coherence Marc De Melo. Outline Non-Uniform Cache Architecture (NUCA) Cache Coherence Implementation of directories in multicore.

Optimizing Communication and Capacity in 3D Stacked Cache Hierarchies Aniruddha Udipi N. Madan, L. Zhao, N. Muralimanohar, A. Udipi, R. Balasubramonian,

Adaptive Backpressure: Efficient Buffer Management for On-Chip Networks Daniel U. Becker, Nan Jiang, George Michelogiannakis, William J. Dally Stanford.

Improving DRAM Performance by Parallelizing Refreshes with Accesses

ABC Technology Project

1 PhD Defense Presentation Managing Shared Resources in Chip Multiprocessor Memory Systems 12. October 2010 Magnus Jahre.

1 Overview Assignment 4: hints Memory management Assignment 3: solution.

1 A Case for MLP-Aware Cache Replacement International Symposium on Computer Architecture (ISCA) 2006 Moinuddin K. Qureshi Daniel N. Lynch, Onur Mutlu,

Javier Lira (UPC, Spain)Carlos Molina (URV, Spain) David Brooks (Harvard, USA)Antonio González (Intel-UPC,

CRUISE: Cache Replacement and Utility-Aware Scheduling

Virtual Memory 1 Computer Organization II © McQuain Virtual Memory Use main memory as a cache for secondary (disk) storage – Managed jointly.

Bypass and Insertion Algorithms for Exclusive Last-level Caches

1 ICCD 2010 Amsterdam, the Netherlands Rami Sheikh North Carolina State University Mazen Kharbutli Jordan Univ. of Science and Technology Improving Cache.

Application-aware Memory System for Fair and Efficient Execution of Concurrent GPGPU Applications Adwait Jog 1, Evgeny Bolotin 2, Zvika Guz 2,a, Mike Parker.

Fairness via Source Throttling: A configurable and high-performance fairness substrate for multi-core memory systems Eiman Ebrahimi * Chang Joo Lee * Onur.

© 2012 National Heart Foundation of Australia. Slide 2.

Chapter 5 Test Review Sections 5-1 through 5-4.

KAIST Computer Architecture Lab. The Effect of Multi-core on HPC Applications in Virtualized Systems Jaeung Han¹, Jeongseob Ahn¹, Changdae Kim¹, Youngjin.

Addition 1’s to 20.

25 seconds left…...

SE-292 High Performance Computing

We will resume in: 25 Minutes.

1 On-Chip Networks from a Networking Perspective: Congestion and Scalability in Many-Core Interconnects George Nychis ✝, Chris Fallin ✝, Thomas Moscibroda.

Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer.

SE-292 High Performance Computing Memory Hierarchy R. Govindarajan

PSSA Preparation.

University of Minnesota Optimizing MapReduce Provisioning in the Cloud Michael Cardosa, Aameek Singh†, Himabindu Pucha†, Abhishek Chandra

Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.

High Performing Cache Hierarchies for Server Workloads

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

Aérgia: Exploiting Packet Latency Slack in On-Chip Networks

1 Multi-Core Systems CORE 0CORE 1CORE 2CORE 3 L2 CACHE L2 CACHE L2 CACHE L2 CACHE DRAM MEMORY CONTROLLER DRAM Bank 0 DRAM Bank 1 DRAM Bank 2 DRAM Bank.

Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers Manu Awasthi, David Nellans, Kshitij Sudan, Rajeev Balasubramonian,

1 Application Aware Prioritization Mechanisms for On-Chip Networks Reetuparna Das Onur Mutlu † Thomas Moscibroda ‡ Chita Das § Reetuparna Das § Onur Mutlu.

Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.

Design and Evaluation of Hierarchical Rings with Deflection Routing Rachata Ausavarungnirun, Chris Fallin, Xiangyao Yu, Kevin Chang, Greg Nazario, Reetuparna.

Row Buffer Locality Aware Caching Policies for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.

Department of Computer Science and Engineering The Pennsylvania State University Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das Addressing.

A Row Buffer Locality-Aware Caching Policy for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.

Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.

Topology-aware QOS Support in Highly Integrated CMPs Boris Grot (UT-Austin) Stephen W. Keckler (NVIDIA/UT-Austin) Onur Mutlu (CMU) WIOSCA '10.

HAT: Heterogeneous Adaptive Throttling for On-Chip Networks Kevin Kai-Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu.

UH-MEM: Utility-Based Hybrid Memory Management

Reducing Memory Interference in Multicore Systems

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Effective mechanism for bufferless networks at intensive workloads

Rachata Ausavarungnirun, Kevin Chang

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Presentation transcript:

Application-to-Core Mapping Policies to Reduce Memory System Interference Reetuparna Das * Rachata Ausavarungnirun $ Onur Mutlu $ Akhilesh Kumar § Mani Azimi § * University of Michigan $ Carnegie Mellon University § Intel

Multi-Core to Many-Core Multi-CoreMany-Core 2

Many-Core On-Chip Communication 3 Memory Controller Shared Cache Bank $ $ Light Heavy Applications

Task Scheduling Traditional When to schedule a task? – Temporal Many-Core When to schedule a task? – Temporal + Where to schedule a task? – Spatial Spatial scheduling impacts performance of memory hierarchy  Latency and interference in interconnect, memory, caches 4

Problem: Spatial Task Scheduling Applications Cores How to map applications to cores? 5

Challenges in Spatial Task Scheduling Applications Cores How to reduce destructive interference between applications? How to reduce communication distance? 6 How to prioritize applications to improve throughput?

Application-to-Core Mapping 7 ClusteringBalancing Isolation Radial Mapping Improve Locality Reduce Interference Improve Bandwidth Utilization Reduce Interference Improve Bandwidth Utilization

Step 1 — Clustering 8 Inefficient data mapping to memory and caches Memory Controller

Step 1 — Clustering Improved Locality 9 Reduced Interference Cluster 0Cluster 2 Cluster 1Cluster 3

Step 1 — Clustering Clustering memory accesses  Locality aware page replacement policy (cluster-CLOCK) When allocating free page, give preference to pages belonging to the cluster’s memory controllers (MCs) Look ahead “N” pages beyond the default replacement candidate to find page belonging to cluster’s MC Clustering cache accesses  Private caches automatically enforce clustering  Shared caches can use Dynamic Spill Receive * mechanism 10 *Qureshi et al, HPCA 2009

Step 2 — Balancing Heavy Light ApplicationsCores 11 Too much load in clusters with heavy applications

Step 2 — Balancing Is this the best we can do? Let’s take a look at application characteristics Heavy Light ApplicationsCores 12 Better bandwidth utilization

Application Types 13 （ c ） PHD Comics

Application Types Identify and isolate sensitive applications while ensuring load balance 14 Medium Med Miss Rate High MLP Guru There for cookies Heavy High Miss Rate High MLP Adversary Bitter rival Light Low Miss Rate Nice Guy No opinions Asst. Professor Sensitive High Miss Rate Low MLP Advisor Sensitive Thesis Committee Applications （ c ） PHD Comics

Step 3 — Isolation Heavy Light ApplicationsCores Sensitive Medium Isolate sensitive applications to a cluster 15 Balance load for remaining applications across clusters

Step 3 — Isolation How to estimate sensitivity?  High Miss— high misses per kilo instruction (MPKI)  Low MLP— high relative stall cycles per miss (STPM)  Sensitive if MPKI > Threshold and relative STPM is high Whether to or not to allocate cluster to sensitive applications? How to map sensitive applications to their own cluster?  Knap-sack algorithm 16

Step 4 — Radial Mapping Heavy Light ApplicationsCores Sensitive Medium Map applications that benefit most from being close to memory controllers close to these resources 17

Step 4 — Radial Mapping What applications benefit most from being close to the memory controller?  High memory bandwidth demand  Also affected by network performance  Metric => Stall time per thousand instructions 18

Putting It All Together 19 Balancing Radial Mapping Isolation Clustering Inter-Cluster Mapping Intra-Cluster Mapping Improve Locality Reduce Interference Improve Shared Resource Utilization

Evaluation Methodology 60-core system  x86 processor model based on Intel Pentium M  2 GHz processor, 128-entry instruction window  32KB private L1 and 256KB per core private L2 caches  4GB DRAM, 160 cycle access latency, 4 on-chip DRAM controllers  CLOCK page replacement algorithm Detailed Network-on-Chip model  2-stage routers (with speculation and look ahead routing)  Wormhole switching (4 flit data packets)  Virtual channel flow control (4 VCs, 4 flit buffer depth)  8x8 Mesh (128 bit bi-directional channels) 20

Configurations Evaluated configurations  BASE—Random core mapping  BASE+CLS—Baseline with clustering  A2C Benchmarks  Scientific, server, desktop benchmarks (35 applications)  128 multi-programmed workloads  4 categories based on aggregate workload MPKI MPKI500, MPKI1000, MPKI1500, MPKI

System Performance 22 System performance improves by 17%

Network Power 23 Average network power consumption reduces by 52%

Summary of Other Results A2C can reduce page fault rate 24

Summary of Other Results A2C can reduce page faults Dynamic A2C also improves system performance  Continuous “Profiling” + “Enforcement” intervals  Retains clustering benefits  Migration overheads are minimal A2C complements application-aware packet prioritization* in NoCs A2C is effective for a variety of system parameters  Number of and placement of memory controllers  Size and organization of last level cache 25 *Das et al, MICRO 2009

Conclusion Problem: Spatial scheduling for Many-Core processors  Develop fundamental insights for core mapping policies Solution: Application-to-Core (A2C) mapping policies A2C improves system performance, system fairness and network power significantly 26 Clustering Balancing Radial Isolation

Application-to-Core Mapping Policies to Reduce Memory System Interference Reetuparna Das * Rachata Ausavarungnirun $ Onur Mutlu $ Akhilesh Kumar § Mani Azimi § * University of Michigan $ Carnegie Mellon University § Intel