Understanding a Problem in Multicore and How to Solve It

Slides:

Advertisements

Similar presentations

Prefetch-Aware Shared-Resource Management for Multi-Core Systems Eiman Ebrahimi * Chang Joo Lee * + Onur Mutlu Yale N. Patt * * HPS Research Group The.

Advertisements

A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.

Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.

Thomas Moscibroda Distributed Systems Research, Redmond Onur Mutlu

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

1 MacSim Tutorial (In ISCA-39, 2012). Thread fetch policies Branch predictor Thread fetch policies Branch predictor Software and Hardware prefetcher Cache.

Microarchitecture is Dead AND We must have a multicore interface that does not require programmers to understand what is going on underneath.

Computer Architecture lecture 1 、 2 学习报告（第二次）亢吉男

Some Opportunities and Obstacles in Cross-Layer and Cross-Component (Power) Management Onur Mutlu NSF CPOM Workshop, 2/10/2012.

Computer Architecture Lecture 1: Introduction and Basics Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 1/14/2013.

Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.

1 Multi-Core Systems CORE 0CORE 1CORE 2CORE 3 L2 CACHE L2 CACHE L2 CACHE L2 CACHE DRAM MEMORY CONTROLLER DRAM Bank 0 DRAM Bank 1 DRAM Bank 2 DRAM Bank.

1 Lecture 13: DRAM Innovations Today: energy efficiency, row buffer management, scheduling.

Lecture 12: DRAM Basics Today: DRAM terminology and basics, energy innovations.

1 Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections )

Parallel Application Memory Scheduling Eiman Ebrahimi * Rustam Miftakhutdinov *, Chris Fallin ‡ Chang Joo Lee * +, Jose Joao * Onur Mutlu ‡, Yale N. Patt.

1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.

Memory Systems in the Many-Core Era: Some Challenges and Solution Directions Onur Mutlu June 5, 2011 ISMM/MSPC.

Scalable Many-Core Memory Systems Lecture 4, Topic 3: Memory Interference and QoS-Aware Memory Systems Prof. Onur Mutlu

1 Lecture 4: Memory: HMC, Scheduling Topics: BOOM, memory blades, HMC, scheduling policies.

Stall-Time Fair Memory Access Scheduling Onur Mutlu and Thomas Moscibroda Computer Architecture Group Microsoft Research.

Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.

My major mantra: We Must Break the Layers. Algorithm Program ISA (Instruction Set Arch)‏ Microarchitecture Circuits Problem Electrons.

Embedded System Lab. 최 길 모최 길 모 Kilmo Choi A Software Memory Partition Approach for Eliminating Bank-level Interference in Multicore.

Scalable Many-Core Memory Systems Topic 3: Memory Interference and QoS-Aware Memory Systems Prof. Onur Mutlu

Modern DRAM Memory Architectures Sam Miller Tam Chantem Jon Lucas CprE 585 Fall 2003.

Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research.

Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.

Computer Architecture: Memory Interference and QoS (Part I) Prof. Onur Mutlu Carnegie Mellon University.

1 Lecture 3: Memory Buffers and Scheduling Topics: buffers (FB-DIMM, RDIMM, LRDIMM, BoB, BOOM), memory blades, scheduling policies.

Achieving High Performance and Fairness at Low Cost Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri, Harsha Rastogi, Onur Mutlu 1 The Blacklisting Memory.

DRAM Tutorial Lecture Vivek Seshadri. Vivek Seshadri – Thesis Proposal DRAM Module and Chip 2.

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University.

CS203 – Advanced Computer Architecture Main Memory Slides adapted from Onur Mutlu (CMU)

RTAS 2014 Bounding Memory Interference Delay in COTS-based Multi-Core Systems Hyoseung Kim Dionisio de Niz Bj ӧ rn Andersson Mark Klein Onur Mutlu Raj.

15-740/ Computer Architecture Lecture 20: Main Memory II Prof. Onur Mutlu Carnegie Mellon University.

Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Lavanya Subramanian 1.

15-740/ Computer Architecture Lecture 25: Main Memory

Effect of Instruction Fetch and Memory Scheduling on GPU Performance Nagesh B Lakshminarayana, Hyesoon Kim.

18-740/640 Computer Architecture Lecture 15: Memory Resource Management II Prof. Onur Mutlu Carnegie Mellon University Fall 2015, 11/2/2015.

1 Lecture 4: Memory Scheduling, Refresh Topics: scheduling policies, refresh basics.

Computer Architecture Lecture 22: Memory Controllers Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 3/25/2015.

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

CS161 – Design and Architecture of Computer Main Memory Slides adapted from Onur Mutlu (CMU)

CIS-550 Computer Architecture Lecture 1: Introduction and Basics

15-740/ Computer Architecture Lecture 0: Announcements/Logistics

Computer Architecture: Parallel Task Assignment

Computer Architecture Lecture 1: Introduction and Basics

Reducing Memory Interference in Multicore Systems

18-447: Computer Architecture Lecture 23: Caches

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Zhichun Zhu Zhao Zhang ECE Department ECE Department

Computer Architecture Lecture 23: Memory Management

Prof. Onur Mutlu Carnegie Mellon University 11/9/2012

Computer Architecture Lecture 1: Introduction and Basics

Computer Architecture Lecture 24: Memory Scheduling

Computer Architecture Recitation 1

Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/14/2011

Computer Architecture: Multithreading (I)

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Achieving High Performance and Fairness at Low Cost

15-740/ Computer Architecture Lecture 14: Prefetching

15-740/ Computer Architecture Lecture 19: Main Memory

Prof. Onur Mutlu ETH Zürich Fall November 2017

Prof. Onur Mutlu Carnegie Mellon University 11/14/2012

Prof. Onur Mutlu ETH Zurich Spring March 2018

Presented by Florian Ettinger

Computer Architecture Lecture 30: In-memory Processing

Prof. Onur Mutlu Carnegie Mellon University

Presentation transcript:

Understanding a Problem in Multicore and How to Solve It From Onur Mutlu Carnegie Mellon University

An Example: Multi-Core Systems Chip CORE 0 L2 CACHE 0 L2 CACHE 1 CORE 1 SHARED L3 CACHE DRAM INTERFACE DRAM MEMORY CONTROLLER DRAM BANKS CORE 2 L2 CACHE 2 L2 CACHE 3 CORE 3 *Die photo credit: AMD Barcelona

Unexpected Slowdowns in Multi-Core High priority Memory Performance Hog Low priority What kind of performance do we expect when we run two applications on a multi-core system? To answer this question, we performed an experiment. We took two applications we cared about, ran them together on different cores in a dual-core system, and measured their slowdown compared to when each is run alone on the same system. This graph shows the slowdown each app experienced. (DATA explanation…) Why do we get such a large disparity in the slowdowns? Is it the priorities? No. We went back and gave high priority to gcc and low priority to matlab. The slowdowns did not change at all. Neither the software or the hardware enforced the priorities. Is it the contention in the disk? We checked for this possibility, but found that these applications did not have any disk accesses in the steady state. They both fit in the physical memory and therefore did not interfere in the disk. What is it then? Why do we get such large disparity in slowdowns in a dual core system? I will call such an application a “memory performance hog” Now, let me tell you why this disparity in slowdowns happens. Is it that there are other applications or the OS interfering with gcc, stealing its time quantums? No. (Core 0) (Core 1) Moscibroda and Mutlu, “Memory performance attacks: Denial of memory service in multi-core systems,” USENIX Security 2007.

A Question or Two Can you figure out why there is a disparity in slowdowns if you do not know how the processor executes the programs? Can you fix the problem without knowing what is happening “underneath”?

Why the Disparity in Slowdowns? Multi-Core Chip CORE 1 matlab CORE 2 gcc L2 CACHE L2 CACHE unfairness INTERCONNECT Shared DRAM Memory System DRAM MEMORY CONTROLLER -In a multi-core chip, different cores share some hardware resources. In particular, they share the DRAM memory system. The shared memory system consists of this and that. When we run matlab on one core, and gcc on another core, both cores generate memory requests to access the DRAM banks. When these requests arrive at the DRAM controller, the controller favors matlab’s requests over gcc’s requests. As a result, matlab can make progress and continues generating memory requests. These requests are again favored by the DRAM controller over gcc’s requests. Therefore, gcc starves waiting for its requests to be serviced in DRAM whereas matlab makes very quick progress as if it were running alone. Why does this happen? This is because the algorithms employed by the DRAM controller are unfair. But, why are these algorithms unfair? Why do they unfairly prioritize matlab accesses? To understand this, we need to understand how a DRAM bank operates. Almost all systems today contain multi-core chips Multi-core systems consist of multiple on-chip cores and caches Cores share the DRAM memory system DRAM memory system consists of DRAM banks that store data (multiple banks to allow parallel accesses) DRAM memory controller that mediates between cores and DRAM memory It schedules memory operations generated by cores to DRAM This talk is about exploiting the unfair algorithms in the memory controllers to perform denial of service to running threads To understand how this happens, we need to know about how each DRAM bank operates DRAM Bank 0 DRAM Bank 1 DRAM Bank 2 DRAM Bank 3

DRAM Bank Operation Access Address: (Row 0, Column 0) Columns Row address 0 Row address 1 Row decoder Rows Row 0 Empty Row 1 Row Buffer CONFLICT ! HIT HIT Column address 0 Column address 1 Column address 0 Column address 85 Column mux Data

DRAM Controllers A row-conflict memory access takes significantly longer than a row-hit access Current controllers take advantage of the row buffer Commonly used scheduling policy (FR-FCFS) [Rixner 2000]* (1) Row-hit first: Service row-hit memory accesses first (2) Oldest-first: Then service older accesses first This scheduling policy aims to maximize DRAM throughput *Rixner et al., “Memory Access Scheduling,” ISCA 2000. *Zuravleff and Robinson, “Controller for a synchronous DRAM …,” US Patent 5,630,096, May 1997.

The Problem Multiple threads share the DRAM controller DRAM controllers designed to maximize DRAM throughput DRAM scheduling policies are thread-unfair Row-hit first: unfairly prioritizes threads with high row buffer locality Threads that keep on accessing the same row Oldest-first: unfairly prioritizes memory-intensive threads DRAM controller vulnerable to denial of service attacks Can write programs to exploit unfairness

A Memory Performance Hog // initialize large arrays A, B for (j=0; j<N; j++) { index = j*linesize; A[index] = B[index]; … } // initialize large arrays A, B for (j=0; j<N; j++) { index = rand(); A[index] = B[index]; … } streaming random STREAM RANDOM Sequential memory access Very high row buffer locality (96% hit rate) Memory intensive Random memory access Very low row buffer locality (3% hit rate) Similarly memory intensive Streaming through memory by performing operations on two 1D arrays. Sequential memory access Each access is a cache miss (elements of array larger than a cache line size)  hence, very memory intensive Link to the real code… Moscibroda and Mutlu, “Memory Performance Attacks,” USENIX Security 2007.

What Does the Memory Hog Do? Row decoder T0: Row 0 T0: Row 0 T0: Row 0 T0: Row 0 T0: Row 0 T0: Row 0 T0: Row 0 T1: Row 5 T1: Row 111 T0: Row 0 T1: Row 16 T0: Row 0 Memory Request Buffer Row 0 Row 0 Row Buffer Pictorially demonstrate how stream denies memory service to rdarray Stream continuously accesses columns in row 0 in a streaming manner (streams through a row after opening it) In other words almost all its requests are row-hits RDarray’s requests are row-conflicts (no locality) The DRAM controller reorders streams requests to the open row over other requests (even older ones) to maximize DRAM throughput Hence, rdarray’s requests do not get serviced as long as stream is issuing requests at a fast-enough rate In this example, the red thread’s request to another row will not get serviced until stream stops issuing a request to row 0 With those parameters, 128 requests of stream would be serviced before 1 from rdarray As row-buffer size increases, which is the industry trend, this problem will become more severe This is not the worst case, but it is easy to construct and understand Stream falls off the row buffer at some point I leave it to the listeners to construct a case worse than this (it is possible) Row size: 8KB, cache block size: 64B 128 (8KB/64B) requests of T0 serviced before T1 Column mux T0: STREAM T1: RANDOM Data Moscibroda and Mutlu, “Memory Performance Attacks,” USENIX Security 2007.

Now That We Know What Happens Underneath How would you solve the problem? What is the right place to solve the problem? Programmer? System software? Compiler? Hardware (Memory controller)? Hardware (DRAM)? Circuits? Two other goals of this course: Make you think critically Make you think broadly Problem Algorithm Program/Language Runtime System (VM, OS, MM) ISA (Architecture) Microarchitecture Logic Circuits Electrons