MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isolation in Multi-core Platforms Apr 9, 2012 Heechul Yun +, Gang Yao +, Rodolfo.

Slides:

Advertisements

Similar presentations

Zehan Cui, Yan Zhu, Yungang Bao, Mingyu Chen Institute of Computing Technology, Chinese Academy of Sciences July 28, 2011.

Advertisements

A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.

Jaewoong Sim Alaa R. Alameldeen Zeshan Chishti Chris Wilkerson Hyesoon Kim MICRO-47 | December 2014.

Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.

High Performing Cache Hierarchies for Server Workloads

The Migration of Safety-Critical RT Software to Multicore Marco Caccamo University of Illinois at Urbana-Champaign.

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

1 MemScale: Active Low-Power Modes for Main Memory Qingyuan Deng, David Meisner*, Luiz Ramos, Thomas F. Wenisch*, and Ricardo Bianchini Rutgers University.

1 Adaptive History-Based Memory Schedulers Ibrahim Hur and Calvin Lin IBM Austin The University of Texas at Austin.

Improving Cache Performance by Exploiting Read-Write Disparity

1 Multi-Core Systems CORE 0CORE 1CORE 2CORE 3 L2 CACHE L2 CACHE L2 CACHE L2 CACHE DRAM MEMORY CONTROLLER DRAM Bank 0 DRAM Bank 1 DRAM Bank 2 DRAM Bank.

Glenn Reinman, Brad Calder, Department of Computer Science and Engineering, University of California San Diego and Todd Austin Department of Electrical.

Micro-Pages: Increasing DRAM Efficiency with Locality-Aware Data Placement Kshitij Sudan, Niladrish Chatterjee, David Nellans, Manu Awasthi, Rajeev Balasubramonian,

1 Virtual Private Caches ISCA’07 Kyle J. Nesbit, James Laudon, James E. Smith Presenter: Yan Li.

1 Router Construction II Outline Network Processors Adding Extensions Scheduling Cycles.

Min-Sheng Lee Efficient use of memory bandwidth to improve network processor throughput Jahangir Hasan 、 Satish ChandraPurdue University T. N. VijaykumarIBM.

Vinodh Cuppu and Bruce Jacob, University of Maryland Concurrency, Latency, or System Overhead: which Has the Largest Impact on Uniprocessor DRAM- System.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.

Marco Paolieri RePP Workshop October 15 th 1 Efficient Execution of Mixed Application Workloads in a Hard Real-Time Multicore System Marco Paolieri (BSC/UPC)

Router Construction II Outline Network Processors Adding Extensions Scheduling Cycles.

By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and

Xen and the Art of Virtualization. Introduction  Challenges to build virtual machines Performance isolation  Scheduling priority  Memory demand  Network.

Improving Real-Time Performance on Multicore Platforms Using MemGuard University of Kansas Dr. Heechul Yun 10/28/2013.

1 Lecture 4: Memory: HMC, Scheduling Topics: BOOM, memory blades, HMC, scheduling policies.

Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.

Dan Tang, Yungang Bao, Yunji Chen, Weiwu Hu, Mingyu Chen

An Analytical Performance Model for Co-Management of Last-Level Cache and Bandwidth Sharing Taecheol Oh, Kiyeon Lee, and Sangyeun Cho Computer Science.

Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems Marco Caccamo University of Illinois at Urbana-Champaign.

StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University.

Stall-Time Fair Memory Access Scheduling Onur Mutlu and Thomas Moscibroda Computer Architecture Group Microsoft Research.

Timing Channel Protection for a Shared Memory Controller Yao Wang, Andrew Ferraiuolo, G. Edward Suh Feb 17 th 2014.

Ramazan Bitirgen, Engin Ipek and Jose F.Martinez MICRO’08 Presented by PAK,EUNJI Coordinated Management of Multiple Interacting Resources in Chip Multiprocessors.

Predictable Integration of Safety-Critical Software on COTS-based Embedded Systems Marco Caccamo University of Illinois at Urbana-Champaign.

Embedded System Lab. 최 길 모최 길 모 Kilmo Choi A Software Memory Partition Approach for Eliminating Bank-level Interference in Multicore.

A Mixed Time-Criticality SDRAM Controller MeAOW Sven Goossens, Benny Akesson, Kees Goossens COBRA – CA104 NEST.

A Row Buffer Locality-Aware Caching Policy for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.

1 Lecture 14: DRAM Main Memory Systems Today: cache/TLB wrap-up, DRAM basics (Section 2.3)

Handling Mixed-Criticality in SoC- based Real-Time Embedded Systems Rodolfo Pellizzoni, Patrick Meredith, Min-Young Nam, Mu Sun, Marco Caccamo, Lui Sha.

Heracles: Improving Resource Efficiency at Scale ISCA’15 Stanford University Google, Inc.

1 Presented By: Michael Bieniek. Embedded systems are increasingly using chip multiprocessors (CMPs) due to their low power and high performance capabilities.

By Edward A. Lee, J.Reineke, I.Liu, H.D.Patel, S.Kim

MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.

Trading Cache Hit Rate for Memory Performance Wei Ding, Mahmut Kandemir, Diana Guttman, Adwait Jog, Chita R. Das, Praveen Yedlapalli The Pennsylvania State.

Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.

Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Vivek Seshadri Thomas Mullins, Amirali Boroumand,

1 Lecture 2: Memory Energy Topics: energy breakdowns, handling overfetch, LPDRAM, row buffer management, channel energy, refresh energy.

Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.

Energy Efficient Prefetching and Caching Athanasios E. Papathanasiou and Michael L. Scott. University of Rochester Proceedings of 2004 USENIX Annual Technical.

1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.

Quantifying and Controlling Impact of Interference at Shared Caches and Main Memory Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, Onur.

RTAS 2014 Bounding Memory Interference Delay in COTS-based Multi-Core Systems Hyoseung Kim Dionisio de Niz Bj ӧ rn Andersson Mark Klein Onur Mutlu Raj.

Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Lavanya Subramanian 1.

Priority Based Fair Scheduling: A Memory Scheduler Design for Chip-Multiprocessor Systems Tsinghua University Tsinghua National Laboratory for Information.

1 Lecture: Memory Basics and Innovations Topics: memory organization basics, schedulers, refresh,

Reducing Memory Interference in Multicore Systems

Zhichun Zhu Zhao Zhang ECE Department ECE Department

ISPASS th April Santa Rosa, California

Resource Aware Scheduler – Initial Results

A Requests Bundling DRAM Controller for Mixed-Criticality System

Timing Isolation for Memory Systems

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Lecture 15: DRAM Main Memory Systems

Application Slowdown Model

Milad Hashemi, Onur Mutlu, Yale N. Patt

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Haishan Zhu, Mattan Erez

Lecture: DRAM Main Memory

Lecture: DRAM Main Memory

Samira Khan University of Virginia Nov 14, 2018

Presentation transcript:

MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isolation in Multi-core Platforms Apr 9, 2012 Heechul Yun +, Gang Yao +, Rodolfo Pellizzoni *, Marco Caccamo +, Lui Sha + + University of Illinois at Urbana-Champaign * University of Waterloo

Real-Time Applications 2 Resource intensive real-time applications – Multimedia processing(*), real-time data analytic(**), object tracking Requirements – Need more performance and cost less  Commercial Off-The Shelf (COTS) – Performance guarantee (i.e., temporal predictability and isolation) (*) ARM, QoS for High-Performance and Power-Efficient HD Multimedia, 2010 (**) Intel, The Growing Importance of Big Data and Real-Time Analytics, 2012

Modern System-on-Chip (SoC) More cores – Freescale P4080 has 8 cores More sharing – Shared memory hierarchy (LLC, MC, DRAM) – Shared I/O channels 3 More performance Less cost But, isolation?

Problem: Shared Memory Hierarchy Shared hardware resources OS has little control Core1 Core2 Core3 Core4 DRAM Part 1 Part 2 Part 3Part 4 4 Memory Controller (MC) Shared Last Level Cache (LLC) Space contention Access contention

Memory Performance Isolation Q. How to guarantee worst-case performance? – Need to guarantee memory performance Part 1 Part 2 Part 3Part 4 5 Core1 Core2 Core3 Core4 DRAM Memory Controller LLC

Inter-Core Memory Interference Significant slowdown (Up to 2x on 2 cores) Slowdown is not proportional to memory bandwidth usage 6 Core Shared Memory foreground X-axis Intel Core2 L2 Slowdown ratio due to interference (1.6GB/s)(1.5GB/s) (1.4GB/s) Core background 470.lbm Runtime slowdown Core

Bank 4 Background: DRAM Chip Row 1 Row 2 Row 3 Row 4 Row 5 Bank 1 Row Buffer Bank 2 Bank 3 activate precharge Read/write State dependent access latency – Row miss: 19 cycles, Row hit: 9 cycles (*) PC6400-DDR2 with (RAS-CAS-CL latency setting) READ (Bank 1, Row 3, Col 7) Col7

Background: Memory Controller(MC) Request queue – Buffer read/write requests from CPU cores – Unpredictable queuing delay due to reordering 8 Bruce Jacob et al, “Memory Systems: Cache, DRAM, Disk” Fig 13.1.

Background: MC Queue Re-ordering Improve row hit ratio and throughput Unpredictable queuing delay 9 Core1: READ Row 1, Col 1 Core2: READ Row 2, Col 1 Core1: READ Row 1, Col 2 Core1: READ Row 1, Col 1 Core1: READ Row 1, Col 2 Core2: READ Row 2, Col 1 DRAM Initial Queue Reordered Queue 2 Row Switch 1 Row Switch

Challenges for Real-Time Systems Memory controller(MC) level queuing delay – Main source of interference – Unpredictable (re-ordering) DRAM state dependent latency – DRAM row open/close state State of Art – Predictable DRAM controller h/w: [Akesson’07] [Paolieri’09] [Reineke’11]  not in COTS 10 Core 1 Core 2 Core 3 Core 4 DRAM Memory Controller DRAM Predictable Memory Controller

Our Approach OS level approach – Works on commodity h/w – Guarantees performance of each core – Maximizes memory performance if possible 11 Core 1 Core 2 Core 3 Core 4 DRAM Memory Controller DRAM OS control mechanism

Operating System MemGuard 12 Core1 Core2 Core3 Core4 PMC DRAM DIMM MemGuard Multicore Processor Memory Controller Memory bandwidth reservation and reclaiming BW Regulator BW Regulator BW Regulator BW Regulator 0.9GB/s0.1GB/s Reclaim Manager

Memory Bandwidth Reservation Idea – OS monitor and enforce each core’s memory bandwidth usage 13 1ms2ms 0 Dequeue tasks Enqueue tasks Dequeue tasks Budget Core activity 2121 computationmemory fetch

Memory Bandwidth Reservation 14

Guaranteed Bandwidth: r min 15 (*) PC6400-DDR2 with (RAS-CAS-CL latency setting)

Memory Access Pattern Memory access patterns vary over time Static reservation is inefficient 16 Time(ms) Memory requests Time(ms) Memory requests

Memory Bandwidth Reclaiming Key objective – Redistribute excess bandwidth to demanding cores – Improve memory b/w utilization Predictive bandwidth donation and reclaiming – Donate unneeded budget predictively – Reclaim on-demand basis 17

Reclaim Example Time 0 – Initial budget 3 for both cores Time 3,4 – Decrement budgets Time 10 (period 1) – Predictive donation (total 4) Time 12,15 – Core 1: reclaim Time 16 – Core 0: reclaim Time 17 – Core 1: no remaining budget; dequeue tasks Time 20 (period 2) – Core 0 donates 2 – Core 1 do not donate 18

Best-effort Bandwidth Sharing Key objective – Utilize best-effort bandwidth whenever possible Best-effort bandwidth – After all cores use their budgets (i.e., delivering guaranteed bandwidth), before the next period begins Sharing policy – Maximize throughput – Broadcast all cores to continue to execute 19

Evaluation Platform Intel Core2Quad 8400, 4MB L2 cache, PC6400 DDR2 DRAM – Prefetchers were turned off for evaluation – Power PC based P4080 (8core) and ARM based Exynos4412(4core) also has been ported Modified Linux kernel MemGuard kernel module – Used the entire 29 benchmarks from SPEC2006 and synthetic benchmarks 20 Core 0 ID L2 Cache Intel Core2Quad Core 1 ID Core 2 ID L2 Cache Core 3 ID Memory Controller DRAM

Evaluation Results C0 Shared Memory C2 Intel Core2 L2 462.libquantum (foreground) memory hogs (background) Normalized IPC Foreground (462.libquantum) C2 53% performance reduction

Evaluation Results C0 Shared Memory C2 Intel Core2 L2 462.Libquantum (foreground) Normalized IPC Foreground (462.libquantum) 1GB /s memory hogs (background) C2.2GB /s Reservation provides performance isolation Guaranteed performance

Evaluation Results C0 Shared Memory C2 Intel Core2 L2 462.Libquantum (foreground) Normalized IPC Foreground (462.libquantum) 1GB /s memory hogs (background) C2.2GB /s Reclaiming and Sharing maximize performance Guaranteed performance

SPEC2006 Results 24 Normalized IPC Normalized to guaranteed performance Guarantee(soft) performance of foreground (x-axis) – W.r.t. 1.0GB/s memory b/w reservation C0 Shared Memory C2 Intel Core2 L2 X-axis (foreground) 1GB /s 470.lbm (background) C2.2GB /s

SPEC2006 Results Improve overall throughput – background: 368%, foreground(X-axis): 6% 25 Normalized IPC Normalized to guaranteed performance C0 Shared Memory C2 Intel Core2 L2 X-axis (foreground) 1GB /s 470.lbm (background) C2.2GB /s

Conclusion Inter-Core Memory Interference – Big challenge for multi-core based real-time systems – Sources: queuing delay in MC, state dependent latency in DRAM MemGuard – OS mechanism providing efficient per-core memory performance isolation on COTS H/W – Memory bandwidth reservation and reclaiming support –

Thank you. 27

Effect of Reclaim IPC improvement of background is IPC reduction of foreground is 28

Reclaim Underrun Error 29

Effect of Spare Sharing IPC of background improves IPC of foreground also improves 30

Isolation and Throughput Effect of r min 4 core configuration 31

Isolation Effect of Reservation Sum b/w reservation < r min (1.2GB/s)  Isolation – 1.0GB/s(X-axis) + 0.2GB/s(lbm) = r min 32 Isolation Core 0: 1.0 GB/s for X-axis Core 2: 0.2 – 2.0 GB/s for lbm Solo

Effect of MemGuard Soft real-time application on each core. Provides differentiated memory bandwidth – weight for each core=1:2:4:8 for the guaranteed b/w, spare bandwidth sharing is enabled 33

Hard/Soft Reservation on MemGuard Hard reservation (w/o reclaiming) – Can guarantee memory bandwidth B i regardless of other cores at each period – Wasted if not used Soft reservation (w/ reclaiming) – Misprediction can caused missed b/w guarantee at each period – Error rate is small---less than 5%. Selectively applicable on per-core basis 34