RTAS 2014 Bounding Memory Interference Delay in COTS-based Multi-Core Systems Hyoseung Kim Dionisio de Niz Bj ӧ rn Andersson Mark Klein Onur Mutlu Raj.

Slides:



Advertisements
Similar presentations
Prefetch-Aware Shared-Resource Management for Multi-Core Systems Eiman Ebrahimi * Chang Joo Lee * + Onur Mutlu Yale N. Patt * * HPS Research Group The.
Advertisements

Carnegie Mellon University Software Engineering Institute CERT® Knowledgebase Copyright © 1997 Carnegie Mellon University VU#14202 UNIX rlogin with stack.
© 2013 Carnegie Mellon University UFO: From Underapproximations to Overapproximations and Back! Arie Gurfinkel (SEI/CMU) with Aws Albarghouthi and Marsha.
A Case for Refresh Pausing in DRAM Memory Systems
Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.
1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isolation in Multi-core Platforms Apr 9, 2012 Heechul Yun +, Gang Yao +, Rodolfo.
© 2011 Carnegie Mellon University System of Systems V&V John B. Goodenough October 19, 2011.
A Case for Subarray-Level Parallelism (SALP) in DRAM Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, Onur Mutlu.
1 Adaptive History-Based Memory Schedulers Ibrahim Hur and Calvin Lin IBM Austin The University of Texas at Austin.
Understanding a Problem in Multicore and How to Solve It
© 2013 Carnegie Mellon University Academy for Software Engineering Education and Training, 2013 Session Architect: Tony Cowling Session Chair: Nancy Mead.
Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.
1 DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems Magnus Jahre †, Marius Grannaes † ‡ and Lasse Natvig † † Norwegian.
1 Multi-Core Systems CORE 0CORE 1CORE 2CORE 3 L2 CACHE L2 CACHE L2 CACHE L2 CACHE DRAM MEMORY CONTROLLER DRAM Bank 0 DRAM Bank 1 DRAM Bank 2 DRAM Bank.
1 Lecture 13: DRAM Innovations Today: energy efficiency, row buffer management, scheduling.
1 Virtual Private Caches ISCA’07 Kyle J. Nesbit, James Laudon, James E. Smith Presenter: Yan Li.
Lecture 12: DRAM Basics Today: DRAM terminology and basics, energy innovations.
1 Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections )
1 Lecture 15: DRAM Design Today: DRAM basics, DRAM innovations (Section 5.3)
© 2011 Carnegie Mellon University Should-Cost: A Use for Parametric Estimates Additional uses for estimation tools Presenters:Bob Ferguson (SEMA) Date:November.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Memory access scheduling Authers: Scott RixnerScott Rixner,William J. Dally,Ujval J. Kapasi, Peter Mattson, John D. OwensWilliam J. DallyUjval J. KapasiPeter.
Parallel Application Memory Scheduling Eiman Ebrahimi * Rustam Miftakhutdinov *, Chris Fallin ‡ Chang Joo Lee * +, Jose Joao * Onur Mutlu ‡, Yale N. Patt.
1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.
Improving Real-Time Performance on Multicore Platforms Using MemGuard University of Kansas Dr. Heechul Yun 10/28/2013.
Ipek Ozkaya, COCOMO Forum © 2012 Carnegie Mellon University Affordability and the Value of Architecting Ipek Ozkaya Research, Technology.
1 Lecture 4: Memory: HMC, Scheduling Topics: BOOM, memory blades, HMC, scheduling policies.
© 2010 Carnegie Mellon University Team Software Process.
Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Conditions and Terms of Use
Virtual Machine Scheduling for Parallel Soft Real-Time Applications
Stall-Time Fair Memory Access Scheduling Onur Mutlu and Thomas Moscibroda Computer Architecture Group Microsoft Research.
Embedded System Lab. 최 길 모최 길 모 Kilmo Choi A Software Memory Partition Approach for Eliminating Bank-level Interference in Multicore.
1 Fair Queuing Memory Systems Kyle Nesbit, Nidhi Aggarwal, Jim Laudon *, and Jim Smith University of Wisconsin – Madison Department of Electrical and Computer.
Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.
1 Process Scheduling in Multiprocessor and Multithreaded Systems Matt Davis CS5354/7/2003.
Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.
1 Presented By: Michael Bieniek. Embedded systems are increasingly using chip multiprocessors (CMPs) due to their low power and high performance capabilities.
By Edward A. Lee, J.Reineke, I.Liu, H.D.Patel, S.Kim
Author Software Engineering Institute
Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research.
High-Performance DRAM System Design Constraints and Considerations by: Joseph Gross August 2, 2010.
Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
Achieving High Performance and Fairness at Low Cost Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri, Harsha Rastogi, Onur Mutlu 1 The Blacklisting Memory.
Contemporary DRAM memories and optimization of their usage Nebojša Milenković and Vladimir Stanković, Faculty of Electronic Engineering, Niš.
Quantifying and Controlling Impact of Interference at Shared Caches and Main Memory Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, Onur.
15-740/ Computer Architecture Lecture 20: Main Memory II Prof. Onur Mutlu Carnegie Mellon University.
Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Lavanya Subramanian 1.
15-740/ Computer Architecture Lecture 25: Main Memory
1 CERT BFF: From Start To PoC June 09, 2016 © 2016 Carnegie Mellon University This material has been approved for public release and unlimited distribution.
1 Lecture 4: Memory Scheduling, Refresh Topics: scheduling policies, refresh basics.
Secure Software Workforce Development Panel Session
NFV Compute Acceleration APIs and Evaluation
Reducing Memory Interference in Multicore Systems
Michael Spiegel, Esq Timothy Shimeall, Ph.D.
The One-Out-of-m Multicore Problem
SystemC Simulation Based Memory Controller Optimization
A Requests Bundling DRAM Controller for Mixed-Criticality System
Metrics-Focused Analysis of Network Flow Data
Application Slowdown Model
Accelerating Dependent Cache Misses with an Enhanced Memory Controller
Dynamic Cyber Training with Moodle
15-740/ Computer Architecture Lecture 19: Main Memory
Presentation transcript:

RTAS 2014 Bounding Memory Interference Delay in COTS-based Multi-Core Systems Hyoseung Kim Dionisio de Niz Bj ӧ rn Andersson Mark Klein Onur Mutlu Raj Rajkumar

RTAS 2014 Copyright 2014 Carnegie Mellon University and IEEE This material is based upon work funded and supported by the Department of Defense under Contract No. FA C-0003 with Carnegie Mellon University for the operation of the Software Engineering Institute, a federally funded research and development center. NO WARRANTY. THIS CARNEGIE MELLON UNIVERSITY AND SOFTWARE ENGINEERING INSTITUTE MATERIAL IS FURNISHED ON AN “AS-IS” BASIS. CARNEGIE MELLON UNIVERSITY MAKES NO WARRANTIES OF ANY KIND, EITHER EXPRESSED OR IMPLIED, AS TO ANY MATTER INCLUDING, BUT NOT LIMITED TO, WARRANTY OF FITNESS FOR PURPOSE OR MERCHANTABILITY, EXCLUSIVITY, OR RESULTS OBTAINED FROM USE OF THE MATERIAL. CARNEGIE MELLON UNIVERSITY DOES NOT MAKE ANY WARRANTY OF ANY KIND WITH RESPECT TO FREEDOM FROM PATENT, TRADEMARK, OR COPYRIGHT INFRINGEMENT. This material has been approved for public release and unlimited distribution. This material may be reproduced in its entirety, without modification, and freely distributed in written or electronic form without requesting formal permission. Permission is required for any other use. Requests for permission should be directed to the Software Engineering Institute at Carnegie Mellon ® is registered in the U.S. Patent and Trademark Office by Carnegie Mellon University. DM /25

RTAS 2014 Why Multi-Core Processors? Processor development trend –Increasing overall performance by integrating multiple cores Embedded systems: Actively adopting multi-core CPUs Automotive: –Freescale i.MX6 4-core CPU –NVIDIA Tegra K1 platform Avionics and defense: –Rugged Intel i7 single board computers –Freescale P core CPU 3/25

RTAS 2014 Multi-Core Memory System Core 1 Core 2 Core 3Core P … Memory Controller DRAM Bank 1 DRAM Bank 2 DRAM Bank 3 DRAM Bank x … Cache Shared memory resources Memory interference An upper bound on the memory interference delay is needed to check task schedulability 4/25

RTAS 2014 Impact of Memory Interference (1/2) Task execution times with memory attacker tasks –Intel i7 4-core processor + DDR SDRAM 8GB Core 1 Core 2 Core 3 Cache Core 4 Cache DDR3 SDRAM PARSEC benchmark task Memory attacker * S/W cache partitioning is used 5/25

RTAS attacker  Max 5.5x increase 2 attackers  Max 8.4x increase 3 attackers  Max 12x increase Impact of Memory Interference (2/2) Norm. execution time (%) black- scholes body- track cannealferretfluid- animate freq- mine ray- trace stream- cluster swap- tions vipsx264 We should predict, bound and reduce the memory interference delay! 12x increase observed 6/25

RTAS 2014 Issues with Memory Models Q1. Can we assume that each memory request takes a constant service time? –The memory access time varies considerably depending on the requested address and the rank/bank/bus states Q2. Can we assume memory requests are serviced in either Round-Robin or FIFO order? –Memory requests arriving early may be serviced later than ones arriving later in today’s COTS memory systems No. An over-simplified memory model may produce pessimistic or optimistic estimates on the memory interference delay 7/25

RTAS 2014 Our Approach Explicitly considers the timing characteristics of major DRAM resources –Rank/bank/bus timing constraints (JEDEC standard) –Request re-ordering effect Bounding memory interference delay for a task –Combines request-driven and job-driven approaches Software DRAM bank partitioning awareness –Analyzes the effect of dedicated and shared DRAM banks Task’s own memory requestsInterfering memory requests during the job execution 8/25

RTAS 2014Outline Introduction Details on DRAM Systems –DRAM organization –Memory controller and scheduling policy –DRAM bank partitioning Bounding Memory Interference Delay Evaluation Conclusion 9/25

RTAS 2014 DRAM Organization DRAM Rank CHIP 1 CHIP 2 CHIP 3 CHIP 4 CHIP 5 CHIP 6 CHIP 7 CHIP 8 Data bus 8-bit Address bus Command bus 64-bit DRAM Chip Bank 1 Command decoder Data bus Address bus Command bus 8-bit Bank... Bank 8 Columns Rows Row buffer Row decoder Column decoder Row address Column address DRAM access latency varies depending on which row is stored in the row buffer Row hit Row conflict 10/25

RTAS 2014 Memory Controller Request buffer Bank 1 Priority Queue Bank 2 Priority Queue Bank n Priority Queue... Bank 1 Scheduler Bank 2 Scheduler Bank n Scheduler... Channel Scheduler Memory scheduler Read/ Write Buffers DRAM address/command buses Processor data bus DRAM data bus Memory requests from CPU cores Two-level hierarchical scheduling structure 11/25

RTAS 2014 FR-FCFS: First-Ready, First-Come First-Serve –Goal: maximize DRAM throughput  Maximize row buffer hit rate Memory Scheduling Policy Bank 1 Scheduler Bank 2 Scheduler Bank n Scheduler... Channel Scheduler Memory access interference occurs at both bank and channel schedulers Intra-bank interference at bank scheduler Inter-bank interference at channel scheduler 1.Bank scheduler Considers bank timing constraints Prioritizes row-hit requests In case of tie, prioritizes older requests 2.Channel scheduler Considers channel timing constraints Prioritizes older requests 12/25

RTAS 2014 DRAM Bank Partitioning Prevents intra-bank interference by dedicating different DRAM banks to each core –Can be supported in the OS kernel Bank 1 Scheduler Bank 2 Scheduler Channel Scheduler Core 1Core 2 (1) w/o bank partitioning Bank 1 Scheduler Bank 2 Scheduler Channel Scheduler Core 1Core 2 (2) w/ bank partitioning Intra-bank and inter-bank interference Only inter-bank interference 13/25

RTAS 2014Outline Introduction Details on DRAM Systems Bounding Memory Interference Delay –System Model –Request-Driven Bounding –Job-Driven Bounding –Schedulability Analysis Evaluation Conclusion 14/25

RTAS 2014 System Model 15/25

RTAS 2014 Bounding Memory Interference Delay 1. Request-Driven Bounding 2. Job-Driven Bounding Response-Time Based Schedulability Analysis 16/25

RTAS 2014 Request-Driven (RD) Approach Bounded by using DRAM and CPU core params Not by using task params of other tasks Per-request interference delay 17/25

RTAS 2014 Job-Driven (JD) Approach 18/25

RTAS 2014 Memory interference delay cannot exceed any results from the RD and JD approaches –We take the smaller result from the two approaches Extended response-time test Response-Time Test Classical iterative response-time test Request-Driven (RD) Approach Job-Driven (JD) Approach 19/25

RTAS 2014Outline Introduction Details on DRAM Systems Bounding Memory Interference Delay Evaluation Conclusion 20/25

RTAS 2014 Experimental Setup Target system –Linux/RK –Intel i quad-core processor –DDR SDRAM (single channel configuration) Workload: PARSEC benchmarks Methodology –Compare the observed and predicted response times in the presence of memory interference –Core 1: Each PARSEC benchmark –Core 2, 3 and 4: Tasks generating interfering memory requests Severe and non-severe memory interference (modified versions of the stream benchmark) Cache partitioning: to avoid cache interference Bank partitioning: to test both private and shared banks 21/25

RTAS 2014 Shared DRAM Bank Severe Memory Interference (1 of 2) Observed Predicted w/o Re-ordering effect Predicted w/ Re-ordering effect black- scholes body- track cannealferretfluid- animate freq- mine ray- trace stream- cluster swap- tions vipsx264 12x increase due to memory interference w/o re-ordering effect  overly optimistic Request re-ordering effect should be considered in COTS memory systems 22/25

RTAS 2014 Private DRAM Bank Severe Memory Interference (2 of 2) black- scholes body- track cannealferretfluid- animate freq- mine ray- trace stream- cluster swap- tions vipsx264 Observed Predicted 4.1x increase  DRAM bank partitioning helps reducing the memory interference Our analysis enables the quantification of the benefit of DRAM bank partitioning 23/25

RTAS 2014 Private DRAM Bank Non-severe Memory Interference Average over-estimates are 8% (13% for a shared bank) Observed Predicted black- scholes body- track cannealferretfluid- animate freq- mine ray- trace stream- cluster swap- tions vipsx264 Our analysis bounds memory interference delay with low pessimism under both high and low memory contentions 24/25

RTAS 2014Conclusions Analysis for bounding memory interference –Based on a realistic memory model JEDEC DDR3 SDRAM standard FR-FCFS policy of the memory controller Shared and private DRAM banks –Combination of the request-driven and job-driven approaches Reduces pessimism in analysis (8% under severe interference) Advantages –Does not require any modifications to hardware components or application software –Readily applicable to COTS-based multicore real-time systems Future work –Pre-fetcher, multi-channel memory systems 25/25