1 DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems Magnus Jahre †, Marius Grannaes † ‡ and Lasse Natvig † † Norwegian.

Slides:

Advertisements

Similar presentations

Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.

Advertisements

Improving DRAM Performance by Parallelizing Refreshes with Accesses

1 PhD Defense Presentation Managing Shared Resources in Chip Multiprocessor Memory Systems 12. October 2010 Magnus Jahre.

A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.

A Case for Refresh Pausing in DRAM Memory Systems

Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.

LEMap: Controlling Leakage in Large Chip-multiprocessor Caches via Profile-guided Virtual Address Translation Jugash Chandarlapati Mainak Chaudhuri Indian.

Our approach! 6.9% Perfect L2 cache (hit rate 100% ) 1MB L2 cache Cholesky 47% speedup BASE: All cores are used to execute the application-threads. PB-GS(PB-LS)

Virtual Exclusion: An Architectural Approach to Reducing Leakage Energy in Multiprocessor Systems Mrinmoy Ghosh Hsien-Hsin S. Lee School of Electrical.

PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

1 MemScale: Active Low-Power Modes for Main Memory Qingyuan Deng, David Meisner*, Luiz Ramos, Thomas F. Wenisch*, and Ricardo Bianchini Rutgers University.

Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.

Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Lei Jin and Sangyeun Cho Dept. of Computer Science University.

A Cache-Like Memory Organization for 3D memory systems CAMEO 12/15/2014 MICRO Cambridge, UK Chiachen Chou, Georgia Tech Aamer Jaleel, Intel Moinuddin K.

Computer Architecture Evaluation, Simulation and Research OSU ECE OS Interaction with Cache Memories Dr. Sohum Sohoni School of Electrical and Computer.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 (and Appendix B) Memory Hierarchy Design Computer Architecture A Quantitative Approach,

HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs.

1 Virtual Private Caches ISCA’07 Kyle J. Nesbit, James Laudon, James E. Smith Presenter: Yan Li.

June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.

Utilizing Shared Data in Chip Multiprocessors with the Nahalal Architecture Zvika Guz, Idit Keidar, Avinoam Kolodny, Uri C. Weiser The Technion – Israel.

Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers Manu Awasthi, David Nellans, Kshitij Sudan, Rajeev Balasubramonian,

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.

Skewed Compressed Cache

1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.

McRouter: Multicast within a Router for High Performance NoCs

An Analytical Performance Model for Co-Management of Last-Level Cache and Bandwidth Sharing Taecheol Oh, Kiyeon Lee, and Sangyeun Cho Computer Science.

A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffery Stuecheli,

Cooperative Caching for Chip Multiprocessors Jichuan Chang Guri Sohi University of Wisconsin-Madison ISCA-33, June 2006.

Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.

StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University.

Stall-Time Fair Memory Access Scheduling Onur Mutlu and Thomas Moscibroda Computer Architecture Group Microsoft Research.

Comparing Memory Systems for Chip Multiprocessors Leverich et al. Computer Systems Laboratory at Stanford Presentation by Sarah Bird.

LIBRA: Multi-mode On-Chip Network Arbitration for Locality-Oblivious Task Placement Gwangsun Kim Computer Science Department Korea Advanced Institute of.

Timing Channel Protection for a Shared Memory Controller Yao Wang, Andrew Ferraiuolo, G. Edward Suh Feb 17 th 2014.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.

Ramazan Bitirgen, Engin Ipek and Jose F.Martinez MICRO’08 Presented by PAK,EUNJI Coordinated Management of Multiple Interacting Resources in Chip Multiprocessors.

A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Jason Zebchuk, Elham Safi, and Andreas Moshovos

CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR

ImanFaraji Time-based Snoop Filtering in Chip Multiprocessors Amirkabir University of Technology Tehran, Iran University of Victoria Victoria, Canada Amirali.

1 Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly Koushik Chakraborty Philip Wells Gurindar Sohi

Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.

Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1.

Analyzing Performance Vulnerability due to Resource Denial-Of-Service Attack on Chip Multiprocessors Dong Hyuk WooGeorgia Tech Hsien-Hsin “Sean” LeeGeorgia.

1 Presented By: Michael Bieniek. Embedded systems are increasingly using chip multiprocessors (CMPs) due to their low power and high performance capabilities.

HPCA Laboratory for Computer Architecture1/11/2010 Dimitris Kaseridis 1, Jeff Stuecheli 1,2, Jian Chen 1 & Lizy K. John 1 1 University of Texas.

Min Lee, Vishal Gupta, Karsten Schwan

Memory Hierarchy Adaptivity An Architectural Perspective Alex Veidenbaum AMRM Project sponsored by DARPA/ITO.

MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.

Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.

CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z.

Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.

1 Efficient System-on-Chip Energy Management with a Segmented Counting Bloom Filter Mrinmoy Ghosh- Georgia Tech Emre Özer- ARM Ltd Stuart Biles- ARM Ltd.

Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.

Optimizing Replication, Communication, and Capacity Allocation in CMPs Z. Chishti, M. D. Powell, and T. N. Vijaykumar Presented by: Siddhesh Mhambrey Published.

1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.

Migration Cost Aware Task Scheduling Milestone Shraddha Joshi, Brian Osbun 10/24/2013.

Sunpyo Hong, Hyesoon Kim

Quantifying and Controlling Impact of Interference at Shared Caches and Main Memory Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, Onur.

HAT: Heterogeneous Adaptive Throttling for On-Chip Networks Kevin Kai-Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu.

The CRISP Performance Model for Dynamic Voltage and Frequency Scaling in a GPGPU Rajib Nath, Dean Tullsen 1 Micro 2015.

Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Lavanya Subramanian 1.

Zhichun Zhu Zhao Zhang ECE Department ECE Department

ASR: Adaptive Selective Replication for CMP Caches

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Application Slowdown Model

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Reducing Memory Reference Energy with Opportunistic Virtual Caching

Presentation transcript:

1 DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems Magnus Jahre †, Marius Grannaes † ‡ and Lasse Natvig † † Norwegian University of Science and Technology ‡ Energy Micro

2 Chip Multiprocessor Resources Hardware-controlled, shared resources –Interconnect bandwidth –Shared cache capacity –Memory bus bandwidth –Memory capacity is allocated by the operating system Interference can occur in all shared units Current CMP implementations do not take interference into account

3 Why Control Resource Allocation? Provide predictable performance Support OS scheduler assumptions Cloud: Fulfill Service Level Agreement

4 Resource Allocation Tasks Focus of this work

5 Resource Allocation Baselines Baseline = Interference-free configuration Quantify performance impact from interference Private Mode and Shared Mode

6 Multi-Programmed Baseline All processes in a workload run concurrently Static and equal partitioning of all shared resources

7 Single Program Baseline The process is run alone in one core All other cores are idle Exclusive access to all shared resources

8 Baseline Weaknesses Multiprogrammed Baseline –Only accounts for interference in partitioned resources –Static and equal division of DRAM bandwidth does not give equal latency –Complex relationship between resource allocation and performance Single Program Baseline –Does not exist in shared mode Dynamic Interference Estimation Framework (DIEF)

9 Outline Introduction Dynamic Interference Estimation Framework –Shared Cache –Memory Bus –On-chip interconnect Results Summary

10 Interference Estimation Full-System Interference Estimation Aggregate interference from different units Common unit of measure Average Latency (Clock Cycles) DIEF General, component-based framework

11 Interference Definition Interference Private Mode Latency Estimate Error Private Mode Latency Measurement Private Mode Latency Measurement Shared Mode Latency Private Mode Latency Estimate Private Mode Latency Estimate

12 Shared Cache Interference B NM ABAMN Auxiliary Tag Directories C P U 0 C P U 1 Cache Accesses: B Shared Cache

13 Shared Cache Interference B NM AABMN Auxiliary Tag Directories C P U 0 C P U 1 Cache Accesses: B Shared Cache CC Eviction may not be interference

14 Shared Cache Interference B NM AABM Auxiliary Tag Directories C P U 0 C P U 1 Cache Accesses: B Shared Cache CCC B N Interference cost = miss penalty Hit Miss

15 Bus Interference Requirements Out-of-order memory bus scheduling Shared mode only cache misses and cache hits Shared cache writebacks Computing private latency based on shared mode queue contents is difficult Emulate private scheduling in the shared mode

16 ED Shared Bus Queue CB DCBA Arrival Order Head Pointer Execution Order Latency Lookup Table Bank Open Page Emulation Registers Memory Latency Estimation Buffer Bank/Page Mapping:A  (0,15),B  (0,19),C  (0,15),D  (1,32) Estimated Queue Latency = B C D 200

17 Interconnect Interference A FE BCCPU0 1 L2Bank0 L2 1 Interference Counters 00 A E 4 8 CPU 1 delays CPU 0

18 Outline Introduction Dynamic Interference Estimation Framework –Shared Cache –Memory Bus –On-chip interconnect Results Summary

19 Relative Estimation Errors

20 RMS Error Breakdown Remaining units contribute less than 2 clock cycles

21 Auxiliary Tag Directory Accuracy

22 Outline Introduction Dynamic Interference Estimation Framework –Shared Cache –Memory Bus –On-chip interconnect Results Summary

23 Summary Memory system interference causes unpredictable performance DIEF provides –Accurate private mode latency estimates –Accurate shared mode latency measurements Future opportunities –Guiding dynamic optimizations –Guiding OS scheduling decisions –Debugging and optimization

24 Thank you! Visit our website: Questions?

25 Experiment Methodology M5 simulator –Extended with crossbar and ring on-chip interconnect models –DDR2 memory bus model Randomly generated workloads of SPEC2000 benchmarks –40 4-core workloads –20 8-core workloads –10 16-core workloads