Adaptive Cache Partitioning on a Composite Core

Slides:

Advertisements

Similar presentations

A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.

Advertisements

Jaewoong Sim Alaa R. Alameldeen Zeshan Chishti Chris Wilkerson Hyesoon Kim MICRO-47 | December 2014.

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,

D. Tam, R. Azimi, L. Soares, M. Stumm, University of Toronto Appeared in ASPLOS XIV (2009) Reading Group by Theo 1.

ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors Mohammad Hammoud, Sangyeun Cho, and Rami Melhem Presenter: Socrates Demetriades.

Virtual Exclusion: An Architectural Approach to Reducing Leakage Energy in Multiprocessor Systems Mrinmoy Ghosh Hsien-Hsin S. Lee School of Electrical.

PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

University of Michigan Electrical Engineering and Computer Science 1 Libra: Tailoring SIMD Execution using Heterogeneous Hardware and Dynamic Configurability.

Chimera: Collaborative Preemption for Multitasking on a Shared GPU

Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.

Nov COMP60621 Concurrent Programming for Numerical Applications Lecture 6 Chronos – a Dell Multicore Computer Len Freeman, Graham Riley Centre for.

Memory System Characterization of Big Data Workloads

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science August 20, 2009 Enabling.

1 Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1)

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

1 Balanced Cache:Reducing Conflict Misses of Direct-Mapped Caches through Programmable Decoders ISCA 2006,IEEE. By Chuanjun Zhang Speaker: WeiZeng.

CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.

University of Michigan Electrical Engineering and Computer Science 1 StageNet: A Reconfigurable CMP Fabric for Resilient Systems Shantanu Gupta Shuguang.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and

Copyright © 2012 Houman Homayoun 1 Dynamically Heterogeneous Cores Through 3D Resource Pooling Houman Homayoun Vasileios Kontorinis Amirali Shayan Ta-Wei.

Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott.

University of Michigan Electrical Engineering and Computer Science 1 Dynamic Acceleration of Multithreaded Program Critical Paths in Near-Threshold Systems.

University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University.

1 Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction Rakesh Kumar, Keith I. Farkas, Norman P. Jouppi, Parthasarathy.

Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University.

Ramazan Bitirgen, Engin Ipek and Jose F.Martinez MICRO’08 Presented by PAK,EUNJI Coordinated Management of Multiple Interacting Resources in Chip Multiprocessors.

(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)

University of Michigan Electrical Engineering and Computer Science Composite Cores: Pushing Heterogeneity into a Core Andrew Lukefahr, Shruti Padmanabha,

CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR

A Row Buffer Locality-Aware Caching Policy for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.

MadCache: A PC-aware Cache Insertion Policy Andrew Nere, Mitch Hayenga, and Mikko Lipasti PHARM Research Group University of Wisconsin – Madison June 20,

Analyzing Performance Vulnerability due to Resource Denial-Of-Service Attack on Chip Multiprocessors Dong Hyuk WooGeorgia Tech Hsien-Hsin “Sean” LeeGeorgia.

Embedded System Lab. 정범종 A_DRM: Architecture-aware Distributed Resource Management of Virtualized Clusters H. Wang et al. VEE, 2015.

MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.

1 Dynamically Heterogeneous Cores Through 3D Resource Pooling Houman Homayoun Vasileios Kontorinis Amirali Shayan Ta-Wei Lin Dean M. Tullsen Speaker: Houman.

Hardware Architectures for Power and Energy Adaptation Phillip Stanley-Marbell.

Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke

1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.

Jason Jong Kyu Park1, Yongjun Park2, and Scott Mahlke1

Speaker : Kyu Hyun, Choi. Problem: Interference in shared caches – Lack of isolation → no QoS – Poor cache utilization → degraded performance.

Rakesh Kumar Keith Farkas Norman P Jouppi,Partha Ranganathan,Dean M.Tullsen University of California, San Diego MICRO 2003 Speaker ： Chun-Chung Chen Single-ISA.

Dynamic Associative Caches:

Scalpel: Customizing DNN Pruning to the

Improving Cache Performance using Victim Tag Stores

Seth Pugsley, Jeffrey Jestes,

ECE Dept., Univ. Maryland, College Park

Zhichun Zhu Zhao Zhang ECE Department ECE Department

Xiaodong Wang, Shuang Chen, Jeff Setter,

Lynn Choi School of Electrical Engineering

Less is More: Leveraging Belady’s Algorithm with Demand-based Learning

Architecture Background

Cache Memory Presentation I

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

DynaMOS: Dynamic Schedule Migration for Heterogeneous Cores

Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor

Lecture: SMT, Cache Hierarchies

CARP: Compression Aware Replacement Policies

University of Michigan

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Milad Hashemi, Onur Mutlu, Yale N. Patt

Using Dead Blocks as a Virtual Victim Cache

Lecture: SMT, Cache Hierarchies

CARP: Compression-Aware Replacement Policies

Lecture: SMT, Cache Hierarchies

Introduction to Heterogeneous Parallel Computing

Stream-based Memory Specialization for General Purpose Processors

Presentation transcript:

Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University of Michigan, Ann Arbor June 14th, 2015

Energy Consumption on Mobile Platform

Heterogeneous Multicore System (Kumar, MICRO’03) Multiple cores with different implementations Applications migration Mapped to the most energy-efficient core Migrate between cores High overhead Instruction phase must be long 100M-500M instructions Fine-grained phases expose opportunities ARM big.LITTLE Reduce migration overhead Composite Core

Composite Core (Lukefahr, MICRO’12) Shared L1 Caches Shared Front-end Big μEngine Primary Thread Little μEngine - 0.5x performance - 5x less power Secondary Thread

Problem with Cache Contention Threads compete for cache resources L2 cache space in traditional multicore system Memory intensive threads get most space Decrease total throughput L1 cache contention – Composite Cores / SMT Foreground Background

Performance Loss of Primary Thread Worst case: 28% decrease Average: 10% decrease Normalized IPC

Solutions to L1 Cache Contention Cache Partitioning Resolve cache contention Maximize the total throughput All data cache to the primary thread Naïve solution Performance loss on secondary thread

Existing Cache Partitioning Schemes Existing Schemes Placement-based e.g., molecular caches (Varadarajan, MICRO’06) Replacement-based e.g., PriSM (Manikantan, ISCA’12) Limitations Focus on last level cache High overhead No limitation on primary thread performance loss L1 caches + Composite Cores

Adaptive Cache Partitioning Scheme Limitation on primary thread performance loss Maximize total throughput Way-partitioning and augmented LRU policy Structural limitations of L1 caches Low overhead Adaptive scheme for inherent heterogeneity Composite Core Dynamic resizing at a fine granularity

Augmented LRU Policy Cache Access Set Index Miss! LRU Victim! Primary Secondary

L1 Caches of a Composite Core Limitation of L1 caches Hit latency Low associativity Smaller size than most working sets Fine-grained memory sets of instruction phases Heterogeneous memory access Inherent heterogeneity Different thread priorities

Adaptive Scheme Cache partitioning priority Cache reuse rate Size of memory sets Cache space resizing based on priorities Raising priority (↑) Lower priority (↓) Maintain priority ( = ) Primary thread tends to get higher priority

Case – Contention + + + + gcc* - gcc* Overlap Memory sets overlap Set Index in Data Cache Overlap Time gcc* - gcc* Memory sets overlap High cache reuse rate + small memory set Both threads maintain priorities

Evaluation Multiprogrammed workload 95% performance limitation Benchmark1 – Benchmark2 (Primary – Secondary) 95% performance limitation Baseline: primary thread with all data cache Oracle simulation Length of instruction phases: 100K instructions Switching disabled / only data cache Runs under six cache partitioning modes Mode maximizing the total throughput under the limitation of primary thread performance

Cache Partitioning Modes

Architecture Parameters Architectural Features Parameters Big μEngine 3 wide Out-of-Order @ 2.0GHz 12 stage pipeline 92 ROB Entries 144 entry register file Little μEngine 2 wide In-Order @ 2.0GHz 8 stage pipeline 32 entry register file Memory System 32 KB L1 I – Cache 64 KB L1 D – Cache 1MB L2 cache, 18 cycle access 4GB Main Mem, 80 cycle access

Performance Loss of Primary Thread <5% for all workloads, 3% on average Normalized IPC

Total Throughput Limitation on primary thread performance loss Sacrifice Total Throughput but Not Much Normalized IPC

Conclusion Questions? Adaptive cache partitioning scheme Way-partitioning and augmented LRU policy L1 caches Composite Core Cache partitioning priorities Limitation on primary thread performance loss Sacrifice total throughput Questions?

Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University of Michigan, Ann Arbor June 14th, 2015