Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Slides:



Advertisements
Similar presentations
Prefetch-Aware Shared-Resource Management for Multi-Core Systems Eiman Ebrahimi * Chang Joo Lee * + Onur Mutlu Yale N. Patt * * HPS Research Group The.
Advertisements

Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,
A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.
CS 7810 Lecture 22 Processor Case Studies, The Microarchitecture of the Pentium 4 Processor G. Hinton et al. Intel Technology Journal Q1, 2001.
Lecture 12 Reduce Miss Penalty and Hit Time
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
Chapter 8. Pipelining.
1 DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems Magnus Jahre †, Marius Grannaes † ‡ and Lasse Natvig † † Norwegian.
April 27, 2010CS152, Spring 2010 CS 152 Computer Architecture and Engineering Lecture 23: Putting it all together: Intel Nehalem Krste Asanovic Electrical.
CS 7810 Lecture 10 Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors O. Mutlu, J. Stark, C. Wilkerson, Y.N.
Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Parallel Application Memory Scheduling Eiman Ebrahimi * Rustam Miftakhutdinov *, Chris Fallin ‡ Chang Joo Lee * +, Jose Joao * Onur Mutlu ‡, Yale N. Patt.
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
Mitigating Prefetcher-Caused Pollution Using Informed Caching Policies for Prefetched Blocks Vivek Seshadri Samihan Yedkar ∙ Hongyi Xin ∙ Onur Mutlu Phillip.
An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.
1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.
Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in partitioned architectures Rajeev Balasubramonian Naveen.
Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.
The Dirty-Block Index Vivek Seshadri Abhishek Bhowmick ∙ Onur Mutlu Phillip B. Gibbons ∙ Michael A. Kozuch ∙ Todd C. Mowry.
CS 152 Computer Architecture and Engineering Lecture 23: Putting it all together: Intel Nehalem Krste Asanovic Electrical Engineering and Computer Sciences.
Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.
DASX : Hardware Accelerator for Software Data Structures Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman (Simon Fraser University), Vijayalakshmi Srinivasan.
AMD Opteron Overview Michael Trotter (mjt5v) Tim Kang (tjk2n) Jeff Barbieri (jjb3v)
On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.
15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.
Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.
1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.
1 Lecture 5a: CPU architecture 101 boris.
Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,
Chang Hyun Park, Taekyung Heo, and Jaehyuk Huh
Lecture: Large Caches, Virtual Memory
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
Instruction Level Parallelism
Adaptive Cache Partitioning on a Composite Core
Zhichun Zhu Zhao Zhang ECE Department ECE Department
ALPHA Introduction I- Stream
The University of Adelaide, School of Computer Science
Milad Hashemi, Onur Mutlu, and Yale N. Patt
5.2 Eleven Advanced Optimizations of Cache Performance
Short Circuiting Memory Traffic in Handheld Platforms
Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor
Energy-Efficient Address Translation
Rachata Ausavarungnirun GPU 2 (Virginia EF) Tuesday 2PM-3PM
Rachata Ausavarungnirun
The Microarchitecture of the Pentium 4 processor
Milad Hashemi, Onur Mutlu, Yale N. Patt
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
Address-Value Delta (AVD) Prediction
Lecture 11: Memory Data Flow Techniques
Comparison of Two Processors
Ka-Ming Keung Swamy D Ponpandi
Lecture: Cache Innovations, Virtual Memory
Alpha Microarchitecture
Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt
Lecture 20: OOO, Memory Hierarchy
Lecture 20: OOO, Memory Hierarchy
Accelerating Dependent Cache Misses with an Enhanced Memory Controller
* From AMD 1996 Publication #18522 Revision E
José A. Joao* Onur Mutlu‡ Yale N. Patt*
Chapter 8. Pipelining.
Lecture 8: Efficient Address Translation
Main Memory Background
Lois Orosa, Rodolfo Azevedo and Onur Mutlu
Haonan Wang, Adwait Jog College of William & Mary
Ka-Ming Keung Swamy D Ponpandi
Presentation transcript:

Accelerating Dependent Cache Misses with an Enhanced Memory Controller Milad Hashemi, Khubaib, Eiman Ebrahimi, Onur Mutlu, Yale N. Patt June 21, 2016

Overview Dependent Cache Misses Enhanced Memory Controller Microarchitecture Dependence Chain Generation EMC Performance Evaluation and Analysis Conclusions

Effective Memory Access Latency The effective latency of accessing main memory is made up of two components: DRAM access latency On-chip latency

On-Chip Delay

Dependent Cache Misses LD [P17] -> P10 MOV P10 -> P7 LD [P7] -> P13 ADD P3, P13 -> P20 LD [P20] -> P2 ADD P2, P3 -> P12

Dependent Cache Misses The impact of effective memory latency on processor performance is magnified when a cache miss has dependent memory operations that will also result in a cache miss Dependent cache misses form chains of long-latency operations that fill the reorder buffer and prevent the core from making forward progress Important in pointer-based workloads Dependent cache misses tend to have addresses that are difficult to predict with prefetching

Prefetching and Dependent Cache Misses

Dependence Chains LD [P17] -> P10 MOV P10 -> P7 ADD P13, P3 -> P20 LD [P20] -> P2

Dependence Chain Length

Reducing Effective Memory Access Latency Transparently reduce memory access latency for these dependent cache misses Add compute capability to the memory controller Modify the core to automatically identify the operations that are in the dependence chain of a cache miss Migrate the dependence chain to the new enhanced memory controller (EMC) for execution when the source data arrives from main-memory Maintain traditional sequential execution model

Enhanced Memory Controller (EMC) Execution On Core On EMC Op 0: MEM_LD (0xc[R3] -> R1) Op 1: ADD (R2 + 1 -> R2) Op 2: MOV (R3 R5) Initiate EMC Execution Op 3: MEM_LD ([R1] -> R1) Op 4: SUB (R1 – [R4] -> R1) Op 5: MEM_LD ([R1] -> R3) Registers to Core Op 6: MEM_ST (R1-> [R3])

Quad-Core System

EMC Microarchitecture No Front-End 16 Physical Registers NOT 256 No Register Renaming 2-Wide NOT 4-Wide No Floating Point or Vector Pipeline 4kB Streaming Data Cache

Dependence Chain Generation Cycle: 1 4 2 6 5 3 LD [P17] -> P10 LD [P17] -> E0 Live-In Vector: P3 MOV P10 -> P7 MOV E0 -> E1 Register Remapping Table: LD [P7] -> P13 LD [E1] -> E2 Core Physical Register EMC Physical Register P10 E0 P7 E1 P13 E2 P20 E3 P2 E4 ADD P3, P13 -> P20 ADD L0, E2 -> E3 LD [P20] -> P2 LD [E3] -> E4

Memory Operations Virtual Memory Support Virtual address translation occurs through a 32-entry TLB per core Execution at the EMC is cancelled at a page fault Loads first query the EMC data cache Misses query the LLC in parallel with the memory access A miss predictor [Qureshi and Loh: MICRO 2012] is maintained to reduce bandwidth cost Memory operations are retired in program order back at the core

System Configuration Quad-Core Caches 4-wide Issue 256 Entry Reorder Buffer 92 Entry Reservation Station Caches 32 KB 8-Way Set Associative L1 I/D-Cache 1MB 8-Way Set Associative Shared Last Level Cache per Core Non-Uniform Memory Access Latency DDR3 System 128-Entry Memory Queue Batch Scheduling Prefetchers Stream, Markov, Global History Buffer Feedback Directed Prefetching: Dynamic Degree 1-32 EMC Compute 2-wide issue 2 issue contexts Each Context Contains: 16 entry uop buffer, 16 entry physical register file 4 kB Streaming Data Cache

Workloads

Effective Memory Access Latency Reduction

Quad-Core Performance

Quad-Core Performance with Prefetching

Latency Reduction Factors

Row-Buffer Conflict Reduction

EMC Data Cache Hit Rate

Overhead: On-Chip Bandwidth H1-H10 observe a 33% average increase in data ring messages and a 7% increase in control ring messages Sending dependence chains and live-ins to the EMC Sending live-outs back to the core

Overhead: Memory Bandwidth

Energy Consumption

Other Information in the Paper 8-core evaluation with multiple distributed memory controllers Analysis of how the EMC and prefetching interact EMC sensitivity to increasing memory bandwidth More details of EMC execution: memory and control operations 5% EMC Area Overhead

Conclusions Adding an enhanced, compute capable, memory controller to the system results in two benefits: EMC generates cache misses faster than the core by bypassing on-chip contention EMC increases the likelihood of a memory access hitting an open row buffer 15% average gain in weighted speedup and a 11% reduction in energy consumption over a quad-core baseline with no prefetching Memory requests issued from the EMC observe a 20% lower latency on average than requests that are issued from the core

Accelerating Dependent Cache Misses with an Enhanced Memory Controller Milad Hashemi, Khubaib, Eiman Ebrahimi, Onur Mutlu, Yale N. Patt June 21, 2016