Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Slides:



Advertisements
Similar presentations
Prefetch-Aware Shared-Resource Management for Multi-Core Systems Eiman Ebrahimi * Chang Joo Lee * + Onur Mutlu Yale N. Patt * * HPS Research Group The.
Advertisements

Predicting Performance Impact of DVFS for Realistic Memory Systems Rustam Miftakhutdinov Eiman Ebrahimi Yale N. Patt.
Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,
Fairness via Source Throttling: A configurable and high-performance fairness substrate for multi-core memory systems Eiman Ebrahimi * Chang Joo Lee * Onur.
Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.
CSE431 Chapter 7A.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 7A: Intro to Multiprocessor Systems Mary Jane Irwin (
Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.
School of Engineering & Technology Computer Architecture Pipeline.
55:035 Computer Architecture and Organization Lecture 7 155:035 Computer Architecture and Organization.
Our approach! 6.9% Perfect L2 cache (hit rate 100% ) 1MB L2 cache Cholesky 47% speedup BASE: All cores are used to execute the application-threads. PB-GS(PB-LS)
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
OWL: Cooperative Thread Array (CTA) Aware Scheduling Techniques for Improving GPGPU Performance Adwait Jog, Onur Kayiran, Nachiappan CN, Asit Mishra, Mahmut.
Managing GPU Concurrency in Heterogeneous Architect ures Shared Resources Network LLC Memory.
Using one level of Cache:
1 Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures M. Aater Suleman* Onur Mutlu† Moinuddin K. Qureshi‡ Yale N. Patt* *The.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Parallel Application Memory Scheduling Eiman Ebrahimi * Rustam Miftakhutdinov *, Chris Fallin ‡ Chang Joo Lee * +, Jose Joao * Onur Mutlu ‡, Yale N. Patt.
1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.
Flexible Reference-Counting-Based Hardware Acceleration for Garbage Collection José A. Joao * Onur Mutlu ‡ Yale N. Patt * * HPS Research Group University.
Comp-TIA Standards.  AMD- (Advanced Micro Devices) An American multinational semiconductor company that develops computer processors and related technologies.
computer
Niagara: a 32-Way Multithreaded SPARC Processor
1 Presented By: Michael Bieniek. Embedded systems are increasingly using chip multiprocessors (CMPs) due to their low power and high performance capabilities.
Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Vivek Seshadri Thomas Mullins, Amirali Boroumand,
Taeho Kgil, Trevor Mudge Advanced Computer Architecture Laboratory The University of Michigan Ann Arbor, USA CASES’06.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
A Case for Toggle-Aware Compression for GPU Systems
Transparent Offloading and Mapping (TOM) Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh Eiman Ebrahimi, Gwangsun Kim,
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
Yuanrui Zhang, Mahmut Kandemir
Managing GPU Concurrency in Heterogeneous Architectures
Milad Hashemi, Onur Mutlu, and Yale N. Patt
740: Computer Architecture First Assignments to Complete
Jason F. Cantin, Mikko H. Lipasti, and James E. Smith
(Find all PTEs that map to a given PPN)
Super Quick Architecture Review
Eiman Ebrahimi, Kevin Hsieh, Phillip B. Gibbons, Onur Mutlu
Rachata Ausavarungnirun GPU 2 (Virginia EF) Tuesday 2PM-3PM
Eiman Ebrahimi, Kevin Hsieh, Phillip B. Gibbons, Onur Mutlu
Rachata Ausavarungnirun, Joshua Landgraf, Vance Miller
Hasan Hassan, Nandita Vijaykumar, Samira Khan,
Lecture 21: Memory Hierarchy
Directory-based Protocol
MASK: Redesigning the GPU Memory Hierarchy
Accelerating Dependent Cache Misses with an Enhanced Memory Controller
Milad Hashemi, Onur Mutlu, Yale N. Patt
Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Session C1, Tuesday 10:40 AM Vivek Seshadri.
Today, DRAM is just a storage device
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
Virtual Memory S04, Recitation, Section A
Vulnerabilities in MLC NAND Flash Memory Programming: Experimental Analysis, Exploits, and Mitigation Techniques HPCA Session 3A – Monday, 3:15 pm,
Application Slowdown Model
Address-Value Delta (AVD) Prediction
Session 1A at am MEMCON Detecting and Mitigating
The Memory-Processor Gap
Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri , Yoongu Kim, Hongyi.
A Case for Richer Cross-layer Abstractions: Bridging the Semantic Gap with Expressive Memory Nandita Vijaykumar Abhilasha Jain, Diptesh Majumdar, Kevin.
Yaohua Wang, Arash Tavakkol, Lois Orosa, Saugata Ghose,
Computer System Design Lecture 9
José A. Joao* Onur Mutlu‡ Yale N. Patt*
Fawkner S.C Coaching curriculum implementation : 2016
Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri , Yoongu Kim, Hongyi.
2 Hari.
Design of Digital Circuits Discussion Session 1
CoNDA: Efficient Cache Coherence Support for Near-Data Accelerators
Onur Mutlu § Jared Stark † Chris Wilkerson ‡ Yale N. Patt§
ASPLOS 2009 Presented by Lara Lazier
MICRO 2012 MorphCore An Energy-Efficient Microarchitecture for High Performance ILP and High Throughput TLP Khubaib Milad Hashemi Yale N. Patt M. Aater.
Prof. Onur Mutlu Carnegie Mellon University
Presentation transcript:

Accelerating Dependent Cache Misses with an Enhanced Memory Controller Milad Hashemi, Khubaib, Eiman Ebrahimi, Onur Mutlu, Yale N. Patt Tuesday June 21: Session 7A, 3:30pm

Memory Access Latency Multiprocessor DRAM The latency of accessing main memory is made up of two parts: Multiprocessor DRAM

Memory Access Latency Multiprocessor DRAM The latency of accessing main memory is made up of two parts: DRAM access latency Multiprocessor DRAM

Memory Access Latency Multiprocessor DRAM The latency of accessing main memory is made up of two parts: DRAM access latency On-chip latency Multiprocessor DRAM

On-Chip Delay

Dependent Cache Misses LD [R3] -> R5

Dependent Cache Misses LD [R3] -> R5 ADD R4, R5 -> R9

Dependent Cache Misses LD [R3] -> R5 ADD R4, R5 -> R9 ADD R9, R1 -> R6

Dependent Cache Misses LD [R3] -> R5 ADD R4, R5 -> R9 ADD R9, R1 -> R6 Cache Miss LD [R6] -> R8

Compute Capable Memory Controller

Effective Memory Access Latency Reduction

Effective Memory Access Latency Reduction

Accelerating Dependent Cache Misses with an Enhanced Memory Controller Milad Hashemi, Khubaib, Eiman Ebrahimi, Onur Mutlu, Yale N. Patt Tuesday June 21: Session 7A, 3:30pm