Prefetching Techniques for STT-RAM based Last-level Cache in CMP Systems Mengjie Mao, Guangyu Sun, Yong Li, Kai Bu, Alex K. Jones, Yiran Chen Department.

Slides:

Advertisements

Similar presentations

Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.

Advertisements

Energy-efficient Task Scheduling in Heterogeneous Environment 2013/10/25.

Energy Efficiency through Burstiness Athanasios E. Papathanasiou and Michael L. Scott University of Rochester, Computer Science Department Rochester, NY.

Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.

Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.

1 A Hybrid Adaptive Feedback Based Prefetcher Santhosh Verma, David Koppelman and Lu Peng Louisiana State University.

Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.

1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.

Lecture 8: Memory Hierarchy Cache Performance Kai Bu

Lecture 12 Reduce Miss Penalty and Hit Time

High Performing Cache Hierarchies for Server Workloads

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems.

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

International Symposium on Low Power Electronics and Design Dynamic Workload Characterization for Power Efficient Scheduling on CMP Systems 1 Gaurav Dhiman,

Probabilistic Design Methodology to Improve Run- time Stability and Performance of STT-RAM Caches Xiuyuan Bi (1), Zhenyu Sun (1), Hai Li (1) and Wenqing.

Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Lei Jin and Sangyeun Cho Dept. of Computer Science University.

†The Pennsylvania State University

1 Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections )

1 Lecture 15: DRAM Design Today: DRAM basics, DRAM innovations (Section 5.3)

Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers Manu Awasthi, David Nellans, Kshitij Sudan, Rajeev Balasubramonian,

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.

An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.

1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.

By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and

CPACT – The Conditional Parameter Adjustment Cache Tuner for Dual-Core Architectures + Also Affiliated with NSF Center for High- Performance Reconfigurable.

NVSleep: Using Non-Volatile Memory to Enable Fast Sleep/Wakeup of Idle Cores Xiang Pan and Radu Teodorescu Computer Architecture Research Lab

CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.

A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffery Stuecheli,

Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.

A Hardware-based Cache Pollution Filtering Mechanism for Aggressive Prefetches Georgia Institute of Technology Atlanta, GA ICPP, Kaohsiung, Taiwan,

StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University.

1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

1 Tuning Garbage Collection in an Embedded Java Environment G. Chen, R. Shetty, M. Kandemir, N. Vijaykrishnan, M. J. Irwin Microsystems Design Lab The.

Department of Computer Science and Engineering The Pennsylvania State University Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das Addressing.

Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.

1 Lecture 14: DRAM Main Memory Systems Today: cache/TLB wrap-up, DRAM basics (Section 2.3)

Energy Reduction for STT-RAM Using Early Write Termination Ping Zhou, Bo Zhao, Jun Yang, *Youtao Zhang Electrical and Computer Engineering Department *Department.

Improving Disk Throughput in Data-Intensive Servers Enrique V. Carrera and Ricardo Bianchini Department of Computer Science Rutgers University.

Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative

MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.

Copyright © 2010 Houman Homayoun Houman Homayoun National Science Foundation Computing Innovation Fellow Department of Computer Science University of California.

Analyzing the Impact of Data Prefetching on Chip MultiProcessors Naoto Fukumoto, Tomonobu Mihara, Koji Inoue, Kazuaki Murakami Kyushu University, Japan.

Exploiting Multithreaded Architectures to Improve Data Management Operations Layali Rashid The Advanced Computer Architecture U of C (ACAG) Department.

Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

Quantifying and Controlling Impact of Interference at Shared Caches and Main Memory Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, Onur.

HAT: Heterogeneous Adaptive Throttling for On-Chip Networks Kevin Kai-Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu.

Hang Zhang1, Xuhao Chen1, Nong Xiao1,2, Fang Liu1

Modeling of Failure Probability and Statistical Design of Spin-Torque Transfer MRAM (STT MRAM) Array for Yield Enhancement Jing Li, Charles Augustine,

Lecture: Large Caches, Virtual Memory

5.2 Eleven Advanced Optimizations of Cache Performance

Prefetch-Aware Cache Management for High Performance Caching

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Experiment Evaluation

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Milad Hashemi, Onur Mutlu, Yale N. Patt

Lecture 14: Reducing Cache Misses

Haishan Zhu, Mattan Erez

Managing GPU Concurrency in Heterogeneous Architectures

Lecture: Cache Innovations, Virtual Memory

CARP: Compression-Aware Replacement Policies

Adaptive Single-Chip Multiprocessing

Lecture: Cache Hierarchies

Literature Review A Nondestructive Self-Reference Scheme for Spin-Transfer Torque Random Access Memory (STT-RAM) —— Yiran Chen, et al. Fengbo Ren 09/03/2010.

Presentation transcript:

Prefetching Techniques for STT-RAM based Last-level Cache in CMP Systems Mengjie Mao, Guangyu Sun, Yong Li, Kai Bu, Alex K. Jones, Yiran Chen Department of Electrical and Computer Engineering University of Pittsburgh

Data prefetching is important Introduction Not any Simple Software-aided Complicated Heterogeneous Software/hardware Prefetch data from memory to on-chip cache

Data prefetching is important But far from efficient – Average accuracy seldom exceeds 60% across most designs – 40% prefetching is useless – Wasted memory bandwidth – Access conflict and cache pollution Sensitive to write latency Introduction

A example—Stream prefetcher in IBM POWER4 microprocessor Up to 4 prefetchings prefetcher One LLC read …… Keep track of 32 prefetching streams Each stream can cover the memory address of 64 contiguous cache lines Prefetch table A heavy burden for LLC and memory subsystem!

Introduction Spin-Transfer Torque Random Access Memory (STT- RAM) as Last-Level Cache (LLC) – High write cost of STT-RAM, ~ 10ns – Combine STT-RAM based LLC with data prefetching – High-density/small-size STT-RAM cell ->bank access conflicts – Low-density/large-size STT-RAM cells->prefetching-incurred cache pollution

Introduction Our work – Reduce the negative impact on system performance of CMPs with STT-RAM based LLC induced by data prefetching – Request Prioritization – Hybrid local-global prefetch control

Outline Motivation STT-RAM Basics Methodology – Request prioritization – Hybrid local-global prefetch control Result Conclusion

STT-RAM Basics STT-RAM Cell: – Transistor and MTJ (Magnetic Tunnel Junction); – Denoted as 1T-1J cell. MTJ: – Free Layer and Ref. Layer; – Read: Direction → Resistance; – Write: Current → Direction. Free Layer Barrier Reference Layer Write-1 Write-0 Parallel (R Low ), 0 Anti-Parallel (R High ), 1

STT-RAM Basics STT-RAM Cell: – Transistor and MTJ (Magnetic Tunnel Junction); – Denoted as 1T-1J cell. MTJ: – Free Layer and Ref. Layer; – Read: Direction → Resistance; – Write: Current → Direction. Write latency VS. write current Free Layer Barrier Reference Layer Write-1 Write-0

Outline Motivation STT-RAM Basics Methodology – Request Prioritization – Hybrid local-global prefetch control Result Conclusion

Request Prioritization(RP) Various LLC access requests – Access conflict – Further aggravated due to long write latency LLC access conflict--the source of performance degradation read fill prefetch fill Write back …… miss LLC conflict 1. Read2. Fill3. Write back4. Prefetch fill

Request Prioritization(RP) Various LLC access requests – Access conflict – Further aggravated due to long write latency LLC access conflict--the source of performance degradation – Critical request is blocked by non-critical Request read …… LLC prefetch fill

Request Prioritization(RP) Various LLC access requests – Access conflict – Further aggravated due to long write latency LLC access conflict--the source of performance degradation – Critical request is blocked by non-critical Request – Read is the most critical – Fill is slightly less critical than read – Prefetch fill is less critical than read/fill – Write back is the least critical read …… LLC prefetch fill stall… fill miss Write back

Request Prioritization(RP) Assign priority to individual request based on its criticality – Respond to the request based on its priority High-priority request is blocked by low priority-request read …… LLC time Prefetch fill read prefetch fill

Request Prioritization(RP) Assign priority to individual request based on its criticality – Respond to the request based on its priority High-priority request is blocked by low priority-request – Retirement accomplishment degree (RAD) – Determines at what stage a low-priority request can be preempted by high-priority request preempted < RAD time Prefetch fill read …… LLC prefetch fill

Outline Motivation STT-RAM Basics Methodology – Request prioritization – Hybrid local-global prefetch control Result Conclusion

Hybrid Local-global Prefetch Control (HLGPC) The prefetching efficiency is affected by the capacity (cell size) of STT-RAM based LLC – A large cell size alleviates bank conflict, by reducing the blocking time of write operations – Cache pollution incurred by prefetching also becomes severer due to the reduced total capacity Prefetch control considering LLC access contention – Dynamically control the aggressiveness of the prefetchers – Tune the prefetch distance/prefetch degree

Hybrid Local-global Prefetch Control (HLGPC) Local (per core) prefetch control (LPC) – Focuses on maximizing the performance of each CPU core Hybrid Local-global Prefetch Control (HLGPC) – Achieve a balanced dynamic aggressiveness control – At core-level, feedback directed prefetching (FDP)* as the LPC – At chip-level, GPC may retain or override the decision of LPC based on the runtime information of LLC Case Core i’s runtime informationLLC’s info Dicision Pref. AccuracyPref. FrequencyLLC acc. Freq. 1 Low High High Force scale down 2 High High High Disable scale up 3 Low High Low Disable scale up Allow local decision

Hybrid Local-global Prefetch Control (HLGPC) Local (per core) prefetch control (LPC) – Focuses on maximizing the performance of each CPU core Hybrid Local-global Prefetch Control (HLGPC) – Achieve a balanced dynamic aggressiveness control – At core-level, feedback directed prefetching (FDP)* as the LPC – At chip-level, GPC may retain or override the decision of LPC based on the runtime information of LLC Sampling runtime information – Dynamically identify the execution phase of an app UpateCount i = 0.5 UpdateCount i CurrentCount i execution ii - 1

Outline Motivation STT-RAM Basics Methodology – Request prioritization – Hybrid local-global prefetch control Result Conclusion

Experiment Methodology Workload construction – 2 SPEC CPU 2000 & 16 SPEC CPU 2006 – 6 4-app workloads: at least 2 memory intensive, 1 prefetch intensive, 1 memory non-intensive CPU1CPU2CPU3CPU0 STT-RAM Based LLC Ring NOC 200 cycles to memory 32KB D/I L1 256 KB L2, 4 banks IBM POWER4 stream prefetcher Out-Of-Order, 4 issues 2/4/8 MB, 8 banks, 128 MSHRs Read: 12/16/12 cycles Write: 20/48/128 cycles

Three priority assignments – From P2 to P4, the aggressiveness of read request goes up Performance improvement – Prioritizing LLC access requests always achieve system performance improvement – Achieves more substantial performance improvement for large LLC with long write access latency – The highest performance is achieved by P1 Enhancement by RP priority Read & fill Pref. fill wb P1 60% Read fill Pref. fill P2 60% wb 60% 25% Read fill Pref. fill P3 60% wb 60% 100%

More Insights for BPA Average stall cycles of different types of LLC request of P21 – Shorter read stall cycles – Unchanged fill stall cycles – Stall cycles is shifted from read to wb/pref. fill LLC energy consumption – The retries of write operations unavoidably increases the dynamic energy – P3 increase the most dynamic energy – Static energy still dominates the total energy consumption – Shorter execution time can reduce leakage energy consumption 1st: no prior; 2nd: P1; 3rd: P2; 4th: P3

Effectiveness of HLGPC Performance improvement – LPC alone achieves little improvement – HLGPC alone is better – HLGPC+P1 achieves the highest improvement The total energy consumption is further reduced – Further decrease of dynamic energy – Much shorter execution time leads to continuously reduction of leakage energy –EDP is improved by 7.8% 1st: P1; 2nd: LPC+P1; 3rd: HLGPC+P1

Outline Motivation STT-RAM Basics Methodology – Request prioritization – Hybrid local-global prefetch control Result Conclusion

In CMP systems with aggressive prefetching, STT-RAM based LLC suffers from increased LLC cache latency due to higher write pressure and cache pollution Request prioritization can significantly mitigate the negative impact induced by bank conflicts on large LLC Coupling GPC and LPC can alleviate the cache pollution on small LLC RP+HLGPC unveils the performance potential of prefetching in CMP systems System performance can be improved by 9.1%, 6.5%, and 11.0% for 2MB, 4MB, and 8MB STT-RAM LLCs; the corresponding LLC energy consumption is also saved by 7.3%, 4.8%, and 5.6%, respectively.

Q&A Thank you