Experiment Evaluation

Slides:

Advertisements

Similar presentations

Pooja ROY, Manmohan MANOHARAN, Weng Fai WONG National University of Singapore ESWEEK (CASES) October 2014 EnVM : Virtual Memory Design for New Memory Architectures.

Advertisements

Thank you for your introduction.

Prefetching Techniques for STT-RAM based Last-level Cache in CMP Systems Mengjie Mao, Guangyu Sun, Yong Li, Kai Bu, Alex K. Jones, Yiran Chen Department.

LEVERAGING ACCESS LOCALITY FOR THE EFFICIENT USE OF MULTIBIT ERROR-CORRECTING CODES IN L2 CACHE By Hongbin Sun, Nanning Zheng, and Tong Zhang Joseph Schneider.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

A Scalable and Reconfigurable Search Memory Substrate for High Throughput Packet Processing Sangyeun Cho and Rami Melhem Dept. of Computer Science University.

Department of Computer Science iGPU: Exception Support and Speculative Execution on GPUs Jaikrishnan Menon, Marc de Kruijf Karthikeyan Sankaralingam Vertical.

Probabilistic Design Methodology to Improve Run- time Stability and Performance of STT-RAM Caches Xiuyuan Bi (1), Zhenyu Sun (1), Hai Li (1) and Wenqing.

Chuanjun Zhang, UC Riverside 1 Low Static-Power Frequent-Value Data Caches Chuanjun Zhang*, Jun Yang, and Frank Vahid** *Dept. of Electrical Engineering.

Wish Branches A Review of “Wish Branches: Enabling Adaptive and Aggressive Predicated Execution” Russell Dodd - October 24, 2006.

Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Oguz Ergin*, Deniz Balkan, Kanad Ghose, Dmitry Ponomarev Department.

Scheduling Reusable Instructions for Power Reduction J.S. Hu, N. Vijaykrishnan, S. Kim, M. Kandemir, and M.J. Irwin Proceedings of the Design, Automation.

Efficient Internet Traffic Delivery over Wireless Networks Sandhya Sumathy.

An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.

Dyer Rolan, Basilio B. Fraguela, and Ramon Doallo Proceedings of the International Symposium on Microarchitecture (MICRO’09) Dec /7/14.

NVSleep: Using Non-Volatile Memory to Enable Fast Sleep/Wakeup of Idle Cores Xiang Pan and Radu Teodorescu Computer Architecture Research Lab

A Hardware-based Cache Pollution Filtering Mechanism for Aggressive Prefetches Georgia Institute of Technology Atlanta, GA ICPP, Kaohsiung, Taiwan,

Evaluating Impact of Storage on Smartphone Energy Efficiency David T. Nguyen.

University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.

Dong Hyuk Woo Nak Hee Seong Hsien-Hsin S. Lee

1 Tuning Garbage Collection in an Embedded Java Environment G. Chen, R. Shetty, M. Kandemir, N. Vijaykrishnan, M. J. Irwin Microsystems Design Lab The.

Yun-Chung Yang SimTag: Exploiting Tag Bits Similarity to Improve the Reliability of the Data Caches Jesung Kim, Soontae Kim, Yebin Lee 2010 DATE(The Design,

Low-Power Cache Organization Through Selective Tag Translation for Embedded Processors with Virtual Memory Support Xiangrong Zhou and Peter Petrov Proceedings.

CPS3340 COMPUTER ARCHITECTURE Fall Semester, /3/2013 Lecture 9: Memory Unit Instructor: Ashraf Yaseen DEPARTMENT OF MATH & COMPUTER SCIENCE CENTRAL.

Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,

11 Online Computing and Predicting Architectural Vulnerability Factor of Microprocessor Structures Songjun Pan Yu Hu Xiaowei Li {pansongjun, huyu,

Copyright © 2010 Houman Homayoun Houman Homayoun National Science Foundation Computing Innovation Fellow Department of Computer Science University of California.

Hardware Architectures for Power and Energy Adaptation Phillip Stanley-Marbell.

IP Routing table compaction and sampling schemes to enhance TCAM cache performance Author: Ruirui Guo a, Jose G. Delgado-Frias Publisher: Journal of Systems.

1 IP Routing table compaction and sampling schemes to enhance TCAM cache performance Author: Ruirui Guo, Jose G. Delgado-Frias Publisher: Journal of Systems.

Click to edit Present’s Name AP-Tree: Efficiently Support Continuous Spatial-Keyword Queries Over Stream Xiang Wang 1*, Ying Zhang 2, Wenjie Zhang 1, Xuemin.

Optimizing Interconnection Complexity for Realizing Fixed Permutation in Data and Signal Processing Algorithms Ren Chen, Viktor K. Prasanna Ming Hsieh.

Hang Zhang1, Xuhao Chen1, Nong Xiao1,2, Fang Liu1

Improving Multi-Core Performance Using Mixed-Cell Cache Architecture

Presenter: Darshika G. Perera Assistant Professor

Authors: Jiang Xie, Ian F. Akyildiz

Failure-Atomic Slotted Paging for Persistent Memory

Chang Hyun Park, Taekyung Heo, and Jaehyuk Huh

Memory Segmentation to Exploit Sleep Mode Operation

Seth Pugsley, Jeffrey Jestes,

Seyed Mohammad Seyedzadeh, Rakan Maddah, Alex Jones, Rami Melhem

Multilevel Memories (Improving performance using alittle “cash”)

A Scalable Routing Architecture for Prefix Tries

System Control based Renewable Energy Resources in Smart Grid Consumer

Toward Advocacy-Free Evaluation of Packet Classification Algorithms

Cache Memory Presentation I

Improving Program Efficiency by Packing Instructions Into Registers

Computer Architecture & Operations I

BitWarp Energy Efficient Analytic Data Processing on Next Generation General Purpose GPUs Jason Power || Yinan Li || Mark D. Hill || Jignesh M. Patel.

Linchuan Chen, Xin Huo and Gagan Agrawal

BlueScan: Boosting Wi-Fi Scanning Efficiency Using Bluetooth Radio

Fine-Grain CAM-Tag Cache Resizing Using Miss Tags

Pyramid Sketch: a Sketch Framework

Gurunath Kadam (College of William and Mary)

Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform

Lecture 6: Reliability, PCM

Guihai Yan, Yinhe Han, Xiaowei Li, and Hui Liu

Morgan Kaufmann Publishers Memory Hierarchy: Virtual Memory

†UCSD, ‡UCSB, EHTZ*, UNIBO*

GATES: A Grid-Based Middleware for Processing Distributed Data Streams

A Small and Fast IP Forwarding Table Using Hashing

STT-RAM Design Fengbo Ren Advisor: Prof. Dejan Marković Dec. 3rd, 2010

Literature Review A Nondestructive Self-Reference Scheme for Spin-Transfer Torque Random Access Memory (STT-RAM) —— Yiran Chen, et al. Fengbo Ren 09/03/2010.

Achieving Resilient Routing in the Internet

A Novel Cache-Utilization Based Dynamic Voltage Frequency Scaling (DVFS) Mechanism for Reliability Enhancements *Yen-Hao Chen, *Yi-Lun Tang, **Yi-Yu Liu,

Relax and Adapt: Computing Top-k Matches to XPath Queries

Address-Stride Assisted Approximate Load Value Prediction in GPUs

Haonan Wang, Adwait Jog College of William & Mary

Yingze Wang and Shi-Kuo Chang University of Pittsburgh

Restrictive Compression Techniques to Increase Level 1 Cache Capacity

Presentation transcript:

Experiment Evaluation Red-Shield: Shielding Read Disturbance for STT-RAM Based Register Files on GPUs Hang Zhang, Xuhao Chen, Nong Xiao, Fang Liu, Zhiguang Chen College of Computer, National University of Defense Technology Abstraction RED-Shield Design To address the high energy consumption issue of SRAM on GPUs, emerging Spin-Transfer Torque (STT-RAM) memory technology has been intensively studied to build GPU register files for better energy-efficiency, thanks to its benefits of low leakage power, high density, and good scalability. However, STT-RAM suffers from a reliability issue, read disturbance, which stems from the fact that the voltage difference between read current and write current becomes smaller as technology scales. The read disturbance leads to high error rates for read operations, which cannot be effectively protected by SECDEC ECC on large-capacity register files of GPUs. Prior schemes (e.g. read-restore) to mitigate the read disturbance usually incur either non-trivial performance loss or excessive energy overheads, thus not applicable for the GPU register file design which aims to achieve both high performance and energy-efficiency. 2. Read Buffer Promotion 3. Adaptive Restore Design A reference tracking algorithm to record register reference count information, and exploit register access locality. a small SRAM read buffer to hold register values with high temporal locality direct restore is fast but consumes more energy. selective restore is slow but consumes less energy. adaptive restore: adaptively choose restore schemes, according to the busy status of the accessed register bank. Motivation Figure 4: The overall architecture of our proposed design. STT-RAM read operations have the same operating mechanism as write operations but with smaller voltage. As technology scales, the read current becomes close to write current. Read operations may change the store value, referred as read disturbance. The line error rate is so high that cannot be protected by the SECDEC code (Table 1). Both direct restore scheme and selective restore scheme will incur unacceptable performance loss and energy overheads. Figure 5: A conceptual comparison between the direct restore operation and the selective restore. Experiment Evaluation base: the baseline register file built with SRAM; WB: the STT-RAM based register file with a SRAM write buffer. RD: the STT-RAM based register file design with the selective restore scheme. CO: configured similar with RD, but with the dead read identification technique. CORB: configured similar with CO, but with the read buffer promotion design. CORBAR: Configured similar with CORB, but with the adaptive restore scheme. Performance: Our proposed dead read identification scheme (CO) can effectively improve throughput and achieve 90.6% of the baseline performance. Finally, CORBAR achieves 93.1% of the baseline performance on average. Energy Consumption: Our proposed CO scheme can effectively mitigate the restore energy consumption on average and achieve 78.7% energy consumption of the baseline, even under read disturbance. CORB consumes 79.8% of the baseline energy on average. CORBAR consumes 81.2% of the baseline energy on average. Though CORBAR consumes a little bit higher energy than CORB, the energy-efficiency of the register file is not sacrificed thanks to the low leakage power of STT-RAM. Figure 1: The scaling of read and write currents[2]. Table 1. Read Disturbance Error Rate for a 1024-bit GPU register entry. BER: Raw bit error rate. LER: Line error rate. SECDEC: Line error rate under the SECDEC ECC protection Tech(nm) 45 32 22 15 11 BER 1.38E-08 3.38E-07 3.07E-06 2.16E-05 1.20E-04 LER 1.42E-05 3.50E-04 3.17E-03 2.21E-02 1.17E-01 SECDEC 1.02-10 6.12E-08 5.04E-06 2.46E-04 7.11E-03 RED-Shield Design The Red-Disturbance Shield design Dead Read Identification Observation: on average, nearly 49.3% of generated register values are referenced (read) only once, while nearly 27.9% and 11.6% of them are referenced twice and three times, respectively. (Fig. 2) Figure 6: GPU throughput comparison between different configurations. Figure 2: Register read access count distribution. Policy: identify dead read a) Do not need to restore register values that will not be re-referenced in the future. b) For a register value that will be re-referenced several times, the last reference does not need to be restored. Figure 7: Energy consumption comparison between different configurations. Conclusions The read disturbance issue of STT-RAM imposes great challenges for GPU register file design as technology further scales. We present a novel design, Red-shield, to employ complier optimization to filter out short-lifetime reads so as to mitigate the performance and energy overhead incurred by the read-restore operations. Coupled with the read buffer promotion design as well as the optimized read-restore scheme, Red-shield can effectively alleviate pipeline stalls caused by read disturbance and improve the energy efficiency. Our design can promote STT-RAM based GPU register files to become a feasible and effective solution for future GPU architectures. Implementation: Goal: identify dead reads and avoid unnecessary restore operations. Solution: standard register analysis algorithm to analyze register liveness coverage. Figure 3: A snippet of the PTX source code for the backprop benchmark. R: read operation. W: write operation. L: register is alive. Contact References Hang Zhang College of Computer, National University of Defense Technology Email: hangzhang@nudt.edu.cn Z. Sun, H. Li, and W. Wu, “A dual-mode architecture for fast-switching STT-RAM,” in ISLPED, 2012. R. Wang, L. Jiang, Y. Zhang et al., “Selective restore: an energy efficient read disturbance mitigation scheme for future STT-MRAM,” in DAC, 2015. X. Liu, M. Mao, X. Bi et al., “An efficient STT-RAM-based register file in GPU architectures,” in ASP-DAC, 2015. G. Li, X. Chen, G. Sun et al., “A STT-RAM-based low-power hybrid register file for GPGPUs,” in DAC, 2015.