Sangyeun Cho Hyunjin Lee

Slides:

Advertisements

Similar presentations

Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.

Advertisements

Rethinking Database Algorithms for Phase Change Memory

AN ANALYTICAL MODEL TO STUDY OPTIMAL AREA BREAKDOWN BETWEEN CORES AND CACHES IN A CHIP MULTIPROCESSOR Taecheol Oh, Hyunjin Lee, Kiyeon Lee and Sangyeun.

ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors Mohammad Hammoud, Sangyeun Cho, and Rami Melhem Presenter: Socrates Demetriades.

Lecture 12 Reduce Miss Penalty and Hit Time

LEVERAGING ACCESS LOCALITY FOR THE EFFICIENT USE OF MULTIBIT ERROR-CORRECTING CODES IN L2 CACHE By Hongbin Sun, Nanning Zheng, and Tong Zhang Joseph Schneider.

Practical Caches COMP25212 cache 3. Learning Objectives To understand: –Additional Control Bits in Cache Lines –Cache Line Size Tradeoffs –Separate I&D.

Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.

CML Efficient & Effective Code Management for Software Managed Multicores CODES+ISSS 2013, Montreal, Canada Ke Bai, Jing Lu, Aviral Shrivastava, and Bryce.

A Scalable and Reconfigurable Search Memory Substrate for High Throughput Packet Processing Sangyeun Cho and Rami Melhem Dept. of Computer Science University.

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.

University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

Exploiting Spatial Locality in Data Caches using Spatial Footprints Sanjeev Kumar, Princeton University Christopher Wilkerson, MRL, Intel.

Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs Mrinmoy Ghosh Hsien-Hsin S. Lee School.

Probabilistic Design Methodology to Improve Run- time Stability and Performance of STT-RAM Caches Xiuyuan Bi (1), Zhenyu Sun (1), Hai Li (1) and Wenqing.

Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.

Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Lei Jin and Sangyeun Cho Dept. of Computer Science University.

STT-RAM as a sub for SRAM and DRAM

Phase Change Memory What to wear out today? Chris Craik, Aapo Kyrola, Yoshihisa Abe.

Chuanjun Zhang, UC Riverside 1 Low Static-Power Frequent-Value Data Caches Chuanjun Zhang*, Jun Yang, and Frank Vahid** *Dept. of Electrical Engineering.

Power Efficient IP Lookup with Supernode Caching Lu Peng, Wencheng Lu*, and Lide Duan Dept. of Electrical & Computer Engineering Louisiana State University.

Defining Wakeup Width for Efficient Dynamic Scheduling A. Aggarwal, O. Ergin – Binghamton University M. Franklin – University of Maryland Presented by:

Instruction Set Architecture (ISA) for Low Power Hillary Grimes III Department of Electrical and Computer Engineering Auburn University.

Restrictive Compression Techniques to Increase Level 1 Cache Capacity Prateek Pujara Aneesh Aggarwal Dept of Electrical and Computer Engineering Binghamton.

Skewed Compressed Cache

CAFO: Cost Aware Flip Optimization for Asymmetric Memories RAKAN MADDAH *, SEYED MOHAMMAD SEYEDZADEH AND RAMI MELHEM COMPUTER SCIENCE DEPARTMENT UNIVERSITY.

DEUCE: WRITE-EFFICIENT ENCRYPTION FOR PCM March 16 th 2015 ASPLOS-XX Istanbul, Turkey Vinson Young Prashant Nair Moinuddin Qureshi.

Due to the economic downturn, Microsoft Research has eliminated all funding for title slides. We sincerely apologize for any impact these austerity measures.

Xiaodong Wang  Dilip Vasudevan Hsien-Hsin Sean Lee University of College Cork  Georgia Tech Global Built-In Self-Repair for 3D Memories with Redundancy.

Shuchang Shan † ‡, Yu Hu †, Xiaowei Li † † Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences.

Seok-Won Seong and Prabhat Mishra University of Florida IEEE Transaction on Computer Aided Design of Intigrated Systems April 2008, Vol 27, No. 4 Rahul.

Use of PCM in Computer Systems: an End-to-End Exploration Sangyeun Cho Computer Science Department University of Pittsburgh We need V.

Defining Anomalous Behavior for Phase Change Memory

CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.

Storage Class Memory Architecture for Energy Efficient Data Centers Bruce Childers, Sangyeun Cho, Rami Melhem, Daniel Mossé, Jun Yang, Youtao Zhang Computer.

Min Xu1, Yunfeng Zhu2, Patrick P. C. Lee1, Yinlong Xu2

A Hardware-based Cache Pollution Filtering Mechanism for Aggressive Prefetches Georgia Institute of Technology Atlanta, GA ICPP, Kaohsiung, Taiwan,

StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University.

NOTE: To change the image on this slide, select the picture and delete it. Then click the Pictures icon in the placeholder to insert your own image. NON.

A Low-Power CAM Design for LZ Data Compression Kun-Jin Lin and Cheng-Wen Wu, IEEE Trans. On computers, Vol. 49, No. 10, Oct Presenter: Ming-Hsien.

Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.

Memory Intro Computer Organization 1 Computer Science Dept Va Tech March 2006 ©2006 McQuain & Ribbens Built using D flip-flops: 4-Bit Register Clock input.

|Processors designed for low power |Architectural state is correct at basic block granularity rather than instruction granularity 2.

RDIS: A Recursively Defined Invertible Set Scheme to Tolerate Multiple Stuck-At Faults in Resistive Memory Rami Melhem, Rakan Maddah and Sangyeun cho Computer.

Energy Reduction for STT-RAM Using Early Write Termination Ping Zhou, Bo Zhao, Jun Yang, *Youtao Zhang Electrical and Computer Engineering Department *Department.

Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative

Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.

Copyright © 2010 Houman Homayoun Houman Homayoun National Science Foundation Computing Innovation Fellow Department of Computer Science University of California.

Comp 335 File Structures Data Compression. Why Study Data Compression? Conserves storage space Files can be transmitted faster because there are less.

07/11/2005 Register File Design and Memory Design Presentation E CSE : Introduction to Computer Architecture Slides by Gojko Babić.

Computer Architecture Lecture 25 Fasih ur Rehman.

Rakan Maddah1, Sangyeun2,1 Cho and Rami Melhem1

High-Speed Stochastic Circuits Using Synchronous Analog Pulses M

Seyed Mohammad Seyedzadeh, Rakan Maddah, Alex Jones, Rami Melhem

Multilevel Memories (Improving performance using alittle “cash”)

Cache Memory Presentation I

Jason F. Cantin, Mikko H. Lipasti, and James E. Smith

Improving Program Efficiency by Packing Instructions Into Registers

Scalable High Performance Main Memory System Using PCM Technology

Fine-Grain CAM-Tag Cache Resizing Using Miss Tags

Dynamically Reconfigurable Architectures: An Overview

Xiaoyang Zhang1, Yuchong Hu1, Patrick P. C. Lee2, Pan Zhou1

Jianbo Dong, Lei Zhang, Yinhe Han, Ying Wang, and Xiaowei Li

Reducing Cache Traffic and Energy with Macro Data Load

MICRO-50 Swamit Tannu Zachary Myers Douglas Carmean Prashant Nair

Reseeding-based Test Set Embedding with Reduced Test Sequences

4-Bit Register Built using D flip-flops:

Restrictive Compression Techniques to Increase Level 1 Cache Capacity

Presentation transcript:

Sangyeun Cho Hyunjin Lee Flip-N-Write: A Simple Deterministic Technique to Improve PRAM Write Performance, Energy and Endurance Sangyeun Cho Hyunjin Lee Dept. of Computer Science University of Pittsburgh

Phase-change RAM (PRAM) top electrode GST metal bit line select TR current pulse source memory VCC Vgate SET RESET Amorphous = high resistivity Crystalline = low resistivity (Pictures from Hegedüs and Elliott, Nature Materials, March 2008)

read latency << write latency PRAM asymmetries read latency << write latency

read power << write power PRAM asymmetries read power << write power

read endurance >> write endurance PRAM asymmetries read endurance >> write endurance

Writes are Bad… … and are Hated!

Partial writes [Lee et al. ‘09] 1 cache block replaced and written to PRAM 1 introduce multiple dirty bits to isolate and not write clean data 4B granularity gives a 6x improvement (over 64B)

Differential write [Yang et al. ‘07][Zhou et al. ‘09] 0 0 0 1 0 1 1 0 0 1 0 1 0 0 0 1 cache block replaced to be written to PRAM content of the corresponding memory block 0 0 0 1 0 1 1 0 1 1 1 1 0 1 1 0

Differential write [Yang et al. ‘07][Zhou et al. ‘09] 0 0 0 1 0 1 1 0 0 1 0 1 0 0 0 1 cache block replaced to be written to PRAM READ out (old) memory content 1 content of the corresponding memory block 0 0 0 1 0 1 1 0 1 1 1 1 0 1 1 0 0 0 0 1 0 1 1 0 1 1 1 1 0 1 1 0

Differential write [Yang et al. ‘07][Zhou et al. ‘09] cache block replaced to be written to PRAM 0 0 0 1 0 1 1 0 0 1 0 1 0 0 0 1 0 0 1 Compare old and new data bit by bit 2 0 0 0 1 0 1 1 0 1 1 1 1 0 1 1 0 0 0 0 1 0 1 1 0 1 1 1 1 0 1 1 0

Differential write [Yang et al. ‘07][Zhou et al. ‘09] cache block replaced to be written to PRAM 0 0 0 1 0 1 1 0 0 1 0 1 0 0 0 1 0 1 0 1 0 0 0 1 0 0 1 Probabilistically, 50% of bits will be updated; In practice, (typically) fewer bits are updated as bit-level redundancies are common in data 0 0 0 1 0 1 1 0 1 1 1 1 0 1 1 0 Update bits that differ 3 0 0 0 1 0 1 1 0 1 1 1 1 0 1 1 0 0 0 1

Contributions of this work We propose Flip-N-Write, a differential write scheme based on Read-Modify-Write and data encoding We use a simple bi-modal data encoding strategy: Intact or flipped Flip bit is introduced to denote the mode Importantly, Flip-N-Write will update at most N/2 bits at a time when updating N bits c.f., Differential write updates at most N bits Write power is deterministically bounded We perform a comprehensive comparative study Conventional write scheme Differential write Flip-N-Write

Flip-N-Write basic idea cache block replaced to be written to PRAM “New data” 1 1 1 1 1 1 1 1 0 0 0 1 0 0 0 1 “Old data” 0 0 0 1 0 1 1 0 1 1 1 1 0 1 1 0

Flip-N-Write basic idea cache block replaced to be written to PRAM “New data” 1 1 1 1 1 1 1 1 0 0 0 1 0 0 0 1 11 bits are different! “Old data” 0 0 0 1 0 1 1 0 1 1 1 1 0 1 1 0

Flip-N-Write basic idea cache block replaced to be written to PRAM “New data” 1 1 1 1 1 1 1 1 0 0 0 1 0 0 0 1 “Flipped new data” 0 0 0 0 0 0 0 0 1 1 1 0 1 1 1 0 “Old data” 0 0 0 1 0 1 1 0 1 1 1 1 0 1 1 0 Only five bits are different!

Flip-N-Write basic idea cache block replaced to be written to PRAM “New data” 1 1 1 1 1 1 1 1 0 0 0 1 0 0 0 1 “Flipped new data” 0 0 0 0 0 0 0 0 1 1 1 0 1 1 1 0 1 “Flip bit” “Old data” 0 0 0 1 0 1 1 0 1 1 1 1 0 1 1 0 (5+1) bits need be updated…

Flip-N-Write steps 1 1 1 1 1 1 1 1 0 0 0 1 0 0 0 1 cache block replaced to be written to PRAM content of the corresponding memory block 0 0 0 1 0 1 1 0 1 1 1 1 0 1 1 0

Flip-N-Write steps 1 1 1 1 1 1 1 1 0 0 0 1 0 0 0 1 cache block replaced to be written to PRAM READ out (old) memory content 1 content of the corresponding memory block 0 0 0 1 0 1 1 0 1 1 1 1 0 1 1 0 0 0 0 1 0 1 1 0 1 1 1 1 0 1 1 0

Flip-N-Write steps cache block replaced to be written to PRAM 1 1 1 1 1 1 1 1 0 0 0 1 0 0 0 1 2 Compare old and new data and compute hamming distance 0 0 0 1 0 1 1 0 1 1 1 1 0 1 1 0 0 0 0 1 0 1 1 0 1 1 1 1 0 1 1 0

Flip-N-Write steps cache block replaced to be written to PRAM 1 1 1 1 1 1 1 1 0 0 0 1 0 0 0 1 2 If hamming distance > N/2, flip new data 0 0 0 1 0 1 1 0 1 1 1 1 0 1 1 0 0 0 0 1 0 1 1 0 1 1 1 1 0 1 1 0

Flip-N-Write steps cache block replaced to be written to PRAM 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 1 1 0 1 2 If hamming distance > N/2, flip new data 0 0 0 1 0 1 1 0 1 1 1 1 1 1 1 0 1 1 0 1 0 0 0 1 0 1 1 0 1 1 1 1 0 1 1 0

Flip-N-Write steps At most N/2 bits are updated; cache block replaced to be written to PRAM 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 1 1 0 0 0 1 0 0 1 1 1 0 1 1 1 0 1 At most N/2 bits are updated; In practice, (typically) fewer bits are updated as bit-level redundancies are common in data Update bits that differ 3 0 0 0 1 0 1 1 0 1 1 1 1 0 1 1 0 0 0 1 0 0 1

Analysis: bit update reduction improvement over conventional word width

Analysis: bit update reduction improvement over conventional improvement over DW word width

Analysis: flip bit overheads improvement over conventional improvement over DW overhead word width

Analysis: comparison with DW # bit updates/16-bit write in-position bit flip probability

Analysis: comparison with DW # bit updates/16-bit write Flip-N-Write in-position bit flip probability

Implications Savings in bit updates improves power and endurance On average, in terms of # update bits, Flip-N-Write is better than, but not far superior to DW However, in terms of maximum # update bits, Flip-N-Write has an edge: N/2 vs. N This deterministic property is extremely useful with a limited cell write current requirement Write current limited write time (M bits, S-bit update rate): Conventional: (M/S)×TSET DW: TREAD + (M/S)×TSET Flip-N-Write: TREAD + (M/2S)×TSET

Circuit design (16): read path cell blocks have more bits (flip bits) larger read buffer (flip bits) logic for bypassing/flipping data (Baseline design from Lee et al., ISSCC, February 2007)

Circuit design (16): write prep. determine if flipping is needed bit mask showing what bits need programming bit-wise SET/RESET commands (Baseline design from Lee et al., ISSCC, February 2007)

Circuit design (16): write driver write (SET/RESET) pulse Cell current SET operation suppressed if outcome is “0” (not needed) RESET operation suppressed if outcome is “0” (Baseline design from Lee et al., ISSCC, February 2007)

Experimental setup We have three sets of experiments Storage environment: # of bit updates from firmware/data update Main memory environment: Program execution time Four-core processor chip with ATOM-like cores PRAMs with conventional write, DW, and Flip-N-Write Based on Samsung’s 512Mbit prototype chip Workloads Firmware update: MiBench compiled with gcc Data update: photos from New York Times “Pictures of the Day” (late April of 2009), music files “ripped” from two albums Program run: SPEC2006 (multiprogrammed), SPLASH-2, SPECjbb

Firmware update (ARM) RESET SET # bit updates/1,024 code bits

Firmware update (ARM) conventional DW Flip-N-Write RESET # bit updates/1,024 code bits

Many more “SET” operations than “RESET” Firmware update (ARM) # bit updates/1,024 code bits Many more “SET” operations than “RESET”

For DW & Flip-N-Write, # SET ~ # RESET Firmware update (ARM) # bit updates/1,024 code bits For DW & Flip-N-Write, # SET ~ # RESET

Conventional > DW > Flip-N-Write Firmware update (ARM) # bit updates/1,024 code bits 305.7 278.6 Conventional > DW > Flip-N-Write

Improvement w/ Flip-N-Write over DW ~14% Firmware update (x86) # bit updates/1,024 code bits 385.0 331.3 Improvement w/ Flip-N-Write over DW ~14%

Compressed files have roughly equal 0’s and 1’s Music file update # bit updates/1,024 data bits Compressed files have roughly equal 0’s and 1’s

Main memory performance conventional DW Flip-N-Write AMAL (cycles) NoDelay

Main memory performance AMAL (cycles) Higher memory intensity leads to larger AMAL

Main memory performance AMAL (cycles) Flip-N-Write eases bandwidth shortage problem

PRAM write bandwidth remains a challenge… Program performance relative performance (to NoDelay) PRAM write bandwidth remains a challenge…

Flip-N-Write salvages performance… Program performance relative performance (to NoDelay) Flip-N-Write salvages performance…

Bandwidth, bandwidth, bandwidth # banks/device L2 cache size PRAM update time conventional DW AMAL (cycles) Flip-N-Write NoDelay Write bandwidth is the key to high performance

Conclusions Flip-N-Write is a differential write scheme to reduce PRAM’s bit update actions using simple encoding Power/energy reduction Potential improvement in write endurance Simple algorithm allows straightforward design Deterministic bound on # update bits beneficial for power provisioning or bandwidth improvement Faster firmware/data update latency Performance improvement can be sizable Write bandwidth remains a challenge

Sangyeun Cho Hyunjin Lee Flip-N-Write: A Simple Deterministic Technique to Improve PRAM Write Performance, Energy and Endurance Sangyeun Cho Hyunjin Lee Dept. of Computer Science University of Pittsburgh

Flip-N-Write, more formally