Donghyuk Lee, Mike O’Connor, Niladrish Chatterjee

Slides:



Advertisements
Similar presentations
Practical Caches COMP25212 cache 3. Learning Objectives To understand: –Additional Control Bits in Cache Lines –Cache Line Size Tradeoffs –Separate I&D.
Advertisements

Click to add text © IBM Corporation Optimization Issues in SSE/AVX-compatible functions on PowerPC Ian McIntosh November 5, 2014.
Main Mem.. CSE 471 Autumn 011 Main Memory The last level in the cache – main memory hierarchy is the main memory made of DRAM chips DRAM parameters (memory.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
Overview Memory definitions Random Access Memory (RAM)
Memory Hierarchy.1 Review: Major Components of a Computer Processor Control Datapath Memory Devices Input Output.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
CSE 451: Operating Systems Winter 2010 Module 13 Redundant Arrays of Inexpensive Disks (RAID) and OS structure Mark Zbikowski Gary Kimura.
Hardware Overview Net+ARM – Well Suited for Embedded Ethernet
JPEG C OMPRESSION A LGORITHM I N CUDA Group Members: Pranit Patel Manisha Tatikonda Jeff Wong Jarek Marczewski Date: April 14, 2009.
Lecture on Electronic Memories. What Is Electronic Memory? Electronic device that stores digital information Types –Volatile v. non-volatile –Static v.
DISKS IS421. DISK  A disk consists of Read/write head, and arm  A platter is divided into Tracks and sector  The R/W heads can R/W at the same time.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
CSIE30300 Computer Architecture Unit 07: Main Memory Hsin-Chou Chi [Adapted from material by and
 Higher associativity means more complex hardware  But a highly-associative cache will also exhibit a lower miss rate —Each set has more blocks, so there’s.
Energy-Efficient Data Compression for Modern Memory Systems Gennady Pekhimenko ACM Student Research Competition March, 2015.
Main Memory CS448.
CPEN Digital System Design
Yun-Chung Yang SimTag: Exploiting Tag Bits Similarity to Improve the Reliability of the Data Caches Jesung Kim, Soontae Kim, Yebin Lee 2010 DATE(The Design,
COMP SYSTEM ARCHITECTURE PRACTICAL CACHES Sergio Davies Feb/Mar 2014COMP25212 – Lecture 3.
Simultaneous Multi-Layer Access Improving 3D-Stacked Memory Bandwidth at Low Cost Donghyuk Lee, Saugata Ghose, Gennady Pekhimenko, Samira Khan, Onur Mutlu.
Contemporary DRAM memories and optimization of their usage Nebojša Milenković and Vladimir Stanković, Faculty of Electronic Engineering, Niš.
Operating Systems A Biswas, Dept. of Information Technology.
CS Introduction to Operating Systems
A Case for Toggle-Aware Compression for GPU Systems
Prof. Hsien-Hsin Sean Lee
CMSC 611: Advanced Computer Architecture
Computers’ Basic Organization
COMP541 Memories II: DRAMs
Lecture 11 Topics Additional Design Techniques Distributed Connections
Main Memory Cache Architectures
Lecture 23: Interconnection Networks
Improving Memory Access 1/3 The Cache and Virtual Memory
Edexcel GCSE Computer Science Topic 15 - The Processor (CPU)
Disks and RAID.
Cache Memory Presentation I
Cover a section of Ch 4 Review both Exam 2 and Exam 3
FPGAs in AWS and First Use Cases, Kees Vissers
Lecture 15: DRAM Main Memory Systems
Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor
Instructors: Randy H. Katz David A. Patterson
Directory-based Protocol
Introduction to Computer Systems
ESE534: Computer Organization
CS149D Elements of Computer Science
CSE 451: Operating Systems Spring 2005 Module 17 Redundant Arrays of Inexpensive Disks (RAID) Ed Lazowska Allen Center 570.
CSE 451: Operating Systems Autumn 2010 Module 19 Redundant Arrays of Inexpensive Disks (RAID) Ed Lazowska Allen Center 570.
MICROPROCESSOR MEMORY ORGANIZATION
CSE 451: Operating Systems Winter 2009 Module 13 Redundant Arrays of Inexpensive Disks (RAID) and OS structure Mark Zbikowski Gary Kimura 1.
Lecture 24: Memory, VM, Multiproc
Adapted from slides by Sally McKee Cornell University
Lecture 17 Logistics Last lecture Today HW5 due on Wednesday
Mark Zbikowski and Gary Kimura
CS 213 Lecture 11: Multiprocessor 3: Directory Organization
CSE 451: Operating Systems Winter 2012 Redundant Arrays of Inexpensive Disks (RAID) and OS structure Mark Zbikowski Gary Kimura 1.
CSE 451: Operating Systems Winter 2007 Module 18 Redundant Arrays of Inexpensive Disks (RAID) Ed Lazowska Allen Center 570.
CSE 451: Operating Systems Autumn 2009 Module 19 Redundant Arrays of Inexpensive Disks (RAID) Ed Lazowska Allen Center 570.
What is Computer Architecture?
Chapter 4 Introduction to Computer Organization
15-740/ Computer Architecture Lecture 19: Main Memory
What is Computer Architecture?
What is Computer Architecture?
Chapter 5 Computer Organization
Modified from notes by Saeid Nooshabadi
Lecture 17 Logistics Last lecture Today HW5 due on Wednesday
IT 344: Operating Systems Winter 2007 Module 18 Redundant Arrays of Inexpensive Disks (RAID) Chia-Chi Teng CTB
CSE 451: Operating Systems Winter 2006 Module 18 Redundant Arrays of Inexpensive Disks (RAID) Ed Lazowska Allen Center 570.
4-Bit Register Built using D flip-flops:
Introduction to Computers
CSE378 Introduction to Machine Organization
Presentation transcript:

Donghyuk Lee, Mike O’Connor, Niladrish Chatterjee REDUCING DATA TRANSFER ENERGY BY EXPLOITING SIMILARITY WITHIN A DATA TRANSACTION Donghyuk Lee, Mike O’Connor, Niladrish Chatterjee

GPU with 384-bit wide DRAM Interface

Terminated Pseudo-Open-Drain I/O Transmitting a ‘0’ value Transmitting a ‘1’ value

Terminated Pseudo-Open-Drain I/O Transmitting a ‘0’ value

Basic Idea Fewer ‘1’ bits in the data Less energy required on DRAM interface Is there a simple & effective encoding to reduce the number of ‘1’ bits?

Typical 32B cache sector/DRAM burst 01000000010010010000111111011011 01000000110111110101101101111110 01000001000111011110100111100110 01000000100011100010011010011010 01000000011010010000111111011011 00000000000000000000000000000000 00111111101101001111110111110100 00111111111111111110110000110101 Consists of eight 32-bit data elements

Typical 32B cache sector/DRAM burst 01000000010010010000111111011011 01000000110111110101101101111110 01000001000111011110100111100110 01000000100011100010011010011010 01000000011010010000111111011011 00000000000000000000000000000000 00111111101101001111110111110100 00111111111111111110110000110101 Many instances of data similarity between adjacent data elements

Base + XOR Transfer 10% Savings 01000000010010010000111111011011 Base value 01000000010010010000111111011011 00000000100101100101010010100101 00000001110000101011001010011000 00000001100100111100111101111100 00000000111001110010100101000001 01000000011010010000111111011011 00111111101101001111110111110100 00000000010010110001000111000001 01000000010010010000111111011011 01000000110111110101101101111110 01000001000111011110100111100110 01000000100011100010011010011010 01000000011010010000111111011011 00000000000000000000000000000000 00111111101101001111110111110100 00111111111111111110110000110101 XOR [8] K. Lee, S.-J. Lee, and H.-J. Yoo, “SILENT: Serialized Low Energy Transmission Coding for On-chip Interconnection Networks,” in ICCAD, 2004. 121 ‘one’ values 109 ‘one’ values 10% Savings

Avg. 29% reduction in # of ‘1’s How are we doing? 18% of apps get worse Avg. 29% reduction in # of ‘1’s

Challenge #1: Zero Data Values 01000000010010010000111111011011 01000001000111011110100111100110 01000000011010010000111111011011 00111111101101001111110111110100 01000000010010010000111111011011 00000000000000000000000000000000 01000001000111011110100111100110 01000000011010010000111111011011 00111111101101001111110111110100 2× more ‘1’ bits Zero-valued elements mixed with non-zero data can cause significant increases in ‘1’s

Zero Data Remapping B Z B B Z = B C C B A = C B B A A = B C C B 01000000010010010000111111011011 Z 00000000000000000000000000000000 B 01000000010010010000111111011011 B Z = B C C 00000000000010000000000000000000 (an arbitrary low-weight constant) B A = C B B 01000000010010010000111111011011 A A = B C 01000000010000010000111111011011 C 00000000000010000000000000000000 B 01000000010010010000111111011011

Zero-valued elements cause limited increase Zero Data Remapping 01000000010010010000111111011011 00001000000000000000000000000000 01000001000111011110100111100110 01000000011010010000111111011011 00111111101101001111110111110100 01000000010010010000111111011011 00000000000000000000000000000000 01000001000111011110100111100110 01000000011010010000111111011011 00111111101101001111110111110100 4 more ‘1’ bits (1.5%) Zero-valued elements cause limited increase

How are we doing now? 12% of apps get worse (was 18%) Avg. 30% reduction in # of ‘1’s (was 29%)

Challenge #2: Granularity of Data Little similarity at 32-bit granularity 01000000000010010010000111111011 01010100010001000010110100011000 01000000000110111110101101101111 10110000010111000110110011110110 01000000001000111011110100111100 11001001101111100100010111011110 01000000000100011100010011010011 00111001110010001100011011001011 Consists of four 64-bit data elements

Challenge #2: Granularity of Data 01000000000010010010000111111011 00010100010011010000110011100011 00010100010111111100011001110111 11110000010001111000011110011001 11110000011111111101000111001010 10001001100111011111100011100010 10001001101011111000000100001101 01111001110110010000001000011000 01000000000010010010000111111011 01010100010001000010110100011000 01000000000110111110101101101111 10110000010111000110110011110110 01000000001000111011110100111100 11001001101111100100010111011110 01000000000100011100010011010011 00111001110010001100011011001011 117 ‘one’ values 122 ‘one’ values 4% worse w/ wrong granularity

Challenge #2: Granularity of Data Perform XOR at 64-bit granularity 01000000000010010010000111111011 01010100010001000010110100011000 01000000000110111110101101101111 10110000010111000110110011110110 01000000001000111011110100111100 11001001101111100100010111011110 01000000000100011100010011010011 00111001110010001100011011001011 01000000000010010010000111111011 01010100010001000010110100011000 00000000000100101100101010010100 11100100010110000100000111101110 00000000001110000101011001010011 01111001111000100010100100101001 00000000001100100111100111101111 11110000011101101000001100010101 117 ‘one’ values 103 ‘one’ values 12% savings w/ correct granularity

Dynamic selection of granularity? 8B isn’t too bad even when 2B or 4B is best

Observation Similarity of 2B and 4B elements can be exploited at 8B granularity similarity in N-byte elements 3901 3903 3905 3907 3909 390b 390d 390f similar similarity in 2N-byte elements 3901 3903 3905 3907 3909 390b 390d 390f similar similarity in 4N-byte elements 3901 3903 3905 3907 3909 390b 390d 390f similar

Perform Base + XOR transform on 8B granularity Universal Base Missed opportunity 01111001000000010111100100000011 01111001000001010111100100000111 00000000000010000000000000001000 00000000000110000000000000011000 01111001000000010111100100000011 01111001000001010111100100000111 01111001000010010111100100001011 01111001000011010111100100001111 01111001000100010111100100010011 01111001000101010111100100010111 01111001000110010111100100011011 01111001000111010111100100011111 Perform Base + XOR transform on 8B granularity

Universal Base Perform Base + XOR transform on 8B granularity 01111001000000010000000000000010 01111001000000010111100100000011 01111001000001010111100100000111 00000000000010000000000000001000 00000000000110000000000000011000 01111001000000010111100100000011 00000000000001000000000000000100 00000000000010000000000000001000 00000000000110000000000000011000 Perform Base + XOR transform on 8B granularity Perform 4B granularity transform within 8B base Perform 2B granularity transform within 4B base

Now, how are we doing? 3.7% of apps get worse (was 12%) Avg. 35% reduction in # of ‘1’s (was 30%)

What’s up with the 33% Increase? 00000000000000000000000011010010 00001000000000000000000000000000 00000000110100100000000000000000 00000000000000001101001000000000 11010010000000000000000000000000 00000000000000000000000011010010 00000000000000000000000000000000 00000000110100100000000000000000 00000000000000001101001000000000 11010010000000000000000000000000 4 more ‘1’ bits due to ZDR Very few 1’s to begin with 16 ones -> 20 ones (6%->8%) A 33% increase in ‘1’s

What’s up with the 33% Increase? Very small Power increases

Secondary Effect: Switching Reduction Fewer ‘1’ values reduces the probability of a 10 or 01 transition on the bus Across our benchmarks we see a 23% reduction on average in switching activity Also saves power on the interface due to charging/discharging the channel capacitance Good for unterminated/on-chip bus, too!

Reduces SSO by limiting diff. between min/max power Synergy with DBI 01000000:0 01001001:0 00001111:0 00100100:1 01000000:0 00100000:1 10100100:1 10000001:1 01000001:0 00011101:0 00010110:1 00011001:1 01000000:0 10001110:0 00100110:0 10011010:0 01000000:0 01101001:0 00001111:0 00100100:1 00000000:0 00000000:0 00000000:0 00000000:0 11000000:1 10110100:0 00000010:1 00001011:1 11000000:1 00000000:1 00010011:1 00110101:0 01000000 01001001 00001111 11011011 01000000 11011111 01011011 01111110 01000001 00011101 11101001 11100110 01000000 10001110 00100110 10011010 01000000 01101001 00001111 11011011 00000000 00000000 00000000 00000000 00111111 10110100 11111101 11110100 00111111 11111111 11101100 00110101 Data Bus Inversion adds a bit of metadata and conditionally inverts data if more than 50% are ‘1’s Reduces SSO by limiting diff. between min/max power

We reduce 1’s by 30% over DBI and reduce switching 24% Synergy with DBI We reduce 1’s by 30% over DBI and reduce switching 24%

Implementation costs Very simple to implement Verilog for encoder: (45 non-comment lines) Encoder + Decoder 2,232 µm2 (16nm FinFet process) 0.027mm2 total for 384-bit DRAM interface

Overall DRAM Energy Savings Detailed model of GDDR5X DRAM system Assumes 70% utilization Typical traffic patterns (e.g. R/W mix, activation rate) Modeling reduction of 1’s and switching activity Incremental power for encoder/decoder logic Saves 4.4% of total GDDR5X DRAM system energy

Great! What about for CPUs? 32% of apps get worse Avg. 12% reduction in # of ‘1’s

Why aren’t CPUs seeing the benefits? Array of Structs struct Circle { int color; float radius; int x; int y; } Circle circles[1000]; Struct of Arrays struct Circles { int color[1000]; float radius[1000]; int x[1000]; int y[1000]; } Typical GPU/SIMD coding style keeps data of the same type/content together Typical CPU coding style interleaves data of different types/contents in memory Breaks premise of adjacent data elements being similar

Conclusions Simple, cheap, & easy to implement No additional metadata or code-storage SRAMs Works with off-the-shelf DRAMs today Reduces number of 1’s by 30-35% (depending on DBI) Reduces switching activity 23-24% (depending on DBI) Applicable to any bus transferring predominantly vector data (e.g. for GPUs or data for SIMD units) But especially good for terminated buses – e.g. Saves 4.4% total DRAM energy for GDDR5X