Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computer Architecture Lecture 30: In-memory Processing

Similar presentations


Presentation on theme: "Computer Architecture Lecture 30: In-memory Processing"— Presentation transcript:

1 18-447 Computer Architecture Lecture 30: In-memory Processing
Vivek Seshadri Carnegie Mellon University Spring 2015, 4/13/2015

2 Goals for This Lecture Understand DRAM technology
How it is built? How it operates? What are the trade-offs? Can we use DRAM for more than just storage? In-DRAM copying In-DRAM bitwise operations

3 DRAM Module and Chip

4 Goals of DRAM Design Cost Latency Bandwidth Parallelism Power Energy
Reliability

5 DRAM Chip Bank

6 DRAM Cell – Capacitor Empty State Fully Charged State Logical “0”
1 Small – Cannot drive circuits 2 Reading destroys the state

7 Sense Amplifier top enable Inverter bottom

8 Sense Amplifier – Two Stable States
VDD en en VDD Logical “1” Logical “0”

9 Sense Amplifier Operation
VT VDD VT > VB en dis VB

10 Capacitor to Sense Amplifier
? en VDD en VDD

11 DRAM Cell Operation Cell regains charge Cell loses charge ½VDD ½VDD+δ
dis en ½VDD

12 Amortizing Cost – DRAM Tile
Row Driver

13 DRAM Subarray Row Decoder Row Driver Tile Tile Tile

14 DRAM Subarray Tile Tile Tile Tile Tile Tile Tile Row Decoder
Row Driver Tile Tile Tile Tile Tile Tile Tile

15 Array of Sense Amplifiers (8Kb) Array of Sense Amplifiers
DRAM Bank Row Decoder Cell Array Array of Sense Amplifiers (8Kb) Cell Array Address Row Decoder Cell Array Array of Sense Amplifiers Cell Array Bank I/O (64b) Data Address

16 Array of Sense Amplifiers (8Kb) Array of Sense Amplifiers
DRAM Chip Shared internal bus Row Decoder Array of Sense Amplifiers (8Kb) Cell Array Array of Sense Amplifiers Bank I/O (64b) Memory channel - 8bits

17 Array of Sense Amplifiers
DRAM Operation Row Decoder 1 ACTIVATE Row Row Address 2 READ/WRITE Column Row Decoder Cell Array Array of Sense Amplifiers Cell Array 3 PRECHARGE Bank I/O Data Column Address

18 Goals for This Lecture Understand DRAM technology
How it is built? How it operates? What are the trade-offs? Can we use DRAM for more than just storage? In-DRAM copying In-DRAM bitwise operations

19 Trade-offs in DRAM Design
Cost Latency Bandwidth Parallelism Power Energy Reliability Rows/Subarray Data width, Chips/DIMM Banks/Chip

20 Goals for This Lecture Understand DRAM technology
How it is built? How it operates? What are the trade-offs? Can we use DRAM for more than just storage? In-DRAM copying In-DRAM bitwise operations

21 Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization
RowClone Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization Vivek Seshadri Put down “Vivek” Y. Kim, C. Fallin, D. Lee, R. Ausavarungnirun, G. Pekhimenko, Y. Luo, O. Mutlu, P. B. Gibbons, M. A. Kozuch, T. C. Mowry

22 Memory Channel – Bottleneck
Limited Bandwidth Memory Core Cache Channel MC Core High Energy

23 Goal: Reduce Memory Bandwidth Demand
Core Cache Channel MC Core Reduce unnecessary data movement

24 Bulk Data Copy and Initialization
src dst Bulk Data Copy Bulk Data Initialization dst val

25 Bulk Copy and Initialization – Applications
Zero initialization (e.g., security) Forking Checkpointing VM Cloning Deduplication Page Migration Many more

26 Shortcomings of Existing Approach
High Energy (3600nJ to copy 4KB) Core Cache dst Channel MC src Core 1 micro second even with pipelining… High latency (1046ns to copy 4KB) Interference

27 Our Approach: In-DRAM Copy with Low Cost
X High Energy Core Cache dst Channel MC ? src Core X High latency X Interference

28 RowClone: In-DRAM Copy

29 Bulk Copy in DRAM – RowClone
½VDD +δ ½VDD VDD Data gets copied 1 ½VDD

30 Fast Parallel Mode – Benefits
Bulk Data Copy (4KB across a module) 11X 74X Latency Energy 1046ns to 90ns 3600nJ to 40nJ No bandwidth consumption Very little changes to the DRAM chip

31 Fast Parallel Mode – Constraints
Location constraint Source and destination in same subarray Size constraint Entire row gets copied (no partial copy) Can still accelerate many existing primitives (copy-on-write, bulk zeroing) 1 Alternate mechanism to copy data across banks (pipelined serial mode – lower benefits than Fast Parallel) 2

32 End-to-end System Design
Software interface memcpy and meminit instructions Managing cache coherence Use existing DMA support! Maximizing use of Fast Parallel Mode Smart OS page allocation

33 Results Summary

34 Goals for This Lecture Understand DRAM technology
How it is built? How it operates? What are the trade-offs? Can we use DRAM for more than just storage? In-DRAM copying In-DRAM bitwise operations

35 Triple Row Activation C(A + B) + ~C(AB) Final State AB + BC + AC ½VDD
dis en ½VDD

36 Throughput Results L1 L2 L3 Memory

37 Goals for This Lecture Understand DRAM technology
How it is built? How it operates? What are the trade-offs? Can we use DRAM for more than just storage? In-DRAM copying In-DRAM bitwise operations


Download ppt "Computer Architecture Lecture 30: In-memory Processing"

Similar presentations


Ads by Google