Download presentation
Presentation is loading. Please wait.
1
Assistant Professor, EECS Department
In-Situ Compute Memory Systems Reetuparna Das Assistant Professor, EECS Department
2
Massive amounts of data generated each day
3
Near Data Computing Move compute near storage
We can solve the data movememnt problem huge volumes of data
4
Evolution... Processing in Memory (PIM)
IRAM, DIVA, Active Pages, etc ... 1997 Evolution... Emergence of Big Data Data movement dominates Joules/Op 3D Memories available Resurgence of Processing in Memory (PIM) Logic layer near Memory enabled by 3D Technology 2012 Automaton Processor Associative memory with custom interconnects 2014 2015 Computer Memories (bit-line computing) In-situ computation inside memory arrays
5
Problem 1: Memory are big and inactive
Memory consumes most of aggregate die area Cache Haswell Core i7-5960X
6
Problem 2: Significant energy wasted is moving data through memory hierarchy
pJ 1-50 pJ Data movement Operation 20-40x
7
Problem 3: General purpose processors are inefficient for data parallel applications
Scalar Small vector (32 bytes) Inefficient More efficient Very large vector 1000x larger?
8
Problem summary: Conventional systems process data inefficiently
Energy Problem 3 Memory Memory Problem 1 Area CORE CORE
9
Key Idea Memory = Storage + In-place compute
10
Proposal: Repurpose memory logic for compute
Energy Massive Parallelism (up to 100X) Memory Memory Memory Energy Efficiency (up to 20X) Depicting dynamic energies here.. With total, even base is dominated by caches.. Area CORE CORE CORE
11
Compute Caches for Efficient Very Large Vector Processing
PIs: Reetuparna Das, Satish Narayanasamy, David Blaauw Students: Shaizeen Aga, Supreet Jeloka, Arun Subramanian
12
Proposal: Compute Caches
Memory disambiguation support In-place compute SRAM A B = A op B Bank Sub-array CORE0 CORE3 L1 L1 L2 L2 Interconnect Challenges: Data orchestration Managing parallelism Coherence and Consistency Cache controller extension L3-Slice0 L3-Slice3
13
Opportunity 1: Large vector computation
Operation width = row size L3 Many smaller sub-arrays Each sub-array can compute in parallel L2 Parallelism available (16 MB L3) 512 sub-arrays * 64B 32KB Operand 128X saved L1
14
Opportunity 2: Save data movement energy
L3 Significant portion of cache energy is wire energy (60-80%) H-tree Save wire energy Save energy in moving data to higher cache levels L2 L1
15
Compute Cache Architecture
Memory disambiguation for large vector operations CORE0 CORE3 Cache controller extension L1 L1 More details in upcoming HPCA paper L2 L2 Interconnect L3-Slice0 L3-Slice3 In-place compute SRAM
16
SRAM array operation Read Operation Precharge Bitlines Address
BLB0 BL0 BLBn BLn Precharge Row Decoder 0 1 0 1 sub-array Wordlines Read: – Precharge bit, bit_b – Raise wordline Write: – Drive data onto bit, bit_b – Raise wordline SA SA Differential Sense Amplifiers 1 1
17
In-place compute SRAM Changes Bitlines Row Decoder-O Row Decoder
BLB0 BL0 BLBn BLn Row Decoder-O Row Decoder Wordlines Single-ended Sense Amplifiers SA Vref SA Vref SA SA Differential Sense Amplifiers
18
In-place compute SRAM A AND B A B 1 A AND B B A Row Decoder-O
BLB0 BL0 BLBn BLn Row Decoder-O Row Decoder A 0 1 0 1 B 1 0 0 1 SA Vref SA Vref Single-ended Sense Amplifiers 1 A AND B
19
SRAM Prototype Test Chip
20
Compute Cache ISA So Far
𝑐 𝑐 𝒄𝒐𝒑𝒚 𝑎, 𝑏, 𝑛 𝑐 𝑐 𝒔𝒆𝒂𝒓𝒄𝒉 𝑎, 𝑘, 𝑟, 𝑛 𝑐 𝑐 𝒍𝒐𝒈𝒊𝒄𝒂𝒍 𝑎, 𝑏, 𝑐, 𝑛 𝑐 𝑐 𝒄𝒍𝒎𝒖𝒍 𝑎, 𝑘, 𝑟, 𝑛 𝑐 𝑐 𝒃𝒖𝒛 𝑎, 𝑛 𝑐 𝑐 𝒄𝒎𝒑 𝑎, 𝑏, 𝑟, 𝑛 𝑐 𝑐 𝒏𝒐𝒕 𝑎, 𝑏, 𝑛
21
Applications modeled using Compute Caches
Text Processing StringMatch Wordcount In-memory Checkpointing FastBit BitMap Database WC: 10MB SM: 50MB BMM: 256 Bitmap ~256KB of bitmaps 100,000 ins checkpointing interval. Bit Matrix Multiplication
22
Compute Cache Results Summary
1.9X 2.4X 4%
23
Compute Cache Summary Empower caches to compute In-place compute SRAM
Performance: Large vector parallelism Energy: Reduce on-chip data movement In-place compute SRAM Data placement and cache geometry for increased operand locality 8% area overhead 2.1X performance, 2.7X energy savings
24
Future
25
Compute Memory System Stack
Data Analytics Crypto Image Processing Bioinformatics Graphs OS primitives FSA Machine Learning Redesign Applications Express computation using in-situ operations Data-flow languages Adapt PL/Compiler Java/C++ OpenCL Express parallelism Google’s TensorFlow [14] RAPID [20] ISA Data orchestration Coherence & Consistency Design Architecture Data-flow Large SIMD Manage parallelism Design Compute Memories Non-Volatile Volatile Cache Customize hierarchy Re-RAM STT-RAM MRAM Flash DRAM SRAM Where to compute? Rich operation set In-situ technique Bit-line Parallel Automaton Bit-line Locality Operation set Logical, Data migration, Comparison, Search Addition, Multiplication, Convolution, FSM
26
In-Situ Compute Memory Systems
Thank You! Reetuparna Das
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.