Assistant Professor, EECS Department In-Situ Compute Memory Systems Reetuparna Das Assistant Professor, EECS Department
Massive amounts of data generated each day
Near Data Computing Move compute near storage We can solve the data movememnt problem huge volumes of data
Evolution... Processing in Memory (PIM) IRAM, DIVA, Active Pages, etc ... 1997 Evolution... Emergence of Big Data Data movement dominates Joules/Op 3D Memories available Resurgence of Processing in Memory (PIM) Logic layer near Memory enabled by 3D Technology 2012 Automaton Processor Associative memory with custom interconnects 2014 2015 Computer Memories (bit-line computing) In-situ computation inside memory arrays
Problem 1: Memory are big and inactive Memory consumes most of aggregate die area Cache Haswell Core i7-5960X
Problem 2: Significant energy wasted is moving data through memory hierarchy 1000-2000 pJ 1-50 pJ Data movement Operation 20-40x
Problem 3: General purpose processors are inefficient for data parallel applications Scalar Small vector (32 bytes) Inefficient More efficient Very large vector 1000x larger?
Problem summary: Conventional systems process data inefficiently Energy Problem 3 Memory Memory Problem 1 Area CORE CORE
Key Idea Memory = Storage + In-place compute
Proposal: Repurpose memory logic for compute Energy Massive Parallelism (up to 100X) Memory Memory Memory Energy Efficiency (up to 20X) Depicting dynamic energies here.. With total, even base is dominated by caches.. Area CORE CORE CORE
Compute Caches for Efficient Very Large Vector Processing PIs: Reetuparna Das, Satish Narayanasamy, David Blaauw Students: Shaizeen Aga, Supreet Jeloka, Arun Subramanian
Proposal: Compute Caches Memory disambiguation support In-place compute SRAM A B = A op B Bank Sub-array CORE0 CORE3 L1 L1 L2 L2 Interconnect Challenges: Data orchestration Managing parallelism Coherence and Consistency Cache controller extension L3-Slice0 L3-Slice3
Opportunity 1: Large vector computation Operation width = row size L3 Many smaller sub-arrays Each sub-array can compute in parallel L2 Parallelism available (16 MB L3) 512 sub-arrays * 64B 32KB Operand 128X saved L1
Opportunity 2: Save data movement energy L3 Significant portion of cache energy is wire energy (60-80%) H-tree Save wire energy Save energy in moving data to higher cache levels L2 L1
Compute Cache Architecture Memory disambiguation for large vector operations CORE0 CORE3 Cache controller extension L1 L1 More details in upcoming HPCA paper L2 L2 Interconnect L3-Slice0 L3-Slice3 In-place compute SRAM
SRAM array operation Read Operation Precharge Bitlines Address BLB0 BL0 BLBn BLn Precharge Row Decoder 0 1 0 1 sub-array Wordlines Read: – Precharge bit, bit_b – Raise wordline Write: – Drive data onto bit, bit_b – Raise wordline SA SA Differential Sense Amplifiers 1 1
In-place compute SRAM Changes Bitlines Row Decoder-O Row Decoder BLB0 BL0 BLBn BLn Row Decoder-O Row Decoder Wordlines Single-ended Sense Amplifiers SA Vref SA Vref SA SA Differential Sense Amplifiers
In-place compute SRAM A AND B A B 1 A AND B B A Row Decoder-O BLB0 BL0 BLBn BLn Row Decoder-O Row Decoder A 0 1 0 1 B 1 0 0 1 SA Vref SA Vref Single-ended Sense Amplifiers 1 A AND B
SRAM Prototype Test Chip
Compute Cache ISA So Far 𝑐 𝑐 𝒄𝒐𝒑𝒚 𝑎, 𝑏, 𝑛 𝑐 𝑐 𝒔𝒆𝒂𝒓𝒄𝒉 𝑎, 𝑘, 𝑟, 𝑛 𝑐 𝑐 𝒍𝒐𝒈𝒊𝒄𝒂𝒍 𝑎, 𝑏, 𝑐, 𝑛 𝑐 𝑐 𝒄𝒍𝒎𝒖𝒍 𝑎, 𝑘, 𝑟, 𝑛 𝑐 𝑐 𝒃𝒖𝒛 𝑎, 𝑛 𝑐 𝑐 𝒄𝒎𝒑 𝑎, 𝑏, 𝑟, 𝑛 𝑐 𝑐 𝒏𝒐𝒕 𝑎, 𝑏, 𝑛
Applications modeled using Compute Caches Text Processing StringMatch Wordcount In-memory Checkpointing FastBit BitMap Database WC: 10MB SM: 50MB BMM: 256 Bitmap ~256KB of bitmaps 100,000 ins checkpointing interval. Bit Matrix Multiplication
Compute Cache Results Summary 1.9X 2.4X 4%
Compute Cache Summary Empower caches to compute In-place compute SRAM Performance: Large vector parallelism Energy: Reduce on-chip data movement In-place compute SRAM Data placement and cache geometry for increased operand locality 8% area overhead 2.1X performance, 2.7X energy savings
Future
Compute Memory System Stack Data Analytics Crypto Image Processing Bioinformatics Graphs OS primitives FSA Machine Learning Redesign Applications Express computation using in-situ operations Data-flow languages Adapt PL/Compiler Java/C++ OpenCL Express parallelism Google’s TensorFlow [14] RAPID [20] ISA Data orchestration Coherence & Consistency Design Architecture Data-flow Large SIMD Manage parallelism Design Compute Memories Non-Volatile Volatile Cache Customize hierarchy Re-RAM STT-RAM MRAM Flash DRAM SRAM Where to compute? Rich operation set In-situ technique Bit-line Parallel Automaton Bit-line Locality Operation set Logical, Data migration, Comparison, Search Addition, Multiplication, Convolution, FSM
In-Situ Compute Memory Systems Thank You! Reetuparna Das