A Cache-Like Memory Organization for 3D memory systems CAMEO 12/15/2014 MICRO Cambridge, UK Chiachen Chou, Georgia Tech Aamer Jaleel, Intel Moinuddin K. Qureshi, Georgia Tech
EXECUTIVE SUMMARY How to use Stacked DRAM: Cache or Memory? Cache: software-transparent, fine-grained data transfer, but sacrifices memory capacity Memory: larger memory capacity, but software- support, coarse-grained data transfer CAMEO: software-transparent, fine-grained data transfer and almost full memory capacity Results: CAMEO outperforms both Cache (50%) and Two-Level Memory (50%) by providing 78% speedup 2
MEMORY BANDWIDTH WALL 3 Courtesy: JEDEC, Intel, Micron Stacked DRAM helps overcome bandwidth wall Stacked DRAM Bandwidth2-8X Latency0.5-1X Computer systems face memory bandwidth wall. High Bandwidth Memory Hybrid Memory Cube
HYBRID MEMORY SYSTEM 4 Courtesy: JEDEC, Intel, Micron How to use Stacked DRAM: Cache or Main Memory? 1-4 GB Commodity DRAM 8-16 GB Stacked DRAM Commodity DRAM Hybrid Memory System Stacked DRAM
AGENDA Introduction Background –Cache –Two-Level Memory CAMEO –Concept –Implementation Methodology Results Summary 5
Off-chip DRAM Off-chip DRAM HARDWARE-MANAGED CACHE 6 DRAM Cache DRAM Cache Stacked DRAM is architected as DRAM Cache Stacked DRAM Stacked DRAM Memory Hierarchy fast slow CPU L1$ L2$ L3$ CPU L2$ L1$
L4 Cache OS HARDWARE-MANAGED CACHE 7 Off-chip memory Off-chip memory L3 Miss L4 Miss Cache: software-transparency, fine-grained data transfer, but no capacity benefits 64B Shared L3 Cache
CacheTLMCAMEO Need OS SupportNoYesNo Data 64B4KB64B Memory CapacityNo 3DPlus 3D+= 3D 3D DRAM AS A CACHE 8 CPUs DRAM $ Off-chip memory CPU Stacked DRAM Stacked DRAM Commodity DRAM Commodity DRAM 4GB 12GB 16GB 12GB (Cache)
AGENDA Introduction Background –Cache –Two-Level Memory CAMEO –Concept –Implementation Methodology Results Summary 9
TWO-LEVEL MEMORY (TLM) 10 Stacked DRAM is architected as part of OS- visible memory space (Two-Level Memory) Off-chip DRAM Off-chip DRAM Stacked DRAM Stacked DRAM OS 4GB 12GB 16GB CPU L1$ L2$ L3$ CPU L2$ L1$
TWO-LEVEL MEMORY (NO MIGRATION) 11 Static page mapping does not exploit locality OS Page Shared L3 Cache Page 4GB 12GB 25% Pages 75% Pages
TWO-LEVEL MEMORY (WITH MIGRATION) 12 TLM: OS support and inefficient use of bandwidth Page Shared L3 Cache Page Migration L3 Miss 64B OS support OS support Page (4KB Transfer)
MOTIVATION 13 (<12GB) Baseline: 12GB off-chip DRAM w/ 4GB stacked DRAM (>12GB) Small WS: Small Working Set (<12GB)
MOTIVATION 14 Baseline: 12GB off-chip DRAM w/ 4GB stacked DRAM (>12GB) (<12GB) Cache performs poorly in Large WS workloads, as TLM in Small WS workloads 31%
OVERVIEW CPUs DRAM $ Off-chip DRAM Off-chip DRAM OS-visible Memory Space CPUs Stacked DRAM Stacked DRAM Off-chip DRAM Off-chip DRAM 15 CacheTLMIdeal Need OS SupportNoYesNo Data 64B4KB64B Memory CapacityNo 3DPlus 3D
AGENDA Introduction Background –Cache –Two-Level Memory CAMEO –Concept –Implementation Methodology Results Summary 16
CAMEO 17 A CAche-Like MEmory Organization Shared L3 Cache Commodity DRAM Commodity DRAM Stacked DRAM Stacked DRAM OS Page SW get full capacity; HW does data migration 4GB 12GB 16GB Hardware performs data migration
Stacked memory Stacked memory Off-chip memory Off-chip memory CAMEO 18 A CAche-Like MEmory Organization Shared L3 Cache CAMEO transfers only 64B cache lines L3 Miss 64B HW swaps lines (fine-grained transfer)
CAMEO – CONGRUENCE GROUP 19 Off-chip memory Off-chip memory Stacked memory Stacked memory 4GB 12GB 0 N-1 N 2N-1 2N 3N-1 3N 4N-1 A A B B C C D D Congruence group
MIGRATION IN CONGRUENCE GROUP 20 A A B B C C D D Request to B, B, and C: Request to B: Swap line A and B B B A A C C D D Request to B: Hit in Stacked DRAM B B A A C C D D Request to C: Swap line C and B C C A A B B D D Swapping changes line’s location, and requires indexing structure to keep track of the location.
11 LINE LOCATION TABLE (LLT) Location Table for Congruence Group 21 C C A A B B D D 4 Location Request Line ABCD Physical Location C C A A B B D D
LINE LOCATION TABLE (LLT) Size of Location Table Per Congruence Group 22 C C A A B B D D Log 2 (4)=2 bits 4 lines = 8 bits (1 byte) Storing LLT in SRAM is impractical 64M groups (64MB)
2KB LLT IN DRAM LLT in DRAM incurs serialization Latency –Optimizing for common case: Hit in stacked DRAM –Co-locate Line Location Table of each congruence group with data in stacked DRAM LLT L3 Miss Stacked DRAM 1 byte LLT 64 byte Data LEAD 31 LEAD % capacity loss Location Entry And Data Hits
AVOID LLT LOOKUP LATENCY FOR HIT Avoiding LLT Lookup Latency on Stacked DRAM Hit (lines in stacked memory) –Co-locate Line Location Table of each congruence group with data in stacked DRAM Addr Hit: one access Stacked DRAM 24 Co-Locate LLT to avoid latency on hits Data
AVOID LLT LOOKUP LATENCY FOR MISS 25 A A B B C C D D Addr LEAD: verify the location when both are ready. Line Location Predictor Parallel Access to Possible Location Avoiding LLT Lookup Latency on Stacked DRAM Miss (lines in off-chip memory) –Use Line Location Predictor to fetch data from possible location in parallel Always
AVOID LLT LOOKUP LATENCY FOR MISS 26 Line Location Predictor Add r Stacked Off-chip #1 Off-chip #2 Off-chip #3 PredictorAccuracy Always Stacked 70% LLP92% Avoiding LLT Lookup Latency on Stacked DRAM Miss (lines in off-chip memory) –LLP makes M-ary prediction –LLP uses instruction address and last location to make prediction 64 byte per core
AVOIDING LLT LATENCY OVERHEAD Co-locate LLT of each congruence group with data in stacked DRAM 27 We co-locate Line Location Table and use Line Location Predictor to mitigate latency overhead On Hit in Stacked DRAM On Miss in Stacked DRAM Use Line Location Predictor to fetch data from possible location in parallel A A B B C C D D Stacked Off-chip Line Location Table Line Location Predictor Add r Stacked Off-chip #1 Off-chip #2 Off-chip #3
AGENDA Introduction Background –Cache –Two-Level Memory CAMEO –Concept –Implementation Methodology Results Summary 28
Core Chip 3.2GHz 2-wide out-of-order core 32 cores, 32MB 32-way L3 shared cache METHODOLOGY 29 Stacked DRAM Commodity DRAM SSD CPU
METHODOLOGY 30 Stacked DRAM Commodity DRAM SSD CPU Stacked DRAMCommodity DRAM Capacity4GB12GB BusDDR3.2GHz, 128-bitDDR1.6GHz, 64-bit Latency22ns44ns Channels 16 channels, 16 banks/channel 8 channels 8 banks/channels
METHODOLOGY Stacked DRAM Commodity DRAM SSD CPU Baseline: 12GB off-chip DRAM Cache: Alloy Cache [MICRO’12] Two-Level Memory: Page Migration enabled SSD Latency: 32 micro seconds SPEC2006: rate mode; Small Working Set ( 12GB)
PERFORMANCE IMPROVEMENT 32 Small WSet CAMEO as good as Cache in Small WS apps
PERFORMANCE IMPROVEMENT 33 Large WSet CAMEO outperforms both Cache and TLM, and very close to DoubleUse CAMEO outperforms TLM in Large WS apps 28%
EXECUTIVE SUMMARY How to use Stacked DRAM: Cache or Memory? Cache: software-transparent, fine-grained data transfer, but sacrifices memory capacity Memory: larger memory capacity, but software- support, coarse-grained data transfer CAMEO: software-transparent, fine-grained data transfer and almost full memory capacity Results: CAMEO outperforms both Cache (50%) and Two-Level Memory (50%) by providing 78% speedup 34
Thank You! 35
A Cache-Like Memory Organization for 3D memory system CAMEO 12/15/2014 MICRO Cambridge, UK Chiachen Chou, Georgia Tech Aamer Jaleel, Intel Moinuddin K. Qureshi, Georgia Tech
Backup slides 37
LINE LOCATION TABLE Size of Location Table Per Congruence Group 38 A A B B C C D D 4 LocationLog 2 (4)=2 bits 4 lines 8 bits (1 byte) # LocationsSize 41 byte byte 83 byte
POWER AND ENERGY 39 14% 34%