A Cache-Like Memory Organization for 3D memory systems CAMEO 12/15/2014 MICRO Cambridge, UK Chiachen Chou, Georgia Tech Aamer Jaleel, Intel Moinuddin K.

Slides:

Advertisements

Similar presentations

Lecture 19: Cache Basics Today’s topics: Out-of-order execution

Advertisements

Jaewoong Sim Alaa R. Alameldeen Zeshan Chishti Chris Wilkerson Hyesoon Kim MICRO-47 | December 2014.

Fabián E. Bustamante, Spring 2007

1 Parallel Scientific Computing: Algorithms and Tools Lecture #2 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg.

Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.

High Performing Cache Hierarchies for Server Workloads

Practical Caches COMP25212 cache 3. Learning Objectives To understand: –Additional Control Bits in Cache Lines –Cache Line Size Tradeoffs –Separate I&D.

Citadel: Efficiently Protecting Stacked Memory From Large Granularity Failures June 14 th 2014 Prashant J. Nair - Georgia Tech David A. Roberts- AMD Research.

Moinuddin K. Qureshi ECE, Georgia Tech Gabriel H. Loh, AMD Fundamental Latency Trade-offs in Architecting DRAM Caches MICRO 2012.

Citadel: Efficiently Protecting Stacked Memory From Large Granularity Failures Dec 15 th 2014 MICRO-47 Cambridge UK Prashant Nair - Georgia Tech David.

Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University Caches P & H Chapter 5.1, 5.2 (except writes)

Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University Caches P & H Chapter 5.1, 5.2 (except writes)

Virtual Memory Adapted from lecture notes of Dr. Patterson and Dr. Kubiatowicz of UC Berkeley.

Micro-Pages: Increasing DRAM Efficiency with Locality-Aware Data Placement Kshitij Sudan, Niladrish Chatterjee, David Nellans, Manu Awasthi, Rajeev Balasubramonian,

Virtual Memory Adapted from lecture notes of Dr. Patterson and Dr. Kubiatowicz of UC Berkeley and Rabi Mahapatra & Hank Walker.

S.1 Review: The Memory Hierarchy Increasing distance from the processor in access time L1$ L2$ Main Memory Secondary Memory Processor (Relative) size of.

The Memory Hierarchy II CPSC 321 Andreas Klappenecker.

Page-based Commands for DRAM Systems Aamer Jaleel Brinda Ganesh Lei Zong.

Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.

Reducing Refresh Power in Mobile Devices with Morphable ECC

Lecture 19: Virtual Memory

Lecture 15: Virtual Memory EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2014, Dr.

Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University.

A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Jason Zebchuk, Elham Safi, and Andreas Moshovos

0 High-Performance Computer Architecture Memory Organization Chapter 5 from Quantitative Architecture January 2006.

L/O/G/O Cache Memory Chapter 3 (b) CS.216 Computer Architecture and Organization.

1 Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly Koushik Chakraborty Philip Wells Gurindar Sohi

Virtual Memory 1 1.

BEAR: Mitigating Bandwidth Bloat in Gigascale DRAM caches

Computer Organization CS224 Fall 2012 Lessons 45 & 46.

Introduction: Memory Management 2 Ideally programmers want memory that is large fast non volatile Memory hierarchy small amount of fast, expensive memory.

Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.

1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.

The Evicted-Address Filter

Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Vivek Seshadri Thomas Mullins, Amirali Boroumand,

Princess Sumaya Univ. Computer Engineering Dept. Chapter 5:

1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.

Virtual Memory Review Goal: give illusion of a large memory Allow many processes to share single memory Strategy Break physical memory up into blocks (pages)

1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.

Memory Hierarchy and Cache. A Mystery… Memory Main memory = RAM : Random Access Memory – Read/write – Multiple flavors – DDR SDRAM most common 64 bit.

1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.

Agile Paging: Exceeding the Best of Nested and Shadow Paging

CS161 – Design and Architecture of Computer

CS 704 Advanced Computer Architecture

Virtual Memory Chapter 7.4.

Seth Pugsley, Jeffrey Jestes,

Reducing Memory Interference in Multicore Systems

ECE232: Hardware Organization and Design

Memory COMPUTER ARCHITECTURE

CS161 – Design and Architecture of Computer

Section 9: Virtual Memory (VM)

CS 704 Advanced Computer Architecture

Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.

Cache Memory Presentation I

Morgan Kaufmann Publishers Memory & Cache

Moinuddin K. Qureshi ECE, Georgia Tech Gabriel H. Loh, AMD

Energy-Efficient Address Translation

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

CMSC 611: Advanced Computer Architecture

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Reducing Memory Reference Energy with Opportunistic Virtual Caching

Morgan Kaufmann Publishers Memory Hierarchy: Virtual Memory

Lecture 20: OOO, Memory Hierarchy

Lecture 20: OOO, Memory Hierarchy

If a DRAM has 512 rows and its refresh time is 9ms, what should be the frequency of row refresh operation on the average?

CANDY: Enabling Coherent DRAM Caches for Multi-node Systems

Enabling Transparent Memory-Compression for Commodity Memory Systems

Fundamentals of Computing: Computer Architecture

Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.

Caches & Memory.

Presentation transcript:

A Cache-Like Memory Organization for 3D memory systems CAMEO 12/15/2014 MICRO Cambridge, UK Chiachen Chou, Georgia Tech Aamer Jaleel, Intel Moinuddin K. Qureshi, Georgia Tech

EXECUTIVE SUMMARY How to use Stacked DRAM: Cache or Memory? Cache: software-transparent, fine-grained data transfer, but sacrifices memory capacity Memory: larger memory capacity, but software- support, coarse-grained data transfer CAMEO: software-transparent, fine-grained data transfer and almost full memory capacity Results: CAMEO outperforms both Cache (50%) and Two-Level Memory (50%) by providing 78% speedup 2

MEMORY BANDWIDTH WALL 3 Courtesy: JEDEC, Intel, Micron Stacked DRAM helps overcome bandwidth wall Stacked DRAM Bandwidth2-8X Latency0.5-1X Computer systems face memory bandwidth wall. High Bandwidth Memory Hybrid Memory Cube

HYBRID MEMORY SYSTEM 4 Courtesy: JEDEC, Intel, Micron How to use Stacked DRAM: Cache or Main Memory? 1-4 GB Commodity DRAM 8-16 GB Stacked DRAM Commodity DRAM Hybrid Memory System Stacked DRAM

AGENDA Introduction Background –Cache –Two-Level Memory CAMEO –Concept –Implementation Methodology Results Summary 5

Off-chip DRAM Off-chip DRAM HARDWARE-MANAGED CACHE 6 DRAM Cache DRAM Cache Stacked DRAM is architected as DRAM Cache Stacked DRAM Stacked DRAM Memory Hierarchy fast slow CPU L1$ L2$ L3$ CPU L2$ L1$

L4 Cache OS HARDWARE-MANAGED CACHE 7 Off-chip memory Off-chip memory L3 Miss L4 Miss Cache: software-transparency, fine-grained data transfer, but no capacity benefits 64B Shared L3 Cache

CacheTLMCAMEO Need OS SupportNoYesNo Data 64B4KB64B Memory CapacityNo 3DPlus 3D+= 3D 3D DRAM AS A CACHE 8 CPUs DRAM $ Off-chip memory CPU Stacked DRAM Stacked DRAM Commodity DRAM Commodity DRAM 4GB 12GB 16GB 12GB (Cache)

AGENDA Introduction Background –Cache –Two-Level Memory CAMEO –Concept –Implementation Methodology Results Summary 9

TWO-LEVEL MEMORY (TLM) 10 Stacked DRAM is architected as part of OS- visible memory space (Two-Level Memory) Off-chip DRAM Off-chip DRAM Stacked DRAM Stacked DRAM OS 4GB 12GB 16GB CPU L1$ L2$ L3$ CPU L2$ L1$

TWO-LEVEL MEMORY (NO MIGRATION) 11 Static page mapping does not exploit locality OS Page Shared L3 Cache Page 4GB 12GB 25% Pages 75% Pages

TWO-LEVEL MEMORY (WITH MIGRATION) 12 TLM: OS support and inefficient use of bandwidth Page Shared L3 Cache Page Migration L3 Miss 64B OS support OS support Page (4KB Transfer)

MOTIVATION 13 (<12GB) Baseline: 12GB off-chip DRAM w/ 4GB stacked DRAM (>12GB) Small WS: Small Working Set (<12GB)

MOTIVATION 14 Baseline: 12GB off-chip DRAM w/ 4GB stacked DRAM (>12GB) (<12GB) Cache performs poorly in Large WS workloads, as TLM in Small WS workloads 31%

OVERVIEW CPUs DRAM $ Off-chip DRAM Off-chip DRAM OS-visible Memory Space CPUs Stacked DRAM Stacked DRAM Off-chip DRAM Off-chip DRAM 15 CacheTLMIdeal Need OS SupportNoYesNo Data 64B4KB64B Memory CapacityNo 3DPlus 3D

AGENDA Introduction Background –Cache –Two-Level Memory CAMEO –Concept –Implementation Methodology Results Summary 16

CAMEO 17 A CAche-Like MEmory Organization Shared L3 Cache Commodity DRAM Commodity DRAM Stacked DRAM Stacked DRAM OS Page SW get full capacity; HW does data migration 4GB 12GB 16GB Hardware performs data migration

Stacked memory Stacked memory Off-chip memory Off-chip memory CAMEO 18 A CAche-Like MEmory Organization Shared L3 Cache CAMEO transfers only 64B cache lines L3 Miss 64B HW swaps lines (fine-grained transfer)

CAMEO – CONGRUENCE GROUP 19 Off-chip memory Off-chip memory Stacked memory Stacked memory 4GB 12GB 0 N-1 N 2N-1 2N 3N-1 3N 4N-1 A A B B C C D D Congruence group

MIGRATION IN CONGRUENCE GROUP 20 A A B B C C D D Request to B, B, and C: Request to B: Swap line A and B B B A A C C D D Request to B: Hit in Stacked DRAM B B A A C C D D Request to C: Swap line C and B C C A A B B D D Swapping changes line’s location, and requires indexing structure to keep track of the location.

11 LINE LOCATION TABLE (LLT) Location Table for Congruence Group 21 C C A A B B D D 4 Location Request Line ABCD Physical Location C C A A B B D D

LINE LOCATION TABLE (LLT) Size of Location Table Per Congruence Group 22 C C A A B B D D Log 2 (4)=2 bits 4 lines = 8 bits (1 byte) Storing LLT in SRAM is impractical 64M groups (64MB)

2KB LLT IN DRAM LLT in DRAM incurs serialization Latency –Optimizing for common case: Hit in stacked DRAM –Co-locate Line Location Table of each congruence group with data in stacked DRAM LLT L3 Miss Stacked DRAM 1 byte LLT 64 byte Data LEAD 31 LEAD % capacity loss Location Entry And Data Hits

AVOID LLT LOOKUP LATENCY FOR HIT Avoiding LLT Lookup Latency on Stacked DRAM Hit (lines in stacked memory) –Co-locate Line Location Table of each congruence group with data in stacked DRAM Addr Hit: one access Stacked DRAM 24 Co-Locate LLT to avoid latency on hits Data

AVOID LLT LOOKUP LATENCY FOR MISS 25 A A B B C C D D Addr LEAD: verify the location when both are ready. Line Location Predictor Parallel Access to Possible Location Avoiding LLT Lookup Latency on Stacked DRAM Miss (lines in off-chip memory) –Use Line Location Predictor to fetch data from possible location in parallel Always

AVOID LLT LOOKUP LATENCY FOR MISS 26 Line Location Predictor Add r Stacked Off-chip #1 Off-chip #2 Off-chip #3 PredictorAccuracy Always Stacked 70% LLP92% Avoiding LLT Lookup Latency on Stacked DRAM Miss (lines in off-chip memory) –LLP makes M-ary prediction –LLP uses instruction address and last location to make prediction 64 byte per core

AVOIDING LLT LATENCY OVERHEAD Co-locate LLT of each congruence group with data in stacked DRAM 27 We co-locate Line Location Table and use Line Location Predictor to mitigate latency overhead On Hit in Stacked DRAM On Miss in Stacked DRAM Use Line Location Predictor to fetch data from possible location in parallel A A B B C C D D Stacked Off-chip Line Location Table Line Location Predictor Add r Stacked Off-chip #1 Off-chip #2 Off-chip #3

AGENDA Introduction Background –Cache –Two-Level Memory CAMEO –Concept –Implementation Methodology Results Summary 28

Core Chip  3.2GHz 2-wide out-of-order core  32 cores, 32MB 32-way L3 shared cache METHODOLOGY 29 Stacked DRAM Commodity DRAM SSD CPU

METHODOLOGY 30 Stacked DRAM Commodity DRAM SSD CPU Stacked DRAMCommodity DRAM Capacity4GB12GB BusDDR3.2GHz, 128-bitDDR1.6GHz, 64-bit Latency22ns44ns Channels 16 channels, 16 banks/channel 8 channels 8 banks/channels

METHODOLOGY Stacked DRAM Commodity DRAM SSD CPU Baseline: 12GB off-chip DRAM Cache: Alloy Cache [MICRO’12] Two-Level Memory: Page Migration enabled SSD Latency: 32 micro seconds SPEC2006: rate mode; Small Working Set ( 12GB)

PERFORMANCE IMPROVEMENT 32 Small WSet CAMEO as good as Cache in Small WS apps

PERFORMANCE IMPROVEMENT 33 Large WSet CAMEO outperforms both Cache and TLM, and very close to DoubleUse CAMEO outperforms TLM in Large WS apps 28%

EXECUTIVE SUMMARY How to use Stacked DRAM: Cache or Memory? Cache: software-transparent, fine-grained data transfer, but sacrifices memory capacity Memory: larger memory capacity, but software- support, coarse-grained data transfer CAMEO: software-transparent, fine-grained data transfer and almost full memory capacity Results: CAMEO outperforms both Cache (50%) and Two-Level Memory (50%) by providing 78% speedup 34

Thank You! 35

A Cache-Like Memory Organization for 3D memory system CAMEO 12/15/2014 MICRO Cambridge, UK Chiachen Chou, Georgia Tech Aamer Jaleel, Intel Moinuddin K. Qureshi, Georgia Tech

Backup slides 37

LINE LOCATION TABLE Size of Location Table Per Congruence Group 38 A A B B C C D D 4 LocationLog 2 (4)=2 bits 4 lines 8 bits (1 byte) # LocationsSize 41 byte byte 83 byte

POWER AND ENERGY 39 14% 34%