15-740/18-740 Computer Architecture Lecture 5: Project Example Justin Meza Yoongu Kim Fall 2011, 9/21/2011.

Slides:

Advertisements

Similar presentations

Jaewoong Sim Alaa R. Alameldeen Zeshan Chishti Chris Wilkerson Hyesoon Kim MICRO-47 | December 2014.

Advertisements

Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.

Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.

1 A Hybrid Adaptive Feedback Based Prefetcher Santhosh Verma, David Koppelman and Lu Peng Louisiana State University.

1 Line Distillation: Increasing Cache Capacity by Filtering Unused Words in Cache Lines Moinuddin K. Qureshi M. Aater Suleman Yale N. Patt HPCA 2007.

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri , Yoongu Kim,

Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture

Data Mapping for Higher Performance and Energy Efficiency in Multi-Level Phase Change Memory HanBin Yoon*, Naveen Muralimanohar ǂ, Justin Meza*, Onur Mutlu*,

Efficient Data Mapping and Buffering Techniques for Multi-Level Cell Phase-Change Memories HanBin Yoon, Justin Meza, Naveen Muralimanohar*, Onur Mutlu,

1 Multi-Core Systems CORE 0CORE 1CORE 2CORE 3 L2 CACHE L2 CACHE L2 CACHE L2 CACHE DRAM MEMORY CONTROLLER DRAM Bank 0 DRAM Bank 1 DRAM Bank 2 DRAM Bank.

Phase Change Memory What to wear out today? Chris Craik, Aapo Kyrola, Yoshihisa Abe.

Lecture 12: DRAM Basics Today: DRAM terminology and basics, energy innovations.

1 Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections )

1 Lecture 15: DRAM Design Today: DRAM basics, DRAM innovations (Section 5.3)

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Mitigating Prefetcher-Caused Pollution Using Informed Caching Policies for Prefetched Blocks Vivek Seshadri Samihan Yedkar ∙ Hongyi Xin ∙ Onur Mutlu Phillip.

Dyer Rolan, Basilio B. Fraguela, and Ramon Doallo Proceedings of the International Symposium on Microarchitecture (MICRO’09) Dec /7/14.

1 Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture Donghyuk Lee, Yoongu Kim, Vivek Seshadri, Jamie Liu, Lavanya Subramanian, Onur Mutlu.

The Dirty-Block Index Vivek Seshadri Abhishek Bhowmick ∙ Onur Mutlu Phillip B. Gibbons ∙ Michael A. Kozuch ∙ Todd C. Mowry.

HIgh Performance Computing & Systems LAB Staged Memory Scheduling: Achieving High Performance and Scalability in Heterogeneous Systems Rachata Ausavarungnirun,

Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.

The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.

Row Buffer Locality Aware Caching Policies for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.

A Row Buffer Locality-Aware Caching Policy for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.

Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.

CSE378 Intro to caches1 Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early.

MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.

Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,

Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research.

The Evicted-Address Filter

Achieving High Performance and Fairness at Low Cost Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri, Harsha Rastogi, Onur Mutlu 1 The Blacklisting Memory.

Sunpyo Hong, Hyesoon Kim

HAT: Heterogeneous Adaptive Throttling for On-Chip Networks Kevin Kai-Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu.

Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Lavanya Subramanian 1.

Cache Issues Computer Organization II 1 Main Memory Supporting Caches Use DRAMs for main memory – Fixed width (e.g., 1 word) – Connected by fixed-width.

1 Lecture 16: Main Memory Innovations Today: DRAM basics, innovations, trends HW5 due on Thursday; simulations can take a few hours Midterm: 32 scores.

1 Lecture: Memory Basics and Innovations Topics: memory organization basics, schedulers, refresh,

Improving Multi-Core Performance Using Mixed-Cell Cache Architecture

UH-MEM: Utility-Based Hybrid Memory Management

Reducing Memory Interference in Multicore Systems

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Adaptive Cache Partitioning on a Composite Core

18-447: Computer Architecture Lecture 34: Emerging Memory Technologies

Understanding Latency Variation in Modern DRAM Chips Experimental Characterization, Analysis, and Optimization Kevin Chang Abhijith Kashyap, Hasan Hassan,

Cache Memory Presentation I

Morgan Kaufmann Publishers Memory & Cache

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

ECE 445 – Computer Organization

Lecture 15: DRAM Main Memory Systems

Row Buffer Locality Aware Caching Policies for Hybrid Memories

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Lecture: DRAM Main Memory

Lecture 23: Cache, Memory, Virtual Memory

Lecture 22: Cache Hierarchies, Memory

Address-Value Delta (AVD) Prediction

Lecture: DRAM Main Memory

Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.

CARP: Compression-Aware Replacement Policies

Adapted from slides by Sally McKee Cornell University

Reducing DRAM Latency via

Lecture 22: Cache Hierarchies, Memory

Milestone 2 Enhancing Phase-Change Memory via DRAM Cache

CS 3410, Spring 2014 Computer Science Cornell University

Lecture 21: Memory Hierarchy

Page Cache and Page Writeback

Architecting Phase Change Memory as a Scalable DRAM Alternative

Presented by Florian Ettinger

Presentation transcript:

15-740/ Computer Architecture Lecture 5: Project Example Justin Meza Yoongu Kim Fall 2011, 9/21/2011

Reminder: Project Proposals Project proposals due NOON on Monday 9/26 Two to three pages consisting of – Problem – Novelty – Idea – Hypothesis – Methodology – Plan All the details are in the project handout 2

Agenda for Today’s Class Brief background on hybrid main memories Project example from Fall 2010 Project pitches and feedback Q & A 3

Main Memory in Today’s Systems DRAM HDD/SSD CPU 4

HDD/SSD CPU Main Memory in Today’s Systems DRAM Main memory 5

DRAM Pros – Low latency – Low cost Cons – Low capacity – High power Some new and important applications require HUGE capacity (in the terabytes) 6

HDD/SSD CPU Main Memory in Today’s Systems DRAM Main memory 7

HDD/SSD CPU Hybrid Memory (Future Systems) Hybrid main memory DRAM (cache) New memories (high capacity) 8

Row Buffer Locality-Aware Hybrid Memory Caching Policies Justin Meza HanBin Yoon Rachata Ausavarungnirun Rachael Harding Onur Mutlu

Motivation Two conflicting trends: 1.ITRS predicts the end of DRAM scalability 2.Workloads continue to demand more memory Want future memories to have – Large capacity – High performance – Energy efficient Need scalable DRAM alternatives 10

Motivation Emerging memories can offer more scalability Phase change memory (PCM) – Projected to be 3−12× denser than DRAM However, cannot simply replace DRAM – Longer access latencies (4−12× DRAM) – Higher access energies (2−40× DRAM) Use DRAM as a cache to large PCM memory 11 [Mohan, HPTS ’09; Lee+, ISCA ’09]

Phase Change Memory (PCM) Data stored in form of resistance – High current melts cell material – Rate of cooling determines stored resistance – Low current used to read cell contents 12

Projected PCM Characteristics (~2013) [Mohan, HPTS ’09; Lee+, ISCA ’09] nmDRAMPCMRelative to DRAM Cell size6 F 2 0.5–2 F 2 3–12× denser Read latency60 ns300–800 ns6–13× slower Write latency60 ns1400 ns24× slower Read energy1.2 pJ/bit2.5 pJ/bit2× more energy Write energy0.39 pJ/bit16.8 pJ/bit40× more energy DurabilityN/A10 6 –10 8 writesLimited lifetime

Row Buffers and Locality Memory array organized in columns and rows Row buffers store contents of accessed row Row buffers are important for mem. devices – Device slower than bus: need to buffer data – Fast accesses for data with spatial locality – DRAM: Destructive reads – PCM: Writes are costly: want to coalesce 14

LOAD X+1LOAD X Row Buffers and Locality 15 ADDRADDR ADDRADDR ROW DATA LOAD X+1 Row buffer miss! Row buffer hit!

Key Idea Since DRAM and PCM both use row buffers, – Row buffer hit latency same in DRAM and PCM – Row buffer miss latency small in DRAM – Row buffer miss latency large in PCM Cache data in DRAM which – Frequently row buffer misses – Is reused many times  because miss penalty is smaller in DRAM 16

Hybrid Memory Architecture 17 CPU Memory Controller DRAM Cache (Low density) DRAM Cache (Low density) PCM (High density) PCM (High density) Memory channel

Hybrid Memory Architecture 18 CPU DRAM Ctlr DRAM Cache (Low density) DRAM Cache (Low density) PCM (High density) PCM (High density) PCM Ctlr

Hybrid Memory Architecture 19 CPU Memory Controller DRAM Cache (Low density) DRAM Cache (Low density) PCM (High density) PCM (High density) Tag store: 2 KB rows

Hybrid Memory Architecture 20 CPU Tag store: X  DRAM Memory Controller DRAM Cache (Low density) DRAM Cache (Low density) PCM (High density) PCM (High density) LOAD X

Hybrid Memory Architecture 21 CPU How does data get migrated to DRAM?  Caching Policy How does data get migrated to DRAM?  Caching Policy Tag store: Y  PCM Memory Controller DRAM Cache (Low density) DRAM Cache (Low density) PCM (High density) PCM (High density) LOAD Y

Methodology Simulated our system configurations – Collected program traces using a tool called Pin – Fed instruction trace information to a timing simulator modeling an OoO core and DDR3 memory – Migrated data at the row (2 KB) granularity Collected memory traces from a standard computer architecture benchmark suite – SPEC CPU2006 Used an in-house simulator written in C# 22

Conventional Caching Data is migrated when first accessed Simple, used for many caches 23

Conventional Caching Data is migrated when first accessed Simple, used for many caches 24 CPU Memory Controller DRAM Cache (Low density) DRAM Cache (Low density) PCM (High density) PCM (High density) LD Rw1 Row Data Tag store: Z  PCM Bus contention! LD Rw2 How does conventional caching perform in a hybrid main memory?

Conventional Caching 25

Conventional Caching 26 Beneficial for some benchmarks

Conventional Caching 27 Performance degrades due to bus contention

Conventional Caching 28 Many row buffer hits: don’t need to migrate data

Conventional Caching 29 Want to identify data which misses in row buffer and is reused

Problems with Conventional Caching Performs useless migrations – Migrates data which are not reused – Migrates data which hit in the row buffer Causes bus contention and DRAM pollution – Want to cache rows which are reused – Want to cache rows which miss in row buffer 30

Problems with Conventional Caching Performs useless migrations – Migrates data which are not reused – Migrates data which hit in the row buffer Causes bus contention and DRAM pollution – Want to cache rows which are reused – Want to cache rows which miss in row buffer 31

A Reuse-Aware Policy Keep track of the number of accesses to a row Cache row in DRAM when accesses ≥ A – Reset accesses every Q cycles Similar to CHOP [Jiang+, HPCA ’10] – Cached “hot” (reused) pages in on-chip DRAM – To reduce off-chip bandwidth requirements We call this policy 32 A-COUNT

A Reuse-Aware Policy 33

A Reuse-Aware Policy 34 Performs fewer migrations: reduces channel contention

A Reuse-Aware Policy 35 Too few migrations: too many accesses go to PCM

A Reuse-Aware Policy 36 Rows with many hits still needlessly migrated

Problems with Reuse-Aware Policy Agnostic of DRAM/PCM access latencies – May keep data which row buffer misses in PCM – Missed opportunity: could save cycles in DRAM 37

Agnostic of DRAM/PCM access latencies Problems with Reuse-Aware Policy 38

Row Buffer Locality-Aware Policy Cache rows which benefit from being in DRAM – I.e., those with frequent row buffer misses Keep track of number of misses to a row Cache row in DRAM when misses ≥ M – Reset misses every Q cycles We call this policy 39 M-COUNT

Row Buffer Locality-Aware Policy 40

Row Buffer Locality-Aware Policy 41 Recognizes rows with many hits and does not migrate them

Row Buffer Locality-Aware Policy 42 Lots of data with just enough misses to get cached but little reuse after being cached  need to also track reuse

Combined Reuse/Locality Approach Cache rows with reuse and which frequently miss in the row buffer – Use A-COUNT as predictor of future reuse and – M-COUNT as predictor of future row buffer misses Cache row if accesses ≥ A and misses ≥ M We call this policy 43 AM-COUNT

Combined Reuse/Locality Approach 44

Combined Reuse/Locality Approach 45 Reduces useless migrations

Combined Reuse/Locality Approach 46 And data with little reuse kept out of DRAM

Dynamic Reuse/Locality Approach Previously mentioned policies require profiling – To determine the best A and M thresholds We propose a dynamic threshold policy – Performs a cost-benefit analysis every Q cycles – Simple hill-climbing algorithm to maximize benefit – (Side note: we simplify the problem slightly by just finding the best A threshold, because we observe that M = 2 performs the best for a given A.) 47

Cost-Benefit Analysis Each quantum, we measure the first-order costs and benefits of the current A threshold – Cost = cycles of bus contention due to migrations – Benefit = cycles saved at the banks by servicing a request in DRAM versus PCM Cost = Migrations × t migration Benefit = Reads DRAM × (t read,PCM − t read,DRAM ) + Writes DRAM × (t write,PCM − t write,DRAM ) 48

Cost-Benefit Maximization Algorithm 49 // net benefit // too many migrations? // increase threshold // last A beneficial // increasing benefit? // try next A // decreasing benefit // too strict, reduce

Dynamic Policy Performance 50

Dynamic Policy Performance 51 29% improvement over All PCM, Within 18% of All DRAM 29% improvement over All PCM, Within 18% of All DRAM

Evaluation Methodology/Metrics 16-core system Averaged across 100 randomly-generated workloads of varying working set size – LARGE = working set size > main memory size Weighted speedup (performance) = Maximum slowdown (fairness) = 52

16-core Performance & Fairness 53

16-core Performance & Fairness 54 More contention  more benefit

16-core Performance & Fairness 55 Dynamic policy can adjust to different workloads

Versus All PCM and All DRAM Compared to an All PCM main memory – 17% performance improvement – 21% fairness improvement Compared to an All DRAM main memory – Within 21% of performance – Within 53% of fairness 56

Robustness to System Configuration 57

Implementation/Hardware Cost Requires a tag store in memory controller – We currently assume 36 KB of storage per 16 MB of DRAM – We are investigating ways to mitigate this overhead Requires a statistics store – To keep track of accesses and misses 58

Conclusions DRAM scalability is nearing its limit – Emerging memories (e.g. PCM) offer scalability – Problem: must address high latency and energy We propose a dynamic, row buffer locality- aware caching policy for hybrid memories – Cache rows which miss frequently in row buffer – Cache rows which are reused many times 17/21% perf/fairness improvement vs. all PCM Within 21/53% perf/fairness of all DRAM system 59

Thank you! Questions? 60

Backup Slides 61

Related Work 62

PCM Latency 63

DRAM Cache Size 64

Versus All DRAM and All PCM 65

Performance vs. Statistics Store Size 66 (8 ways, LRU)

Performance vs. Statistics Store Size 67 Within ~1% of infinite storage with 200 B of storage (8 ways, LRU)

All DRAM 8 Banks 68

All DRAM 16 Banks 69

Simulation Parameters 70

Overview DRAM is reaching its scalability limits – Yet, memory capacity requirements are increasing Emerging memory devices offer scalability – Phase-change, resistive, ferroelectric, etc. – But, have worse latency/energy than DRAM We propose a scalable hybrid memory arch. – Use DRAM as a cache to phase change memory – Cache data based on row buffer locality and reuse 71

Methodology Core model – 3-wide issue with 128-entry instruction window – 32 KB L1 D-cache per core – 512 KB shared L2 cache per core Memory model – 16 MB DRAM / 512 MB PCM per core Scaled based on workload trace size and access patterns to be smaller than working set – DDR3 800 MHz, single channel, 8 banks per device – Row buffer hit: 40 ns – Row buffer miss: 80 ns (DRAM); 128, 368 ns (PCM) – Migrate data at 2 KB row granularity 72

Outline Overview Motivation/Background Methodology Caching Policies Multicore Evaluation Conclusions 73

16-core Performance & Fairness 74 Distributing data benefits small working sets, too TODO: change diagram to two channels so that this can be explained