1 Lecture 7: Caching in Row-Buffer of DRAM Adapted from “A Permutation-based Page Interleaving Scheme: To Reduce Row-buffer Conflicts and Exploit Data.

Slides:

Advertisements

Similar presentations

Lecture 19: Cache Basics Today’s topics: Out-of-order execution

Advertisements

A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.

Exploiting Locality in DRAM

1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.

CMSC 611: Advanced Computer Architecture Memory & Virtual Memory Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material.

CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.

331 Lec20.1Fall :332:331 Computer Architecture and Assembly Language Fall 2003 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.

ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Nov. 13, 2002 Topic: Main Memory (DRAM) Organization.

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

Lecture 17: Virtual Memory, Large Caches

©UCB CS 162 Ch 7: Virtual Memory LECTURE 13 Instructor: L.N. Bhuyan

331 Lec20.1Spring :332:331 Computer Architecture and Assembly Language Spring 2005 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.

1 Lecture 14: Virtual Memory Today: DRAM and Virtual memory basics (Sections )

Computing Systems Memory Hierarchy.

Systems I Locality and Caching

Software and Hardware Support for Locality Aware High Performance Computing Xiaodong Zhang National Science Foundation College of William and Mary This.

 Higher associativity means more complex hardware  But a highly-associative cache will also exhibit a lower miss rate —Each set has more blocks, so there’s.

Lecture 19: Virtual Memory

EECS 318 CAD Computer Aided Design LECTURE 10: Improving Memory Access: Direct and Spatial caches Instructor: Francis G. Wolff Case.

Chapter 5 Large and Fast: Exploiting Memory Hierarchy CprE 381 Computer Organization and Assembly Level Programming, Fall 2013 Zhao Zhang Iowa State University.

The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.

CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and

Exploiting Locality in DRAM Xiaodong Zhang College of William and Mary.

B. Ramamurthy.  12 stage pipeline  At peak speed, the processor can request both an instruction and a data word on every clock.  We cannot afford pipeline.

Modern DRAM Memory Architectures Sam Miller Tam Chantem Jon Lucas CprE 585 Fall 2003.

1010 Caching ENGR 3410 – Computer Architecture Mark L. Chang Fall 2006.

Computer Organization CS224 Fall 2012 Lessons 45 & 46.

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

Outline Cache writes DRAM configurations Performance Associative caches Multi-level caches.

DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%

1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.

1 CMPE 421 Parallel Computer Architecture PART3 Accessing a Cache.

Princess Sumaya Univ. Computer Engineering Dept. Chapter 5:

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

Memory Hierarchies Sonish Shrestha October 3, 2013.

1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.

1 Lecture: Virtual Memory Topics: virtual memory, TLB/cache access (Sections 2.2)

Contemporary DRAM memories and optimization of their usage Nebojša Milenković and Vladimir Stanković, Faculty of Electronic Engineering, Niš.

1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.

Memory Hierarchy and Cache Design (3). Reducing Cache Miss Penalty 1. Giving priority to read misses over writes 2. Sub-block placement for reduced miss.

Memory Hierarchy and Cache. A Mystery… Memory Main memory = RAM : Random Access Memory – Read/write – Multiple flavors – DDR SDRAM most common 64 bit.

1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)

1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.

CMSC 611: Advanced Computer Architecture Memory & Virtual Memory Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material.

Address – 32 bits WRITE Write Cache Write Main Byte Offset Tag Index Valid Tag Data 16K entries 16.

Reducing Hit Time Small and simple caches Way prediction Trace caches

Zhichun Zhu Zhao Zhang ECE Department ECE Department

CS 704 Advanced Computer Architecture

Cache Memory Presentation I

Morgan Kaufmann Publishers Memory & Cache

ECE 445 – Computer Organization

Lecture 21: Memory Hierarchy

CMSC 611: Advanced Computer Architecture

Lecture: DRAM Main Memory

Lecture 23: Cache, Memory, Virtual Memory

Lecture 22: Cache Hierarchies, Memory

Lecture: DRAM Main Memory

Lecture 22: Cache Hierarchies, Memory

Lecture 20: OOO, Memory Hierarchy

Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics

Lecture 20: OOO, Memory Hierarchy

If a DRAM has 512 rows and its refresh time is 9ms, what should be the frequency of row refresh operation on the average?

Lecture 22: Cache Hierarchies, Memory

15-740/ Computer Architecture Lecture 19: Main Memory

Lecture 21: Memory Hierarchy

Chapter Five Large and Fast: Exploiting Memory Hierarchy

Cache - Optimization.

Caches & Memory.

Presentation transcript:

1 Lecture 7: Caching in Row-Buffer of DRAM Adapted from “A Permutation-based Page Interleaving Scheme: To Reduce Row-buffer Conflicts and Exploit Data Locality” by x. Zhang et. al.

2 CPU Registers L1 TLB L3 L2 Row buffer DRAM Bus adapter Controller buffer Buffer cache CPU-memory bus I/O bus I/O controller disk Disk cache TLB registers L1 L2 L3 Controller buffer Buffer cache disk cache Row buffer A Bigger Picture

3 DRAM Architecture CPU/Cache Bus DRAM Bank 0Bank 1Bank n CoreCore Row buffer

4 Caching in DRAM l DRAM is the center of memory hierarchy: –High density and high capacity –Low cost but slow access (compared to SRAM) A cache miss has been considered as a constant delay for long time. This is wrong. –Non-uniform access latencies exist within DRAM l Row-buffer serves as a fast cache in DRAM –Its access patterns here have been paid little attention. –Reusing buffer data minimizes the DRAM latency.

5 DRAM Access Precharge: charge a DRAM bank before a row access Row access: activate a row (page) of a DRAM bank Column access: select and return a block of data in an activated row Refresh: periodically read and write DRAM to keep data

6 Precharge Row Access Bus bandwidth time DRAM Core Row Buffer Processor Column Access DRAMLatency Row buffer misses come from a sequence of accesses to different pages in the same bank.

7 When to Precharge --- Open Page vs. Close Page Determine when to do precharge. Close page: starts precharge after every access –May reduce latency for row buffer misses –Increase latency for row buffer hits Open page: delays precharge until a miss –Minimize latency for row buffer hits –Increase latency for row buffer misses Which is good? depends on row buffer miss rate.

8 Non-uniform DRAM Access Latency l Case 1: Row buffer hit (20+ ns) l Case 2: Row buffer miss (core is precharged, 40+ ns) l Case 3: Row buffer miss (not precharged, ≈ 70 ns) prechargerow accesscol. access row accesscol. access

9 Amdahl’s Law applies in DRAM  As the bandwidth improves, DRAM latency will decide cache miss penalty.  Time (ns) to fetch a 128-byte cache block: latency bandwidth

10 Row Buffer Locality Benefit Objective: serve memory requests without accessing the DRAM core as much as possible. Reduce latency by up to 67%.

11 SPEC95: Miss Rate to Row Buffer Specfp95 applications Conventional page interleaving scheme 32 DRAM banks, 2KB page size Why is it so high? Can we reduce it?

12 Effective DRAM Bandwidth Case 2: Row buffer misses to different banks row accesscol. access Access 2 Access 1 Case 1: Row buffer hits Access 2 Access 1 row accesscol. access trans. data Case 3: Row buffer conflicts row accesscol. access prechargerow access bubble trans. data Access 1 Access 2

13 Conventional Data Layout in DRAM ---- Cacheline Interleaving Bank 0Bank 1Bank 2Bank 3 Address format page indexpage offset bank rp-bbk cacheline 0 cacheline 4 … cacheline 1 cacheline 5 … cacheline 2 cacheline 6 … cacheline 3 cacheline 7 … Spatial locality is not well preserved!

14 Page 0Page 1Page 2Page 3 Page 4Page 5Page 6Page 7 ………… Bank 0 Address format Bank 1Bank 2Bank 3 page indexpage offsetbank rpk Conventional Data Layout in DRAM ---- Page Interleaving

15 Compare with Cache Mapping page indexpage offset cache tagcache set indexblock offset bank page indexpage offset tsb bank r r p-bbk pk 1.Observation: bank index  cache set index 2.Inference:  x  y, x and y conflict on cache  x and y conflict on row buffer Cache-related representation Cache line interleaving Page interleaving

16 Sources of Row-Buffer Conflicts --- L2 Conflict Misses L2 conflict misses may result in severe row buffer conflicts. Example: assume x and y conflicts on a direct mapped cache (address distance of X[0] and y[0] is a multiple of the cache size) sum = 0; for (i = 0; i < 4; i ++) sum += x[i] + y[i];

17 x y Cache misses Row buffer misses Cache line that x,y residesRow buffer that x,y resides Thrashing at both cache and row buffer! Sources of Row-Buffer Conflicts --- L2 Conflict Misses (Cont’d)

18 Writebacks interfere reads on row buffer –Writeback addresses are L2 conflicting with read addresses Example: assume writeback is used (address distance of X[0] and y[0] is a multiple of the cache size) for (i = 0; i < N; i ++) y[i] = x[i]; Sources of Row-Buffer Conflicts --- L2 Writebacks

19 x x Load y y Writeback x+b y+b x+2b y+2b x+3b … … x x+b x+2b x+3b x+4b x+5b x+6b x+7b the same row buffer Sources of Row-Buffer Conflicts --- L2 Writebacks (Cont’d)

20 Key Issues To exploit spatial locality, we should use maximal interleaving granularity (or row-buffer size). To reduce row buffer conflicts, we cannot use only those bits in cache set index for “bank bits”. page indexpage offsetbank rp k cache tagcache set indexblock offset tsb

21 Permutation-based Interleaving k XOR k page index page offset new bank k page offset indexbank L2 Cache tag

22 Scheme Properties (1) L2-conflicting addresses are distributed onto different banks memory banks Permutation-base interleaving L2 Conflicting addresses xor Different bank indexes Conventional interleaving Same bank indexes

23 Scheme Properties (2) The spatial locality of memory references is preserved. memory banks …… Within one page Permutation-based interleaving Conventional interleaving Same bank indexes xor Same bank indexes

24 Scheme Properties (3) Pages are uniformly mapped onto ALL memory banks. C+1P 2C+2P bank 0bank 1bank 2bank 3 C 2C+3P C+3P 2C 01P2P3P C+2P 2C+1P 4P5P6P7P ………… C+5PC+4PC+7PC+6P ………… 2C+6P2C+7P2C+4P2C+5P …………

25 Experimental Environment SimpleScalar Simulate XP1000 Processor: 500MHz L1 cache: 32 KB inst., 32KB data L2 cache: 2 MB, 2- way, 64-byte block MSHR: 8 entries Memory bus: 32 bytes wide, 83MHz Banks: Row buffer size: 1- 8KB Precharge: 36ns Row access: 36ns Column access: 24ns

26 Row-buffer Miss Rate for SPECfp95

27 Miss Rate for SPECint95 & TPC-C

28 Miss Rate of Applu: 2KB Buf. Size

29 Comparison of Memory Stall Time

30 Improvement of IPC

31 Contributions of the Work We study interleaving for DRAM –DRAM has a row buffer as a natural cache We study page interleaving in the context of Superscalar processor –Memory stall time is sensitive to both latency and effective bandwidth –Cache miss pattern has direct impact on row buffer conflicts and thus the access latency –Address mapping conflicts at the cache level, including address conflicts and write-back conflicts, may inevitably propagate to DRAM memory under a standard memory interleaving method, causing significant memory access delays. Proposed permutation interleaving technique as a low- cost solution to these conflict problems.

32 Conclusions Row buffer conflicts can significantly increase memory stall time. We have analyzed the source of conflicts. Our permutation-based page interleaving scheme can effectively reduce row buffer conflicts and exploit data locality.