1 CSCI 2510 Computer Organization Memory System II Cache In Action.

Slides:

Advertisements

Similar presentations

361 Computer Architecture Lecture 15: Cache Memory

Advertisements

Topics covered: Memory subsystem CSE243: Introduction to Computer Architecture and Hardware/Software Interface.

Performance of Cache Memory

Technical University of Lodz Department of Microelectronics and Computer Science Elements of high performance microprocessor architecture Memory system.

Chapter 6 Computer Architecture

CSC 4250 Computer Architectures December 8, 2006 Chapter 5. Memory Hierarchy.

Processor - Memory Interface

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

Review CPSC 321 Andreas Klappenecker Announcements Tuesday, November 30, midterm exam.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.

Memory Organization.

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

1  Caches load multiple bytes per block to take advantage of spatial locality  If cache block size = 2 n bytes, conceptually split memory into 2 n -byte.

1 The Memory System (Chapter 5)

Cache Memories Effectiveness of cache is based on a property of computer programs called locality of reference Most of programs time is spent in loops.

Memory Hierarchy and Cache Design The following sources are used for preparing these slides: Lecture 14 from the course Computer architecture ECE 201 by.

Memory Systems Architecture and Hierarchical Memory Systems

 Higher associativity means more complex hardware  But a highly-associative cache will also exhibit a lower miss rate —Each set has more blocks, so there’s.

Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.

In1210/01-PDS 1 TU-Delft The Memory System. in1210/01-PDS 2 TU-Delft Organization Word Address Byte Address

Memory and cache CPU Memory I/O. CEG 320/52010: Memory and cache2 The Memory Hierarchy Registers Primary cache Secondary cache Main memory Magnetic disk.

How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.

CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.

Computer Architecture Memory organization. Types of Memory Cache Memory Serves as a buffer for frequently accessed data Small  High Cost RAM (Main Memory)

L/O/G/O Cache Memory Chapter 3 (b) CS.216 Computer Architecture and Organization.

Computer Architecture Lecture 26 Fasih ur Rehman.

2007 Sept. 14SYSC 2001* - Fall SYSC2001-Ch4.ppt1 Chapter 4 Cache Memory 4.1 Memory system 4.2 Cache principles 4.3 Cache design 4.4 Examples.

CSE 241 Computer Engineering (1) هندسة الحاسبات (1) Lecture #3 Ch. 6 Memory System Design Dr. Tamer Samy Gaafar Dept. of Computer & Systems Engineering.

B. Ramamurthy.  12 stage pipeline  At peak speed, the processor can request both an instruction and a data word on every clock.  We cannot afford pipeline.

1 CENG 450 Computer Systems and Architecture Cache Review Amirali Baniasadi

The Memory Hierarchy Lecture # 30 15/05/2009Lecture 30_CA&O_Engr Umbreen Sabir.

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

COMP SYSTEM ARCHITECTURE HOW TO BUILD A CACHE Antoniu Pop COMP25212 – Lecture 2Jan/Feb 2015.

Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.

Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.

DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%

Topics covered: Memory subsystem CSE243: Introduction to Computer Architecture and Hardware/Software Interface.

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

Cache Small amount of fast memory Sits between normal main memory and CPU May be located on CPU chip or module.

Introduction to computer architecture April 7th. Access to main memory –E.g. 1: individual memory accesses for j=0, j++, j

Characteristics Location Capacity Unit of transfer Access method Performance Physical type Physical characteristics Organisation.

Memory Hierarchy and Cache. A Mystery… Memory Main memory = RAM : Random Access Memory – Read/write – Multiple flavors – DDR SDRAM most common 64 bit.

Cache Memory Yi-Ning Huang. Principle of Locality Principle of Locality A phenomenon that the recent used memory location is more likely to be used again.

CMSC 611: Advanced Computer Architecture

Computer Organization

Address – 32 bits WRITE Write Cache Write Main Byte Offset Tag Index Valid Tag Data 16K entries 16.

COSC3330 Computer Architecture

The Memory System (Chapter 5)

Memory and cache CPU Memory I/O.

Improving Memory Access 1/3 The Cache and Virtual Memory

Multilevel Memories (Improving performance using alittle “cash”)

How will execution time grow with SIZE?

The Hardware/Software Interface CSE351 Winter 2013

Cache Memory Presentation I

William Stallings Computer Organization and Architecture 7th Edition

Part V Memory System Design

Memory and cache CPU Memory I/O.

Chapter 6 Memory System Design

M. Usha Professor/CSE Sona College of Technology

Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics

Chapter Five Large and Fast: Exploiting Memory Hierarchy

Cache - Optimization.

Overview Problem Solution CPU vs Memory performance imbalance

Presentation transcript:

1 CSCI 2510 Computer Organization Memory System II Cache In Action

2 Cache-Main Memory Mapping A way to record which part of the Main Memory is now in cache Design concerns:  Be Efficient: fast determination of cache hits/ misses  Be Effective: make full use of the cache; increase probability of cache hits In the following discussion, we assume:  16-bit address, i.e = 64k Bytes in the Main Memory  One Word = One Byte  Synonym: Cache line === Cache block

3 Imagine: Trivial Conceptual Case Cache size == Main Memory size == 64kB Trivial one-to-one mapping Do we need Main Memory any more? Cache 64kB FAST Main Memory 64kB SLOW CPU FASTEST

4 Reality: Cache Block/ Cache Line Cache size is much smaller than the Main Memory size A block in the Main Memory maps to a block in the Cache Many-to-One Mapping Main Memory Block 0 Block 1 Block 127 Block 128 Block 129 Block 255 Block 256 Block 257 Block 4095 tag Cache Block 0 Block 1 Block 127

5 Direct Mapping Direct mapped cache Block j of Main Memory maps to block ( j mod 128 ) of Cache [same colour in figure] Cache hit occurs if tag matches desired address there are 2 4 = 16 words (bytes) in a block 2 7 = 128 Cache blocks 2 (7+5) = 2 7 x 2 5 = 4096 Main Memory blocks Main Memory Block 0 Block 1 Block 127 Block 128 Block 129 Block 255 Block 256 Block 257 Block 4095 tag Cache Block 0 Block 1 Block bit Main Memory address Cache tag Cache Block No Word/ Byte Address within block (4-bit) 5 12-bit Main Memory Block number/ address

6 Direct Mapping Memory address divided into 3 fields  Main Memory Block number determines position of block in cache  Tag used to keep track of which block is in cache (as many MM blocks can map to same position in cache)  The last 4 bits in the address selects target word in the block Given an address t,b,w (16-bit)  See if it is already in cache by comparing t with the tag in block b  If not, cache miss!, replace the current block at b with a new one from memory block t,b (12-bit) E.g. CPU is looking for [A7B4] MAR = Go to cache block See if the tag is If YES, cache hit! Get the word/ byte 0100 in the cache block Main Memory Block 0 Block 1 Block 127 Block 128 Block 129 Block 255 Block 256 Block 257 Block 4095 tag Cache Block 0 Block 1 Block bit Main Memory address Cache tag Cache Block No Word/ Byte Address within block (4-bit) 5 12-bit Main Memory Block number/ address

7 Associative Mapping In direct mapping, a Main Memory block is restricted to reside in a given position in the Cache (determined by mod) Associative mapping allows a MM block to reside in an arbitrary Cache block location In this example, all 128 tag entries must be compared with the address Tag in parallel (by hardware) 4 tag Cache Main Memory Block 0 Block 1 Block i Block 4095 Block 0 Block 1 Block bit Main Memory address TagWord/ Byte E.g. CPU is looking for [A7B4] MAR = See if the tag matches one of the 128 cache tags If YES, cache hit! Get the word/ byte 0100 in the BINGO cache block Tag width is 12 bits

8 Set Associative Mapping Combination of direct and associative Same colour Main Memory blocks 0,64,128,…,4032 map to cache set 0 and can occupy either of 2 positions within that set A cache with k-blocks per set is called a k-way set associative cache.  (j mod 64) derives the Set Number  Within the target Set, compare the 2 tag entries with the address Tag in parallel (by hardware) Main Memory Block 0 Block 1 Block 63 Block 64 Block 65 Block 127 Block 128 Block 129 Block bit Main Memory address 664 Tag Set Number Word/ Byte tag Cache Block 0 Block 1 Block 126 tag Block 2 Block 3 tag Block 127 Set 0 Set 1 Set 63 Tag width is 6 bits

9 Set Associative Mapping Main Memory Block 0 Block 1 Block 63 Block 64 Block 65 Block 127 Block 128 Block 129 Block bit Main Memory address 664 Tag Set Number Word/ Byte tag Cache Block 0 Block 1 Block 126 tag Block 2 Block 3 tag Block 127 Set 0 Set 1 Set 63 Tag width is 6 bits E.g. 2-Way Set Associative CPU is looking for [A7B4] MAR = Go to cache Set (59 10 ) - Block ( ) - Block ( ) See if ONE of the TWO tags in the Set is If YES, cache hit! Get the word/ byte 0100 in the BINGO cache block

10 Replacement Algorithms Direct mapped cache  Position of each block fixed  Whenever replacement is needed (i.e. cache miss  new block to load), the choice is obvious and thus no "replacement algorithm" is needed Associative and Set Associative  Need to decide which block to replace (thus keep/ retain ones likely to be used in cache in near future again)  One strategy is least recently used (LRU) e.g. for a 4-block/ set cache, use a log 2 4 = 2-bit counter for each block Reset the counter to 0 whenever the block is accessed; counters of other blocks in the same set should be incremented On cache miss, replace/ uncache a block with counter reaching 3  Another is random replacement (choose random block) Advantage is that it is easier to implement at high speed

11 Cache Example Assume separate instruction and data caches  We consider only the data Cache has space for 8 blocks A block contains one 16-bit word (i.e. 2 bytes) A[10][4] is an array of words located at 7A00-7A27 in row- major order The program is to normalize first column of the array (a vertical vector) by its average short A[10][4]; int sum = 0; int j, i; double mean; // forward loop for (j = 0; j <= 9; j++) sum += A[j][0]; mean = sum / 10.0; // backward loop for (i = 9; i >= 0; i--) A[i][0] = A[i][0] / mean;

12 Cache Example A[0][0] A[0][1] A[0][2] A[0][3] A[1][0] A[9][0] A[9][1] A[9][2] A[9][3] Array Contents (40 elements) Tag for Direct Mapped Tag for Set-Associative Tag for Associative 8 blocks in cache, 3 bits encodes cache block number To simplify the discussion in this example: 16-bit word address; byte addresses are not shown One item == One block == One word == Two bytes 4 blocks/ set, 2 cache sets, 1 bit encodes cache set number

13 Direct Mapping Least significant 3-bits of address determine location in cache No replacement algorithm is needed in Direct Mapping When i = 9 and i = 8, get a cache hit (2 hits in total) Only 2 out of the 8 cache positions used Very inefficient cache utilization Content of data cache after loop pass: (time line) j = 0j = 1j = 2j = 3j = 4j = 5j = 6j = 7j = 8j = 9i = 9i = 8i = 7i = 6i = 5i = 4i = 3i = 2i = 1i = 0 Cache Block number 0 A[0][0] A[2][0] A[4][0] A[6][0] A[8][0] A[6][0] A[4][0] A[2][0] A[0][0] A[1][0] A[3][0] A[5][0] A[7][0] A[9][0] A[7][0] A[5][0] A[3][0] A[1][0] Tags not shown but are needed

14 Associative Mapping LRU replacement policy: get cache hits for i = 9, 8, …, 2 If i loop was a forward one, we would get no hits! Content of data cache after loop pass: ( time line) j = 0j = 1j = 2j = 3j = 4j = 5j = 6j = 7j = 8j = 9i = 9i = 8i = 7i = 6i = 5i = 4i = 3i = 2i = 1i = 0 Cache Block numbe r 0 A[0][0] A[8][0] A[0][0] 1 A[1][0] A[9][0] A[1][0] 2 A[2][0] 3 A[3][0] 4 A[4][0] 5 A[5][0] 6 A[6][0] 7 A[7][0] Tags not shown but are needed LRU Counters not shown but are needed

15 Set Associative Mapping Since all accessed blocks have even addresses (7A00, 7A04, 7A08,...), only half of the cache is used, i.e. they all map to set 0 LRU replacement policy: get hits for i = 9, 8, 7 and 6 Random replacement would have better average performance If i loop was a forward one, we would get no hits! Set 0 Content of data cache after loop pass: (time line) j = 0j = 1j = 2j = 3j = 4j = 5j = 6j = 7j = 8j = 9i = 9i = 8i = 7i = 6i = 5i = 4i = 3i = 2i = 1i = 0 Cache Block numbe r 0 A[0][0] A[4][0] A[8][0] A[4][0] A[0][0] 1 A[1][0] A[5][0] A[9][0] A[5][0] A[1][0] 2 A[2][0] A[6][0] A[2][0] 3 A[3][0] A[7][0] A[3][0] Tags not shown but are needed Set 1 LRU Counters not shown but are needed

16 Comments on the Example In this example, Associative is best, then Set- Associative, lastly Direct Mapping What are the advantages and disadvantages of each scheme? In practice,  low hit rates like in the example is very rare  usually Set-Associative with LRU replacement scheme Larger blocks and more blocks greatly improve cache hit rate, i.e. more cache memory

17 Real-life Example: Intel Core 2 Duo Core Processing unit L2 cache 4MB Main memory Input/Output System bus L1 instruction cache 32kB L1 data cache 32kB Core Processing unit L1 instruction cache 32kB L1 data cache 32kB

18 Real-life Example: Intel Core 2 Duo Number of processors: 1 Number of cores: 2 per processor Number of threads: 2 per processor Name: Intel Core 2 Duo E6600 Code Name: Conroe Specification: Intel(R) Core(TM)2 CPU 2.40GHz Technology: 65 nm Core Speed: 2400 MHz Multiplier x Bus speed: 9.0 x MHz = 2400 MHz Front-Side-Bus speed: 4 x 266.0MHz = 1066 MHz Instruction sets: MMX, SSE, SSE2, SSE3, SSSE3, EM64T L1 Data cache 2 x 32 KBytes, 8-way set associative, 64-byte line size L1 Instruction cache 2 x 32 KBytes, 8-way set associative, 64-byte line size L2 cache 4096 KBytes, 16-way set associative, 64-byte line size

19 Memory Module Interleaving Processor and cache are fast, main memory is slow. Try to hide access latency by interleaving memory accesses across several memory modules. Each memory module has own Address Buffer Register (ABR) and Data Buffer Register (DBR) Which scheme below can be better interleaved? m bits Address in moduleMM address (a) Consecutive words in a module i k bits Module DBRABRDBRABR DBR 0 n1- (b) Consecutive words in consecutive modules i k bits 0 Module MM address DBRABR DBRABRDBR Address in module 2 k 1- m bits

20 Memory Module Interleaving Memory interleaving can be realized in technology such as “Dual Channel Memory Architecture.” Two or more compatible (identical the best) memory modules are used in matching banks. Within a memory module, in fact, 8 chips are used in “parallel”, to achieve a 8 x 8 = 64-bit memory bus. This is also a kind of memory interleaving.

21 Memory Interleaving Example Suppose we have a cache read miss and need to load from main memory Assume cache with 8-word block (cache line size = 8 bytes) Assume it takes one clock to send address to DRAM memory and one clock to send data back. In addition, DRAM has 6 cycle latency for first word; good that each of subsequent words in same row takes only 4 cycles Read 8 bytes without interleaving: 1+(1x6)+(7x4)+1 = 36 cycles For 4-module interleaved scheme: 1+(1x6)+(1x8) = 15 cycles  Transfer of first 4 words are overlapped with access of second 4 words

22 Memory Without Interleaving … Read 8 bytes (non-interleaving): 1 + 1x6 + 7x4 + 1 = 36 Cycles 161 Single Memory Read: = 8 Cycles

23 Four Memory Modules Interleaved Read 8 bytes (interleaving): x8 = 15 Cycles 16

24 Cache Hit Rate and Miss Penalty Goal is to have a big memory system with speed of cache High cache hit rates (> 90%) are essential  Miss penalty must also be reduced Example  h is hit rate  M is miss penalty  C is cache access time  Average memory access time = h * C + (1-h) * M

25 Cache Hit Rate and Miss Penalty Optimistically assume 8 cycles to read a single memory item; 15 cycles to load a 8-byte block from main memory (previous example); cache access time = 1 cycle For every 100 instructions, statistically 30 instructions are data read/ write Instruction fetch: 100 memory access: assume hit rate = 0.95 Data read/ write: 30 memory access: assume hit rate = 0.90 Execution cycles without cache = ( ) x 8 Execution cycles with cache = 100(0.95x1+0.05x15)+30(0.9x1+0.1x15) Ratio = 4.30, i.e. cache speeds up execution more than 4 times!

26 Summary Cache Organizations:  Direct, Associative, Set-Associative Cache Replacement Algorithms:  Random, Least Recently Used Memory Interleaving Cache Hit and Miss Penalty