Yunju Baek yunju@pusan.ac.kr 2006 spring Chapter 4 Cache Memory Yunju Baek yunju@pusan.ac.kr 2006 spring.

Slides:



Advertisements
Similar presentations
Chapter 6 Computer Architecture
Advertisements

Computer Memory System
Computer Organization and Architecture
Characteristics of Computer Memory
Associative Cache Mapping A main memory block can load into any line of cache Memory address is interpreted as tag and word (or sub-address in line) Tag.
Characteristics Location Capacity Unit of transfer Access method Performance Physical type Physical characteristics Organisation.
TK 2123 COMPUTER ORGANISATION & ARCHITECTURE
Cache memory Direct Cache Memory Associate Cache Memory Set Associative Cache Memory.
Characteristics of Computer Memory
Computer Organization and Architecture
CH05 Internal Memory Computer Memory System Overview Semiconductor Main Memory Cache Memory Pentium II and PowerPC Cache Organizations Advanced DRAM Organization.
CSNB123 coMPUTER oRGANIZATION
Faculty of Information Technology Department of Computer Science Computer Organization and Assembly Language Chapter 4 Cache Memory.
Cache Memory. Characteristics Location Capacity Unit of transfer Access method Performance Physical type Physical characteristics Organisation.
Computer Architecture Lecture 3 Cache Memory. Characteristics Location Capacity Unit of transfer Access method Performance Physical type Physical characteristics.
03-04 Cache Memory Computer Organization. Characteristics Location Capacity Unit of transfer Access method Performance Physical type Physical characteristics.
Cache Memory.
L/O/G/O Cache Memory Chapter 3 (b) CS.216 Computer Architecture and Organization.
1 CACHE MEMORY 1. 2 Cache Memory ■ Small amount of fast memory, expensive memory ■ Sits between normal main memory (slower) and CPU ■ May be located on.
Computer system & Architecture
2007 Sept. 14SYSC 2001* - Fall SYSC2001-Ch4.ppt1 Chapter 4 Cache Memory 4.1 Memory system 4.2 Cache principles 4.3 Cache design 4.4 Examples.
CSE 241 Computer Engineering (1) هندسة الحاسبات (1) Lecture #3 Ch. 6 Memory System Design Dr. Tamer Samy Gaafar Dept. of Computer & Systems Engineering.
Memory Hierarchy. Hierarchy List Registers L1 Cache L2 Cache Main memory Disk cache Disk Optical Tape.
Chapter 4: MEMORY Cache Memory.
Chapter 8: System Memory Dr Mohamed Menacer Taibah University
Memory Characteristics Location Capacity Unit of transfer Access method Performance Physical type Physical characteristics Organisation.
1 Chapter 5 Cache Memory Chapter Five : Cache Memory Memory Processor Input/Output.
Cache Small amount of fast memory Sits between normal main memory and CPU May be located on CPU chip or module.
Chapter 4 Cache Memory. Characteristics Location Capacity Unit of transfer Access method Performance Physical type Physical characteristics Organisation.
Characteristics Location Capacity Unit of transfer Access method Performance Physical type Physical characteristics Organisation.
Characteristics Location Capacity Unit of transfer Access method Performance Physical type Physical characteristics Organisation.
Associative Mapping A main memory block can load into any line of cache Memory address is interpreted as tag and word Tag uniquely identifies block of.
Memory Hierarchy Ideal memory is fast, large, and inexpensive
Cache Memory.
William Stallings Computer Organization and Architecture 7th Edition
William Stallings Computer Organization and Architecture 8th Edition
Computer Organization and Architecture William Stallings 8th Edition
William Stallings Computer Organization and Architecture 8th Edition
Cache Memory Presentation I
William Stallings Computer Organization and Architecture 7th Edition
William Stallings Computer Organization and Architecture 7th Edition
Cache memory Direct Cache Memory Associate Cache Memory
BIC 10503: COMPUTER ARCHITECTURE
Computer Architecture
Chapter 6 Memory System Design
Chap. 12 Memory Organization
Cache Memory.
William Stallings Computer Organization and Architecture 8th Edition
William Stallings Computer Organization and Architecture 8th Edition
William Stallings Computer Organization and Architecture 8th Edition
William Stallings Computer Organization and Architecture 8th Edition
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

Yunju Baek yunju@pusan.ac.kr 2006 spring Chapter 4 Cache Memory Yunju Baek yunju@pusan.ac.kr 2006 spring

Topics Computer Memory System Overview Memory Hierarchy Cache Memory Principles Elements of Cache Design Pentium and PowerPC Cache

Computer Memory System Overview Characteristics of memory systems Location Capacity Unit of transfer Access method Performance Physical type Physical characteristics Organization

Location CPU: registers, control memory Internal: main memory External: secondary memory

Capacity In terms of words or bytes 1 Byte = 8 bits Word size The natural unit of organization size: 8, 16, and 32 bits are common, even 64 bits

Unit of Transfer Number of data elements transferred at a time Internal Usually governed by data bus width External Usually a block which is much larger than a word

Addressable Unit Smallest location which can be uniquely addressed Word or byte E.g., Motorola 68000 word: 16 bits internal transfer unit: 16 bits addressable unit: 8 bits (byte addressable) Let A = address length in bits N = # addressable units  2A = N

Access Methods (1) Sequential Data does not have a unique address Start at the beginning and read through in order Must read intermediate data items until the desired item found Access time depends on location of data and previous location e.g. tape

Access Methods (2) Direct Individual blocks have unique addresses Access is by jumping to vicinity plus sequential search Access time depends on location and previous location e.g. disk

Access Methods (3) Random Individual addresses identify locations exactly Location can be selected randomly and addressed and accessed directly Access time is independent of location or previous access (i.e., constant) e.g. RAM

Access Methods (4) Associative A variation of random access Data is located by a comparison with contents of a portion of the store All words are searched simultaneously Access time is independent of location or previous access e.g. cache

Performance (1) Access time Memory Cycle time Time between presenting the address and getting the valid data For random access memory: time to address data unit and perform transfer For non-random access memory: time to position hardware mechanism at the desired position Memory Cycle time Primarily applied to random access memory Time may be required for the memory to “recover” before next access Cycle time is (access time + recovery time)

Performance (2) Transfer Rate: R bps Rate at which data can be transferred in/out of memory For random access memory, R = 1/(memory cycle time) For non-random access memory, TN = TA + N/R, where TN : average time to R/W N bits TA : average access time N : # bits

Physical Types Semiconductor RAM Magnetic Disk & Tape Optical CD & DVD

Physical Characteristics Decay Volatility Erasability Power consumption

Organization Physical arrangement of bits into words Not always obvious e.g. interleaved

The Bottom Line How much? How fast? How expensive? Capacity Time is money How expensive? Cost/bit

Memory Hierarchy (1) Major design objective of memory systems Provision of adequate storage capacity at an acceptable level of performance a reasonable cost Memory technologies Smaller access time  greater cost/bit Greater capacity  smaller cost/bit Greater capacity  greater access time  DILEMMA  Solution: MEMORY HIERARCHY

Memory Hierarchy (2) If then How can we validate D)? Memory organized according to A) - C) Data and instruction distributed according to D) then Overall cost reduced Level of performance maintained How can we validate D)?

Locality of Reference (1) Basis for validity of D) During the course of the execution of a program, memory references tend to cluster Examples? Over a long period of time, clusters in used migrate from one locality to another Over a short period of time, fixed clusters are used primarily Current locality kept in high speed memory  average access time reduced

Locality of Reference (2) Spatial locality Tendency of execution to involve a number of memory locations that are clustered E.g., sequential instruction access, subroutines, arrays, tables Temporal locality Tendency to access memory locations that have been used recently E.g., iteration loops

Typical Memory Hierarchy Registers In CPU Internal or Main memory May include one or more levels of cache “RAM” External memory Backing store

Hierarchy List Registers L1 Cache L2 Cache Main memory Disk cache Disk Optical Tape

Performance example (1) Assume 2-level memory system Level 1: access time T1 Level 2: access time T2 Hit ratio, H: fraction of time a reference can be found in level 1 Average access time, Tave = prob(found in level1) x T(found in level1) + prob(not found in level1) x T(not found in level1) = H xT1 + (1- H ) x (T1+ T2 ) =T1 + (1 - H )T2

Performance example (2) Assume 2-level memory system Level 1: access time T1 = 1 s Level 2: access time T2 = 10 s Hit ratio, H = 95% Average access time, Tave = H xT1 + (1- H )x(T1+ T2 ) = .95 x 1 + (1 - .95) X (1 + 10) = .95 + .05 X 11 = 1.5 s

Performance example (3) Higher hit ratio  better performance

So you want Speed? It is possible to build a computer which uses only static RAM (technique for cache) This would be very fast This would need no cache How can you cache cache? This would cost a very large amount Stick with memory hierarchy!

Cache Memory Principles Objective High speed Large memory size Less expensive memory system Cache Small amount of fast memory Sits between normal main memory and CPU May be located on CPU chip or module

Cache and Main Memory Cache contains a copy of portions of main memory smaller, faster larger, slower

Cache operation - overview Consider READ operation CPU requests contents of memory location Check cache for this data If present, get from cache (fast) If not present, read required block from main memory to cache Then deliver from cache to CPU Q: Why delivering a whole block into cache? Cache includes tags to identify which block of main memory is in each cache slot

Typical Cache Organization

Cache/Main-Memory Structure 2n addressable words each word has a unique n-bit address M fixed length blocks of K words each  M = 2n/K Cache C slots (lines) of K words each C << M

Cache/Main-Memory Structure At any time, some subset of blocks resides in lines As C << M, each line includes a tag indicating which block is being stored tag is a portion of an address

(line)

Elements of Cache Design Size Mapping Function Replacement Algorithm Write Policy Block Size Number of Caches

Size does matter Usually 1K - 512K Cost Speed More cache is more expensive Speed More cache is faster (up to a point) Checking cache for data takes time

Comparison of Cache Sizes   Comparison of Cache Sizes Processor Type Year of Introduction L1 cache L2 cache L3 cache IBM 360/85 Mainframe 1968 16 to 32 KB — PDP-11/70 Minicomputer 1975 1 KB VAX 11/780 1978 16 KB IBM 3033 64 KB IBM 3090 1985 128 to 256 KB Intel 80486 PC 1989 8 KB Pentium 1993 8 KB/8 KB 256 to 512 KB PowerPC 601 32 KB PowerPC 620 1996 32 KB/32 KB PowerPC G4 PC/server 1999 256 KB to 1 MB 2 MB IBM S/390 G4 1997 256 KB IBM S/390 G6 8 MB Pentium 4 2000 IBM SP High-end server/ supercomputer 64 KB/32 KB CRAY MTAb Supercomputer Itanium 2001 16 KB/16 KB 96 KB 4 MB SGI Origin 2001 High-end server Itanium 2 2002 6 MB IBM POWER5 2003 1.9 MB 36 MB CRAY XD-1 2004 64 KB/64 KB 1MB   a Two values seperated by a slash refer to instruction and data caches b Both caches are instruction only; no data caches

Mapping Function Algorithms for mapping main memory blocks to cache lines Needed, as C << M Approaches Direct Associate Set Associate

Mapping Function Example Cache of 64KByte Cache block of 4 bytes i.e. cache is 16K (214) lines of 4 bytes (why?) 16MBytes main memory, byte addressable 24 bit address (224 = 16M) 4M blocks C = 16K, M = 4M, C << M

Direct Mapping (1) Each block of main memory maps to only one possible cache line i.e. if a block is in cache, it must be in one specific place Mapping i = j mod m, where i : cache line number j : memory block number m : number of lines (i.e., C )

Direct Mapping (2) Example of mapping: 16 blocks, 4 lines line blocks 0 0, 4, 8, 12 1 1, 5, 9, 13 2 2, 6, 10, 14 3 3, 7, 11, 15 Which block (in the line)? No two blocks in the same line have the same Tag field in address Check contents of cache by finding line and then check Tag

Direct Mapping - Address Structure Address is in 3 fields Least Significant w bits identify unique word in a block (or line) Most Significant s bits specify one memory block The MSBs are split into cache line field of r bits, where m = 2r (or C = 2r) tag of s-r (most significant) bits

Direct Mapping Cache Line Table Cache line Main Memory blocks held 0 0, m, 2m, …, 2s-m 1 1, m+1, 2m+1, …, 2s-m+1 : : m-1 m-1, 2m-1, 3m-1, …, 2s-1

Direct Mapping Cache Organization

Direct Mapping Example (1) Tag s-r Line or Slot r Word w 14 2 8 24 bit address 2 bit word identifier (4 byte block) 22 bit block identifier 8 bit tag (=22-14) 14 bit slot or line Again No two blocks in the same line have the same Tag field Check contents of cache by finding line and checking Tag

Direct Mapping Example (2) Q1: Where in cache is the word from main memory location 16339D mapped? 0 C E 7 Ans: Line 0CE7, Tag 16, word offset 1 Q2: Where in cache is the word from main memory location ABCDEF mapped? Tag 8 bits Line 14 bits Word 2 bits 01 0001 0110 0011 0011 1001 11

Direct Mapping Summary Address length = (s + w) bits Number of addressable units = 2s+w words or bytes Block size = line size = 2w words or bytes Number of blocks in main memory = 2s+ w/2w = 2s Number of lines in cache = m = 2r Size of tag = (s – r) bits

Direct Mapping pros & cons Advantages Simple Inexpensive to implement Disadvantage Fixed location for given block  If a program accesses 2 blocks that map to the same line repeatedly, cache misses are very high  These blocks will be continually swapped in and out  Hit ratio will be low

Associative Mapping A main memory block can load into any line of cache Memory address is interpreted as tag and word Tag uniquely identifies block of memory Every line’s tag is examined for a match Cache searching gets expensive must simultaneously examine every line’s tag for a match

Fully Associative Cache Organization

3 9 C 001110011100 7 E

Associative Mapping Address Structure Example Word 2 bit Tag 22 bit 24 bit address 22 bit tag stored with each 32 bit block of data Compare tag field with tag entry in cache to check for hit Least significant 2 bits of address identify which byte is required from 32 bit data block

Associative Mapping Example Word 2 bit Tag 22 bit Address Tag Cache line Offset Data FFFFFC 3FFFFF 3FFF 00 24 16339D 058CE7 0001 01 DC ABCDEF ? ? ? ?

Associative Mapping Summary Address length = (s + w) bits Number of addressable units = 2s+w words or bytes Block size = line size = 2w words or bytes Number of blocks in main memory = 2s+ w/2w = 2s Number of lines in cache = undetermined Size of tag = s bits

Associate Mapping pros & cons Advantage Flexible Disadvantages Cost Complex circuit for simultaneous comparison

Set Associative Mapping Compromise between the previous two Cache is divided into v sets of k lines each m = v x k, where m: #lines i = j mod v, where i : cache set number j : memory block number A given block maps to any line in a given set K-way set associate cache 2-way and 4-way are common

Set Associative Mapping Example m = 16 lines, v = 8 sets  k = 2 lines/set, 2 way set associative mapping Assume 32 blocks in memory, i = j mod v set blocks 0 0, 8, 16, 24 1 1, 9, 17, 25 : : 7 7, 15, 23, 31 A given block can be in one of 2 lines in only one set e.g., block 17 can be assigned to either line 0 or line 1 in set 1

Set Associative Mapping Address Structure Word w bit Tag (s-d) bit Set d bit d bits: v = 2d, specify one of v sets s bits: specify one of 2s blocks Use set field to determine cache set to look in Compare tag field simultaneously to see if we have a hit

K Way Set Associative Cache Organization

Set Associative Mapping Example Word 2 bit Tag 9 bit Set 13 bit Same example, 2-way set associate 214 lines, 2 lines/set  213 sets  29 blocks can be loaded to either of the two lines in a set Each block mapped into a set has a unique tag E.g., Address Tag Set Offset Data FFFFFF  1FF 7FFF 1FF 1FFF 11 68 16339D  02C 339D 02C 0CE7 01 DC ABCDEF ? ? ? ? ?

Set Associative Mapping Summary Address length = (s + w) bits Number of addressable units = 2s+w words or bytes Block size = line size = 2w words or bytes Number of blocks in main memory = 2s Number of lines in set = k Number of sets = v = 2d Number of lines in cache = kv = k * 2d Size of tag = (s – d) bits

Remarks Why is the simultaneous comparison cheaper here, compared to associate mapping? Tag is much smaller Only k tags within a set are compared Relationship between set associate and the first two: extreme cases of set associate k = 1  v = m  direct (1 line/set) k = m  v = 1  associate (one big set)

Replacement Algorithms (1) Direct mapping When a new block is brought into cache, one of existing blocks must be replaced Direct Mapping No choice Each block only maps to one line Replace that line

Replacement Algorithms (2) Associative & Set Associative Hardware implemented algorithm (speed) Least Recently used (LRU) e.g. in 2 way set associative Which of the 2 block is LRU? First in first out (FIFO) replace block that has been in cache longest Least frequently used replace block which has had fewest hits Random

Write Policy Must not overwrite a cache block unless main memory is up to date Multiple CPUs may have individual caches I/O may address main memory directly

Write through All writes go to main memory as well as cache Both copies always agree Multiple CPUs can monitor main memory traffic to keep local (to CPU) cache up to date Disadvantage Lots of traffic  bottleneck

Write back Updates initially made in cache only Update bit for cache slot is set when update occurs If block is to be replaced, write to main memory only if update bit is set, i.e., only if the cache line is dirty, i.e., only if at least one word in the cache line is updated Other caches get out of sync I/O must access main memory through cache N.B. 15% of memory references are writes

Block Size Block size = line size As block size increases from very small  hit ratio increases because of “the principle of locality” As block size becomes very large  hit ratio decreases as Number of blocks decreases Probability of referencing all words in a block decreases 4 - 8 addressable units is reasonable

Number of Caches Two aspects Number of levels Unified vs. split

Multilevel Caches Modern CPU has on-chip cache (L1) that increases overall performance e.g., 80486: 8KB Pentium: 16KB PowerPC: up to 64KB Secondary, off-chip cache (L2) provides high speed access to main memory Generally 512KB or less

Unified vs. Split Unified cache Split cache Stores data and instructions in one cache Flexible and can balance the load between data and instruction fetches  higher hit ratio Only one cache to design and implement Split cache Two caches, one for data and one for instructions Trend toward split cache Good for superscalar machines that support parallel execution, prefetch, and pipelining Overcome cache contention

Pentium 4 Cache 80386 – no on chip cache 80486 – 8k using 16 byte lines and four way set associative organization Pentium (all versions) – two on chip L1 caches Data & instructions Pentium III – L3 cache added off chip Pentium 4 L1 caches 8k bytes 64 byte lines four way set associative L2 cache Feeding both L1 caches 256k 128 byte lines 8 way set associative L3 cache on chip

Processor on which feature first appears Intel Cache Evolution Problem Solution Processor on which feature first appears External memory slower than the system bus. Add external cache using faster memory technology. 386 Increased processor speed results in external bus becoming a bottleneck for cache access. Move external cache on-chip, operating at the same speed as the processor. 486 Internal cache is rather small, due to limited space on chip Add external L2 cache using faster technology than main memory Contention occurs when both the Instruction Prefetcher and the Execution Unit simultaneously require access to the cache. In that case, the Prefetcher is stalled while the Execution Unit’s data access takes place. Create separate data and instruction caches. Pentium Increased processor speed results in external bus becoming a bottleneck for L2 cache access. Create separate back-side bus that runs at higher speed than the main (front-side) external bus. The BSB is dedicated to the L2 cache. Pentium Pro Move L2 cache on to the processor chip. Pentium II Some applications deal with massive databases and must have rapid access to large amounts of data. The on-chip caches are too small. Add external L3 cache. Pentium III   Move L3 cache on-chip. Pentium 4

Pentium 4 Block Diagram

Pentium 4 Core Processor Fetch/Decode Unit Fetches instructions from L2 cache Decode into micro-ops Store micro-ops in L1 cache Out of order execution logic Schedules micro-ops Based on data dependence and resources May speculatively execute Execution units Execute micro-ops Data from L1 cache Results in registers Memory subsystem L2 cache and systems bus

Pentium 4 Design Reasoning Decodes instructions into RISC like micro-ops before L1 cache Micro-ops fixed length Superscalar pipelining and scheduling Pentium instructions long & complex Performance improved by separating decoding from scheduling & pipelining (More later – ch14) Data cache is write back Can be configured to write through L1 cache controlled by 2 bits in register CD = cache disable NW = not write through 2 instructions to invalidate (flush) cache and write back then invalidate L2 and L3 8-way set-associative Line size 128 bytes

PowerPC Cache Organization 601 – single 32kb 8 way set associative 603 – 16kb (2 x 8kb) two way set associative 604 – 32kb 620 – 64kb G3 & G4 64kb L1 cache 8 way set associative 256k, 512k or 1M L2 cache two way set associative G5 32kB instruction cache 64kB data cache

PowerPC G5 Block Diagram