CS61C L31 Caches I (1) Garcia 2005 © UCB Peer Instruction Answer A. Mem hierarchies were invented before 1950. (UNIVAC I wasn’t delivered ‘til 1951) B.

Slides:

Advertisements

Similar presentations

Lecture 8: Memory Hierarchy Cache Performance Kai Bu

Advertisements

Memory Hierarchy CS465 Lecture 11. D. Barbara Memory CS465 2 Control Datapath Memory Processor Input Output Big Picture: Where are We Now?  The five.

CS 430 – Computer Architecture

CS 430 Computer Architecture 1 CS 430 – Computer Architecture Caches, Part II William J. Taffe using slides of David Patterson.

CS61C L32 Caches II (1) A Carle, Summer 2005 © UCB inst.eecs.berkeley.edu/~cs61c/su05 CS61C : Machine Structures Lecture #20: Caches Andy.

CS61C L31 Caches II (1) Garcia, Fall 2006 © UCB GPUs >> CPUs?  Many are using graphics processing units on graphics cards for high-performance computing.

CS61C L22 Caches II (1) Garcia, Fall 2005 © UCB Lecturer PSOE, new dad Dan Garcia inst.eecs.berkeley.edu/~cs61c CS61C : Machine.

Memory Subsystem and Cache Adapted from lectures notes of Dr. Patterson and Dr. Kubiatowicz of UC Berkeley.

Lecturer PSOE Dan Garcia

CS61C L23 Cache II (1) Chae, Summer 2008 © UCB Albert Chae, Instructor inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures Lecture #23 – Cache II.

Inst.eecs.berkeley.edu/~cs61c UCB CS61C : Machine Structures Lecture 32 – Caches III Prem Kumar of Northwestern has created a quantum inverter.

Chap. 7.4: Virtual Memory. CS61C L35 VM I (2) Garcia © UCB Review: Caches Cache design choices: size of cache: speed v. capacity direct-mapped v. associative.

Modified from notes by Saeid Nooshabadi

CS61C L22 Caches III (1) A Carle, Summer 2006 © UCB inst.eecs.berkeley.edu/~cs61c/su06 CS61C : Machine Structures Lecture #22: Caches Andy.

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

CS61C L23 Caches I (1) Beamer, Summer 2007 © UCB Scott Beamer, Instructor inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures Lecture #23 Cache I.

CS61C L21 Caches I (1) Garcia, Fall 2005 © UCB Lecturer PSOE, new dad Dan Garcia inst.eecs.berkeley.edu/~cs61c CS61C : Machine.

Inst.eecs.berkeley.edu/~cs61c UCB CS61C : Machine Structures Lecture 31 – Caches II In this week’s Science, IBM researchers describe a new class.

CS61C L33 Caches III (1) Garcia, Spring 2007 © UCB Future of movies is 3D?  Dreamworks says they may exclusively release movies in this format. It’s based.

Chapter 7 Large and Fast: Exploiting Memory Hierarchy Bo Cheng.

COMP3221: Microprocessors and Embedded Systems Lecture 26: Cache - II Lecturer: Hui Wu Session 2, 2005 Modified from.

CS61C L32 Caches II (1) Garcia, 2005 © UCB Lecturer PSOE Dan Garcia inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures.

CS61C L31 Caches I (1) Garcia 2005 © UCB Lecturer PSOE Dan Garcia inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures.

CS 61C L35 Caches IV / VM I (1) Garcia, Fall 2004 © UCB Andy Carle inst.eecs.berkeley.edu/~cs61c-ta inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures.

CS61C L20 Caches I (1) A Carle, Summer 2006 © UCB inst.eecs.berkeley.edu/~cs61c/su06 CS61C : Machine Structures Lecture #20: Caches Andy Carle.

CS61C L32 Caches II (1) Garcia, Spring 2007 © UCB Experts weigh in on Quantum CPU  Most “profoundly skeptical” of the demo. D-Wave has provided almost.

CS61C L30 Caches I (1) Garcia, Fall 2006 © UCB Shuttle can’t fly over Jan 1?  A computer bug has come up for the shuttle – its computers don’t reset to.

Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

Chap.7 Memory system Jen-Chang Liu, Spring Big Ideas so far 15 weeks to learn big ideas in CS&E Principle of abstraction, used to build systems.

COMP3221 lec34-Cache-II.1 Saeid Nooshabadi COMP 3221 Microprocessors and Embedded Systems Lectures 34: Cache Memory - II

CS61C L24 Cache II (1) Beamer, Summer 2007 © UCB Scott Beamer, Instructor inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures Lecture #24 Cache II.

Cs 61C L17 Cache.1 Patterson Spring 99 ©UCB CS61C Cache Memory Lecture 17 March 31, 1999 Dave Patterson (http.cs.berkeley.edu/~patterson) www-inst.eecs.berkeley.edu/~cs61c/schedule.html.

CS61C L31 Caches I (1) Garcia, Spring 2007 © UCB Powerpoint bad!!  Research done at the Univ of NSW says that “working memory”, the brain part providing.

CS 61C L21 Caches II (1) Garcia, Spring 2004 © UCB Lecturer PSOE Dan Garcia inst.eecs.berkeley.edu/~cs61c CS61C : Machine.

CS61C L32 Caches III (1) Garcia, Fall 2006 © UCB Lecturer SOE Dan Garcia inst.eecs.berkeley.edu/~cs61c UC Berkeley CS61C.

CS 61C L23 Caches IV / VM I (1) Garcia, Spring 2004 © UCB Lecturer PSOE Dan Garcia inst.eecs.berkeley.edu/~cs61c CS61C :

1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.

Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.

DAP Spr.‘98 ©UCB 1 Lecture 11: Memory Hierarchy—Ways to Reduce Misses.

Memory Hierarchy and Cache Design The following sources are used for preparing these slides: Lecture 14 from the course Computer architecture ECE 201 by.

Cache Memory CSE Slides from Dan Garcia, UCB.

10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.

CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.

CS61C L17 Cache1 © UC Regents 1 CS61C - Machine Structures Lecture 17 - Caches, Part I October 25, 2000 David Patterson

CML CML CS 230: Computer Organization and Assembly Language Aviral Shrivastava Department of Computer Science and Engineering School of Computing and Informatics.

Lecture 08: Memory Hierarchy Cache Performance Kai Bu

Csci 211 Computer System Architecture – Review on Cache Memory Xiuzhen Cheng

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.

DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%

Review °We would like to have the capacity of disk at the speed of the processor: unfortunately this is not feasible. °So we create a memory hierarchy:

1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.

Inst.eecs.berkeley.edu/~cs61c UCB CS61C : Machine Structures Lecture 14 – Caches III Google Glass may be one vision of the future of post-PC.

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

1 Appendix C. Review of Memory Hierarchy Introduction Cache ABCs Cache Performance Write policy Virtual Memory and TLB.

LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”

1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.

COMP 3221: Microprocessors and Embedded Systems Lectures 27: Cache Memory - III Lecturer: Hui Wu Session 2, 2005 Modified.

CPE 626 CPU Resources: Introduction to Cache Memories Aleksandar Milenkovic Web:

Address – 32 bits WRITE Write Cache Write Main Byte Offset Tag Index Valid Tag Data 16K entries 16.

CS61C : Machine Structures Lecture 6. 2

Memristor memory on its way (hopefully)

CS61C : Machine Structures Lecture 6. 2

Systems Architecture II

CS-447– Computer Architecture Lecture 20 Cache Memories

Lecturer PSOE Dan Garcia

Some of the slides are adopted from David Patterson (UCB)

Chapter Five Large and Fast: Exploiting Memory Hierarchy

Cache - Optimization.

Presentation transcript:

CS61C L31 Caches I (1) Garcia 2005 © UCB Peer Instruction Answer A. Mem hierarchies were invented before (UNIVAC I wasn’t delivered ‘til 1951) B. If you know your computer’s cache size, you can often make your code run faster. C. Memory hierarchies take advantage of spatial locality by keeping the most recent data items closer to the processor. ABC 1: FFF 2: FFT 3: FTF 4: FTT 5: TFF 6: TFT 7: TTF 8: TTT A.“We are…forced to recognize the possibility of constructing a hierarchy of memories, each of which has greater capacity than the preceding but which is less accessible.” – von Neumann, 1946 B.Certainly! That’s call “tuning” C.“Most Recent” items  Temporal locality

CS61C L31 Caches I (2) Garcia 2005 © UCB Peer Instruction Answer 1. All caches take advantage of spatial locality. 2. All caches take advantage of temporal locality. 3. On a read, the return value will depend on what is in the cache. T R U E F A L S E 1. Block size = 1, no spatial! 2. That’s the idea of caches; We’ll need it again soon. 3. It better not! If it’s there, use it. Oth, get from mem F A L S E ABC 1: FFF 2: FFT 3: FTF 4: FTT 5: TFF 6: TFT 7: TTF 8: TTT

CS61C L32 Caches II (3) Garcia, 2005 © UCB Review: Direct-Mapped Cache Cache Location 0 can be occupied by data from: Memory location 0, 4, 8,... 4 blocks => any memory location that is multiple of 4 Memory Memory Address A B C D E F 4 Byte Direct Mapped Cache Cache Index

CS61C L32 Caches II (4) Garcia, 2005 © UCB Caching Terminology When we try to read memory, 3 things can happen: 1.cache hit: cache block is valid and contains proper address, so read desired word 2.cache miss: nothing in cache in appropriate block, so fetch from memory 3.cache miss, block replacement: wrong data is in cache at appropriate block, so discard it and fetch desired data from memory (cache always copy)

Cache access (Fig 7.7) 1 word = 4bytes Byte addressing Cache 裡真正用來存資料的部分 Cache block 大小：

CS61C L32 Caches II (6) Garcia, 2005 © UCB Issues with Direct-Mapped Since multiple memory addresses map to same cache index, how do we tell which one is in there? What if we have a block size > 1 byte? Answer: divide memory address into three fields ttttttttttttttttt iiiiiiiiii oooo tagindexbyte to checkto offset if have selectwithin correct blockblockblock WIDTHHEIGHT Tag Index Offset

Use of spatial locality Previous cache design takes advantage of temporal locality Use spatial locality in cache design A cache block that is larger than 1 word in length With a cache miss, we will fetch multiple words that are adjacent 時間上的局部性空間上的局部性一次抓多個相鄰的 words

4-word cache (Fig 7.10) 4-word block addr.

Advantage of multiple-word block (spatial locality) Ex. access word with byte address 16,24,20 … … word block cache 1-word block cache 16 - cache miss 24 - cache miss 20 - cache miss 16 – cache miss load 4-word block 24 – cache hit 20 – cache hit memory

Multiple-word cache: write miss addr. 1-word data 01 Reload 4-word block 1-word data miss

CS61C L32 Caches II (11) Garcia, 2005 © UCB Accessing data in a direct mapped cache Ex.: 16KB of data, direct-mapped, 4 word blocks Read 4 addresses 1.0x x C 3.0x x Address (hex) Value of Word Memory C a b c d C e f g h C i j k l...

CS61C L32 Caches II (12) Garcia, 2005 © UCB 16 KB Direct Mapped Cache, 16B blocks Valid bit: determines whether anything is stored in that row (when computer initially turned on, all entries invalid)... Valid Tag 0x0-3 0x4-70x8-b0xc-f Index

CS61C L32 Caches II (13) Garcia, 2005 © UCB 1. Read 0x Valid Tag 0x0-3 0x4-70x8-b0xc-f Index Tag fieldIndex fieldOffset

CS61C L32 Caches II (14) Garcia, 2005 © UCB So we read block 1 ( )... Valid Tag 0x0-3 0x4-70x8-b0xc-f Index Tag fieldIndex fieldOffset

CS61C L32 Caches II (15) Garcia, 2005 © UCB No valid data... Valid Tag 0x0-3 0x4-70x8-b0xc-f Index Tag fieldIndex fieldOffset

CS61C L32 Caches II (16) Garcia, 2005 © UCB So load that data into cache, setting tag, valid... Valid Tag 0x0-3 0x4-70x8-b0xc-f abcd Index Tag fieldIndex fieldOffset

CS61C L32 Caches II (17) Garcia, 2005 © UCB Read from cache at offset, return word b Valid Tag 0x0-3 0x4-70x8-b0xc-f abcd Index Tag fieldIndex fieldOffset

CS61C L32 Caches II (18) Garcia, 2005 © UCB 2. Read 0x C = 0… Valid Tag 0x0-3 0x4-70x8-b0xc-f abcd Index Tag fieldIndex fieldOffset

CS61C L32 Caches II (19) Garcia, 2005 © UCB Index is Valid... Valid Tag 0x0-3 0x4-70x8-b0xc-f abcd Index Tag fieldIndex fieldOffset

CS61C L32 Caches II (20) Garcia, 2005 © UCB Index valid, Tag Matches... Valid Tag 0x0-3 0x4-70x8-b0xc-f abcd Index Tag fieldIndex fieldOffset

CS61C L32 Caches II (21) Garcia, 2005 © UCB Index Valid, Tag Matches, return d... Valid Tag 0x0-3 0x4-70x8-b0xc-f abcd Index Tag fieldIndex fieldOffset

CS61C L32 Caches II (22) Garcia, 2005 © UCB 3. Read 0x = 0… Valid Tag 0x0-3 0x4-70x8-b0xc-f abcd Index Tag fieldIndex fieldOffset

CS61C L32 Caches II (23) Garcia, 2005 © UCB So read block 3... Valid Tag 0x0-3 0x4-70x8-b0xc-f abcd Index Tag fieldIndex fieldOffset

CS61C L32 Caches II (24) Garcia, 2005 © UCB No valid data... Valid Tag 0x0-3 0x4-70x8-b0xc-f abcd Index Tag fieldIndex fieldOffset

CS61C L32 Caches II (25) Garcia, 2005 © UCB Load that cache block, return word f... Valid Tag 0x0-3 0x4-70x8-b0xc-f abcd efgh Index Tag fieldIndex fieldOffset

CS61C L32 Caches II (26) Garcia, 2005 © UCB 4. Read 0x = 0… Valid Tag 0x0-3 0x4-70x8-b0xc-f abcd efgh Index Tag fieldIndex fieldOffset

CS61C L32 Caches II (27) Garcia, 2005 © UCB So read Cache Block 1, Data is Valid... Valid Tag 0x0-3 0x4-70x8-b0xc-f abcd efgh Index Tag fieldIndex fieldOffset

CS61C L32 Caches II (28) Garcia, 2005 © UCB Cache Block 1 Tag does not match (0 != 2)... Valid Tag 0x0-3 0x4-70x8-b0xc-f abcd efgh Index Tag fieldIndex fieldOffset

CS61C L32 Caches II (29) Garcia, 2005 © UCB Miss, so replace block 1 with new data & tag... Valid Tag 0x0-3 0x4-70x8-b0xc-f ijkl efgh Index Tag fieldIndex fieldOffset

CS61C L32 Caches II (30) Garcia, 2005 © UCB And return word j... Valid Tag 0x0-3 0x4-70x8-b0xc-f ijkl efgh Index Tag fieldIndex fieldOffset

CS61C L32 Caches II (31) Garcia, 2005 © UCB Do an example yourself. What happens? Chose from: Cache: Hit, Miss, Miss w. replace Values returned:a,b, c, d, e,..., k, l Read address 0x ? Read address 0x c ? Valid Tag 0x0-3 0x4-70x8-b0xc-f ijkl 1 0efgh Index

Advantage of multiple-word block (spatial locality) Comparison of miss rate Block size in words program Instruction miss rate Data miss rate gcc 1 6.1% 2.1% 4 2.0% 1.7% spice 1 1.2% 1.3% 4 0.3% 0.6% Why improvement on instruction miss is significant? Instruction references have better spatial locality

Miss rate v.s. block size Why? Block 數變少 !

Short conclusion Direct mapped cache Map a memory word to a cache block Valid bit, tag field Cache read Hit, read miss, miss penalty Cache write Write-through Write-back Write miss penalty Multi-word cache (use spatial locality)

Outline Introduction Basics of caches Measuring cache performance Set associative cache Multilevel cache Virtual memory Make memory system fast

Cache performance How cache affects system performance? CPU time = ( CPU execution clock cycles ) x clock cycle time + Memory-stall clock cycles cache hit cache miss Memory-stall cycles = Read-stall cycles + Write-stall cycles Read-stall cycles = Program Reads X Read miss rate x read miss penalty Assume read and write miss penalty are the same Memory-stall cycles = Program Mem. access X miss rate x miss penalty

Ex. Calculate cache performance CPI = 2 without any memory stalls For gcc, instruction cache miss rate=2% data cache miss rate=4% miss penalty = 40 cycles Sol: Set instruction count = I Instruction miss cycles = I x 2% x 40 = 0.8 x I Data miss cycles = I x 36% x 4% x 40 = 0.58 x I percentage of lw/sw Memory-stall cycles = 0.8I I = 1.38I CPU time stalls CPU time perfect cache = 2I I 2I =1.69

Why memory is bottleneck for system performance? In previous example, if we make the processor faster, change CPI from 2 to 1 Memory-stall cycles remains the same=1.38I CPU time stalls CPU time perfect cache = I I I =2.38 Percentage of memory stall: =41% =58% CPU 變快 (CPI 降低，或 clock rate 提高 ) Memory 對系統效能的影響百分比越重

Outline Introduction Basics of caches Measuring cache performance Set associative cache (reduce miss rate) Multilevel cache Virtual memory Make memory system fast

How to improve cache performance ? Larger cache Set associative cache Reduce cache miss rate New placement rule other than direct mapping Multi-level cache Reduce cache miss penalty Memory-stall cycles = Program Mem. access X miss rate x miss penalty

Flexible placement of blocks Recall: direct mapped cache One address -> one block in cache ? One address -> more than one block in cache 一個 memory address 可以對應到 cache 中一個以上的 block

Full-associative cache A memory data can be placed in any block in the cache Disadvantage: Search all entries in the cache for a match Using parallel comparators 記憶體資料可放在 cache 任意位置

Set-associative cache Between direct mapped and full-associative A memory data can be placed in a set of blocks in the cache Disadvantage: Search all entries in the set for a match Parallel comparators 可放在 cache 中某一個集合中 (address) modulo (number of sets in cache) Ex. 12 modulo 4 = 0

Example: 4-way set-associative cache Parallel comparators

Take all schemes as a case of set-associativity Ex. 8-block cache

Example: set-associative caches (p. 499) A cache with 4 blocks Load data with block addresses 0,8,0,6,8 one-way set-associative cache (direct mapped) 5 misses

Example: set-associative caches 2-way set-associative cache 4-way set-associative cache 4 misses 3 misses

CS61C L34 Caches IV (48) Garcia © UCB Block Replacement Policy (1/2) Direct-Mapped Cache: index completely specifies position which position a block can go in on a miss N-Way Set Assoc: index specifies a set, but block can occupy any position within the set on a miss Fully Associative: block can be written into any position Question: if we have the choice, where should we write an incoming block?

CS61C L34 Caches IV (49) Garcia © UCB Block Replacement Policy (2/2) If there are any locations with valid bit off (empty), then usually write the new block into the first one. If all possible locations already have a valid block, we must pick a replacement policy: rule by which we determine which block gets “cached out” on a miss.

CS61C L34 Caches IV (50) Garcia © UCB Block Replacement Policy: LRU LRU (Least Recently Used) Idea: cache out block which has been accessed (read or write) least recently Pro: temporal locality  recent past use implies likely future use: in fact, this is a very effective policy Con: with 2-way set assoc, easy to keep track (one LRU bit); with 4-way or greater, requires complicated hardware and much time to keep track of this

CS61C L34 Caches IV (51) Garcia © UCB Block Replacement Example We have a 2-way set associative cache with a four word total capacity and one word blocks. We perform the following word accesses (ignore bytes for this problem): 0, 2, 0, 1, 4, 0, 2, 3, 5, 4 How many hits and how many misses will there be for the LRU block replacement policy?

CS61C L34 Caches IV (52) Garcia © UCB Block Replacement Example: LRU Addresses 0, 2, 0, 1, 4, 0,... 0 lru 2 1 loc 0loc 1 set 0 set 1 02 lru set 0 set 1 0: miss, bring into set 0 (loc 0) 2: miss, bring into set 0 (loc 1) 0: hit 1: miss, bring into set 1 (loc 0) 4: miss, bring into set 0 (loc 1, replace 2) 0: hit 0 set 0 set 1 lru 02 set 0 set 1 lru set 0 set lru 2 4 set 0 set lru

Short conclusion Higher degree of associativity Lower miss rate More hardware cost to search

Outline Introduction Basics of caches Measuring cache performance Set associative cache Multilevel cache (reduce miss penalty) Virtual memory Make memory system fast

Multi-level cache Goal: reduce miss penalty Primary cache (L1) Secondary cache(L2) L1 cache miss L2 cache miss Cache hit Main memory

Example: Performance of multilevel cache CPI = 1 without cache miss, clock rate = 500MHz Primary cache, miss rate=5% Secondary cache, miss rate=2%, access time=20ns Main memory, access time=200 ns Total CPI = Base CPI + memory-stall CPI 1 ?

Example: Performance of multilevel cache (cont.) Total CPI = Base CPI + memory-stall CPI 1 ? access to main memory=200ns x 500M clock/sec=100clock access to L2 cache =20ns x 500M clock/sec =10 clock Total CPI = 1 + L1 miss penalty + L2 miss penalty = 1 + 5% x % x 100 = 3.5 One-level cache Two-level cache Total CPI = 1 + 5% x 100 = 6

Cache Things to Remember Caches are NOT mandatory: Processor performs arithmetic Memory stores data Caches simply make data transfers go faster Caches speed up due to temporal locality: store data used recently Block size > 1 wd spatial locality speedup: Store words next to the ones used recently Cache design choices: size of cache: speed v. capacity N-way set assoc: choice of N (direct-mapped, fully-associative just special cases for N)