Chap.7 Memory system Jen-Chang Liu, Spring 2006. Big Ideas so far 15 weeks to learn big ideas in CS&E Principle of abstraction, used to build systems.

Slides:



Advertisements
Similar presentations
CS61C L31 Caches II (1) Garcia, Fall 2006 © UCB GPUs >> CPUs?  Many are using graphics processing units on graphics cards for high-performance computing.
Advertisements

Memory Subsystem and Cache Adapted from lectures notes of Dr. Patterson and Dr. Kubiatowicz of UC Berkeley.
1 Lecture 20 – Caching and Virtual Memory  2004 Morgan Kaufmann Publishers Lecture 20 Caches and Virtual Memory.
CS61C L23 Cache II (1) Chae, Summer 2008 © UCB Albert Chae, Instructor inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures Lecture #23 – Cache II.
Chap. 7.4: Virtual Memory. CS61C L35 VM I (2) Garcia © UCB Review: Caches Cache design choices: size of cache: speed v. capacity direct-mapped v. associative.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
CS61C L21 Caches I (1) Garcia, Fall 2005 © UCB Lecturer PSOE, new dad Dan Garcia inst.eecs.berkeley.edu/~cs61c CS61C : Machine.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Inst.eecs.berkeley.edu/~cs61c UCB CS61C : Machine Structures Lecture 31 – Caches II In this week’s Science, IBM researchers describe a new class.
CS61C L31 Caches I (1) Garcia 2005 © UCB Peer Instruction Answer A. Mem hierarchies were invented before (UNIVAC I wasn’t delivered ‘til 1951) B.
CS61C L33 Caches III (1) Garcia, Spring 2007 © UCB Future of movies is 3D?  Dreamworks says they may exclusively release movies in this format. It’s based.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
Chapter 7 Large and Fast: Exploiting Memory Hierarchy Bo Cheng.
Memory Chapter 7 Cache Memories.
COMP3221: Microprocessors and Embedded Systems Lecture 26: Cache - II Lecturer: Hui Wu Session 2, 2005 Modified from.
CS61C L32 Caches II (1) Garcia, 2005 © UCB Lecturer PSOE Dan Garcia inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures.
CS61C L31 Caches I (1) Garcia 2005 © UCB Lecturer PSOE Dan Garcia inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Nov. 3, 2003 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
CS 61C L35 Caches IV / VM I (1) Garcia, Fall 2004 © UCB Andy Carle inst.eecs.berkeley.edu/~cs61c-ta inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures.
331 Lec20.1Fall :332:331 Computer Architecture and Assembly Language Fall 2003 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
CS61C L20 Caches I (1) A Carle, Summer 2006 © UCB inst.eecs.berkeley.edu/~cs61c/su06 CS61C : Machine Structures Lecture #20: Caches Andy Carle.
CS61C L32 Caches II (1) Garcia, Spring 2007 © UCB Experts weigh in on Quantum CPU  Most “profoundly skeptical” of the demo. D-Wave has provided almost.
CS61C L30 Caches I (1) Garcia, Fall 2006 © UCB Shuttle can’t fly over Jan 1?  A computer bug has come up for the shuttle – its computers don’t reset to.
Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
COMP3221 lec34-Cache-II.1 Saeid Nooshabadi COMP 3221 Microprocessors and Embedded Systems Lectures 34: Cache Memory - II
CS61C L24 Cache II (1) Beamer, Summer 2007 © UCB Scott Beamer, Instructor inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures Lecture #24 Cache II.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
Cs 61C L17 Cache.1 Patterson Spring 99 ©UCB CS61C Cache Memory Lecture 17 March 31, 1999 Dave Patterson (http.cs.berkeley.edu/~patterson) www-inst.eecs.berkeley.edu/~cs61c/schedule.html.
331 Lec20.1Spring :332:331 Computer Architecture and Assembly Language Spring 2005 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
1  2004 Morgan Kaufmann Publishers Chapter Seven.
CS 61C L21 Caches II (1) Garcia, Spring 2004 © UCB Lecturer PSOE Dan Garcia inst.eecs.berkeley.edu/~cs61c CS61C : Machine.
1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.
Memory: PerformanceCSCE430/830 Memory Hierarchy: Performance CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu (U. Maine)
CS61C L37 VM III (1)Garcia, Fall 2004 © UCB Lecturer PSOE Dan Garcia inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures.
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
Computing Systems Memory Hierarchy.
Memory Hierarchy and Cache Design The following sources are used for preparing these slides: Lecture 14 from the course Computer architecture ECE 201 by.
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Patterson.
CMPE 421 Parallel Computer Architecture
EECS 318 CAD Computer Aided Design LECTURE 10: Improving Memory Access: Direct and Spatial caches Instructor: Francis G. Wolff Case.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.
CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and
CML CML CS 230: Computer Organization and Assembly Language Aviral Shrivastava Department of Computer Science and Engineering School of Computing and Informatics.
The Goal: illusion of large, fast, cheap memory Fact: Large memories are slow, fast memories are small How do we create a memory that is large, cheap and.
Computer Organization & Programming
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%
CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.
CPE232 Cache Introduction1 CPE 232 Computer Organization Spring 2006 Cache Introduction Dr. Gheith Abandah [Adapted from the slides of Professor Mary Irwin.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
Memory Hierarchy How to improve memory access. Outline Locality Structure of memory hierarchy Cache Virtual memory.
1 CMPE 421 Parallel Computer Architecture PART3 Accessing a Cache.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
1  1998 Morgan Kaufmann Publishers Chapter Seven.
1  2004 Morgan Kaufmann Publishers Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality:
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
Address – 32 bits WRITE Write Cache Write Main Byte Offset Tag Index Valid Tag Data 16K entries 16.
Yu-Lun Kuo Computer Sciences and Information Engineering
The Goal: illusion of large, fast, cheap memory
Improving Memory Access 1/3 The Cache and Virtual Memory
Morgan Kaufmann Publishers Memory & Cache
Systems Architecture II
Lecturer PSOE Dan Garcia
Chapter Five Large and Fast: Exploiting Memory Hierarchy
Cache - Optimization.
Memory & Cache.
Presentation transcript:

Chap.7 Memory system Jen-Chang Liu, Spring 2006

Big Ideas so far 15 weeks to learn big ideas in CS&E Principle of abstraction, used to build systems as layers Pliable Data: a program determines what it is Stored program concept: instructions just data Greater performance by exploiting parallelism (pipeline) Principle of Locality, exploited via a memory hierarchy (cache) Principles/Pitfalls of Performance Measurement

Five components of computer Input, output, memory, datapath, control

Outline Introduction Basics of caches Measuring cache performance Set associative cache Multilevel cache Virtual memory Make memory system fast Make memory system big

Introduction Programmer ’ s view about memory Unlimited amount of fast memory How to create the above illusion? 無限大的快速記憶體 Scene: library Book shelf desk one book books

Principle of locality Program access a relatively small portion of their address space at any instant of time Temporal locality If an item is referenced, it will tend to be referenced again soon Spatial locality If an item is referenced, items whose address are close by will tend to be referenced soon

Cost and performance of memory How to build a memory system from the above memory technologies? Access time$ per GB in 2004 SRAM 0.5-5ns$4000-$10000 DRAM 50-70ns$100-$200 Magnetic disk 5-20 million ns$0.5-$2 SRAM: static random access memory DRAM: dynamic random access memory

Memory hierarchy 記憶體階層 Ex. disk DRAM SRAM data All data Subset of data Subset of data

Operation in memory hierarchy If data is found /* hit */ transfer to processor; else /* miss */ transfer data to upper level; access time Hit time Miss penalty

Outline Introduction Basics of caches Measuring cache performance Set associative cache Multilevel cache Virtual memory How to design memory hierarchy?

Cache Cache: a safe place for hiding or storing things. Cache Memory hierarchy between CPU and main memory Any storage managed to take advantage of locality of access Webster ’ s dictionary 快取記憶體

What does a cache do?

Problem to design a cache Cache contains part of the data in memory of disk Q1: How do we know if a data item is in the cache? 如何知道 cache 有沒有現在要用的資料? = > 如何把記憶體抓到的資料放到 cache 裡?

Direct mapped cache (Fig 7.5) Ex. (block address) modulo (no. of cache blocks in the cache) Address of wordLocation in cache

Direct mapped cache (cont.) Many memory words one location in cache Q: Which memory word in the cache? Use tag to identify Q: Whether the memory block is valid? Ex. Initially, the cache is empty Use valid bit to identify data word … Cache addr. valid tag

Fig7.6

Cache access (Fig 7.7) Word = 4bytes address Cache 裡真正用來存資料的部分 Cache block 大小:

Ex. Calculate bits in a cache How many bits are required for a direct- mapped cache with 64KB of data and one- word blocks, assuming a 32-bit address? 32-bit address Word data 2 64KB = 16K words = 2 14 words Tag = = Cache bit: 2 14 x ( ) = 98KB

Ex. Real machine: DECstation … … KB data 98KB cache size (2 14 )

Ex. DECStation 3100 Use MIPS R2000 CPU Use pipeline as in Chap. 6 Data memory Instruction memory Two memory Units?

Ex. DECStation 3100 caches Instruction cache and data cache 64KB Instruction cache 64KB data cache

Ex. DECStation 3100 Cache access: Read 64KB Instruction cache 64KB data cache PC Address calculated from ALU Cache hit Cache miss Update cache

Peer Instruction A. Mem hierarchies were invented before (UNIVAC I wasn ’ t delivered ‘ til 1951) B. If you know your computer ’ s cache size, you can often make your code run faster. C. Memory hierarchies take advantage of spatial locality by keeping the most recent data items closer to the processor. ABC 1: FFF 2: FFT 3: FTF 4: FTT 5: TFF 6: TFT 7: TTF 8: TTT

CS61C L31 Caches I (25) Garcia 2005 © UCB Peer Instruction Answer A. Mem hierarchies were invented before (UNIVAC I wasn’t delivered ‘til 1951) B. If you know your computer’s cache size, you can often make your code run faster. C. Memory hierarchies take advantage of spatial locality by keeping the most recent data items closer to the processor. ABC 1: FFF 2: FFT 3: FTF 4: FTT 5: TFF 6: TFT 7: TTF 8: TTT A.“We are…forced to recognize the possibility of constructing a hierarchy of memories, each of which has greater capacity than the preceding but which is less accessible.” – von Neumann, 1946 B.Certainly! That’s call “tuning” C.“Most Recent” items  Temporal locality

CS61C L31 Caches I (26) Garcia 2005 © UCB Peer Instructions 1. All caches take advantage of spatial locality. 2. All caches take advantage of temporal locality. 3. On a read, the return value will depend on what is in the cache. ABC 1: FFF 2: FFT 3: FTF 4: FTT 5: TFF 6: TFT 7: TTF 8: TTT

CS61C L31 Caches I (27) Garcia 2005 © UCB Peer Instruction Answer 1. All caches take advantage of spatial locality. 2. All caches take advantage of temporal locality. 3. On a read, the return value will depend on what is in the cache. T R U E F A L S E 1. Block size = 1, no spatial! 2. That’s the idea of caches; We’ll need it again soon. 3. It better not! If it’s there, use it. Oth, get from mem F A L S E ABC 1: FFF 2: FFT 3: FTF 4: FTT 5: TFF 6: TFT 7: TTF 8: TTT

Handling cache misses Cache miss processing Stall the processor Fetch the data from memory Write the cache entry Put the data Update the tag field Update the valid bit Continue execution

Ex. DECStation 3100 Cache access: Write Store data new value Data in cache and memory is inconsistent!!! 資料不相符 1. Write-through 更改快取記憶體 同時也寫回記憶體 2. Write-back 不寫回記憶體

Problems with write-through Writing to main memory slows down the performance Ex. CPI without cache miss = 1.2 clock cycles write to memory causes extra 10 cycles 13% store instructions in gcc x13% = 2.5 clock cycles 記憶體存取造成效率變差 Solution: write buffer Store the data into write buffer while the data is waiting to be written to memory The process can continue execution after writing data into cache and write buffer 寫入資料暫存在 write buffer ,等待寫入記憶體,程式繼續執行

Problems with write-back New value is written only to the cache Problem: cache and memory inconsistence Complex to implement Ex. When a cache entry is replaced, it must update the corresponding memory address

Use of spatial locality Previous cache design takes advantage of temporal locality Use spatial locality in cache design A cache block that is larger than 1 word in length With a cache miss, we will fetch multiple words that are adjacent 時間上的局部性 空間上的局部性 一次抓多個相鄰的 words

One-word cache (Fig 7.7) address

Multiple-word cache 4-word block addr.

Advantage of multiple-word block (spatial locality) Ex. access word with byte address 16,24,20 … … word block cache 1-word block cache 16 - cache miss 24 - cache miss 20 - cache miss 16 – cache miss load 4-word block 24 – cache hit 20 – cache hit memory

Multiple-word cache: write miss addr. 1-word data 01 Reload 4-word block 1-word data miss

CS61C L32 Caches II (37) Garcia, 2005 © UCB 1. Read 0x Valid Tag 0x0-3 0x4-70x8-b0xc-f Index Tag fieldIndex fieldOffset

CS61C L32 Caches II (38) Garcia, 2005 © UCB So we read block 1 ( )... Valid Tag 0x0-3 0x4-70x8-b0xc-f Index Tag fieldIndex fieldOffset

CS61C L32 Caches II (39) Garcia, 2005 © UCB No valid data... Valid Tag 0x0-3 0x4-70x8-b0xc-f Index Tag fieldIndex fieldOffset

CS61C L32 Caches II (40) Garcia, 2005 © UCB So load that data into cache, setting tag, valid... Valid Tag 0x0-3 0x4-70x8-b0xc-f abcd Index Tag fieldIndex fieldOffset

CS61C L32 Caches II (41) Garcia, 2005 © UCB Read from cache at offset, return word b Valid Tag 0x0-3 0x4-70x8-b0xc-f abcd Index Tag fieldIndex fieldOffset

CS61C L32 Caches II (42) Garcia, 2005 © UCB 2. Read 0x C = 0… Valid Tag 0x0-3 0x4-70x8-b0xc-f abcd Index Tag fieldIndex fieldOffset

CS61C L32 Caches II (43) Garcia, 2005 © UCB Index is Valid... Valid Tag 0x0-3 0x4-70x8-b0xc-f abcd Index Tag fieldIndex fieldOffset

CS61C L32 Caches II (44) Garcia, 2005 © UCB Index valid, Tag Matches... Valid Tag 0x0-3 0x4-70x8-b0xc-f abcd Index Tag fieldIndex fieldOffset

CS61C L32 Caches II (45) Garcia, 2005 © UCB Index Valid, Tag Matches, return d... Valid Tag 0x0-3 0x4-70x8-b0xc-f abcd Index Tag fieldIndex fieldOffset

CS61C L32 Caches II (46) Garcia, 2005 © UCB 3. Read 0x = 0… Valid Tag 0x0-3 0x4-70x8-b0xc-f abcd Index Tag fieldIndex fieldOffset

CS61C L32 Caches II (47) Garcia, 2005 © UCB So read block 3... Valid Tag 0x0-3 0x4-70x8-b0xc-f abcd Index Tag fieldIndex fieldOffset

CS61C L32 Caches II (48) Garcia, 2005 © UCB No valid data... Valid Tag 0x0-3 0x4-70x8-b0xc-f abcd Index Tag fieldIndex fieldOffset

CS61C L32 Caches II (49) Garcia, 2005 © UCB Load that cache block, return word f... Valid Tag 0x0-3 0x4-70x8-b0xc-f abcd efgh Index Tag fieldIndex fieldOffset

CS61C L32 Caches II (50) Garcia, 2005 © UCB 4. Read 0x = 0… Valid Tag 0x0-3 0x4-70x8-b0xc-f abcd efgh Index Tag fieldIndex fieldOffset

CS61C L32 Caches II (51) Garcia, 2005 © UCB So read Cache Block 1, Data is Valid... Valid Tag 0x0-3 0x4-70x8-b0xc-f abcd efgh Index Tag fieldIndex fieldOffset

CS61C L32 Caches II (52) Garcia, 2005 © UCB Cache Block 1 Tag does not match (0 != 2)... Valid Tag 0x0-3 0x4-70x8-b0xc-f abcd efgh Index Tag fieldIndex fieldOffset

CS61C L32 Caches II (53) Garcia, 2005 © UCB Miss, so replace block 1 with new data & tag... Valid Tag 0x0-3 0x4-70x8-b0xc-f ijkl efgh Index Tag fieldIndex fieldOffset

CS61C L32 Caches II (54) Garcia, 2005 © UCB And return word j... Valid Tag 0x0-3 0x4-70x8-b0xc-f ijkl efgh Index Tag fieldIndex fieldOffset

Advantage of multiple-word block (spatial locality) Comparison of miss rate Block size in words program Instruction miss rate Data miss rate gcc 1 6.1% 2.1% 4 2.0% 1.7% spice 1 1.2% 1.3% 4 0.3% 0.6% Why improvement on instruction miss is significant? Instruction references have better spatial locality

Miss rate v.s. block size Why? Block 數變少 !

Short conclusion Direct mapped cache Map a memory word to a cache block Valid bit, tag field Cache read Hit, read miss, miss penalty Cache write Write-through Write-back Write miss penalty Multi-word cache (use spatial locality)

Outline Introduction Basics of caches Measuring cache performance Set associative cache Multilevel cache Virtual memory Make memory system fast

Cache performance How cache affects system performance? CPU time = ( CPU execution clock cycles ) x clock cycle time + Memory-stall clock cycles cache hit cache miss Memory-stall cycles = Read-stall cycles + Write-stall cycles Read-stall cycles = Program Reads X Read miss rate x read miss penalty Assume read and write miss penalty are the same Memory-stall cycles = Program Mem. access X miss rate x miss penalty

Ex. Calculate cache performance CPI = 2 without any memory stalls For gcc, instruction cache miss rate=2% data cache miss rate=4% miss penalty = 40 cycles Sol: Set instruction count = I Instruction miss cycles = I x 2% x 40 = 0.8 x I Data miss cycles = I x 36% x 4% x 40 = 0.58 x I percentage of lw/sw Memory-stall cycles = 0.8I I = 1.38I CPU time stalls CPU time perfect cache = 2I I 2I =1.69

Why memory is bottleneck for system performance? In previous example, if we make the processor faster, change CPI from 2 to 1 Memory-stall cycles remains the same=1.38I CPU time stalls CPU time perfect cache = I I I =2.38 Percentage of memory stall: =41% =58% CPU 變快 (CPI 降低,或 clock rate 提高 ) Memory 對系統效能的影響百分比越重

Outline Introduction Basics of caches Measuring cache performance Set associative cache (reduce miss rate) Multilevel cache Virtual memory Make memory system fast

How to improve cache performance ? Larger cache Set associative cache Reduce cache miss rate New placement rule other than direct mapping Multi-level cache Reduce cache miss penalty Memory-stall cycles = Program Mem. access X miss rate x miss penalty

Flexible placement of blocks Recall: direct mapped cache One address -> one block in cache ? One address -> more than one block in cache 一個 memory address 可以對應到 cache 中一個以上的 block

Full-associative cache A memory data can be placed in any block in the cache Disadvantage: Search all entries in the cache for a match Using parallel comparators 可放在 cache 任意位置

Set-associative cache Between direct mapped and full-associative A memory data can be placed in a set of blocks in the cache Disadvantage: Search all entries in the set for a match Parallel comparators 可放在 cache 中某一個集合中 (address) modulo (number of sets in cache) Ex. 12 modulo 4 = 0

Example: 4-way set-associative cache Parallel comparators

Take all schemes as a case of set-associativity Ex. 8-block cache

Example: set-associative caches (p. 500) A cache with 4 blocks Load data with block addresses 0,8,0,6,8 one-way set-associative cache (direct mapped) 5 misses

Example: set-associative caches 2-way set-associative cache 4-way set-associative cache 4 misses 3 misses

Short conclusion Higher degree of associativity Lower miss rate More hardware cost to search

Outline Introduction Basics of caches Measuring cache performance Set associative cache Multilevel cache (reduce miss penalty) Virtual memory Make memory system fast

Multi-level cache Goal: reduce miss penalty Primary cache (L1) Secondary cache(L2) L1 cache miss L2 cache miss Cache hit Main memory

Example: Performance of multilevel cache CPI = 1 without cache miss, clock rate = 500MHz Primary cache, miss rate=5% Secondary cache, miss rate=2%, access time=20ns Main memory, access time=200 ns Total CPI = Base CPI + memory-stall CPI 1 ?

Example: Performance of multilevel cache (cont.) Total CPI = Base CPI + memory-stall CPI 1 ? access to main memory=200ns x 500M clock/sec=100clock access to L2 cache =20ns x 500M clock/sec =10 clock Total CPI = 1 + L1 miss penalty + L2 miss penalty = 1 + 5% x % x 100 = 3.5 One-level cache Two-level cache Total CPI = 1 + 5% x 100 = 6