Presentation is loading. Please wait.

Presentation is loading. Please wait.

Memory Design Principles Principle of locality dominates design Smaller = faster Hierarchy goal: total memory system almost as cheap as the cheapest component,

Similar presentations


Presentation on theme: "Memory Design Principles Principle of locality dominates design Smaller = faster Hierarchy goal: total memory system almost as cheap as the cheapest component,"— Presentation transcript:

1 Memory Design Principles Principle of locality dominates design Smaller = faster Hierarchy goal: total memory system almost as cheap as the cheapest component, almost as fast as the fastest component

2 Memory Hierarchy Design Chapter 5 covers the memory hierarchy, mainly the caches and main memory Later in the course, chapter 7, we discuss the lower level – the I/O system Several performance examples will be studies, along with a continuing example: the HP AlphaServer ES40 using the Alpha 21264 microprocessor

3 Remarks Most of this chapter is aimed at performance issues surrounding the transfer of data and instructions between the CPU and the cache Some main memory material is presented in the chapter too

4 Outline Four memory questions Block placement Block ID Block replacement Write strategy HP AlphaServer ES40 Cache performance Three examples Out-of-order processors

5 Outline - continued Improving cache performance Reducing miss penalty Reducing miss rate Use of parallelism Reducing the hit time Main memory – improving performance Real world application of concepts – the AlphaServer with the 21264 memory hierarchy

6 The ABC’s of Caches We are not going to review much, so be sure you know the material in section 5.2

7 Four Memory Hierarchy Questions Block placement Three categories of cache organization: Direct mapped Fully associative Set associative Make sure you understand all these categories Today direct mapped, 2-way and 4-way set associative organizations dominate the market

8 Block Placement and Associativity

9 How is a Block Found in Cache? Addresses have three fields: Tag Index Block offset Tags are searched in parallel to save time Block offsets don’t need to be searched neither do indexes (explanations on page 399)

10 Which Block should be Replaced on a Cache Miss? Situation: a CPU reference misses the cache, so a block must be brought in from main memory – but the cache is full – which block currently in cache should be replaced and sent back to main memory to allow space for the new block? Strategies: LRU FIFO Random

11 What Happens on a Write? This gets more complicated and it really gets complicated when the multiprocessor problem is studied – more on that in chapter 6. Be sure you fully understand write back, write through and dirty bits

12 What is the Big Problem with Writes? On a read the block in cache can be read at the same time that the tag is being read and compared. If the read is a hit fine, if it is a miss just ignore it – no benefit, but no harm. On a write modifying a block cannot begin until the tag is checked to see if the address is a hit. Because tag checking cannot occur in parallel, writes normally take longer than reads. Writes must be slower.

13 Reads vs. Writes - Frequency This is only one study but it is representative: Page 400 quoting from figure 2.32 has write composing about 7% of overall memory traffic

14 Real World Example: The Alpha 21264 Microprocessor http://h18002.www1.hp.com/alphaserver/es4 0/ http://h18002.www1.hp.com/alphaserver/es4 0/ Up to 4 processors, midlevel server family 64KB instruction cache, 64KB data cache (on chip) ES40 uses 8MB direct-mapped second-level cache (see pg 485) 1-16MB for family of servers Benchmark results listed on web page

15 The Alpha 21264 Microprocessor - continued Cache (much of these facts on pg. 404-5) Two-set associative FIFO 64-byte blocks Write back 44-bit physical address – not all of 64 bit virtual space, but designers did not think anyone needs that big a virtual address space yet.

16 Data Cache in the Alpha 21264 microprocessor

17 Cache Performance Section 5.3 has several interesting and instructive examples. Minimizing average memory access time is our usual goal for this chapter However, the final goal is to reduce CPU execution time and these two goals can actually give different results (example on page 409) Key parameters: miss rate, miss penalty, and hit time

18 Cache Performance Equations Average memory access time = Hit time + Miss rate X Miss Penalty The main equation Miss rate is often divided up into two separate miss rates: instruction miss rate and data miss rate

19 Cache Performance Equations - continued Design parameters that effect the equation parameters: separate or unified cache, direct- mapped vs. associative cache, required time to find a block in cache and others. Average memory access time = Hit time + Miss rate X Miss Penalty

20 Out-of-Order Execution Out-of-order execution changes our definitions No single correct definition Read pages 182-184 for discussion of out-of- order processing – we will get back to the topic later in the course

21 Reducing Cache Miss Penalty Multilevel caches – first level is small enough to match the clock cycle time of a fast CPU, the second level is large enough to capture many accesses that otherwise would go to main memory and pay a large penalty Check out the definitions of local miss rate and global miss rate Study the two diagrams 5.10 and 5.11 for simulate studies of the Alpha 21264.

22 Miss Rates vs. Cache Size for Multilevel Caches

23 Relative Execution Time by Second-level Cache Size

24 More Reducing Cache Miss Penalty Methods Critical word and early restart Give priority to read misses over write misses Merging write buffers Victim Caches Remember what was discarded in case it is needed again Alpha 21264 uses this concept Study the next diagram

25 “Victim” Caches

26 Reducing Miss Rate Larger block size Larger blocks take more advantage of spatial locality But larger blocks increase the miss penalty Study the following diagram Larger caches Obvious technique – but a drawback is longer hit time and higher cost

27 Miss Rate vs. Block Size

28 Reducing Miss Rate - continued Higher associativity Rules of thumb apply Check them out on page 429 Way prediction Predicting the way or block within the set of the next cache access

29 Reducing Miss Rate - continued Compiler optimization Observations Code can be rearranged without affecting correctness Reordering of instructions can maximize use of data in a cache block before it is discarded Very important – widely used in software especially scientific or DSP which have wide usage of matrices, iterated loops and very predictable code Study the examples that go with the next two diagrams. Pages 432 – 5.

30 Arrays x, y, and z when i=1

31 The Age of accesses to x, y and z

32 Reducing Hit TIme Key observation: A time-consuming portion of a cache hit is using the index portion of the address to read the tag memory and then compare it to the address. Hence smaller = faster!

33 Summary of Techniques Study over the (large) table on page 449

34 Virtual Memory Note the relative penalty for missing cache vice the relative penalty for missing memory hence lower miss rates is always the preferred goal. (see table 5.32 on page 426)

35 Main Memory Revisit the four questions asked about the memory hierarchy Pages 463-5

36 Summary of Virtual Memory and Caches “With virtual memory, first-level caches and second-level caches all mapping portions of the virtual and physical address space, it can get confusing what bits go where.” (page 467) Study over the following diagram (figure 5.37) and review the like concepts if there is any confusion in your mind.

37 Overall Picture of the Memory Hierarchy

38 The Cache Coherence Problem There is a short section (page 480-2) about this. Read it over – it is easy enough. We need to greatly expand upon this idea later for multiprocessors in chapter 6.

39 The Alpha 21264 Memory Hierarchy Note the location of the following components: ITLB, DTLB, Victim buffer Note also that the 21264 is an out-of-order execution processor that fetches up t four instructions per clock cycle ES40 has a 8MB direct-mapped L2 cache Way prediction is used in the instruction cache

40 Alpha 21264 Performance Look over the benchmark results in table 5.45 and the comments on pages 487-8 Comments – the SPEC95 programs do not tax the 21264 memory hierarchy, but the database benchmarks do. The text suggests that microprocessors designed for use in servers may see much heavier demands on the memory systems than for desktops.

41 Sony Playstation 2


Download ppt "Memory Design Principles Principle of locality dominates design Smaller = faster Hierarchy goal: total memory system almost as cheap as the cheapest component,"

Similar presentations


Ads by Google