Dr Masri Ayob TK 2123 COMPUTER ORGANISATION & ARCHITECTURE Lecture 7: CPU and Memory (3)

Slides:



Advertisements
Similar presentations
Chapter 6 Computer Architecture
Advertisements

Computer Memory System
Computer Organization and Architecture
Characteristics of Computer Memory
Associative Cache Mapping A main memory block can load into any line of cache Memory address is interpreted as tag and word (or sub-address in line) Tag.
Characteristics Location Capacity Unit of transfer Access method Performance Physical type Physical characteristics Organisation.
TK 2123 COMPUTER ORGANISATION & ARCHITECTURE
TK6123: COMPUTER ORGANISATION & ARCHITECTURE Lecture 8: CPU and Memory (3) 1 Prepared By: Associate Prof. Dr Masri Ayob.
Cache memory Direct Cache Memory Associate Cache Memory Set Associative Cache Memory.
Characteristics of Computer Memory
Computer Organization and Architecture
CH05 Internal Memory Computer Memory System Overview Semiconductor Main Memory Cache Memory Pentium II and PowerPC Cache Organizations Advanced DRAM Organization.
CSNB123 coMPUTER oRGANIZATION
Faculty of Information Technology Department of Computer Science Computer Organization and Assembly Language Chapter 4 Cache Memory.
Cache Memory. Characteristics Location Capacity Unit of transfer Access method Performance Physical type Physical characteristics Organisation.
Computer Architecture Lecture 3 Cache Memory. Characteristics Location Capacity Unit of transfer Access method Performance Physical type Physical characteristics.
03-04 Cache Memory Computer Organization. Characteristics Location Capacity Unit of transfer Access method Performance Physical type Physical characteristics.
Cache Memory.
Advanced Computer Architecture Cache Memory 1. Characteristics of Memory Systems 2.
L/O/G/O Cache Memory Chapter 3 (b) CS.216 Computer Architecture and Organization.
1 CACHE MEMORY 1. 2 Cache Memory ■ Small amount of fast memory, expensive memory ■ Sits between normal main memory (slower) and CPU ■ May be located on.
Computer system & Architecture
2007 Sept. 14SYSC 2001* - Fall SYSC2001-Ch4.ppt1 Chapter 4 Cache Memory 4.1 Memory system 4.2 Cache principles 4.3 Cache design 4.4 Examples.
CSE 241 Computer Engineering (1) هندسة الحاسبات (1) Lecture #3 Ch. 6 Memory System Design Dr. Tamer Samy Gaafar Dept. of Computer & Systems Engineering.
Memory Hierarchy. Hierarchy List Registers L1 Cache L2 Cache Main memory Disk cache Disk Optical Tape.
Chapter 4: MEMORY Cache Memory.
Lecture#15. Cache Function The data that is stored within a cache might be values that have been computed earlier or duplicates of original values that.
Chapter 8: System Memory Dr Mohamed Menacer Taibah University
1 Chapter 5 Cache Memory Chapter Five : Cache Memory Memory Processor Input/Output.
Cache Small amount of fast memory Sits between normal main memory and CPU May be located on CPU chip or module.
Chapter 4 Cache Memory. Characteristics Location Capacity Unit of transfer Access method Performance Physical type Physical characteristics Organisation.
Characteristics Location Capacity Unit of transfer Access method Performance Physical type Physical characteristics Organisation.
Characteristics Location Capacity Unit of transfer Access method Performance Physical type Physical characteristics Organisation.
Cache memory. Cache memory Overview CPU Cache Main memory Transfer of words Transfer of blocks of words.
Cache Memory.
William Stallings Computer Organization and Architecture 7th Edition
William Stallings Computer Organization and Architecture 8th Edition
William Stallings Computer Organization and Architecture 8th Edition
Yunju Baek 2006 spring Chapter 4 Cache Memory Yunju Baek 2006 spring.
Cache Memory Presentation I
William Stallings Computer Organization and Architecture 7th Edition
Cache memory Direct Cache Memory Associate Cache Memory
BIC 10503: COMPUTER ARCHITECTURE
Chapter 6 Memory System Design
Cache Memory.
William Stallings Computer Organization and Architecture 8th Edition
William Stallings Computer Organization and Architecture 8th Edition
Computer Organization & Architecture 3416
William Stallings Computer Organization and Architecture 8th Edition
William Stallings Computer Organization and Architecture 8th Edition
Presentation transcript:

Dr Masri Ayob TK 2123 COMPUTER ORGANISATION & ARCHITECTURE Lecture 7: CPU and Memory (3)

2Contents This lecture will discuss: This lecture will discuss: Cache. Cache. Error Correcting Codes. Error Correcting Codes.

3 The Memory Hierarchy Trade-off: cost, capacity and access time. Trade-off: cost, capacity and access time. Faster access time, greater cost per bit. Faster access time, greater cost per bit. Greater capacity, smaller cost per bit. Greater capacity, smaller cost per bit. Greater capacity, slower access time. Greater capacity, slower access time. Access time - the time it takes to perform a read or write operation. Memory Cycle time -Time may be required for the memory to “recover” before next access, i.e. access + recovery. Transfer Rate - rate at which data can be moved.

4 Memory Hierarchies A five-level memory hierarchy.

5 Hierarchy List Registers Registers L1 Cache L1 Cache L2 Cache L2 Cache Main memory Main memory Disk cache Disk cache Disk Disk Optical Optical Tape Tape Internal memory external memory decreasing cost/bit, increasing capacity, and slower access time

6 Hierarchy List It would be nice to use only the fastest memory, but because that is the most expensive memory, It would be nice to use only the fastest memory, but because that is the most expensive memory, we trade off access time for cost by using more of the slower memory. we trade off access time for cost by using more of the slower memory. The design challenge is to organise the data and programs in memory so that the accessed memory words are usually in the faster memory. The design challenge is to organise the data and programs in memory so that the accessed memory words are usually in the faster memory. In general, it is likely that most future accesses to main memory by the processor will be to locations recently accessed. In general, it is likely that most future accesses to main memory by the processor will be to locations recently accessed. So the cache automatically retains a copy of some of the recently used words from the DRAM. So the cache automatically retains a copy of some of the recently used words from the DRAM. If the cache is designed properly, then most of the time the processor will request memory words that are already in the cache. If the cache is designed properly, then most of the time the processor will request memory words that are already in the cache.

7 Hierarchy List No one technology is optimal in satisfying the memory requirements for a computer system. No one technology is optimal in satisfying the memory requirements for a computer system. As a consequence, the typical computer system is equipped with a hierarchy of memory subsystems; As a consequence, the typical computer system is equipped with a hierarchy of memory subsystems; some internal to the system (directly accessible by the processor) and some internal to the system (directly accessible by the processor) and some external (accessible by the processor via an I/O module). some external (accessible by the processor via an I/O module).

8Cache Small amount of fast memory Small amount of fast memory Sits between normal main memory and CPU Sits between normal main memory and CPU May be located on CPU chip or module May be located on CPU chip or module or cache line.

9Cache The cache contains a copy of portions of main memory. The cache contains a copy of portions of main memory. When the processor attempts to read a word of memory, a check is made to determine if the word is in the cache. When the processor attempts to read a word of memory, a check is made to determine if the word is in the cache. If so (hit), the word is delivered to the processor. If so (hit), the word is delivered to the processor. If not (miss), a block of main memory, consisting of some fixed number of words, is read into the cache and then the word is delivered to the processor. If not (miss), a block of main memory, consisting of some fixed number of words, is read into the cache and then the word is delivered to the processor. Because of the phenomenon of locality of reference, when a block of data is fetched into the cache to satisfy a single memory reference, it is likely that there will be future references to that same memory location or to other words in the block. Because of the phenomenon of locality of reference, when a block of data is fetched into the cache to satisfy a single memory reference, it is likely that there will be future references to that same memory location or to other words in the block. The ratio of hits to the total number of requests is known as the hit ratio.

10 Cache/Main Memory Structure

11 Cache operation – overview CPU requests contents of memory location CPU requests contents of memory location Check cache for this data Check cache for this data If present, get from cache (fast) If present, get from cache (fast) If not present, read required block from main memory to cache If not present, read required block from main memory to cache Then deliver from cache to CPU Then deliver from cache to CPU Cache includes tags to identify which block of main memory is in each cache slot Cache includes tags to identify which block of main memory is in each cache slot

12 Cache Operation

13 Cache Design Size Size Mapping Function Mapping Function Replacement Algorithm Replacement Algorithm Write Policy Write Policy Block Size Block Size Number of Caches – L1, L2, L3 etc. Number of Caches – L1, L2, L3 etc.

14 Size does matter Cost Cost More cache is expensive More cache is expensive Speed Speed More cache is faster (up to a point) More cache is faster (up to a point) Checking cache for data takes time Checking cache for data takes time We would like the size of the cache to be small enough so that the overall average cost per bit is close to that of main memory alone and large enough so that the overall average access time is close to that of the cache alone. The larger the cache, the larger the number of gates involved in addressing the cache. The result is that large caches tend to be slightly slower than small ones.

15 Comparison of Cache Sizes a Two values seperated by a slash refer to instruction and data caches b Both caches are instruction only; no data caches ProcessorType Year of Introduction L1 cache a L2 cacheL3 cache IBM 360/85Mainframe to 32 KB—— PDP-11/70Minicomputer19751 KB—— VAX 11/780Minicomputer KB—— IBM 3033Mainframe KB—— IBM 3090Mainframe to 256 KB—— Intel 80486PC19898 KB—— PentiumPC19938 KB/8 KB256 to 512 KB— PowerPC 601PC KB—— PowerPC 620PC KB/32 KB—— PowerPC G4PC/server KB/32 KB256 KB to 1 MB2 MB IBM S/390 G4Mainframe KB256 KB2 MB IBM S/390 G6Mainframe KB8 MB— Pentium 4PC/server20008 KB/8 KB256 KB— IBM SP High-end server/ supercomputer KB/32 KB8 MB— CRAY MTA b Supercomputer20008 KB2 MB— ItaniumPC/server KB/16 KB96 KB4 MB SGI Origin 2001High-end server KB/32 KB4 MB— Itanium 2PC/server KB256 KB6 MB IBM POWER5High-end server KB1.9 MB36 MB CRAY XD-1Supercomputer KB/64 KB1MB—

16 Cache: Mapping Function Cache lines < main memory blocks: Cache lines < main memory blocks: An algorithm is needed for mapping main memory blocks into cache lines. An algorithm is needed for mapping main memory blocks into cache lines. Three techniques: Three techniques: Direct Direct Associative Associative set associative set associative

17 Direct Mapping Each block of main memory maps to only one cache line Each block of main memory maps to only one cache line i.e. if a block is in cache, it must be in one specific place. i.e. if a block is in cache, it must be in one specific place. pros & cons pros & cons Simple Simple Inexpensive Inexpensive Fixed location for given block Fixed location for given block If a program accesses 2 blocks that map to the same line repeatedly, cache misses are very high If a program accesses 2 blocks that map to the same line repeatedly, cache misses are very high

18 Associative Mapping A main memory block can load into any line of cache A main memory block can load into any line of cache Memory address is interpreted as tag and word Memory address is interpreted as tag and word Tag uniquely identifies block of memory Tag uniquely identifies block of memory Every line’s tag is examined for a match Every line’s tag is examined for a match Disadvantage: Disadvantage: Cache searching gets expensive Cache searching gets expensive The complex circuitry is required to examine the tags of all cache lines in parallel. The complex circuitry is required to examine the tags of all cache lines in parallel.

19 Set Associative Mapping A compromise that exhibits the strengths of both the direct and associative approaches while reducing their disadvantages. A compromise that exhibits the strengths of both the direct and associative approaches while reducing their disadvantages. Cache is divided into a number of sets. Cache is divided into a number of sets. Each set contains a number of lines. Each set contains a number of lines. A given block maps to any line in a given set A given block maps to any line in a given set e.g. Block B can be in any line of set i. e.g. Block B can be in any line of set i. With fully associative mapping, the tag in a memory address is quite large and must be compared to the tag of every line in the cache. With fully associative mapping, the tag in a memory address is quite large and must be compared to the tag of every line in the cache. With k-way set associative mapping, the tag in a memory address is much smaller and is only compared to the k tags within a single set. With k-way set associative mapping, the tag in a memory address is much smaller and is only compared to the k tags within a single set.

20 Replacement Algorithms When cache memory is full, some block in cache memory must be selected for replacement. When cache memory is full, some block in cache memory must be selected for replacement. Direct mapping : Direct mapping : No choice No choice Each block only maps to one line Each block only maps to one line Replace that line Replace that line

21 Replacement Algorithms (2) Associative & Set Associative Hardware implemented algorithm (speed) Hardware implemented algorithm (speed) Least Recently used (LRU) Least Recently used (LRU) An LRU algorithm, keeps track of the usage of each block and replaces the block that was last used the longest time ago. An LRU algorithm, keeps track of the usage of each block and replaces the block that was last used the longest time ago. First in first out (FIFO) First in first out (FIFO) replace block that has been in cache longest replace block that has been in cache longest Least frequently used (LFU) Least frequently used (LFU) replace block which has had fewest hits replace block which has had fewest hits Random Random

22 Write Policy Issues: Issues: Must not overwrite a cache block unless main memory is up to date Must not overwrite a cache block unless main memory is up to date Multiple CPUs may have individual caches Multiple CPUs may have individual caches I/O may address main memory directly I/O may address main memory directly

23 Write through All writes go to main memory as well as cache All writes go to main memory as well as cache Multiple CPUs can monitor main memory traffic to keep local (to CPU) cache up to date Multiple CPUs can monitor main memory traffic to keep local (to CPU) cache up to date Disadvantage: Disadvantage: Lots of traffic Lots of traffic Slows down writes Slows down writes Create a bottleneck. Create a bottleneck.

24 Cache: Line Size As the block size increases from very small to larger sizes, the hit ratio will at first increase because of the principle of locality. As the block size increases from very small to larger sizes, the hit ratio will at first increase because of the principle of locality. Two issues: Two issues: Larger blocks reduce the number of blocks that fit into a cache. Because each block fetch overwrites older cache contents, a small number of blocks results in data being overwritten shortly after they are fetched. Larger blocks reduce the number of blocks that fit into a cache. Because each block fetch overwrites older cache contents, a small number of blocks results in data being overwritten shortly after they are fetched. As a block becomes larger, each additional word is farther from the requested word, therefore less likely to be needed in the near future. As a block becomes larger, each additional word is farther from the requested word, therefore less likely to be needed in the near future.

25 Number of Caches Multilevel Caches: Multilevel Caches: On-chip cache: On-chip cache: A cache on the same chip as the processor. A cache on the same chip as the processor. Reduces the processor’s external bus activity and therefore speeds up execution times and increases overall system performance. Reduces the processor’s external bus activity and therefore speeds up execution times and increases overall system performance. external cache: Is it still desirable? external cache: Is it still desirable? Yes - most contemporary designs include both on- chip and external caches. Yes - most contemporary designs include both on- chip and external caches. E.g. two-level cache, with the internal cache (L1) and the external cache (L2). Why? E.g. two-level cache, with the internal cache (L1) and the external cache (L2). Why? If there is no L2 cache and the processor makes an access request for a memory location not in the L1 cache, then the processor must access DRAM or ROM memory across the bus – poor performance. If there is no L2 cache and the processor makes an access request for a memory location not in the L1 cache, then the processor must access DRAM or ROM memory across the bus – poor performance.

26 Number of Caches More recently, it has become common to split the cache into two: More recently, it has become common to split the cache into two: one dedicated to instructions and one dedicated to data. one dedicated to instructions and one dedicated to data. There are two potential advantages of a unified cache: There are two potential advantages of a unified cache: For a given cache size, a unified cache has a higher hit rate than split caches because it balances the load between instruction and data fetches automatically. For a given cache size, a unified cache has a higher hit rate than split caches because it balances the load between instruction and data fetches automatically. Only one cache needs to be designed and implemented. Only one cache needs to be designed and implemented. The trend is toward split caches, such as the Pentium and PowerPC, which emphasize parallel instruction execution and the prefetching of predicted future instructions. Advantage: The trend is toward split caches, such as the Pentium and PowerPC, which emphasize parallel instruction execution and the prefetching of predicted future instructions. Advantage: It eliminates contention for the cache between the instruction fetch/decode unit and the execution unit. It eliminates contention for the cache between the instruction fetch/decode unit and the execution unit.

27 Intel Cache Evolution ProblemSolution Processor on which feature first appears External memory slower than the system bus. Add external cache using faster memory technology. 386 Increased processor speed results in external bus becoming a bottleneck for cache access. Move external cache on-chip, operating at the same speed as the processor. 486 Internal cache is rather small, due to limited space on chip Add external L2 cache using faster technology than main memory 486 Contention occurs when both the Instruction Prefetcher and the Execution Unit simultaneously require access to the cache. In that case, the Prefetcher is stalled while the Execution Unit’s data access takes place. Create separate data and instruction caches. Pentium Increased processor speed results in external bus becoming a bottleneck for L2 cache access. Create separate back-side bus that runs at higher speed than the main (front-side) external bus. The BSB is dedicated to the L2 cache. Pentium Pro Move L2 cache on to the processor chip. Pentium II Some applications deal with massive databases and must have rapid access to large amounts of data. The on-chip caches are too small. Add external L3 cache.Pentium III Move L3 cache on-chip.Pentium 4

28Locality Why the principle of locality make sense? Why the principle of locality make sense? In most cases, the next instruction to be fetched immediately follows the last instruction fetched (except for branch and call instructions). In most cases, the next instruction to be fetched immediately follows the last instruction fetched (except for branch and call instructions). A program remains confined to a rather narrow window of procedure-invocation depth. Thus, over a short period of time references to instructions tend to be localised to a few procedures. A program remains confined to a rather narrow window of procedure-invocation depth. Thus, over a short period of time references to instructions tend to be localised to a few procedures. Most iterative constructs consist of a relatively small number of instructions repeated many times. Most iterative constructs consist of a relatively small number of instructions repeated many times. In many programs, much of the computation involves processing data structures, such as arrays or sequences of records. In many cases, successive references to these data structures will be to closely located data items. In many programs, much of the computation involves processing data structures, such as arrays or sequences of records. In many cases, successive references to these data structures will be to closely located data items.

29 Internal Memory (revision)

30 Memory Packaging and Types A SIMM holding 256 MB. Two of the chips control the SIMM. A group of chips, typically 8 or 16, is mounted on a tiny PCB and sold as a unit. A group of chips, typically 8 or 16, is mounted on a tiny PCB and sold as a unit. SIMM - single inline memory module, has a row of connectors on one side. SIMM - single inline memory module, has a row of connectors on one side. DIMM – Dual inline memory module, has a row of connectors on both side. DIMM – Dual inline memory module, has a row of connectors on both side.

31 Error Correction Hard Failure Hard Failure Permanent defect Permanent defect Caused by harsh environmental abuse, manufacturing defects, and wear. Caused by harsh environmental abuse, manufacturing defects, and wear. Soft Error Soft Error Random, non-destructive Random, non-destructive No permanent damage to memory No permanent damage to memory Caused by power supply problems. Caused by power supply problems. Detected using Hamming error correcting code. Detected using Hamming error correcting code.

32 Error Correction When reading out the stored word, a new set of K code bits is generated from M data bits and compared with fetch code bits. Results: When reading out the stored word, a new set of K code bits is generated from M data bits and compared with fetch code bits. Results: No errors – the fetch data bits are sent out. No errors – the fetch data bits are sent out. An error is detected, and it is possible to correct the error. An error is detected, and it is possible to correct the error. Data bits + error correction bits  corrector  sent out the corrected set of M bits. Data bits + error correction bits  corrector  sent out the corrected set of M bits. An error is detected, but it is not possible to correct the error. This condition is reported. An error is detected, but it is not possible to correct the error. This condition is reported.

33 Error Correcting Code Function A function to produce code Stored codeword: M+K bits

34 Error Correcting Codes: Venn diagram (a) Encoding of 1100 (b) Even parity added (c) Error in AC

35 Error Correction: Hamming Distance The number of bit positions in which two codewords differ is called the Hamming distance. The number of bit positions in which two codewords differ is called the Hamming distance. If two codewords are a Hamming distance d apart, it will require d single-bit errors to convert one into the other. If two codewords are a Hamming distance d apart, it will require d single-bit errors to convert one into the other. E.g. the codewords and are a Hamming distance 3 apart because it takes 3 single-bit errors to convert one into the other. E.g. the codewords and are a Hamming distance 3 apart because it takes 3 single-bit errors to convert one into the other. To detect d single-bit errors, you need a distance d + 1 code. To detect d single-bit errors, you need a distance d + 1 code. To correct d single-bit errors, you need a distance 2d + 1 code. To correct d single-bit errors, you need a distance 2d + 1 code. To determine how many bits differ, just compute the bitwise Boolean EXCLUSIVE OR of the two codewords, and count the number of 1 bits in the result.

36 Example: Hamming algorithm All bits whose bit number (start with bit 1) is a power of 2 are parity bits; the rest are used for data. All bits whose bit number (start with bit 1) is a power of 2 are parity bits; the rest are used for data. E.g. with a 16-bit word, 5 parity bits are added. Bits 1, 2, 4, 8, and 16 are parity bits, and all the rest are data bits. E.g. with a 16-bit word, 5 parity bits are added. Bits 1, 2, 4, 8, and 16 are parity bits, and all the rest are data bits. Bit b is checked by those bits b 1, b 2, …b j such that b1+b2+…+bj=b. Bit b is checked by those bits b 1, b 2, …b j such that b1+b2+…+bj=b. For example, bit 5 is checked by bits 1 and 4 because 1+4=5. For example, bit 5 is checked by bits 1 and 4 because 1+4=5.

37 Construction of the Hamming code for the memory word by adding 5 check bits to the 16 data bits. Construction of the Hamming code for the memory word by adding 5 check bits to the 16 data bits. Example: Hamming algorithm We will (arbitrarily) use even parity in this example.

38 Example: Hamming algorithm Consider what would happen if bit 5 were inverted by an electrical surge on the power line. The new codeword would be instead of Consider what would happen if bit 5 were inverted by an electrical surge on the power line. The new codeword would be instead of The 5 parity bits will be checked, with the following results: The 5 parity bits will be checked, with the following results: Since parity bits 1 and 4 are incorrect but 2, 8, and 16 are correct  bit 5 (1 + 4) has been inverted.

39 Thank you Q & A