Presentation is loading. Please wait.

Presentation is loading. Please wait.

0 High-Performance Computer Architecture Memory Organization Chapter 5 from Quantitative Architecture January 2006.

Similar presentations


Presentation on theme: "0 High-Performance Computer Architecture Memory Organization Chapter 5 from Quantitative Architecture January 2006."— Presentation transcript:

1 0 High-Performance Computer Architecture Memory Organization Chapter 5 from Quantitative Architecture January 2006

2 Winter 2005 ENGR9861 R. Venkatesan High-Performance Computer Architecture 1 Basic Ideas -- Locality Instructions and data exhibit both spatial and temporal locality Temporal locality: If a particular instruction or data item is used now, there is a good chance that it will be used again in the near future. Spatial locality: If a particular instruction or data item is used now, there is a good chance that the instructions or data items that are located in memory immediately following or preceding this item will soon be used. Therefore, it is a good idea to move such instruction and data items that are expected to be used soon from slow memory to fast memory (cache). This is prediction, and therefore will not always be correct – depends on the extent of locality.

3 Winter 2005 ENGR9861 R. Venkatesan High-Performance Computer Architecture 2 Memory lags in speed

4 Winter 2005 ENGR9861 R. Venkatesan High-Performance Computer Architecture 3 Memory Hierarchy – Why and How Memory is much slower compared to processor. Faster memories are more expensive. Due to high decoding time and other reasons, larger memories are always slower. Therefore, locate small but very fast memory (SRAM: L1 cache) very close to processor. L2 cache will be larger and slower. Main memory is usually GBs large and are made up of DRAMs: several nanoseconds. Secondary memory is on discs (hard disk, CD, DVD), are hundreds of GBs large, and take milliseconds to access. Similar to warehouse. CPU registers are the closest to CPU, but do not use memory address – they have separate ids.

5 Winter 2005 ENGR9861 R. Venkatesan High-Performance Computer Architecture 4 Speedup due to cache – Example Assume 1 GHz processor using 10 ns memory, and 35% of all executed instructions are load or store. The application runs 1 billion instructions. Execution time = (1 + 1.35*10)*10^9*10^-9 = 14.5 s. Assume all instruction and data required are stored in a perfect cache that operates within the clock period. Exec. Time with perfect cache = 1 ns. Now, assume that the cache has a hit rate of 90%. Exec. time with cache = (1 + 1.35*0.1*10) = 2.35 s. Caches are 95-99% successful in having the required instructions and 75-90% successful for data. How do we design a better cache (or caches)?

6 Winter 2005 ENGR9861 R. Venkatesan High-Performance Computer Architecture 5 von Neumann Architecture Memory holds instructions (in a sequence) and data, and memory items are not distinguished based on their contents – any memory item is a string of bits. Most modern processors have instruction pipelines. Instruction storage exhibits stronger locality. Due to the above two points, we usually split the L1 cache into IC and DC – Harvard architecture. Nowadays, IC or DC usually has 8,16, 32, 64 or 128 kB of information, plus overheads. Tape uses sequential access; RAM/ROM uses random access; disc uses random/sequential access; caches use associative access (i.e. uses a part of the information to find the remaining).

7 Winter 2005 ENGR9861 R. Venkatesan High-Performance Computer Architecture 6 Cache – mapping options

8 Winter 2005 ENGR9861 R. Venkatesan High-Performance Computer Architecture 7 Cache – how is a block found? Index bits are decoded to select one set. Tag bits corresponding to each block in this set are compared with tag bits in the address. Cache miss: no comparison succeeds. Cache hit: 1 comparison succeeds, block identified. Block offset bits: selects byte or word needed. Fully-associative: no index; Direct mapped: largest index field, but only one comparison.

9 Winter 2005 ENGR9861 R. Venkatesan High-Performance Computer Architecture 8 2-way set-associative cache

10 Winter 2005 ENGR9861 R. Venkatesan High-Performance Computer Architecture 9 Cache performance Memory access time = cache hit time + cache miss rate * miss penalty To improve performance, i.e. reduce memory time, we need to reduce hit time, miss rate & miss penalty. As L1 caches are in the critical path of instruction execution, hit time is the most important parameter. When one parameter is improved, others might suffer Misses: Compulsory, Capacity, Conflict. Compulsory miss: always occurs first time. Capacity miss: reduces with cache size. Conflict miss: reduces with level of associativity. IC/DC: 2-way; DC: write through & write no-allocate.

11 Winter 2005 ENGR9861 R. Venkatesan High-Performance Computer Architecture 10 Cache – write hit & write miss Write: Need to update lower levels of cache and main memory whenever a store instruction modifies DC. Write Hit: the item to be modified is in DC. Write Through: as if no DC, write also to L2. Write Back: set a Dirty Bit, and update L2 before replacing this block. Although write through is an inefficient strategy, most DCs and some lower level caches follow this so that read hit time is not affected due to complicated logic to update dirty bit. Write Miss: the item to be modified is not in DC. Write allocate: exploit locality, and bring the block to DC. Write no-allocate: do not fetch the missing block. Usually, write through and write no-allocate policies are used together; write back and write allocate.

12 Winter 2005 ENGR9861 R. Venkatesan High-Performance Computer Architecture 11 Cache size examples Overhead bits: tag bits, valid bit, dirty bit? #address bits: 46; block size: 64B; 64kB 2-way IC/DC, DC uses write through, write no-allocate. #block offset bits: 6; #index bits: 9 as 512 sets, 1024 blocks; tag bits: 46 – (6+9) = 31. IC/DC size: 64*8*1024 + (31+1)*1024 = 64kB + 4kB. #address bits: 40; block size: 32B; 1MB L2 cache, 4- way set-associative, uses write back & write allocate. #block offset bits: 5; #index bits: 13 as 8192 sets, 2^15 blocks; tag bits: 40 – (5+13) = 22. L2 cache size: 1MB + (22+1+1)/8*2^15 B = 1MB+96kB

13 Winter 2005 ENGR9861 R. Venkatesan High-Performance Computer Architecture 12 Cache performance examples 5 GHz pipelined processor; IC hits 98%; L1 read miss 10%, write miss 5%; 25% of all instructions are loads and 10% are writes. Fetching a block from L2 cache takes 40 clk. L2 misses 12%, with a penalty 25 ns. Instruction execution time = 1 + (0.02*(40+(0.12*125))) + (0.25*0.1*(40+0.12*125))) + (0.1*0.05*(40+(0.12*125))) clk = 4.025 clk = 0.805 ns. An L3 cache might improve the performance. Most general purpose computers today use CPU chips that have on-chip IC&DC, on-package L2 cache, and an L3 cache on the motherboard. Caches are also used in the sound card, video card and with other special purpose hardware.

14 Winter 2005 ENGR9861 R. Venkatesan High-Performance Computer Architecture 13 Main memory organizations

15 Winter 2005 ENGR9861 R. Venkatesan High-Performance Computer Architecture 14 Memory interleaving Access time = address transmit + decode + amplify + data transmit. Decode is the largest component. In memory interleaved system, address is transmitted and decode takes place commonly for all the banks; only data transmission occurs from each bank in a pipelined fashion. 0.5 + 6.0 + 0.3 + 4*0.6 = 9.2 ns.

16 Winter 2005 ENGR9861 R. Venkatesan High-Performance Computer Architecture 15 Virtual Memory Relocatability: programmer and compiler do not have to worry about exact location of information. Use any addresses, and leave it to loader at run time. Security: Using extra bits, access of any information can be regulated. Controlled sharing is possible. Different size: virtual memory can be larger than real memory. This is not the major reason nowadays. Virtual memory can be implemented using Paging: most common; discussed in detail here. Segmentation: used in early Intel processors. Hybrid: some processors use them. Virtual Address and Real Address: translation.

17 Winter 2005 ENGR9861 R. Venkatesan High-Performance Computer Architecture 16 Disk & Real memories Operating System (OS) uses processor identifier bits (PID or ASN) and thus permits the same virtual addresses to be used by different processors

18 Winter 2005 ENGR9861 R. Venkatesan High-Performance Computer Architecture 17 Address Translation Page Table is located in the main memory, contains as each entry real page address and addressed using virtual page address. Page size: 16kB or 64kB. Every instruction execution refers to memory roughly 1.35 times, and so a cache for PT is needed => TLB.

19 Winter 2005 ENGR9861 R. Venkatesan High-Performance Computer Architecture 18 Address Translation Cache: TLB Usually split as ITLB & DTLB; only one level, and when TLB misses, exception occurs and OS consults PT and updates TLB. TLB takes care of protection. ITLB: 12 – 40 entries => 200B to 1kB; DTLB: 32 – 64 entries; TLBs are read-only, fully-associative.

20 Winter 2005 ENGR9861 R. Venkatesan High-Performance Computer Architecture 19 Memory Access: TLB & caches TLB first, L1 cache next, and then, if necessary, L2 cache, etc.

21 Winter 2005 ENGR9861 R. Venkatesan High-Performance Computer Architecture 20 Page Table access

22 Winter 2005 ENGR9861 R. Venkatesan High-Performance Computer Architecture 21

23 Winter 2005 ENGR9861 R. Venkatesan High-Performance Computer Architecture 22 Summary Memory design is extremely important: hierarchical. Caches improve performance due to locality present in instruction and data access. L1 cache design is important as it is on critical path; it is split into IC & DC if processor is pipelined. L2 cache (mixed) will be larger and slower than L1; L3 cache will be larger and slower than L2, and main memory will be larger and slower than L3. Cache performance suffers when store is performed. Virtual memory is used for relocatability & protection. Page Tables are used for virtual to real page address translation; TLBs are caches for PT.


Download ppt "0 High-Performance Computer Architecture Memory Organization Chapter 5 from Quantitative Architecture January 2006."

Similar presentations


Ads by Google