Download presentation
Presentation is loading. Please wait.
1
CSE 586 Computer Architecture Lecture 5
Jean-Loup Baer CSE 586 Spring 00
2
Highlights from last week
Multiple issue implications Will increase throughput The Instruction Fetch step requires buffering and can become a critical point in the design The Commit stage must be able to retire multiple instructions in a given cycle Decoding, issuing, dispatching can encounter more structural hazards CSE 586 Spring 00
3
Highlights from last week (c’ed)
VLIW – EPIC Compiler plays a major role in scheduling operations Merced/Itanium implementation “Bundles” of predicated instructions Large register files with “rotating” registers to facilitate loop unrolling, software pipelining, and call/return paradigms Predication and sophisticated branch prediction Powerful floating-point units SIMD instructions for 3D Graphics and Multimedia CSE 586 Spring 00
4
Highlights from last week (c’ed)
Predication Partial predication (Conditional Moves) Full predication (predicate definitions); Unconditional and OR predicates Multithreading Fine grain multithreading (Tera MTA) Simultaneous Multithreading CSE 586 Spring 00
5
Principle of Locality: Memory Hierarchies
Text and data are not accessed randomly Temporal locality Recently accessed items will be accessed in the near future (e.g., code in loops, top of stack) Spatial locality Items at addresses close to the addresses of recently accessed items will be accessed in the near future (sequential code, elements of arrays) Leads to memory hierarchy at two main interface levels: Processor - Main memory -> Introduction of caches Main memory - Secondary memory -> Virtual memory (paging systems) CSE 586 Spring 00
6
Processor - Main Memory Hierarchy
Registers: Those visible to ISA + those renamed by hardware (Hierarchy of) Caches: plus their enhancements Write buffers, victim caches etc… TLB’s and their management Virtual memory system (O.S. level) and hardware assists (page tables) Inclusion of information (or space to gather information) level per level Almost always true CSE 586 Spring 00
7
Questions that Arise at Each Level
What is the unit of information transferred from level to level Word (byte, double word) to/from a register Block (line) to/from cache Page table entry + misc. bits to/from TLB Page to/from disk When is the unit of information transferred from one level to a lower level in the hierarchy? Generally, on demand (cache miss, page fault) Sometimes earlier (prefetching) CSE 586 Spring 00
8
Questions that Arise at Each Level (c’ed)
Where in the hierarchy is that unit of information placed For registers, directed by ISA For caches, in general in L1 Possibility of hinting to another level (Merced ISA) or of bypassing the cache entirely, or to put in special buffers How do we find if a unit of info is in a given level of the hierarchy Depends on mapping; hardware/software structure (page tables) What happens if there is no room for it Replacement algorithm What happens when we change the contents of the info. I.e., what happens on a write? CSE 586 Spring 00
9
Caches (on-chip, off-chip)
Caches consist of a set of entries where each entry has: block (or line) of data: information contents (initially, the image of some main memory contents) tag: allows to recognize if the block is there (depends on the mapping) status bits: valid, dirty, state for multiprocessors etc. Capacity (or size) of a cache: number of blocks * block size i.e., the cache metadata (tag + status bits) is not counted in the cache capacity Notation First-level (on-chip) cache: L1; Second-level (on-chip/off-chip): L2; third level (Off-chip) L3 CSE 586 Spring 00
10
Cache Organization -- Direct-mapped
Most restricted mapping Direct-mapped cache. A given memory location (block) can only be mapped in a single place in the cache. Generally this place given by: (block address) mod (number of blocks in cache) To make the mapping easier, the number of blocks in a direct-mapped cache is a power of 2. There have been proposals for caches, for vector processors, that have a number of blocks that are Mersenne prime numbers (the modulo arithmetic for those numbers has some “nice” properties) CSE 586 Spring 00
11
Direct-mapped Cache All addresses mod C map to the same cache block
C lines C lines Main memory CSE 586 Spring 00
12
Fully-associative Cache
Most general mapping Fully-associative cache. A given memory location (block) can be mapped anywhere in the cache. No cache of decent size is implemented this way but this is the (general) mapping for pages (disk to main memory), for small TLB’s, and for some small buffers used as cache assists (e.g., victim caches, write caches). CSE 586 Spring 00
13
Fully-associative Cache
Main memory Any memory block can map to any cache block CSE 586 Spring 00
14
Set-associative Caches
Less restricted mapping Set-associative cache. Blocks in the cache are grouped into sets and a given memory location (block) maps into a set. Within the set the block can be placed anywhere. Associativities of 2 (two-way set-associative), 3, 4, 8 and even 16 have been implemented. Direct-mapped = 1-way set-associative Fully associative with m entries is m-way set associative Capacity Capacity = number of sets * set-associativity * block size CSE 586 Spring 00
15
Set-associative Cache
Bank 0 Bank 1 A memory block maps into a specific block of either set Main memory CSE 586 Spring 00
16
Cache Hit or Cache Miss? How to detect if a memory address (a byte address) has a valid image in the cache: Address is decomposed in 3 fields: block offset or displacement (depends on block size) index (depends on number of sets and set-associativity) tag (the remainder of the address) The tag array has a width equal to tag CSE 586 Spring 00
17
Hit Detection displ. tag index Example: cache capacity C, block size b
Direct mapped: displ = log2 b; index = log2(C/ b); tag = 32 -index - displ N -way S.A: displ = log2 b; index = log2(C/ bN); tag = 32 -index - displ So what does it mean to have 3-way (N=3) set-associativity? CSE 586 Spring 00
18
Examples 16 KB direct-mapped cache with 64 byte lines.
Number of entries: / 64 = 256 Displacement: 6 bits (26 = 64) Index: 8 bits (log2 256 = 8) Tag :18 (32 – 6 –8) 16KB 4-way set-associative with 64 byte lines. Number of sets: ( / 64) = 64 Index: 6 bits (log2 64 = 6) Tag :20(32 – 6 –6) CSE 586 Spring 00
19
Why Set-associative Caches?
Cons The higher the associativity the larger the number of comparisons to be made in parallel for high-performance (can have an impact on cycle time for on-chip caches) Higher associativity requires a wider tag array (minimal impact) Pros Better hit ratio Great improvement from 1 to 2, less from 2 to 4, minimal after that but can still be important for large L2 caches Allows parallel search of TLB and caches for larger (but still small) caches (see later) CSE 586 Spring 00
20
Replacement Algorithm
None for direct-mapped Random or LRU or pseudo-LRU for set-associative caches Not an important factor for performance for low associativity. Can become important for large associativity and large caches CSE 586 Spring 00
21
Writing in a Cache On a write hit, should we write:
In the cache only (write-back) policy In the cache and main memory (or higher level cache) (write-through) policy On a write miss, should we Allocate a block as in a read (write-allocate) Write only in memory (write-around) CSE 586 Spring 00
22
The Main Write Options Write-through (aka store-through)
On a write hit, write both in cache and in memory On a write miss, the most frequent option is write-around Pro: consistent view of memory (better for I/O); no ECC required for cache Con: more memory traffic (can be alleviated with write buffers) Write-back (aka copy-back) On a write hit, write only in cache (requires dirty bit) On a write miss, most often write-allocate (fetch on miss) but variations are possible Pro-con reverse of write through CSE 586 Spring 00
23
Classifying the Cache Misses: The 3 C’s
Compulsory misses (cold start) The first time you touch a block. Reduced (for a given cache capacity and associativity) by having large blocks Capacity misses The working set is too big for the ideal cache of same capacity and block size (i.e., fully associative with optimal replacement algorithm). Only remedy: bigger cache! Conflict misses (interference) Mapping of two blocks to the same location. Increasing associativity decreases this type of misses. There is a fourth C: coherence misses (cf. multiprocessors) CSE 586 Spring 00
24
Example of Cache Hierarchies
CSE 586 Spring 00
25
Examples (c’ed) CSE 586 Spring 00
26
Cache Performance CPI contributed by cache = CPIc
= miss rate x number of cycles to handle the miss Another important metric Average memory access time = cache hit time * hit rate + Miss penalty * (1 - hit rate) CSE 586 Spring 00
27
Improving Cache Performance
To improve cache performance: Decrease miss rate without increasing time to handle the miss (more precisely: without increasing average memory access time) Decrease time to handle the miss w/o increasing miss rate A slew of techniques: hardware and/or software Increase capacity, associativity etc. Hardware assists (victim caches, write buffers etc.) Tolerating memory latency: Prefetching (hardware and software), lock-up free caches O.S. interaction: mapping of virtual pages to decrease cache conflicts Compiler interactions: code and data placement; tiling CSE 586 Spring 00
28
Obvious Solutions to Decrease Miss Rate
Increase cache capacity Yes, but the larger the cache, the slower the access time Limitations for first-level (L1) on-chip caches Solution: Cache hierarchies (even on-chip) Increasing L2 capacity can be detrimental on multiprocessor systems because of increase in coherence misses Increase cache associativity Yes, but “law of diminishing returns” (after 4-way for small caches; not sure of the limit for large caches) More comparisons needed, i.e., more logic and therefore longer time to check for hit/miss? Make cache look more associative than it really is (see later) CSE 586 Spring 00
29
What about Cache Block Size?
For a given application, cache capacity and associativity, there is an optimal cache block size Long cache blocks Good for spatial locality (code, vectors) Reduce compulsory misses (implicit prefetching) But takes more time to bring from next level of memory hierarchy (can be compensated by “critical word first” and subblocks) Increase possibility of fragmentation (only fraction of the block is used – or reused) Increase possibility of false-sharing in multiprocessor systems CSE 586 Spring 00
30
What about Cache Block Size? (c’ed)
In general, the larger the cache, the longer the best block size (e.g., 32 or 64 bytes for on-chip, 64, 128 or even 256 bytes for large off-chip caches) Longer block sizes in I-caches Sequentiality of code Matching with the IF unit CSE 586 Spring 00
31
Example of (naïve) Analysis
Choice between 32B (miss rate m32) and 64B block sizes (m64) Time of a miss: send request + access time + transfer time send request independent of block size (e.g., 4 cycles) access time can be considered independent of block size (memory interleaving) (e.g., 28 cycles) transfer time depends on bus width. For a bus width of say 64 bits transfer time is twice as much for 64B (say 16 cycles) than for 32B (8 cycles). In this example, 32B is better if (4+28+8)m32 < ( ) m64 CSE 586 Spring 00
32
Example of (naïve) Analysis (c’ed)
Case KB cache: m32 = 2.87, m64 = 2.64 2.87 * 40 < 2.64 * 48 Case 2. 64KB cache: m32 = 1.35, m64 = 1.06 1.35 * 40 > 1.06 * 48 32B better for 16KB and 64B better for 64KB (Of course the example was designed to support my “in general” statement of a previous slide) CSE 586 Spring 00
33
Impact of Associativity
“Old” conventional wisdom Direct-mapped caches are faster; cache access is bottleneck for on-chip L1; make L1 caches direct mapped For on-board (L2) caches, direct-mapped are 10% faster. “New” conventional wisdom Can make 2-way set-associative caches fast enough for L1. Allows larger caches to be addressed only with page offset bits (see later) Looks like time-wise it does not make much difference for L2/L3 caches, hence provide more associativity (but if caches are extremely large there might not be much benefit) CSE 586 Spring 00
34
Reducing Cache Misses with more “Associativity” -- Victim caches
First example (in this course) of an “hardware assist” Victim cache: Small fully-associative buffer “behind” the L1 cache and “before” the L2 cache Of course can also exist “behind” L2 and “before” main memory Main goal: remove some of the conflict misses in L1 direct-mapped caches (or any cache with low associativity) CSE 586 Spring 00
35
2.Miss in L1; Hit in VC; Send data to register and swap
3’. evicted Victim Cache 1. Hit Cache Index + Tag 3. From next level of memory hierarchy CSE 586 Spring 00
36
Operation of a Victim Cache
1. Hit in L1; Nothing else needed 2. Miss in L1 for block at location b, hit in victim cache at location v: swap contents of b and v (takes an extra cycle) 3. Miss in L1, miss in victim cache : load missing item from next level and put in L1; put entry replaced in L1 in victim cache; if victim cache is full, evict one of its entries. Victim buffer of 4 to 8 entries for a 32KB direct-mapped cache works well. CSE 586 Spring 00
37
Bringing more Associativity -- Column-associative Caches
Split (conceptually) direct-mapped cache in two halves Probe first half according to index. On hit proceed normally On miss probe 2nd half ; If hit, send to register and swap with entry in first half (takes an extra cycle) On miss (on both halves) go to next level, load in 2nd half and swap CSE 586 Spring 00
38
Column-associative Caches (c’ed)
Implementation: Need two more bits in metadata: One extra tag bit to indicate whether entry corresponds to 1st or 2nd half One to indicate whether it has been swapped The latter bit needed to prevent a to remain in cache for strings of references of the form abcbcbcbc… where a, b and c map to the same cache entry Can be extended to set-associative caches CSE 586 Spring 00
39
Is it Worthwhile? Analysis for either CPI contributed by cache, or time to resolve a miss, or average memory access time For example computation of average memory access time Av. mem. access time = hit rate in L1 * hit time in L1 + miss rate in L1 * miss penalty If hit time in L1 is same for direct-mapped and 2-way no advantage for victim cache or pseudo-associative. Different hit times: 1 cycle for direct-mapped 1 .1 cycle for 2-way 1 cycle for first hit and 2 or 3 cycles for second hit CSE 586 Spring 00
40
Pseudo-Associative Cache Analysis (c’ed)
Miss penalty to next level of memory hierarchy is the same in all cases So depends on first and second hit rates… As a first approximation hit rate in one probe = hit rate of direct-mapped cache hit rate in two probes = hit rate in 2-way - hit rate in direct-mapped Then plug in appropriate numbers CSE 586 Spring 00
41
Skewed-associative Caches
Have different mappings for the two (or more) banks of the set-associative cache First mapping conventional; second one “dispersing” the addresses (XOR a few bits) Hit ratio of 2-way skewed as good as 4-way conventional. CSE 586 Spring 00
42
Reducing Conflicts --Page Coloring
Interaction of the O.S. with the hardware In caches where the cache size > page size * associativity, bits of the physical address (besides the page offset) are needed for the index. On a page fault, the O.S. selects a mapping such that it tries to minimize conflicts in the cache . CSE 586 Spring 00
43
Options for Page Coloring
Option 1: It assumes that the process faulting is using the whole cache Attempts to map the page such that the cache will assess data as if it were by virtual addresses Option 2: do the same thing but hash with bits of the PID (process identification number) Reduce inter-process conflicts (e.g., prevent pages corresponding to stacks of various processes to map to the same area in the cache) Implemented by keeping “bins” of free pages CSE 586 Spring 00
44
The CML Approach Monitoring page conflict at run-time
CML (Cache Miss Look-Aside Buffer): TLB like device counts the number of misses on a page per page basis Pages with high level of reference misses are considered “hot” The O.S. inspects the CML (polling or interrupt) and can remap pages if two (or more) of them are hot and map in the same area of the cache. Requires physical copying of memory contents CSE 586 Spring 00
45
Tolerating/hiding Memory Latency
One particular technique: prefetching Goal: bring data in cache just in time for its use Not too early otherwise cache pollution Not too late otherwise “hit-wait”cycles Under the constraints of (among others) Imprecise knowledge of instruction stream Imprecise knowledge of data stream Hardware/software prefetching Works well for regular stride data access Difficult when there are pointer-based accesses CSE 586 Spring 00
46
Why, What, When, Where Why? What When Where
cf. goals: Hide memory latency and/or reduce cache misses What Ideally a semantic object Practically a cache block, or a sequence of cache blocks When Ideally, just in time. Practically, depends on the prefetching technique Where In the cache or in a prefetch buffer CSE 586 Spring 00
47
Hardware Prefetching Nextline prefetching for instructions
Bring missing block and the next one if not already there) OBL “one block look-ahead” for data prefetching As Nextline but with more variations -- e.g. depends on whether prefetching was successful the previous time Use of special assists: Stream buffers, i.e., FIFO queues to fetch consecutive lines (good for instructions not that good for data); Stream buffers with hardware stride detection mechanisms; Use of a reference prediction table etc. CSE 586 Spring 00
48
Software Prefetching Use of special instructions (cache hints: touch in Power PC, load in register 31 for Alpha) Non-binding prefetch (in contrast with proposals to prefetch in registers). If an exception occurs, the prefetch is ignored. Must be inserted by software (compiler analysis) Advantage: no special hardware Drawback: more instructions executed. CSE 586 Spring 00
49
Hit Rate Improvement Techniques: Algorithm and Compiler Optimizations
Tiling or blocking Reuse contents of the data cache as much as possible (excellent results in linear algebra programs) Code reordering (for improving I-cache performance) Use of profiling information Graph coloring techniques Cache conscious data placement A good area of research CSE 586 Spring 00
50
Techniques to Reduce Cache Miss Penalty
Give priority to reads -> Write buffers Send the requested word first -> critical word or wrap around strategy Sectored (subblock) caches Lock-up free (non-blocking) caches Cache hierarchy CSE 586 Spring 00
51
Write Policies Loads (reads) occur twice as often as stores (writes)
Miss ratio of reads and miss ratio of writes pretty much the same Although it is more important to optimize read performance, write performance should not be neglected Write misses can be delayed w/o impeding the progress of execution of subsequent instructions CSE 586 Spring 00
52
A Sample of Write Mechanisms
Fetch-on-write and Write-allocate Proceed like on a read miss followed by a write hit. The preferred method for write-back caches. No-write-before-hit, no-fetch-on-write and no-write-allocate (called “write-around”) The cache is not modified on write misses No-fetch-on-write and write-allocate (“write-validate”) Write directly in cache and invalidate all other parts of the line being written. Requires a valid bit/writable entity. Good for initializing a data structure. Assumedly the best policy for elimination of write misses in write-through caches but more expensive (dirty bits) CSE 586 Spring 00
53
Write Mechanisms (c’ed)
write-before-hit, no-fetch-on-write and no-write-allocate (called “write invalidate”) Possible only for direct-mapped write-through caches The data is written in the cache totally in parallel with checking of the tag. On a miss, the rest of the line is invalidated as in the write-validate case A mix of these policies can be (and has been) implemented Dictate the policy on a page per page basis (bits set in the page table entry) Have the compiler generate instructions (hints) for dynamic changes in policies (e.g. write validate on initialization of a data structure) CSE 586 Spring 00
54
Write Buffers Reads are more important than
Writes to memory if WT cache Replacement of dirty lines if WB Hence buffer the writes in write buffers Write buffers = FIFO queues to store data Since writes have a tendency to come in bunches (e.g., on procedure calls, context-switches etc), write buffers must be “deep” CSE 586 Spring 00
55
Write Buffers (c’ed) Writes from write buffer to next level of the memory hierarchy can proceed in parallel with computation Now loads must check the contents of the write buffer; also more complex for cache coherency in multiprocessors Allow read misses to bypass the writes in the write buffer CSE 586 Spring 00
56
Coalescing Write Buffers and Write Caches
Writes to an address (block) already in the write buffer are combined Note the tension between writing the coalescing buffer to memory at high rate -- more writes -- vs. coalescing to the max -- but buffer might become full Extend write buffers to small fully associative write caches with WB strategy and dirty bit/byte. Not implemented in any machine I know of CSE 586 Spring 00
57
Critical Word First Send first, from next level in memory hierarchy, the word for which there was a miss Send that word directly to CPU register (or IF buffer if it’s an I-cache miss) as soon as it arrives Need a one block buffer to hold the incoming block (and shift it) before storing it in the cache CSE 586 Spring 00
58
Sectored (or subblock) Caches
First cache ever (IBM 360/85 in late 60’s) was a sector cache On a cache miss, send only a subblock, change the tag and invalidate all other subblocks Saves on memory bandwidth Reduces number of tags but requires good spatial locality in application Requires status bits (valid, dirty) per subblock Might reduce false-sharing in multiprocessors But requires metadata status bits for each subblock Alpha L2 uses a dirty bit/16 B for a 64B block size CSE 586 Spring 00
59
Sector Cache Status bits data tag subblock1 subblockn
CSE 586 Spring 00
60
Lock-up Free Caches Proposed in early 1980’s but implemented only recently because quite complex Allow cache to have several outstanding miss requests (hit under miss). Cache miss “happens” during EX stage, i.e., longer (unpredictable) latency Important not to slow down operations that don’t depend on results of the load Single hit under miss (HP PA 1700) relatively simple For several outstanding misses, require the use of MSHR’s (Miss Status Holding Register). CSE 586 Spring 00
61
MSHR’s The outstanding misses do not necessarily come back in the order they were detected For example, miss 1 can percolate from L1 to main memory while miss 2 can be resolved at the L2 level Each MSHR must hold information about the particular miss it will handle such as: Info. relative to its placement in the cache Info. relative to the “missing” item (word, byte) and where to forward it (CPU register) CSE 586 Spring 00
62
Implementation of MSHR’s
Quite a variety of alternatives MIPS 10000, Alpha 21164, Pentium Pro One particular way of doing it: Valid (busy) bit (limited number of MSHR’s – structural hazard) Address of the requested cache block Index in the cache where the block will go Comparator (to prevent using the same MSHR for a miss to the same block) If data to be forwarded to CPU at the same time as in the cache, needs addresses of registers (one per possible word/byte) Valid bits (for writes) CSE 586 Spring 00
63
Cache Hierarchy Two, and even three, levels of caches quite common now
L2 (or L3, i.e., board-level) very large but since L1 filters many references, “local” hit rate might appear low (maybe 50%) (compulsory misses still happen) In general L2 have longer cache blocks and larger associativity In general L2 caches are write-back, write allocate CSE 586 Spring 00
64
Characteristics of Cache Hierarchy
Multi-Level inclusion (MLI) property between off-board cache (L2 or L3) and on-chip cache(s) (L1 and maybe L2) L2 contents must be a superset of L1 contents (or at least have room to store these contents if L1 is write-back) If L1 and L2 are on chip, they could be mutually exclusive (and inclusion will be with L3) MLI very important for cache coherence in multiprocessor systems (shields the on-chip caches from unnecessary interference) Prefetching at L2 level is an interesting challenge (made easier if L2 tags are kept on-chip) CSE 586 Spring 00
65
“Virtual” Address Caches
Will get back to this after we study TLB’s Virtually addressed, virtually tagged caches Main problem to solve is the Synonym problem (2 virtual addresses corresponding to the same physical address). Virtually addressed, physically tagged Advantage: can allow cache and TLB accesses concurrently Easy and usually done for small L1, i.e., capacity < (page * ass.) Can be done for larger caches if O.S. does a form of page coloring such that “index” is the same for synonyms CSE 586 Spring 00
66
Miscellaneous Techniques
Improving on write time. Pipeline the write with a buffer to delay the data write by one cycle Note: not needed for reads where tag and data are processed in parallel For superscalar machines Duplicate the L1 cache(s) (could be cheaper than multiple ports?) For (highly) associative caches Keep for each set the MRU index so that it is checked first (cf. MIPS R10000 which has an 8K*1 prediction table to that effect). Etc. CSE 586 Spring 00
67
Impact of Branch Prediction on Caches
If we are on predicted path and: An I-cache miss occurs, what should we do: stall or fetch? A D-cache miss occurs, what should we do: stall or fetch? If we fetch and we are on the right path, it’s a win If we fetch and are on the wrong path, it is not necessarily a loss Could be a form of prefetching (if branch was mispredicted, there is a good chance that that path will be taken later) However, the channel between the cache and higher-level of hierarchy is occupied while something more pressing could be waiting for it CSE 586 Spring 00
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.