© Karen Miller, What do we want from our computers? correct results we assume this feature, but consider... who defines what is correct? fast fast at what? (easy answer: fast at my programs)
© Karen Miller, price performance slow fast ¢$$$
© Karen Miller, Architectural features ways of increasing speed generally fall into 2 categories: 1. parallelism 2. memory hierarchies
© Karen Miller, parallelism Suppose we have 3 tasks: t1, t2, and t3. If independent, then serial A serial implementation on 1 computer: t1 t2t3 parallel A parallel implementation (given that we have 3 computers). t1 t2 t3
© Karen Miller, memory woes: P M physically separate memory makes memory accesses SLOW ! P and M co-located ? very expensive ! or memory too small !
© Karen Miller, HW design technique to make some memory accesses complete faster is the implementation of hierarchical memory cacheing (also known as cacheing )
© Karen Miller, Recall the fetch and execute cycle: fetch instruction PC update decode get operands ( for a load) do operation store result ( for a store) requires a memory access
© Karen Miller, Now look at the memory access patterns of lots of programs. In general, memory access patterns are not random. locality They exhibit locality 1. temporal 2. spatial
© Karen Miller, temporal locality Recently referenced memory locations are likely to referenced again (soon!) loop: instr A1 instr A2 instr A3 b A4 Instruction stream references: A1 A2 A3 A4 A1 A2 A3 A4 A1 A2 A3... Note that the same memory location is repeatedly read (for the fetch).
© Karen Miller, spatial locality Memory locations near to referenced locations are likely to also be referenced. array memory Code must do something to each element of the array. Must load each element....
© Karen Miller, The fetch of the code exhibits a high degree of spatial locality. I1 I2 I3 I4 I5... I2 is next to I1. If these instructions are not branches, then we fetch I1 I2 I3 etc.
© Karen Miller, cache A cache is designed to hold copies of a subset of memory locations. smaller (in terms of bytes) than main memory faster than main memory co-located: processor and cache are on the same chip
© Karen Miller, Intel 386 chip (1985 image)
© Karen Miller, Pentium II (1997 image)
© Karen Miller, P sends memory request to C. hit hit: requested location's copy is in the C miss miss: requested location's copy is NOT in the C. So, send the memory access to M. P and C M
© Karen Miller, Needed terminology: miss ratio = hit ratio = # of misses # of hits total # of accesses or 1 - miss ratio You already assumed that total # of accesses = # of misses + # of hits
© Karen Miller, So, when designing a cache, keep likely to be referenced (again) bytes and their neighbors in the cache... So, what is in the cache is different for each different program. On average, for a given program: AverageAverage MemoryMemory AccessAccess TimeTime = T c + (miss ratio) (T m )
© Karen Miller, For example: T c = 1 nsec T m = 20 nsec A specific program has 98% hits... AMAT = 1 + (.02) (20) = 1.4 nsec Each individual memory access takes 1 nsec (hit) or 21 nsec (miss)
© Karen Miller, Divide all of memory up into fixed-size blocks... 1 block Copy the entire block into the cache. Make the block size greater than 1 word.
© Karen Miller, An unrealistic cache, with 4 block frames block 00 block 01 block 10 block 11 this is 1 frame another frame yet another frame and a 4th frame
© Karen Miller, Each main memory block maps to a specific block frame.... main memory cache 2 bits of the address define this mapping
© Karen Miller, Take advantage of spatial locality by making the block size greater than 1 word. On a miss, copy the entire block into the cache, and then keep it there as long as possible. (Why?) How the cache uses the address to do a look up: index # byte/word within block ? which block frame
© Karen Miller, Which block frame is known as index # or (sometimes) line # But, many main memory blocks map to the same cache block frame... only one may be in the frame at a time! We must distinguish which one is in the frame right now.
© Karen Miller, tag most significant bits of the block's address to distinguish which main memory block is in the cache block frame tag is kept in the cache together with its data block
© Karen Miller, how the address is utilized by the cache (so far) tagindex #byte w/i block tagsdata blocks address
© Karen Miller, Still missing... must distinguish block frames that have nothing in them from ones that have a block from main memory (consider power up for a computer system: nothing is in the cache) We need 1 bit per block, most often called a valid bit (sometimes called a present bit)
© Karen Miller, cache access (or cache lookup) index # is used to find the correct block frame Is block frame valid? YES: Compare address tag to block frame's tag: match: HIT no match: MISS NO: MISS
© Karen Miller, Completed diagram of the cache: tagindex #byte w/i block tagsdata blocks address valid
© Karen Miller, This cache is called direct mapped or 1-way set associative or set associative, with a set size of 1 Each index # maps to exactly 1 block frame
© Karen Miller, VTagData VTagData VTagData direct mapped 3 bits for index # 2-way set associative 2 bits for index # same amount of data
© Karen Miller, How about 4-way set associative, or 8-way set associative? For a fixed number of block frames, larger set size tends to lead to higher hit ratios larger set size means that the amount of HW (circuitry) goes up, and T c increases
© Karen Miller, VTagData Implementing writes memory 1write through change data in the cache, and send the write to main memory slow , but very little circuitry
© Karen Miller, VTagData 2write back at first, change data in the cache write to memory only when necessary dirty bit dirty bit is set on a write, to identify blocks to be written back to memory when a program completes, all dirty blocks must be written to memory...
© Karen Miller, write back (continued) faster multiple stores to the same location result in only 1 main memory access more circuitry must maintain the dirty bit dirty miss : a miss caused by a read or write to a block not in the cache, but the required block frame has its dirty bit set. So, there is a write of the dirty block, followed by a read of the requested block.
© Karen Miller, VTagData How about 2 separate caches ? I-cache I-cache for instructions only can be rather small, and still have excellent performance. VTagData VTagData D-cache D-cache for data only needs to be fairly large
© Karen Miller, D We can send memory accesses to the 2 caches independently... (increased parallelism) I P fetch load/store M
© Karen Miller, C P M Called an L1 cache (level 1) This hierarchy works so well, that most systems have 2 levels of cache. L1 P M L2