Towards a Theory of Cache-Efficient Algorithms Summary for the seminar: Analysis of algorithms in hierarchical memory – Spring 2004 by Gala Golan
The RAM Model In the previous lecture we discussed a cache in an operating system We saw a lower bound on sorting: N = number of sorted elements B = number of elements in each block M = memory size
The I/O Model 1. A datum can be accessed only from fast memory 2. B elements are brought to memory in each access 3. Computation cost << I/O cost 4. A block of data can be placed anywhere in fast memory 5. I/O operations are explicit
The Cache Model 1. A datum can be accessed only from fast memory √ 2. B elements are brought to memory in each access √ 3. Computation cost << I/O cost L denotes normalized cache latency, accessing a block from cache costs 1 4. A block of data can be placed anywhere in fast memory A fixed mapping distributes main memory in the cache 5. I/O operations are explicit The cache is not visible to the programmer
Notation I(M,B) - The I/O model C(M,B,L) - The cache model n = N/B, m = M/B – The size of the data and of memory in blocks (instead of elements) The goal of an algorithm design is to minimize running time = (number of cache accesses) + (L* number of memory accesses)
Reminder – Cache Associativity Associativity specifies the number of different frames in which a memory block can reside Fully Associative Direct Mapped 2-Way Associative Set
Emulation Theorem An algorithm A in I(M,B) using T block transfers and I processing time can be converted to an equivalent algorithm Ac in C(M,B,L) that runs in O(I+ (L+B)T ) steps. The additional memory requirement is m blocks. In other words – an algorithm that is efficient in main memory, can be efficient in cache.
Proof (1) 1 m C[] 2 21m n Mem[] Buf[]
Proof (2) 1 m 2 C[] 1 2 mn Mem[] Buf[] ba
Proof (3) 1 m 2 C[] 1 2 mn Mem[] Buf[] ba q
Proof (4) 1 m 2 C[] 1 2 mn Mem[] Buf[] ba b
Proof (5) 1 m 2 C[] 1 2 mn Mem[] Buf[] ba b q
Proof (6) 1 m 2 C[] 1 2 mn Mem[] Buf[] ba b q
Block efficient algorithms For a block efficient algorithm, a computation is done on at least a constant fraction of the elements in the blocks transferred. In such a case, O(B*T) ≡ O(I), so an algorithm for I(M,B) can be emulated in C(M,B,L) in O(I+ L*T) steps. The algorithms for sorting, FFT, and matrix transposition are block efficient.
Extension to set-associative cache In a set associative cache, if all k sets are occupied, LRU is used by the hardware to find an assignment for the referenced block. In the emulation technique described before we do not have explicit control of the replacement. Instead, a property of LRU will be used, and the cache will be used only partially.
Optimal Replacement Algorithm for Cache OPT or MIN – a hypothetical algorithm that minimizes cache misses for a given (finite) access trace. Offline – it knows in advance which blocks will be accessed next. Evicts from cache the block which will be accessed again in the longest time in the future. Was proven to be optimal – better than any online algorithm. Proposed by Belady in Used to theoretically test efficiency of online algorithms.
LRU vs. OPT For any constant factor c > 1, LRU with fast memory size m makes at most c times as many misses as OPT with fast memory size (1-1/c)m. For example, LRU with cache size m will cause 3 times more misses than OPT with memory of size 2/3 m. LRU – 3X misses OPT – X misses 6 = (1-1/3)
Extension to set-associative cache – Cont. Similarly, LRU with cache size m will cause 2 times more misses than OPT with memory of size m/2. We emulate The I/O algorithm using only half the size of Buf[]. Instead of k cache lines for every set, there are now k/2 These k/2 blocks are managed optimally, according to the optimality of the I/O algorithm. In the real cache, k lines will be managed by LRU and will experience twice the misses.
Extension to set-associative cache – Cont. 1 m C[] 2 21m n Mem[] Buf[]
Generalized Emulation Theorem An algorithm A in I(M/2,B) using T block transfers and I processing time can be converted to an equivalent algorithm Ac in the k-way associative cache model C(M,B,L) that runs in O(I+ (L+B)T ) steps. The additional memory requirement is m/2 blocks.
The cache complexity of sorting The lower bound for sorting in I(M,B) is The lower bound for sorting in C(M,B,L) is I = computationsT = I/O operations
Cache Miss Classes Compulsory Miss – a block is being referenced for the first time Capacity Miss – a block was evicted from the cache because it is too small Conflict Miss – a block was evicted from the cache because another block was mapped to the same set.
Average case performance of merge-sort in the cache model We want to estimate the number of cache misses while performing the algorithm: Compulsory misses are unavoidable Capacity misses are minimized by the I/O algorithm We can quantify the expected number of conflict misses.
When does a conflict miss occur? s cache sets are available for k runs S 1 …S k. The expected number of elements in any run S i is N/k. A leading block is a cache line containing a leading element of a run. b i is the leading block of S i. A conflict occurs when two leading blocks are mapped to the same cache set.
When does a conflict miss occur – Cont. Formally: a conflict miss occurs for element S i,j+1 when there is at least one element x in a leading block b k, k≠i, such that S i,j <x<S i,j+1 and S(b i ) = S(b k ). SiSi SkSk j j+1 x …
How many conflict misses to expect Pi = the probability of conflict for element i, 1≤i≤N. Assume uniform distribution: The leading blocks among cache sets The leading element within the leading block If k is Ω(s) then Pi is Ω(1). For each round, the number of conflict misses is Ω(N).
How many conflict misses to expect – Cont. The expected number of conflict misses throughout merge-sort is This includes O(N) misses for each pass. By choosing k<<s we minimize the probability of conflict misses, but we incur more capacity misses.
Conclusions There is a way to transform I/O efficient algorithms to cache efficient algorithms It is only for blocking, direct mapped cache that does not distinguish between reads and writes. The constants are important in these orders of magnitude.