Software Methods to Increase Data Cache Performance Presented by Philip Marshall
Outline Introduction Example: Multiple Vector Additions Example: Linked List Example: Binary Tree Conclusion
Introduction Cache hit time is critical to system performance Often determines a processor’s clock period Cache controllers must be as simple as possible The miss rate of a cache can be decreased if we know something about the access patterns If we use software to use better access patterns or hint at how the cache can best be used, we can improve performance
Introduction Various methods can be used: Loop Fusion – combine multiple loops that access the same elements Array Merge – combine multiple arrays to increase spatial locality Cache Prefetch – ask for values to be loaded into cache in advance Cache Bypass – prevent certain accesses from allocating in the cache
Vector Addition – Base Code #define SIZE_N 1024 int a[SIZE_N], b[SIZE_N], c[SIZE_N]; int s1[SIZE_N], s2[SIZE_N]; for (int i = 0; i < SIZE_N; i++) s1[i] = a[i] + b[i]; for (int i = 0; i < SIZE_N; i++) s2[i] = a[i] + c[i];
Vector Addition – Base Code Assume a perfect instruction cache Ignore conflict data misses Assume a cache line size of 4 words Assume write miss penalties can be hidden First loop: a, b: 256 misses each (every 4 th access) Second loop: a, c: 256 misses each unless cache is large enough to hold entire a and b arrays 1024 total misses
Vector Addition – Loop Fusion #define SIZE_N 1024 int a[SIZE_N], b[SIZE_N], c[SIZE_N], s1[SIZE_N], s2[SIZE_N]; for (int i = 0; i < SIZE_N; i++) { s1[i] = a[i] + b[i]; s2[i] = a[i] + c[i]; }
Vector Addition – Loop Fusion a, b, c: 256 misses each 768 total misses Are there always loops that can be combined?
Vector Addition – Array Merge #define SIZE_N 1024 struct vectors_type { int a; int b; int c; } int s1[SIZE_N], s2[SIZE_N]; vectors_type vectors[SIZE_N]; for (int i = 0; i < SIZE_N; i++) { s1[i] = vectors[i].a + vectors[i].b; s2[i] = vectors[i].a + vectors[i].c; }
Vector Addition – Array Merge 3072 accesses, every 4 th one misses 768 misses May not be a viable optimization method in all cases If we have a large set of vectors and want to be able to add any two Dynamic memory allocation What if we only want to traverse one vector?
Vector Addition – Prefetch Speculatively load data into cache before we need it Useful if we know which data we need far enough in advance Assume prefetch is useful if we know the address 10 iterations in advance Assume prefetch past end of array is non- faulting
Vector Addition – Prefetch #define SIZE_N 1024 int a[SIZE_N], b[SIZE_N], c[SIZE_N], s1[SIZE_N], s2[SIZE_N] for (int i = 0; i < SIZE_N; i++) { s1[i] = a[i] + b[i]; s2[i] = a[i] + c[i]; prefetch(a[i+10]); prefetch(b[i+10]); prefetch(c[i+10]); }
Vector Addition – Prefetch Only 30 misses 3072 prefetch instructions issued Does the cost outweigh the benefit? 768 – 30 = 738 fewer misses Miss cost only needs to be 4.2 cycles for prefetch be worthwhile Multiple issue processors can help hide the cost of issuing prefetches Improves performance even if we’re only adding 2 vectors
Vector Addition – Prefetch Do we want a special load instruction that prefetches several blocks ahead? Reduces instruction count Works in the case of sequential access, but what if we want to prefetch from non- contiguous locations?
Vector Addition – Cache Bypass Assume a 2-set fully associative cache with 4 word line size for (int i = 0; i < SIZE_N; i++) { s1[i] = a[i] + b[i]; s2[i] = a[i] + c[i]; } Assume write non-allocate Very worst case: cache always misses (4096 misses) If we use LRU and write our assembly so that a is always in cache: 2048 misses for b[i] and c[i] misses for a[i] = 2304 If we use non-caching reads for c[i]: 1024 misses a[i] and b[i] 256 misses each: 1536 total
Linked List Suppose we are sequentially traversing a linked list We can prefetch the next several items Calculating addresses repeatedly could be expensive (requires multiple memory accesses) Use 2 pointers: one for prefetch
Linked List – Base Code struct linked{ int data; *linked next; } *linked start; *linked temp = start; int a[SIZE], index=0; while (temp->next) { a[index++] = temp->data; temp = temp->next; }
Linked List – Prefetch struct linked{ int data; *linked next; } *linked start; *linked temp = start, temp2 = start; int a[SIZE], index=0; for (int i = 0; i < 10; i++) temp2 = temp2->next; while (temp->next){ a[index++] = temp->data; temp = temp->next; if (temp2->next){ temp2 = temp2->next; prefetch(temp2->next); }
Linked List – Prefetch Instead of every element potentially missing the cache, only the first 10 do If prefetch takes longer to complete, more cache space is necessary
Binary Tree Suppose we are traversing a binary tree where we can’t easily predict which branch we’ll access next. Is prefetch useful? We can speculatively prefetch all values How far down tree? Cache Pollution May be valuable to speculatively fetch next two possible elements if we can do useful work until the prefetch completes (ie, if it takes enough cycles to determine which branch to take)
Binary Tree struct node{ int data; *node left, right; } *node top; *node temp = top; int search_value, found=0; do{ if (temp->left) prefetch(temp->left); if (temp->right) prefetch(temp->right); temp = next_node(temp, search_value, &found); }until (found);
Conclusion Some methods improve contrived cases, but are they always useful? Loop fusion Array merge Prefetch works well for predictable access patterns Dynamic memory and pointers? Is prefetch worthwhile for large block size and random access of small elements?
Conclusion Cache miss time measured in clock cycles is increasing Requires prefetch farther ahead – larger caches Software methods are static Low cost of implementation Potentially pipeline independent