Presentation is loading. Please wait.

Presentation is loading. Please wait.

ASPLOS’02 Presented by Kim, Sun-Hee.  Technology trends ◦ The rate of frequency scaling is slowing down  Performance must come from exploiting concurrency.

Similar presentations


Presentation on theme: "ASPLOS’02 Presented by Kim, Sun-Hee.  Technology trends ◦ The rate of frequency scaling is slowing down  Performance must come from exploiting concurrency."— Presentation transcript:

1 ASPLOS’02 Presented by Kim, Sun-Hee

2  Technology trends ◦ The rate of frequency scaling is slowing down  Performance must come from exploiting concurrency ◦ Increasing global on-chip wire delay problem  Architectures must be partitioned  NUCA (Non-Uniform access Cache Architecture) ◦ Composable on-chip memories ◦ Address the increasing wire delay problem in future large caches ◦ Array of fine-grained memory banks connected by a switched network

3  UCA (Uniform Cache Access) ◦ Traditional cache ◦ Poor performance  Internal wire delays  Restricted numbers of ports

4  ML-UCA (Multi-level Cache) ◦ L2 and L3 ◦ Aggressively baked  Multiple parallel access  Inclusion, replicating

5  S-NUCA-1 (Static Non-Uniform Cache) ◦ Non-uniform access without inclusion ◦ Mapping is predetermined  Based on the block index  Only one bank of the cache ◦ Private, two-way, pipelined transmission channel

6  S-NUCA-2 ◦ 2D switched network  Permitting a larger number of smaller, faster banks  Circumvent wire & decoder area overhead

7  D-NUCA (Dynamic NUCA) ◦ Migrating cache lines  By data to be mapped to many banks  Most requests are serviced by the fastest banks ◦ Fewer misses  By adopting to the working set

8  Experimental Methodology ◦ Cacti to derive the access times for cache ◦ sim-alpha to simulate cache performance  UCA Evaluation

9  Mappings of data to banks are static ◦ Low-order bits index determine bank ◦ Four-way set associative  Advantages ◦ Different access time proportional to the distance of the bank ◦ Access to different banks may in parallel  Reducing contention

10  2 private, per-bank 128-bit channels ◦ Each bank access independently at max speed ◦ Small bank advantages Vs. area overheads  Bank conflict contention model ◦ Conservative policy : b+2d+3 cycles ◦ Aggressive pipelining policy : b+3 cycles

11  Lightweight, wormhole-routed 2-D mesh  Centralized tag store or broadcasting the tags to all of the banks

12  Spread sets ◦ The multibanked cache as a set-associative ◦ Bank set Bank set, 4-way Rows# may not ways Different latencies Equal latencies Complex path in a set Potential longer latencies More contention Fastest bank access

13  Incremental search ◦ From the closest bank ◦ Minimize messages, low energy and performance  Multicast search ◦ Multicast address to banks in a set ◦ Higher performance at more energy and contention  Limited multicast ◦ Search first M banks in parallel then incremental  Partitioned multicast ◦ Subset in bank set is searched iteratively

14  Challenges in distributed cache array ◦ Many banks may need to be searched ◦ Miss resolution time grows as way increase  Partial tag comparison ◦ Reduce bank lookups and miss resolution time  Smart search ◦ Stores the partial tag bits in the cache controller ◦ ss-performance : enough tag bits reducing false hit ◦ ss-energy : serialized search from the closest bank

15  Maximize the hit ratio in the closest bank ◦ MRU line is in the closest bank ◦ Generational promotion  Approximating an LRU mapping  Reduce the copying # by pure LRU  On hit, swapped with the line in the next closest bank  Zero-copy policy, one-copy policy

16  Mapping ◦ Simple or shared  Search ◦ Multicast, incremental, or combination  Promotion ◦ Promotion distance(1bank), promotion trigger(1hit)  Insertion ◦ Location (slowest bank) and replacement (zero copy)  Compare to pure LRU

17 UCA : 67.7 ML-UCA : 22.3 S-NUCA : 30.4 UCA : 67.7 ML-UCA : 22.3 S-NUCA : 30.4 UCA : 0.41 S-NUCA : 0.65 UCA : 0.41 S-NUCA : 0.65

18  Comparison to ML-UCA ◦ Same with D-NUCA in frequently used data is closer Working set > 2MB

19  Low latency access  Technology scalability  Performance stability  Flattening the memory hierarchy

20

21  Cache Design Comparison


Download ppt "ASPLOS’02 Presented by Kim, Sun-Hee.  Technology trends ◦ The rate of frequency scaling is slowing down  Performance must come from exploiting concurrency."

Similar presentations


Ads by Google