Download presentation
Presentation is loading. Please wait.
Published byRobert Chester Poole Modified over 9 years ago
1
ASPLOS’02 Presented by Kim, Sun-Hee
2
Technology trends ◦ The rate of frequency scaling is slowing down Performance must come from exploiting concurrency ◦ Increasing global on-chip wire delay problem Architectures must be partitioned NUCA (Non-Uniform access Cache Architecture) ◦ Composable on-chip memories ◦ Address the increasing wire delay problem in future large caches ◦ Array of fine-grained memory banks connected by a switched network
3
UCA (Uniform Cache Access) ◦ Traditional cache ◦ Poor performance Internal wire delays Restricted numbers of ports
4
ML-UCA (Multi-level Cache) ◦ L2 and L3 ◦ Aggressively baked Multiple parallel access Inclusion, replicating
5
S-NUCA-1 (Static Non-Uniform Cache) ◦ Non-uniform access without inclusion ◦ Mapping is predetermined Based on the block index Only one bank of the cache ◦ Private, two-way, pipelined transmission channel
6
S-NUCA-2 ◦ 2D switched network Permitting a larger number of smaller, faster banks Circumvent wire & decoder area overhead
7
D-NUCA (Dynamic NUCA) ◦ Migrating cache lines By data to be mapped to many banks Most requests are serviced by the fastest banks ◦ Fewer misses By adopting to the working set
8
Experimental Methodology ◦ Cacti to derive the access times for cache ◦ sim-alpha to simulate cache performance UCA Evaluation
9
Mappings of data to banks are static ◦ Low-order bits index determine bank ◦ Four-way set associative Advantages ◦ Different access time proportional to the distance of the bank ◦ Access to different banks may in parallel Reducing contention
10
2 private, per-bank 128-bit channels ◦ Each bank access independently at max speed ◦ Small bank advantages Vs. area overheads Bank conflict contention model ◦ Conservative policy : b+2d+3 cycles ◦ Aggressive pipelining policy : b+3 cycles
11
Lightweight, wormhole-routed 2-D mesh Centralized tag store or broadcasting the tags to all of the banks
12
Spread sets ◦ The multibanked cache as a set-associative ◦ Bank set Bank set, 4-way Rows# may not ways Different latencies Equal latencies Complex path in a set Potential longer latencies More contention Fastest bank access
13
Incremental search ◦ From the closest bank ◦ Minimize messages, low energy and performance Multicast search ◦ Multicast address to banks in a set ◦ Higher performance at more energy and contention Limited multicast ◦ Search first M banks in parallel then incremental Partitioned multicast ◦ Subset in bank set is searched iteratively
14
Challenges in distributed cache array ◦ Many banks may need to be searched ◦ Miss resolution time grows as way increase Partial tag comparison ◦ Reduce bank lookups and miss resolution time Smart search ◦ Stores the partial tag bits in the cache controller ◦ ss-performance : enough tag bits reducing false hit ◦ ss-energy : serialized search from the closest bank
15
Maximize the hit ratio in the closest bank ◦ MRU line is in the closest bank ◦ Generational promotion Approximating an LRU mapping Reduce the copying # by pure LRU On hit, swapped with the line in the next closest bank Zero-copy policy, one-copy policy
16
Mapping ◦ Simple or shared Search ◦ Multicast, incremental, or combination Promotion ◦ Promotion distance(1bank), promotion trigger(1hit) Insertion ◦ Location (slowest bank) and replacement (zero copy) Compare to pure LRU
17
UCA : 67.7 ML-UCA : 22.3 S-NUCA : 30.4 UCA : 67.7 ML-UCA : 22.3 S-NUCA : 30.4 UCA : 0.41 S-NUCA : 0.65 UCA : 0.41 S-NUCA : 0.65
18
Comparison to ML-UCA ◦ Same with D-NUCA in frequently used data is closer Working set > 2MB
19
Low latency access Technology scalability Performance stability Flattening the memory hierarchy
21
Cache Design Comparison
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.