Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology University of Victoria 1
Goal: Improving energy efficiency in snoop-based CMPs. Motivation: Broadcasting/processing entire tag is inefficient. Our Solution: Using Partial Tag Comparison (PTC) prior to snoop. Key Results Performance ( 2.9%) Tag array power ( 52%) Bandwidth utilization ( 78.5%) 2 This Work: Improving Snoop Coherency
Our Solution (PTC) vs. Conventional 3 D$ Interconnect Upper Level Cache …. D$ Upper Level Cache …. D$ Interconnect ConventionalOur solution Fast + Power & Bandwidth Fast ++ (early miss detection) Power & Bandwidth Efficient +
Conventional Snooping 4 Address Bus Snoop Bus Command Bus D$ CPU D$ CPU controller Redundant (miss): ~70%
Snoop Filters 5 Goal: Eliminate redundant snoop requests. Example: RegionScout (ISCA05), CGCT(ISCA05), SSP (ASPLOS08) PTC: (1) Early miss detection using subset of tag bits. (2) Once a miss is detected, snoop is avoided. How often is that possible?
6 How often using n bits is enough to detect a miss? 95 + % of misses can be detected using 8 bits.
7 D$ Address Bus LSB misshit Avoid Snoop Access Upper Level Snoop Potential Targets PTC-Filter
8 4-way D$ PTC-Filter Filter … Core1s LSBCore2s LSBCore3s LSB VDLSB 8 bits
PTC: Filter Miss 9 Address Bus Snoop Bus Command Bus D$ CPU D$ CPU 3 2 controller 1
PTC: Filter Hit 10 Address Bus Snoop Bus Command Bus D$ CPU D$ CPU 2 4 controller
Filter Maintenance 11 PTC- Filter CPU 1 B FDE Request =A 3 3 Address Bus Core 0 ….. Core i Addr.CWD Snoop Controller 4 Command Bus miss A. place it in position of tag F 2 2 Pending Request Table {Address=A, C=0,W=1, D=1} A011 Place A, insert in Way 1 of core 0
12 Methodology SESC simulator 4-way CMP SPLASH-2 benchmarks CACTI MB 4-banked 16-way 10 cycle latency L2 6 cycle arbitration + 2 cycle core to controller latency + Crossbar data network+ MESI protocol DL1/IL1 4-way/2-way 64KB/32KB 3 cycle latency 64 B cache line+ 500 cycle Memory access
13 Performance Average: 2.9%
14 Bandwidth Average: 78.5%
15 Tag Power Average: 52%
Why do benchmarks show different performance improvement? Different cache miss frequency Different early miss detection frequency Not all cache misses are on the critical path Filter overhead: Timing: 1 cycle Power: 78.5% of single tag array access 16 Discussion
PTC: Using subset of tag bits to improve bandwidth/power efficiency. Results: Performance: 2.9% Tag Power: 52% Bandwidth: 78.5% 17 Summary
18
19 Global vs. Local Miss D$ Interconnect Upper Level Cache …. D$ Have B? NO D$ interconnect Upper Level Cache …. D$ Have B? NOYES D$ NO Global Miss Local Miss local miss detection better power/bandwidth profile Remote miss detection (source-based approach) vs. (destination-based filter)
20 Partial tag lookup: global miss
21 Partial tag lookup: local miss