Download presentation
Presentation is loading. Please wait.
Published byLucas Salazar Modified over 11 years ago
1
Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology University of Victoria 1
2
Goal: Improving energy efficiency in snoop-based CMPs. Motivation: Broadcasting/processing entire tag is inefficient. Our Solution: Using Partial Tag Comparison (PTC) prior to snoop. Key Results Performance ( 2.9%) Tag array power ( 52%) Bandwidth utilization ( 78.5%) 2 This Work: Improving Snoop Coherency
3
Our Solution (PTC) vs. Conventional 3 D$ Interconnect Upper Level Cache …. D$ Upper Level Cache …. D$ Interconnect ConventionalOur solution Fast + Power & Bandwidth Fast ++ (early miss detection) Power & Bandwidth Efficient +
4
Conventional Snooping 4 Address Bus Snoop Bus Command Bus D$ CPU D$ CPU 2 1 3 3 3 controller 5 4 4 4 Redundant (miss): ~70%
5
Snoop Filters 5 Goal: Eliminate redundant snoop requests. Example: RegionScout (ISCA05), CGCT(ISCA05), SSP (ASPLOS08) PTC: (1) Early miss detection using subset of tag bits. (2) Once a miss is detected, snoop is avoided. How often is that possible?
6
6 How often using n bits is enough to detect a miss? 95 + % of misses can be detected using 8 bits.
7
7 D$ Address Bus LSB misshit Avoid Snoop Access Upper Level Snoop Potential Targets PTC-Filter
8
8 4-way D$ PTC-Filter Filter 0 12 3 … Core1s LSBCore2s LSBCore3s LSB VDLSB 8 bits
9
PTC: Filter Miss 9 Address Bus Snoop Bus Command Bus D$ CPU D$ CPU 3 2 controller 1
10
PTC: Filter Hit 10 Address Bus Snoop Bus Command Bus D$ CPU D$ CPU 2 4 controller 6 5 1 3
11
Filter Maintenance 11 PTC- Filter CPU 1 B FDE Request =A 3 3 Address Bus Core 0 ….. Core i Addr.CWD Snoop Controller 4 Command Bus 5 6 6 miss A. place it in position of tag F 2 2 Pending Request Table {Address=A, C=0,W=1, D=1} A011 Place A, insert in Way 1 of core 0
12
12 Methodology SESC simulator 4-way CMP SPLASH-2 benchmarks CACTI 6.0 4 MB 4-banked 16-way 10 cycle latency L2 6 cycle arbitration + 2 cycle core to controller latency + Crossbar data network+ MESI protocol DL1/IL1 4-way/2-way 64KB/32KB 3 cycle latency 64 B cache line+ 500 cycle Memory access
13
13 Performance Average: 2.9%
14
14 Bandwidth Average: 78.5%
15
15 Tag Power Average: 52%
16
Why do benchmarks show different performance improvement? Different cache miss frequency Different early miss detection frequency Not all cache misses are on the critical path Filter overhead: Timing: 1 cycle Power: 78.5% of single tag array access 16 Discussion
17
PTC: Using subset of tag bits to improve bandwidth/power efficiency. Results: Performance: 2.9% Tag Power: 52% Bandwidth: 78.5% 17 Summary
18
18
19
19 Global vs. Local Miss D$ Interconnect Upper Level Cache …. D$ Have B? NO D$ interconnect Upper Level Cache …. D$ Have B? NOYES D$ NO Global Miss Local Miss local miss detection better power/bandwidth profile Remote miss detection (source-based approach) vs. (destination-based filter)
20
20 Partial tag lookup: global miss
21
21 Partial tag lookup: local miss
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.