Download presentation
Presentation is loading. Please wait.
Published byDonald Bridges Modified over 9 years ago
1
ImanFaraji Time-based Snoop Filtering in Chip Multiprocessors Amirkabir University of Technology Tehran, Iran University of Victoria Victoria, Canada Amirali Baniasadi
2
This work: Reducing redundant snoops in chip multiprocessors 2 Our Goal Improving energy efficiency of WT-based CMP Our Motivation There are long time intervals where snooping fails, wasting energy and bandwidth. Our Solution Detect such intervals and avoid snoops Key Results Memory Energy 18% Snoop Traffic 93% Performance 3.8%
3
Interconnect Conventional Snooping D$ CPU D$ CPU 2 1 4 4 4 controller 6 5 5 5 Redundant (miss): ~70% 3 3
4
WB vs. WT 4 Write-through configurationWrite-back configuration High memory trafficLow memory traffic Simple coherency mechanismSophisticated coherency mechanism Relative memory energy consumption
5
Previous Work: Snoop Filters 5 Good snoop filter 1. Fast & simple 2. Accurate and effective Eliminate redundant snoop (local & global) requests. Local: one core fails to provide data Global: all cores fail. Examples: RegionScout: Detects Memory Regions Not Shared (Moshovos) Selective Snoop Request: Predicts Supplier (Atoofian & Baniasadi) Serial Snooping: Requests Nodes One by One (Saldanha & Lipasti)
6
Our Work Time-based Snoop Filtering Motivation: There are long intervals where snooping fails consecutively But how long & how often? 6
7
Our Work (Cont.) 7
8
Global Read Miss (GRM): Occurs whenever the last snoopbyall processors fail Local Read Miss (LRM): Redundant snoop occurringbya singleprocessor fails 8
9
Distribution 9 (b) GRM distribution (a) LRM distribution for different processors Periods of Data Scarcity are usually long
10
Time-based Global Miss predictor (TGM) 10 TGM Types: 1.TGM-First: First processor that has failed snooping survives. 2.TGM-Last: Last processor that has failed snooping survives. TGM Goals: 1.Detect GRM intervals 2.Shutting down snooping in all processors but one (surviving node).
11
TGM implementation 11 TGM-enhanced CMP
12
TGM 12 (a) Coverage (b) Accuracy
13
Time-based Local Miss predictor (TLM) 13 Goal: Detect LRMs How? 1. Count consecutive snoop misses in a node 2. Disable snoop when exceeds a threshold 3. Restart snooping after a number of cycles
14
TLM implementation 14 TGM-enhanced CMP Processing Unit (PU) First Level Cache Predictor Redundant SNoop (RSN) Counter ReStarT (RST) Counter Each Processor
15
TLM features 15 (a)Coverage (b) Accuracy
16
Methodology 16 Our Simulator: SESC Benchmarks: Splash-2 To evaluate energy: Cacti 6.5 System used:Quad-Core CMP BenchmarksInput Parameters Barnes 16K Particles Cholesky tk29.O FFT 1024k complex data points Ocean 258x258 ocean Volrend Head Water-Nsqrd 512 molecules Water-spatial 512 molecules Processor Interconnection Network Memory Frequency: 5 GHz Technology: 68 nm Branch Predictor: 16K entry bimodal and gshare Fetch/Issue/Commit 4/4/5 Branch Penalty : 17 cycles RAS: 32 entries BTB: 2k Entries, 2 way Data Interconnect: crossbar Interconnect Width: 64 B IL1: 64KB/ 2 way DL1: 64KB/4way/Write Through Access Time: 1 cycle Block Size: 64 Cache line size: 32 L2:512KB/8way/Write Through Access Time: 11 cycles Block Size: 64 Memory: 1GB Access Time: 70 cycles Page Size: 4 Kbit SPLASH-2 Benchmarks and INPUT parameters System Parameters
17
Relative Snoop Traffic Reduction 17 TGM-F: 58% TGM-L: 57% TLM: 77%
18
Relative Memory Energy 18 TGM-F: 8% TGM-L: 8.5% TLM: 11%
19
Relative Memory Delay 19 TGM-F: 1.1% TGM-L: 2.1% TLM: 1.7%
20
Relative Performance 20 TGM-F: No Change TGM-L: 0.4% TLM: 0.3%
21
Summary 21 We showed: Long data scarcity period (DSP) exist during workload runtime During DSPs redundant snoops happen frequently and consecutively Our solutions TGM: uses snoop behavior on all processors to detect and filter redundant snoops Shutdown snoop on as much processor as possible TLM: Redundant snoops are filtered in a single node Counts recent redundant snoops to detect data scarcity periods and filter upcoming redundant snoops Simulation Results : Snoop Reduction: TGM-F: 58% TGM-L: 57% TLM: 77% Memory Energy: TGM-F: 8% TGM-L: 8.5% TLM: 11% Memory Delay: TGM-F: 1.1% TGM-L: 2.1% TLM: 1.7% Performance: TGM-F: no change TGM-L: 0.4% TLM: 0.3%
22
Thanks for your attention 22
23
Backup Slides 23
24
Discussion 24 How Characteristics of the benchmarks affect memory energy/delay reduced by our solution? 2. Share of Redundant Snoops 1. True detection of redundant snoops
25
Memory Energy.Delay 25 Memory Energy = Energy consumed to provide the requested data Memory Delay = time required to provide the requested data
26
Volrend Benchmark 26 Volrend while running rarely send snoop requests This application renders a three-dimensional volume. It renders several frames from changing viewpoints consecutive frames in rotation sequences often vary slightly in viewpoint High Temporal Locality Volrend does Load Distribution very well High Spatial Locality
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.