Download presentation
Presentation is loading. Please wait.
Published byAlondra Horwood Modified over 9 years ago
1
The Thrifty Barrier Energy-Aware Synchronization in Shared-Memory Multiprocessors Jian Li and José F. Martínez Computer Systems Laboratory Michael C. Huang Electrical & Computer Engineering
2
The Thrifty Barrier – Li, Martínez, and Huang Motivation Multiprocessor architectures sprouting everywhere large compute servers small servers, desktops chip multiprocessors High energy consumption a problem – more so in MPs Most power-aware techniques tailored at uniprocessors Multiprocessors present unique challenges processor co-ordination, synchronization
3
The Thrifty Barrier – Li, Martínez, and Huang Case: Barrier Synchronization Fast threads spin-wait for slower ones Spin-wait wasteful by definition quick reaction but only last iteration useful spin-wait compute
4
The Thrifty Barrier – Li, Martínez, and Huang Proposal: Thrifty Barrier Reduce spin-wait energy waste in barriers leverage existing processor sleep states (e.g. ACPI) Minimize impact on execution time achieve timely wake-up conventionalthrifty
5
The Thrifty Barrier – Li, Martínez, and Huang Challenges Should sleep? transition times (sleep + wake-up) non-negligible What sleep state? more energy savings → longer transition times When to wake up? early w.r.t. barrier release → may hurt energy savings late w.r.t. barrier release → may hurt performance Must predict barrier stall time accurately
6
The Thrifty Barrier – Li, Martínez, and Huang Findings Many barrier stall times large enough to leverage sleep states Stall times predictable discriminate through PC indexing predict indirectly using barrier interval times Timely wake-up: combination of two mechanisms coherence message bounds wake-up latency watchdog timer anticipates wake-up
7
The Thrifty Barrier – Li, Martínez, and Huang Thrifty Barrier Mechanism BARRIER ARRIVAL SLEEP? S1S2S3 Wake-up signal RESIDUAL SPIN No BARRIER DEPARTURE Stall time prediction
8
The Thrifty Barrier – Li, Martínez, and Huang Sleep Mechanism BARRIER ARRIVAL SLEEP? S1S2S3 Wake-up signal RESIDUAL SPIN No BARRIER DEPARTURE Stall time prediction
9
The Thrifty Barrier – Li, Martínez, and Huang Predicting Stall Time Splash-2’s FMM example: 3 important barriers, 4 iterations randomly picked thread (always the same) PC indexing reduces variability Interval time (BIT) more stable metric than stall time (BST)
10
The Thrifty Barrier – Li, Martínez, and Huang Stall Time vs. Interval Time Barriers separate computation phases PC indexing reduces variability Barrier stall time (BST) varies considerably even with PC indexing barrier-, but also thread-dependent – computation shifts among threads across invocations Barrier interval time (BIT) varies much less quite stable if PC indexing used barrier-, but not thread-dependent last-value prediction ok for most applications
11
The Thrifty Barrier – Li, Martínez, and Huang Predicting Stall Time Indirectly Can use BIT to predict BST indirectly compute time measurable upon arrival to barrier subtract from predicted BIT to derive predicted BST How to manage time info? BIT BST t Compute t
12
The Thrifty Barrier – Li, Martínez, and Huang Threads depart from barrier instance b-1 toward instance b Each thread t has local record of release timestamp BRTS t,b-1 Assumptions: no global clock local wallclock active even if CPU sleeps – all CPUs same nominal clock frequency Managing Time Info b-1b BRTS t,b-1
13
The Thrifty Barrier – Li, Martínez, and Huang Thread t arrives, knowing BRTS t,b-1, Compute t,b make prediction pBIT b derive pBST t,b = pBIT b – Compute t,b use pBST t,b to pick sleep state (if warranted) – best fit based on transition time Managing Time Info b-1b pBIT b pBST t,b Compute t,b BRTS t,b-1
14
The Thrifty Barrier – Li, Martínez, and Huang Last thread u arrives, knowing BRTS u,b-1 derive actual BIT b = time( ) – BRTS u,b-1 update (shared) predictor with BIT b release barrier Managing Time Info b-1b BIT b BRTS u,b-1
15
The Thrifty Barrier – Li, Martínez, and Huang Every thread t (possibly after waking up late) read BIT b from updated predictor compute actual BRTS t,b = BRTS t,b-1 + BIT b Threads never use timestamps (BRTS) from other threads no global clock is needed Managing Time Info b-1b BIT b BRTS t,b-1 BRTS t,b *
16
The Thrifty Barrier – Li, Martínez, and Huang Thrifty Barrier Mechanism BARRIER ARRIVAL SLEEP? S1S2S3 Wake-up signal RESIDUAL SPIN No BARRIER DEPARTURE Stall time prediction
17
The Thrifty Barrier – Li, Martínez, and Huang Wake-up Mechanism BARRIER ARRIVAL SLEEP? S1S2S3 Wake-up signal RESIDUAL SPIN No BARRIER DEPARTURE Stall time prediction
18
The Thrifty Barrier – Li, Martínez, and Huang Wake-up Mechanism Communicate barrier completion to sleeping CPUs signal sent to CPU pin options: external vs. internal wake-up External (passive): initiated by processor that releases barrier leverage coherence protocol – invalidation to spinlock must supply spinlock address to cache controller Internal (active): triggered by watchdog timer program with predicted BST before going to sleep
19
The Thrifty Barrier – Li, Martínez, and Huang Early vs. Late Wake-up Early wake-up (underprediction) energy waste – residual spin Late wake-up (overprediction) possible impact on execution time External wake-up guarantees late wake-up (but bounded) Internal wake-up can lead to both (late not bounded) Our approach: hybrid wake-up external provides upper bound internal strives for timely wake-up using prediction
20
The Thrifty Barrier – Li, Martínez, and Huang Other Considerations (see paper) Sleep states that do not snoop for coherence requests flush dirty data before sleeping defer invalidations to clean data Overprediction threshold case of frequent, swinging BITs of modest size turn off prediction if overpredict beyond threshold Interaction with context switching and I/O underprediction threshold Time sharing issues: multiprogramming, overthreading
21
The Thrifty Barrier – Li, Martínez, and Huang Experimental Setup Simultated system: 64-node CC-NUMA 6-way dynamic superscalar L1 16KB 64B 2-way 2clk; L2 64KB 64B 8-way 12clk 16B/4clk memory bus, 60ns SDRAM hypercube, wormhole, 4clk pipelined routers – 16clk pin to pin Energy modeling: Wattch (CPU + L1 + L2) sleep states along lines of Pentium family
22
The Thrifty Barrier – Li, Martínez, and Huang Experimental Setup All Splash-2 applications except: Raytrace – no barriers LU – better version w/o barriers widely available Efficiency (64p) 40-82%, avg. 58% Target Group ≥ 10%
23
The Thrifty Barrier – Li, Martínez, and Huang Energy Savings
24
The Thrifty Barrier – Li, Martínez, and Huang Performance Impact
25
The Thrifty Barrier – Li, Martínez, and Huang Related Work Highlights Quite a bit of work in uniprocessor domain Elnozahy et al. server farms, clusters – thirfty barrier targets shared memory, parallel apps. Moshovos et al., Saldanha and Lipasti energy-aware cache coherence – prob. compatible with and complementary to thrifty barrier
26
The Thrifty Barrier – Li, Martínez, and Huang Conclusions Energy-aware MP mechanisms can and should be pursued Case of energy-aware barrier synchronization simple indirect prediction of barrier stall time hybrid wake-up scheme to minimize impact on exec. time Encouraging results; target applications 17% avg. energy savings 2% avg. performance impact
27
The Thrifty Barrier Energy-Aware Synchronization in Shared-Memory Multiprocessors Jian Li and José F. Martínez Computer Systems Laboratory Michael C. Huang Electrical & Computer Engineering
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.