Download presentation
Presentation is loading. Please wait.
1
Simple Criticality Predictors for Dynamic Performance and Power Management in CMPs Group Talk: Dec 10, 2008 Abhishek Bhattacharjee
2
Why Thread Criticality Prediction (TCP) ? Significant load imbalance in typical parallel programs Threads of a parallel program don’t finish at the same time Worse with heterogeneity (process variation, thermal emergencies etc.) Relative thread speed or criticality is difficult to predict If we can deduce that thread 2 is critical and by how much, what can we do with this ?
3
Performance Improvements from TCP TCP to improve dynamic parallel management performance (eg. TBB) Steal tasks from critical threads
4
Energy Efficiency from TCP TCP-guided DVFS for barriers Slow down non-critical threads without affecting runtime
5
Our Goals Develop low-overhead hardware TCP schemes Harness counters and metrics available on-chip Aim for high accuracy across architectures TCP should be general for a variety of applications Improve TBB performance with TCP-guided task stealing Improve barrier energy efficiency with TCP-guided DVFS
6
Related Work Thrifty Barrier [Li, Martinez, Huang, HPCA ‘05] Transition fast, non-critical threads into low-power sleep modes to save energy at barriers Predict barrier stall time based purely on history Still wastes energy in compute phase… Meeting Points [Cai et al. PACT ’08] DVFS threads to save energy without performance penalty Only applicable to parallel loops Insert meeting points to track loop iterations executed per core Broadcast iteration counts to all cores Use software to calculate appropriate DVFS settings 1.Unlike thrifty barrier, predict before barrier is reached 2.Unlike meeting points, avoid specialized instructions & broadcasts 3.Unlike meeting points, broader applicability than parallel loops 4.Unlike both approaches, use software-independent criticality calculation 5.Target versatility: TBB performance, barrier energy waste, SMT priority schemes, memory priority schemes etc.
7
Outline Methodology Predicting Thread Criticality Basic TCP Hardware Improving TBB Performance with TCPs Minimizing Energy Waste at Barriers with TCPs
8
Methodology: Simulators TCP accuracy is harder with in-order pipelines Assess energy savings accurately and with OS on emulator Assumes 50% fixed leakage energy cost of baseline
9
Methodology: Benchmarks Use larger, realistic data sets for Splash-2 [Bienia et al. PACT ’08]
10
Outline Methodology Predicting Thread Criticality Basic TCP Hardware Improving TBB Performance with TCPs Minimizing Energy Waste at Barriers with TCPs
11
Per-Core Fully-Local History If behavior is sufficiently repetitive, can use fully-local history (thrifty barrier) Difficult to achieve on in-order pipelines Solution: target thread comparative info. Rationale: criticality of a single thread determined by others
12
Comparative Metric: Instruction Count Normalize compute times and metric against critical thread 6
13
Comparative Metric: Instruction Count Avg. error from barrier iterations and 10% execution snapshots of non-barrier apps (Swaptions, Fluidanimate) Poor accuracy across all tested benchmarks
14
Comparative Metric: Cache Statistics But Ocean and LU still suffer from over 25% error Include L1 I-Cache Misses …
15
Comparative Metric: Cache Statistics Instruction counts and control flow affects LU, Water-Nsq, Water-Sp But Ocean still has over 22% error Include L2 Cache Misses …
16
Comparative Metric: Cache Statistics Memory-intensive Ocean, Volrend, Radix, PARSEC particularly improved Now check weighted cache misses metric on out-of-order machine
17
Comparative Metric: Cache Statistics Memory intensive Ocean, Volrend Radix, PARSEC least impacted Weighted cache misses is most accurate
18
Comparative Metric: Control Flow and TLB Misses Control flow (branch mispredictions) penalties and instruction count effects tracked with L1 I-Cache misses TLB Misses Little variation among multiple threads (usually access closely spaced data) Trivial to include weighted TLB component if necessary
19
Outline Methodology Predicting Thread Criticality Basic TCP Hardware Improving TBB Performance with TCPs Minimizing Energy Waste at Barriers with TCPs
20
Basic TCP Hardware Criticality counters placed with shared, unified L2 cache Simple, scalable hardware Eliminate broadcasts as L2 controller sees all cache misses Can accommodate more cache levels, split L2, or distributed LLCs with trivial HW and messages
21
Outline Methodology Predicting Thread Criticality Basic TCP Hardware Improving TBB Performance with TCPs Minimizing Energy Waste at Barriers with TCPs
22
The TBB Task Scheduler Concurrency is expression in tasks TBB dynamic scheduler stores and distributes tasks Scheduler has control of worker thread with per-thread software queue Threads try to extract tasks from local queue If queues are empty, threads steal tasks from remote queues If steal unsuccessful, back off for pre-determined time before retrying Problem: Steal victim is chosen randomly Poor performance at higher core counts and high load imbalance
23
Our TCP-Guided TBB Stealing Algorithm 1. If (Cache miss from Core P){ 2. Update criticality counter for Core P based on cache miss type 3. } 4. If (Steal request from a Core){ 5. Scan all criticality counters to find the maximum value 6. Report core with highest criticality counter value as steal victim 7.} 8. If (Message indicating steal from victim Core P unsuccessful) { 9.Reset criticality counter for Core P 10} 11. If ( (Number of Cycles % Interval Bound) = = 0 ) 12. Reset all criticality counters
24
Hardware Details Interval Bound = 100K cycles Simple and scalable hardware: Even at 64 cores with 14 bits per Criticality Counter, 114 bytes of storage Minimal message overhead TCP access takes the same latency as L2 cache miss
25
Steal Rate Improvements with TCP TCP-guided task stealing limits false negatives under 7% Now, at higher cores, greater success rates
26
Performance Improvements with TCP Up to 32% performance gains against random stealing Regularly outperforms occupancy-based stealing Streamcluster is highly load imbalanced TCP and occupancy benefits are similar
27
Outline Methodology Predicting Thread Criticality Basic TCP Hardware Improving TBB Performance with TCPs Minimizing Energy Waste at Barriers with TCPs
28
TCP-Driven Energy Efficiency in Barrier- Based Parallel Programs Goals recap: Multiple threads reach barriers at different times Use TCP to predict thread speeds DVFS fast threads to low frequencies Assume f 0, 0.85f 0, 0.70f 0, 0.55f 0 Aim: energy efficiency with no performance penalty But: TCP mispredictions due to spurious program behavior DVFS transition overhead impact should be minimized
29
TCP-Driven DVFS Hardware Cache miss from core P Update criticality counter P Is P running at f o and is counter above T?
30
TCP-Driven DVFS Hardware For all cores, find closest SST match to Criticality Counter Is SST suggesting freq. switch?
31
TCP-Driven DVFS Hardware If SST suggesting freq. switch, increment suggested SCT counter, decrement others If max. counter for new DVFS setting, actual DVFS
32
TCP-Driven DVFS Hardware If SST not suggesting freq. switch, increment current SCT counter, decrement others
33
TCP-Driven DVFS Hardware Reset criticality counters
34
Criticality Counter Threshold In-order pipeline results (harder case) – 16 cores Balance between TCP speed and accuracy Avg. Accuracy @ 1024 = 78.19 %
35
Integrating the Suggestion Conf. Table Avg. Accuracy @ 2 bits = 92.68 %
36
Case Study: Impact of SCT on Streamcluster
37
Impact of Memory Parallelism
38
Gradual DVFS Might not save as much energy as direct DVFS at low MHSRs But at high MHSRs, much higher accuracy than direct DVFS
39
TCP-Guided DVFS Scheme Performance
40
Energy Savings FPGA platform with 4-cores, 50% fixed leakage cost Even higher savings expected with greater core counts, core complexity, realistic modeling of leakage (temperature impact), on-chip switching regulators
41
Hardware Characteristics Based on readily available on-chip cache stats All thread criticality calculation done in hardware with SST Minimal network messages Low overhead and scalable 16 core CMP – 71 bytes of storage 64 CMP – 215 bytes of storage
42
Conclusion Low overhead TCPs help manage parallelism for energy and performance Accurate TCPs can be based on simple cache statistics TCP-based TBB task stealer offers 12.9% to 32% performance improvements on 32-core CMP TCP-based DVFS offers 15% energy savings on 4-core CMP Future: TLB prefetching, DRAM scheduling
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.