Presentation is loading. Please wait.

Presentation is loading. Please wait.

Simple Criticality Predictors for Dynamic Performance and Power Management in CMPs Group Talk: Dec 10, 2008 Abhishek Bhattacharjee.

Similar presentations


Presentation on theme: "Simple Criticality Predictors for Dynamic Performance and Power Management in CMPs Group Talk: Dec 10, 2008 Abhishek Bhattacharjee."— Presentation transcript:

1 Simple Criticality Predictors for Dynamic Performance and Power Management in CMPs Group Talk: Dec 10, 2008 Abhishek Bhattacharjee

2 Why Thread Criticality Prediction (TCP) ?  Significant load imbalance in typical parallel programs  Threads of a parallel program don’t finish at the same time  Worse with heterogeneity (process variation, thermal emergencies etc.)  Relative thread speed or criticality is difficult to predict If we can deduce that thread 2 is critical and by how much, what can we do with this ?

3 Performance Improvements from TCP  TCP to improve dynamic parallel management performance (eg. TBB)  Steal tasks from critical threads

4 Energy Efficiency from TCP  TCP-guided DVFS for barriers  Slow down non-critical threads without affecting runtime

5 Our Goals  Develop low-overhead hardware TCP schemes  Harness counters and metrics available on-chip  Aim for high accuracy across architectures  TCP should be general for a variety of applications  Improve TBB performance with TCP-guided task stealing  Improve barrier energy efficiency with TCP-guided DVFS

6 Related Work  Thrifty Barrier [Li, Martinez, Huang, HPCA ‘05]  Transition fast, non-critical threads into low-power sleep modes to save energy at barriers  Predict barrier stall time based purely on history  Still wastes energy in compute phase…  Meeting Points [Cai et al. PACT ’08]  DVFS threads to save energy without performance penalty  Only applicable to parallel loops  Insert meeting points to track loop iterations executed per core  Broadcast iteration counts to all cores  Use software to calculate appropriate DVFS settings 1.Unlike thrifty barrier, predict before barrier is reached 2.Unlike meeting points, avoid specialized instructions & broadcasts 3.Unlike meeting points, broader applicability than parallel loops 4.Unlike both approaches, use software-independent criticality calculation 5.Target versatility: TBB performance, barrier energy waste, SMT priority schemes, memory priority schemes etc.

7 Outline  Methodology  Predicting Thread Criticality  Basic TCP Hardware  Improving TBB Performance with TCPs  Minimizing Energy Waste at Barriers with TCPs

8 Methodology: Simulators  TCP accuracy is harder with in-order pipelines  Assess energy savings accurately and with OS on emulator  Assumes 50% fixed leakage energy cost of baseline

9 Methodology: Benchmarks  Use larger, realistic data sets for Splash-2 [Bienia et al. PACT ’08]

10 Outline  Methodology  Predicting Thread Criticality  Basic TCP Hardware  Improving TBB Performance with TCPs  Minimizing Energy Waste at Barriers with TCPs

11 Per-Core Fully-Local History  If behavior is sufficiently repetitive, can use fully-local history (thrifty barrier)  Difficult to achieve on in-order pipelines  Solution: target thread comparative info.  Rationale: criticality of a single thread determined by others

12 Comparative Metric: Instruction Count  Normalize compute times and metric against critical thread 6

13 Comparative Metric: Instruction Count  Avg. error from barrier iterations and 10% execution snapshots of non-barrier apps (Swaptions, Fluidanimate)  Poor accuracy across all tested benchmarks

14 Comparative Metric: Cache Statistics But Ocean and LU still suffer from over 25% error Include L1 I-Cache Misses …

15 Comparative Metric: Cache Statistics Instruction counts and control flow affects LU, Water-Nsq, Water-Sp But Ocean still has over 22% error Include L2 Cache Misses …

16 Comparative Metric: Cache Statistics Memory-intensive Ocean, Volrend, Radix, PARSEC particularly improved Now check weighted cache misses metric on out-of-order machine

17 Comparative Metric: Cache Statistics Memory intensive Ocean, Volrend Radix, PARSEC least impacted Weighted cache misses is most accurate

18 Comparative Metric: Control Flow and TLB Misses  Control flow (branch mispredictions) penalties and instruction count effects tracked with L1 I-Cache misses  TLB Misses  Little variation among multiple threads (usually access closely spaced data)  Trivial to include weighted TLB component if necessary

19 Outline  Methodology  Predicting Thread Criticality  Basic TCP Hardware  Improving TBB Performance with TCPs  Minimizing Energy Waste at Barriers with TCPs

20 Basic TCP Hardware  Criticality counters placed with shared, unified L2 cache  Simple, scalable hardware  Eliminate broadcasts as L2 controller sees all cache misses  Can accommodate more cache levels, split L2, or distributed LLCs with trivial HW and messages

21 Outline  Methodology  Predicting Thread Criticality  Basic TCP Hardware  Improving TBB Performance with TCPs  Minimizing Energy Waste at Barriers with TCPs

22 The TBB Task Scheduler  Concurrency is expression in tasks  TBB dynamic scheduler stores and distributes tasks  Scheduler has control of worker thread with per-thread software queue  Threads try to extract tasks from local queue  If queues are empty, threads steal tasks from remote queues  If steal unsuccessful, back off for pre-determined time before retrying  Problem: Steal victim is chosen randomly  Poor performance at higher core counts and high load imbalance

23 Our TCP-Guided TBB Stealing Algorithm 1. If (Cache miss from Core P){ 2. Update criticality counter for Core P based on cache miss type 3. } 4. If (Steal request from a Core){ 5. Scan all criticality counters to find the maximum value 6. Report core with highest criticality counter value as steal victim 7.} 8. If (Message indicating steal from victim Core P unsuccessful) { 9.Reset criticality counter for Core P 10} 11. If ( (Number of Cycles % Interval Bound) = = 0 ) 12. Reset all criticality counters

24 Hardware Details  Interval Bound = 100K cycles  Simple and scalable hardware:  Even at 64 cores with 14 bits per Criticality Counter, 114 bytes of storage  Minimal message overhead  TCP access takes the same latency as L2 cache miss

25 Steal Rate Improvements with TCP TCP-guided task stealing limits false negatives under 7% Now, at higher cores, greater success rates

26 Performance Improvements with TCP Up to 32% performance gains against random stealing Regularly outperforms occupancy-based stealing Streamcluster is highly load imbalanced TCP and occupancy benefits are similar

27 Outline  Methodology  Predicting Thread Criticality  Basic TCP Hardware  Improving TBB Performance with TCPs  Minimizing Energy Waste at Barriers with TCPs

28 TCP-Driven Energy Efficiency in Barrier- Based Parallel Programs  Goals recap:  Multiple threads reach barriers at different times  Use TCP to predict thread speeds  DVFS fast threads to low frequencies  Assume f 0, 0.85f 0, 0.70f 0, 0.55f 0  Aim: energy efficiency with no performance penalty  But:  TCP mispredictions due to spurious program behavior  DVFS transition overhead impact should be minimized

29 TCP-Driven DVFS Hardware Cache miss from core P  Update criticality counter P Is P running at f o and is counter above T?

30 TCP-Driven DVFS Hardware For all cores, find closest SST match to Criticality Counter Is SST suggesting freq. switch?

31 TCP-Driven DVFS Hardware If SST suggesting freq. switch, increment suggested SCT counter, decrement others If max. counter for new DVFS setting, actual DVFS

32 TCP-Driven DVFS Hardware If SST not suggesting freq. switch, increment current SCT counter, decrement others

33 TCP-Driven DVFS Hardware Reset criticality counters

34 Criticality Counter Threshold  In-order pipeline results (harder case) – 16 cores  Balance between TCP speed and accuracy Avg. Accuracy @ 1024 = 78.19 %

35 Integrating the Suggestion Conf. Table Avg. Accuracy @ 2 bits = 92.68 %

36 Case Study: Impact of SCT on Streamcluster

37 Impact of Memory Parallelism

38 Gradual DVFS  Might not save as much energy as direct DVFS at low MHSRs  But at high MHSRs, much higher accuracy than direct DVFS

39 TCP-Guided DVFS Scheme Performance

40 Energy Savings FPGA platform with 4-cores, 50% fixed leakage cost Even higher savings expected with greater core counts, core complexity, realistic modeling of leakage (temperature impact), on-chip switching regulators

41 Hardware Characteristics  Based on readily available on-chip cache stats  All thread criticality calculation done in hardware with SST  Minimal network messages  Low overhead and scalable  16 core CMP – 71 bytes of storage  64 CMP – 215 bytes of storage

42 Conclusion  Low overhead TCPs help manage parallelism for energy and performance  Accurate TCPs can be based on simple cache statistics  TCP-based TBB task stealer offers 12.9% to 32% performance improvements on 32-core CMP  TCP-based DVFS offers 15% energy savings on 4-core CMP  Future: TLB prefetching, DRAM scheduling


Download ppt "Simple Criticality Predictors for Dynamic Performance and Power Management in CMPs Group Talk: Dec 10, 2008 Abhishek Bhattacharjee."

Similar presentations


Ads by Google