Download presentation
Presentation is loading. Please wait.
Published byJunior Wilkins Modified over 9 years ago
1
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi International Symposium on Computer Architecture (ISCA), 2009 presented by Yu-Hsin Lin, 2010/10/01 1
2
Outline Thread Criticality Predictor (TCP) Methodology Architectural metrics for predicting criticality Basic design of TCP TCP uses Apply to Intel’s Threading Building Blocks (TBB) Apply for energy-efficiency in barrier-based programs Conclusion 2
3
Outline Thread Criticality Predictor (TCP) Methodology Architectural metrics for predicting criticality Basic design of TCP TCP uses Apply to Intel’s Threading Building Blocks (TBB) Apply for energy-efficiency in barrier-based programs Conclusion 3
4
Thread Criticality Predictor (TCP) Thread Criticality Predictors (TCPs) To determine which thread is critical (or non-critical) Task-stealing decisions for performance Dynamic voltage and frequency scaling (DVFS) for energy efficiency 4
5
TCP Goals The TCP needs to be Highly accurate Low overhead Designed for versatility across a range of applications 5
6
The Case for TCPs 6
7
Outline Introduction to Thread Criticality Predictor (TCP) Methodology Architectural metrics for predicting criticality Basic design of TCP TCP uses Apply to Intel’s Threading Building Blocks (TBB) Apply for energy-efficiency in barrier-based programs Conclusion 7
8
Outline Introduction to Thread Criticality Predictor (TCP) Methodology Architectural metrics for predicting criticality Basic design of TCP TCP uses Apply to Intel’s Threading Building Blocks (TBB) Apply for energy-efficiency in barrier-based programs Conclusion 8
9
Methodology TCP Accuracy Evaluation 9
10
Methodology TCP-Aided TBB Performance Evaluations 10
11
Methodology TCP-Guided DVFS Energy Evaluation 11
12
Methodology Benchmarks 12
13
Outline Introduction to Thread Criticality Predictor (TCP) Methodology Architectural metrics for predicting criticality Basic design of TCP TCP uses Apply to Intel’s Threading Building Blocks (TBB) Apply for energy-efficiency in barrier-based programs Conclusion 13
14
Architectural metrics History-based local TCPs Instruction counts Cache misses 14
15
History-Based Local TCPs 15 Compute and stall times are highly variant!
16
Instruction Counts 16
17
L1 D Cache Misses 17 Under 8% errorOver 25% error
18
L1 I & D Cache Misses 18
19
All L1 & L2 Cache Misses 19
20
All L1 & L2 Cache Misses 20 error increases of 3% of GEMS
21
Outline Introduction to Thread Criticality Predictor (TCP) Methodology Architectural metrics for predicting criticality Basic design of TCP TCP uses Apply to Intel’s Threading Building Blocks (TBB) Apply for energy-efficiency in barrier-based programs Conclusion 21
22
Basic Design of TCP 22
23
Criticality Counter Criticality Counters count L1 and L2 cache misses resulting from each core’s references The proposed weighted criticality counter values Since L2 misses incur a larger penalty, their weighted is proportionately higher 23
24
Interval Bound Register Incremented on every clock cycle Ensures that criticality predictions are based on relatively recent application behavior. Resetting all Criticality Counters whenever the Interval Bound Register reaches a pre-defined threshold M (100K cycles) 24
25
Basic TCP Hardware Example 25 Core 0 L1 I $L1 D $ Core 1 L1 I $L1 D $ Core 2 L1 I $L1 D $ Core 3 L1 I $L1 D $ Shared L2 Cache TCP Hardware 0000 Criticality Counters
26
Basic TCP Hardware Example 26 Core 0 L1 I $L1 D $ Core 1 L1 I $L1 D $ Core 2 L1 I $L1 D $ Core 3 L1 I $L1 D $ Shared L2 Cache TCP Hardware 0100 Criticality Counters L1 Cache Mss!
27
Basic TCP Hardware Example 27 Core 0 L1 I $L1 D $ Core 1 L1 I $L1 D $ Core 2 L1 I $L1 D $ Core 3 L1 I $L1 D $ Shared L2 Cache TCP Hardware 0110 Criticality Counters L1 Cache Mss!
28
Basic TCP Hardware Example 28 Core 0 L1 I $L1 D $ Core 1 L1 I $L1 D $ Core 2 L1 I $L1 D $ Core 3 L1 I $L1 D $ Shared L2 Cache TCP Hardware 01110 Criticality Counters L1 Cache Mss! L2 Cache Mss!
29
Basic TCP Hardware Example 29 Core 0 L1 I $L1 D $ Core 1 L1 I $L1 D $ Core 2 L1 I $L1 D $ Core 3 L1 I $L1 D $ Shared L2 Cache TCP Hardware 01110 Criticality Counters L1 Cache Mss! L2 Cache Mss! Periodically refresh criticality counters with Interval Bound Register
30
Outline Introduction to Thread Criticality Predictor (TCP) Methodology Architectural metrics for predicting criticality Basic design of TCP TCP uses Apply to Intel’s Threading Building Blocks (TBB) Apply for energy-efficiency in barrier-based programs Conclusion 30
31
Intel’s Threading Building Blocks TBB task stealing TBB dynamic scheduler distributes tasks Each thread maintains a software queue filled with tasks Empty queue: The thread steals a task from another thread’s queue Approaches Random task stealing Occupancy-based task stealing [1] Based on number of items in queue 31 [1] G. Contreras and M. Martonosi. Characterizing and Improving the Performance of Intel Threading Building Blocks. IEEE Intl. Symp. on Workload Characterization, 2008.
32
Developing Predictor Hardware To Improve TBB Task Stealing 14-bit Criticality Counters Interval Bound value of 100K cycles A 64-core CMP requires 114 bytes for the Criticality Counters and Interval Bound Register 323232
33
Experimental Results Random stealing TCP-guided stealing 33
34
TCP-guided stealing versus occupancy-based stealing 34
35
Outline Introduction to Thread Criticality Predictor (TCP) Methodology Architectural metrics for predicting criticality Basic design of TCP TCP uses Apply to Intel’s Threading Building Blocks (TBB) Apply for energy-efficiency in barrier-based programs Conclusion 35
36
Adapting TCP for Energy Efficiency 36 DVFS non-critical threads to eliminate barrier stall time
37
TCP for DVFS: Results 37
38
Outline Introduction to Thread Criticality Predictor (TCP) Methodology Architectural metrics for predicting criticality Basic design of TCP TCP uses Apply to Intel’s Threading Building Blocks (TBB) Apply for energy-efficiency in barrier-based programs Conclusion 38
39
Conclusion Simple but effective thread criticality predictors Accuracy Based on simple cache statistics Low-overhead hardware Scalable per-core criticality counters used Versatility TBB improved by 13.8% at 32 cores DVFS used to achieve 15% energy savings 39
40
Thanks for your listening! 40
41
附錄 A. Two Benchmark Suite: SPLASH-2 and PARSEC SPLASH-2 (Stanford ParalleL Applications for Shard memory) S. Woo et al. The SPLASH-2 Programs: Characterization and Methodological Considerations. Intl. Symp. on Computer Architecture, 1995. PARSEC (Princeton Application Repository for Shared-Memory Computers) C. Bienia et al. The PARSEC Benchmark Suite: Characterization and Architectural Implications. Intl. Conf. on Parallel Architectures and Compilation Techniques, 2008. SPLASH-2 versus PARSEC C. Bienia, S. Kumar, and K. Li. PARSEC vs SPLASH-2: A Quantitative Comparison of Two Multithreaded Benchmark Suites on Chip Multiprocessors. IEEE Intl. Symp. on Workload Characterization, 2008. 41
42
SPLASH-2 42
43
PARSEC 43
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.