Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi International Symposium on Computer Architecture (ISCA), 2009 presented by Yu-Hsin Lin, 2010/10/01 1

Outline Thread Criticality Predictor (TCP) Methodology Architectural metrics for predicting criticality Basic design of TCP TCP uses Apply to Intel’s Threading Building Blocks (TBB) Apply for energy-efficiency in barrier-based programs Conclusion 2

Outline Thread Criticality Predictor (TCP) Methodology Architectural metrics for predicting criticality Basic design of TCP TCP uses Apply to Intel’s Threading Building Blocks (TBB) Apply for energy-efficiency in barrier-based programs Conclusion 3

Thread Criticality Predictor (TCP) Thread Criticality Predictors (TCPs) To determine which thread is critical (or non-critical) Task-stealing decisions for performance Dynamic voltage and frequency scaling (DVFS) for energy efficiency 4

TCP Goals The TCP needs to be Highly accurate Low overhead Designed for versatility across a range of applications 5

The Case for TCPs 6

Outline Introduction to Thread Criticality Predictor (TCP) Methodology Architectural metrics for predicting criticality Basic design of TCP TCP uses Apply to Intel’s Threading Building Blocks (TBB) Apply for energy-efficiency in barrier-based programs Conclusion 7

Methodology TCP Accuracy Evaluation 9

Methodology TCP-Aided TBB Performance Evaluations 10

Methodology TCP-Guided DVFS Energy Evaluation 11

Methodology Benchmarks 12

Architectural metrics History-based local TCPs Instruction counts Cache misses 14

History-Based Local TCPs 15 Compute and stall times are highly variant!

Instruction Counts 16

L1 D Cache Misses 17 Under 8% errorOver 25% error

L1 I & D Cache Misses 18

All L1 & L2 Cache Misses 19

All L1 & L2 Cache Misses 20 error increases of 3% of GEMS

Basic Design of TCP 22

Criticality Counter Criticality Counters count L1 and L2 cache misses resulting from each core’s references The proposed weighted criticality counter values Since L2 misses incur a larger penalty, their weighted is proportionately higher 23

Interval Bound Register Incremented on every clock cycle Ensures that criticality predictions are based on relatively recent application behavior. Resetting all Criticality Counters whenever the Interval Bound Register reaches a pre-defined threshold M (100K cycles) 24

Basic TCP Hardware Example 25 Core 0 L1 I $L1 D $ Core 1 L1 I $L1 D $ Core 2 L1 I $L1 D $ Core 3 L1 I $L1 D $ Shared L2 Cache TCP Hardware 0000 Criticality Counters

Basic TCP Hardware Example 26 Core 0 L1 I $L1 D $ Core 1 L1 I $L1 D $ Core 2 L1 I $L1 D $ Core 3 L1 I $L1 D $ Shared L2 Cache TCP Hardware 0100 Criticality Counters L1 Cache Mss!

Basic TCP Hardware Example 27 Core 0 L1 I $L1 D $ Core 1 L1 I $L1 D $ Core 2 L1 I $L1 D $ Core 3 L1 I $L1 D $ Shared L2 Cache TCP Hardware 0110 Criticality Counters L1 Cache Mss!

Basic TCP Hardware Example 28 Core 0 L1 I $L1 D $ Core 1 L1 I $L1 D $ Core 2 L1 I $L1 D $ Core 3 L1 I $L1 D $ Shared L2 Cache TCP Hardware 01110 Criticality Counters L1 Cache Mss! L2 Cache Mss!

Basic TCP Hardware Example 29 Core 0 L1 I $L1 D $ Core 1 L1 I $L1 D $ Core 2 L1 I $L1 D $ Core 3 L1 I $L1 D $ Shared L2 Cache TCP Hardware 01110 Criticality Counters L1 Cache Mss! L2 Cache Mss! Periodically refresh criticality counters with Interval Bound Register

Intel’s Threading Building Blocks TBB task stealing TBB dynamic scheduler distributes tasks Each thread maintains a software queue filled with tasks Empty queue: The thread steals a task from another thread’s queue Approaches Random task stealing Occupancy-based task stealing [1] Based on number of items in queue 31 [1] G. Contreras and M. Martonosi. Characterizing and Improving the Performance of Intel Threading Building Blocks. IEEE Intl. Symp. on Workload Characterization, 2008.

Developing Predictor Hardware To Improve TBB Task Stealing 14-bit Criticality Counters Interval Bound value of 100K cycles A 64-core CMP requires 114 bytes for the Criticality Counters and Interval Bound Register 323232

Experimental Results Random stealing TCP-guided stealing 33

TCP-guided stealing versus occupancy-based stealing 34

Adapting TCP for Energy Efficiency 36 DVFS non-critical threads to eliminate barrier stall time

TCP for DVFS: Results 37

Conclusion Simple but effective thread criticality predictors Accuracy Based on simple cache statistics Low-overhead hardware Scalable per-core criticality counters used Versatility TBB improved by 13.8% at 32 cores DVFS used to achieve 15% energy savings 39

Thanks for your listening! 40

附錄 A. Two Benchmark Suite: SPLASH-2 and PARSEC SPLASH-2 (Stanford ParalleL Applications for Shard memory) S. Woo et al. The SPLASH-2 Programs: Characterization and Methodological Considerations. Intl. Symp. on Computer Architecture, 1995. PARSEC (Princeton Application Repository for Shared-Memory Computers) C. Bienia et al. The PARSEC Benchmark Suite: Characterization and Architectural Implications. Intl. Conf. on Parallel Architectures and Compilation Techniques, 2008. SPLASH-2 versus PARSEC C. Bienia, S. Kumar, and K. Li. PARSEC vs SPLASH-2: A Quantitative Comparison of Two Multithreaded Benchmark Suites on Chip Multiprocessors. IEEE Intl. Symp. on Workload Characterization, 2008. 41

SPLASH-2 42

PARSEC 43

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.

Similar presentations

Presentation on theme: "Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.

Similar presentations

Presentation on theme: "Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi."— Presentation transcript:

Similar presentations

About project

Feedback