Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

Slides:

Advertisements

Similar presentations

Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.

Advertisements

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,

Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.

ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors Mohammad Hammoud, Sangyeun Cho, and Rami Melhem Presenter: Socrates Demetriades.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.

1 MemScale: Active Low-Power Modes for Main Memory Qingyuan Deng, David Meisner*, Luiz Ramos, Thomas F. Wenisch*, and Ricardo Bianchini Rutgers University.

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

Techniques for Multicore Thermal Management Field Cady, Bin Fu and Kai Ren.

Enabling Efficient On-the-fly Microarchitecture Simulation Thierry Lafage September 2000.

Thin Servers with Smart Pipes: Designing SoC Accelerators for Memcached Bohua Kou Jing gao.

1 DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems Magnus Jahre †, Marius Grannaes † ‡ and Lasse Natvig † † Norwegian.

Green Governors: A Framework for Continuously Adaptive DVFS Vasileios Spiliopoulos, Stefanos Kaxiras Uppsala University, Sweden.

Energy Evaluation Methodology for Platform Based System-On- Chip Design Hildingsson, K.; Arslan, T.; Erdogan, A.T.; VLSI, Proceedings. IEEE Computer.

Simple Criticality Predictors for Dynamic Performance and Power Management in CMPs Group Talk: Dec 10, 2008 Abhishek Bhattacharjee.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

GSRC Annual Symposium Sep 29-30, 2008 Full-System Chip Multiprocessor Power Evaluations Using FPGA-Based Emulation Abhishek Bhattacharjee, Gilberto Contreras,

Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.

CS 7810 Lecture 21 Threaded Multiple Path Execution S. Wallace, B. Calder, D. Tullsen Proceedings of ISCA-25 June 1998.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in partitioned architectures Rajeev Balasubramonian Naveen.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

Adaptive Video Coding to Reduce Energy on General Purpose Processors Daniel Grobe Sachs, Sarita Adve, Douglas L. Jones University of Illinois at Urbana-Champaign.

Secure Embedded Processing through Hardware-assisted Run-time Monitoring Zubin Kumar.

Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.

NVSleep: Using Non-Volatile Memory to Enable Fast Sleep/Wakeup of Idle Cores Xiang Pan and Radu Teodorescu Computer Architecture Research Lab

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.

Thread criticality for power efficiency in CMPs Khairul Kabir Nov. 3 rd, 2009 ECE 692 Topic Presentation 1.

Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.

(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)

DASX : Hardware Accelerator for Software Data Structures Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman (Simon Fraser University), Vijayalakshmi Srinivasan.

University of Michigan Electrical Engineering and Computer Science Composite Cores: Pushing Heterogeneity into a Core Andrew Lukefahr, Shruti Padmanabha,

1/36 by Martin Labrecque How to Fake 1000 Registers Oehmke, Binkert, Mudge, Reinhart to appear in Micro 2005.

1 Process Scheduling in Multiprocessor and Multithreaded Systems Matt Davis CS5354/7/2003.

MadCache: A PC-aware Cache Insertion Policy Andrew Nere, Mitch Hayenga, and Mikko Lipasti PHARM Research Group University of Wisconsin – Madison June 20,

VGreen: A System for Energy Efficient Manager in Virtualized Environments G. Dhiman, G Marchetti, T Rosing ISLPED 2009.

Embedded System Lab. 정범종 A_DRM: Architecture-aware Distributed Resource Management of Virtualized Clusters H. Wang et al. VEE, 2015.

Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.

Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.

CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.

Sandeep Navada © 2013 A Unified View of Non-monotonic Core Selection and Application Steering in Heterogeneous Chip Multiprocessors Sandeep Navada, Niket.

BarrierWatch: Characterizing Multithreaded Workloads across and within Program-Defined Epochs Socrates Demetriades and Sangyeun Cho Computer Frontiers.

Hardware Architectures for Power and Energy Adaptation Phillip Stanley-Marbell.

HARD: Hardware-Assisted lockset- based Race Detection P.Zhou, R.Teodorescu, Y.Zhou. HPCA’07 Shimin Chen LBA Reading Group Presentation.

1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.

Combining Software and Hardware Monitoring for Improved Power and Performance Tuning Eric Chi, A. Michael Salem, and R. Iris Bahar Brown University Division.

Sunpyo Hong, Hyesoon Kim

Cache Miss-Aware Dynamic Stack Allocation Authors: S. Jang. et al. Conference: International Symposium on Circuits and Systems (ISCAS), 2007 Presenter:

Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.

1 of 14 Lab 2: Formal verification with UPPAAL. 2 of 14 2 The gossiping persons There are n persons. All have one secret to tell, which is not known to.

Best detection scheme achieves 100% hit detection with

On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.

The CRISP Performance Model for Dynamic Voltage and Frequency Scaling in a GPGPU Rajib Nath, Dean Tullsen 1 Micro 2015.

Canturk ISCI Margaret MARTONOSI

Lecture 3: MIPS Instruction Set

Computer Structure Multi-Threading

Hyperthreading Technology

CSCI1600: Embedded and Real Time Software

Milad Hashemi, Onur Mutlu, Yale N. Patt

Address-Value Delta (AVD) Prediction

Dynamic Voltage Scaling

Adaptive Single-Chip Multiprocessing

A High Performance SoC: PkunityTM

Loop-Level Parallelism

CSCI1600: Embedded and Real Time Software

Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

Presentation transcript:

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi Princeton University

Why Thread Criticality Prediction? D-Cache Miss I-Cache Miss Stall T0 T1T2T3 Threads 1 & 3 are critical  Performance degradation, energy inefficiency Sources of variability: algorithm, process variations, thermal emergencies etc. With thread criticality prediction: 1.Task stealing for performance 2.DVFS for energy efficiency 3.Many others … Insts Exec

Related Work  Instruction criticality [Fields et al., Tune et al etc.]  Thrifty barrier [Li et al. 2005]  Faster cores transitioned into low-power mode based on prediction of barrier stall time  DVFS for energy-efficiency at barriers [Liu et al. 2005]  Meeting points [Cai et al. 2008]  DVFS non-critical threads by tracking loop iterations completion rate across cores (parallel loops) Our Approach: 1.Also handles non-barrier code 2.Works on constant or variable loop iteration size 3.Predicts criticality at any point in time, not just barriers

Thread Criticality Prediction Goals Design Goals 1. Accuracy Absolute TCP accuracy Relative TCP accuracy 2. Low-overhead implementation Simple HW (allow SW policies to be built on top) 3. One predictor, many uses Design Decisions 1. Find suitable arch. metric 2. History-based local approach versus thread-comparative approach 3. This paper: TBB, DVFS Other uses: Shared LLC management, SMT and memory priority, …

Outline of this Talk  Thread Criticality Predictor Design  Methodology  Identify µarchitectural events impacting thread criticality  Introduce basic TCP hardware  Thread Criticality Predictor Uses  Apply to Intel’s Threading Building Blocks (TBB)  Apply for energy-efficiency in barrier-based programs

Methodology  Evaluations on a range of architectures: high- performance and embedded domains  Full-system including OS  Detailed power/energy studies using FPGA emulator Infrastructure Domain System Cores Caches GEMS Simulator High-performance, wide-issue, out-of-order 16 core CMP with Solaris 10 4-issue SPARC 32KB L1, 4MB L2 ARM Simulator Embedded, in-order 4-32 core CMP 2-issue ARM 32KB L1, 4MB L2 FPGA Emulator Embedded, in-order 4-core CMP with Linux issue SPARC 4KB I-Cache, 8KB D-Cache

Why not History-Based TCPs? + Info local to core: no communication -- Requires repetitive barrier behavior -- Problem for in-order pipelines: variant IPCs

Thread-Comparative Metrics for TCP: Instruction Counts

Thread-Comparative Metrics for TCP: L1 D Cache Misses

Thread-Comparative Metrics for TCP: L1 I & D Cache Misses

Thread-Comparative Metrics for TCP: All L1 and L2 Cache Misses

Outline of this Talk  Thread Criticality Predictor Design  Methodology  Identify µarchitectural events impacting thread criticality  Introduce basic TCP hardware  Thread Criticality Predictor Uses  Apply to Intel’s Threading Building Blocks (TBB)  Apply for energy-efficiency in barrier-based programs

Basic TCP Hardware Core 0Core 1Core 2 L1 I $L1 D $L1 I $L1 D $L1 I $L1 D $ Core 3 L1 I $L1 D $ Shared L2 Cache L2 Controller TCP Hardware Inst 1 Inst 2 Inst 5 Inst 5: L1 D$ Miss! Inst 5 Criticality Counters L1 Cache Miss! Inst 15 Inst 5: Miss Over Inst 15 Inst 20Inst 10 Inst 20: L1 I$ Miss! Inst 20 L1 Cache Miss! Inst 30Inst 20 Inst 20: Miss Over Inst 30Inst 35 Inst 25: L2 $ Miss Inst 25Inst 35 L2 Cache Miss! Per-core Criticality Counters track poorly cached, slow threads Inst 135 Inst 25: Miss Over Inst 125Inst 135 Periodically refresh criticality counters with Interval Bound Register

Outline of this Talk  Thread Criticality Predictor (TCP) Design  Methodology  Identify µarchitectural events impacting thread criticality  Introduce basic TCP hardware  Thread Criticality Predictor Uses  Apply to Intel’s Threading Building Blocks (TBB)  Apply for energy-efficiency in barrier-based programs

TBB Task Stealing & Thread Criticality  TBB dynamic scheduler distributes tasks  Each thread maintains software queue filled with tasks  Empty queue – thread “steals” task from another thread’s queue  Approach 1: Default TBB uses random task stealing  More failed steals at higher core counts  poor performance  Approach 2: Occupancy-based task stealing [Contreras, Martonosi, 2008]  Steal based on number of items in SW queue  Must track and compare max. occupancy counts

TCP-Guided TBB Task Stealing Core 0 SW Q0 Shared L2 Cache Core 1 SW Q1 Core 2 SW Q2 Core 3 SW Q3 Criticality Counters Interval Bound Register Task 1 TCP Control Logic Task 0 Task 2 Task 3 Task 4Task 5Task 6 Task 7 Clock: 0Clock: Core 3: L2 Miss 11 Clock: 30Clock: 100 None Core 2: Steal Req. Scan for max val. Steal from Core 3 Task 7 TCP initiates steals from critical thread Modest message overhead: L2 access latency Scalable: 14-bit criticality counters  114 bytes of 64 cores Core 3: L1 Miss

TCP-Guided TBB Performance TCP access penalized with L2 latency % Perf. Improvement versus Random Task Stealing Avg. Improvement over Random (32 cores) = 21.6 % Avg. Improvement over Occupancy (32 cores) = 13.8 %

Outline of this Talk  Thread Criticality Predictor Design  Methodology  Identify µarchitectural events impacting thread criticality  Introduce basic TCP hardware  Thread Criticality Predictor Uses  Apply to Intel’s Threading Building Blocks (TBB)  Apply for energy-efficiency in barrier-based programs

Adapting TCP for Energy Efficiency in Barrier-Based Programs T0 T1 T2 T3 Insts Exec L2 D$ Miss L2 D$ Over T1 critical, => DVFS T0, T2, T3 Approach: DVFS non-critical threads to eliminate barrier stall time Challenges: Relative criticalities Misprediction costs DVFS overheads

TCP for DVFS: Results  FPGA platform with 4 cores, 50% fixed leakage cost  See paper for details: TCP mispredictions, DVFS overheads etc Average 15% energy savings

Conclusions  Goal 1: Accuracy  Accurate TCPs based on simple cache statistics  Goal 2: Low-overhead hardware  Scalable per-core criticality counters used  TCP in central location where cache info. is already available  Goal 3: Versatility  TBB improved by 13.8% over best known 32 cores  DVFS used to achieve 15% energy savings  Two uses shown, many others possible…

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi Princeton University

Backup

Benchmarks BenchmarkSuiteProblem SizeTCP App. Studied LUSPLASH-21024x1024 matrix, 64x64 blocksDVFS BarnesSPLASH-265,536 particlesDVFS VolrendSPLASH-2HeadDVFS OceanSPLASH-2514x514 gridDVFS FFTSPLASH-24,194,304 data pointsDVFS CholeskySPLASH-2tk29.ODVFS RadixSPLASH-28,388,608 integersDVFS Water-NsqSPLASH moleculesDVFS Water-SpSPLASH moleculesDVFS BlackscholesPARSEC16,385 (simmedium)DVFS, TBB StreamclusterPARSEC8192 pointer per block (simmedium)DVFS, TBB SwaptionsPARSEC32 swaptions, 5000 sims. (simmedium)TBB FluidanimatePARSEC5 frames, 100K particles (simmedium)TBB  Use larger, realistic data sets for SPLASH-2 [Bienia et al.’08]

How is %Error of Metric Calculated? For one barrier iteration or 10% execution snapshot Track the following info per thread: 1.Num Instructions = I i 2.Num Cache Misses per Inst = Cm i 3.Compute time per thread = CT i Thread 0 I 0, CM 0, CT 0 Thread 1 I 1, CM 1,, CT 1 Thread 2 I 2, CM 2,, CT 2 Thread 3 I 3, CM 3,, CT 3 Suppose Thread 1 is critical Thread 0 Get I 0 / I 1, CM 0 / CM 1 Compare with CT 0 / CT 1 Thread 1 I 1, CM 1,, CT 1 Thread 2 Get I 2 / I 1, CM 2 / CM 1 Compare with CT 2 / CT 1 Thread 3 Get I 3 / I 1, CM 3 / CM 1 Compare with CT 3 / CT 1

Thread Comparative Metrics for TCP: Control Flow Changes & TLB Misses  Control flow changes captured by I-Cache misses  TLB misses similar across cores  poor indicator of criticality

TBB Random Stealer  TBB dynamic scheduler distributes tasks Thread 0 Task 0 SW Q0 Task 2 Task 3 Thread 1 Task 4 SW Q1 Thread 2 Task 5 SW Q2 Thread 3 Task 6 SW Q3 Task 7 Clock: 0Clock: 100 None Steal Task! Random Stealing: SW Q1 False Negative! Backoff… Clock: 150 Retry Steal: SW Q0 Successful! Task 1 Critical Thread (ideal steal victim) Poor performance at higher core counts and load imbalance

TBB Stealing with Occupancy-Approach  Occupancy-based approach [Contreras, Martonosi 2008] Thread 0 Task 0 SW Q0 Task 2 Task 3 Thread 1 Task 4 SW Q1 Thread 2 SW Q2 Thread 3 Task 6 SW Q3 Task 7 Clock: 0 Occ: 3Occ: 0 Occ: 1 Task 5 Clock: 100 Occ. Steal: SW Q0 None Task 1 Steal successful False negatives eliminated but still not stealing from critical thread

Applying TCP to DVFS  Assume available frequencies are f 0, 0.85f 0, 0.70f 0, 0.55f 0 Switching Suggestion Table Switching Confidence Table Target f 0 Target 0.85f 0 Target 0.70f 0 Target 0.55f 0 Current f T0.70T0.55T Current 0.85f 0 Current 0.70f 0 Current 0.55 f 0 f0f0 0.85f f f 0 CPU 0 CPU CPU 2 CPU 3 Criticality Counters Curr. DVFS Tags f0f0 f0f0 f0f0 f0f0

Applying TCP to DVFS Switching Suggestion Table Switching Confidence Table Target f 0 Target 0.85f 0 Target 0.70f 0 Target 0.55f 0 Current f T0.70T0.55T Current 0.85f 0 Current 0.70f 0 Current 0.55 f 0 f0f0 0.85f f f 0 CPU 0 CPU CPU 2 CPU 3 Criticality Counters Curr. DVFS Tags f0f0 f0f0 f0f0 f0f0

Applying TCP to DVFS Switching Suggestion Table Switching Confidence Table Target f 0 Target 0.85f 0 Target 0.70f 0 Target 0.55f 0 Current f T0.70T0.55T Current 0.85f 0 Current 0.70f 0 Current 0.55 f 0 f0f0 0.85f f f 0 CPU 0 CPU CPU 2 CPU 3 Criticality Counters T 0.83T 0 0 Curr. DVFS Tags f0f0 f0f0 f0f0 f0f0 Is a core with T running at f 0 ?

Applying TCP to DVFS Switching Suggestion Table Switching Confidence Table Target f 0 Target 0.85f 0 Target 0.70f 0 Target 0.55f 0 Current f T0.70T0.55T Current 0.85f 0 Current 0.70f 0 Current 0.55 f 0 f0f0 0.85f f f 0 CPU 0 CPU CPU 2 CPU 3 Criticality Counters T 0.83T 0 0 Curr. DVFS Tags f0f0 f0f0 f0f0 f0f0 For all cores, find closest SST match to Criticality Counter. SST suggests DVFS for core 1.

Applying TCP to DVFS Switching Suggestion Table Switching Confidence Table Target f 0 Target 0.85f 0 Target 0.70f 0 Target 0.55f 0 Current f T0.70T0.55T Current 0.85f 0 Current 0.70f 0 Current 0.55 f 0 f0f0 0.85f f f 0 CPU 0 CPU CPU 2 CPU 3 Criticality Counters T 0.83T 0 0 Curr. DVFS Tags f0f0 f0f0 f0f0 f0f0 SST suggestion goes to SCT. Increment suggested SCT counter, decrement others.

Applying TCP to DVFS Switching Suggestion Table Switching Confidence Table Target f 0 Target 0.85f 0 Target 0.70f 0 Target 0.55f 0 Current f T0.70T0.55T Current 0.85f 0 Current 0.70f 0 Current 0.55 f 0 f0f0 0.85f f f 0 CPU 0 CPU CPU 2 CPU 3 Criticality Counters T 0.83T 0 0 Curr. DVFS Tags f0f0 f0f0 f0f0 f0f0 Scan for max. counter. Is this corresponding to DVFS?

Applying TCP to DVFS Switching Suggestion Table Switching Confidence Table Target f 0 Target 0.85f 0 Target 0.70f 0 Target 0.55f 0 Current f T0.70T0.55T Current 0.85f 0 Current 0.70f 0 Current 0.55 f 0 f0f0 0.85f f f 0 CPU 0 CPU CPU 2 CPU 3 Criticality Counters Curr. DVFS Tags f0f0 0.85f 0 f0f0 f0f0 Initiate DVFS on core 1 and refresh criticality counters

TCP Parameters for DVFS  Carried out a number of experiments to gauge T and bits per SCT counter  T = 1024  78.19% accuracy  2 bits SCT  % accuracy  Refer to paper for details…

Handling DVFS Transition Overheads

TCP vs Meeting Points  Meeting points unsuitable for extremely irregular parallel program behavior  Example: LU from Splash 2 A kk Pseudocode for a thread for all k from 0 to N-1 if I own A kk, factorize it BARRIER for all my blocks A kj in pivot row A kj  A kj * A kk -1 BARRIER for all my blocks A ij in active interior of matrix A ij  A ij -- A ik * A kj end for

TCP vs Meeting Point ctd…

TCP Stability in Out-of-Order and In- Order Architectures  In-order architectures see highly variable IPCs  Thread criticality through instruction windows may be different from overall criticality through barrier run  Experiment: see how IPCs change over 5000 cycle windows and compare thread criticalities against barrier criticality

Comparison to Other Works  Thread Motion  Could use TCP to trigger TM instead of using DVFS for energy-efficiency in barrier-based programs  TM already shown to successfully use a last-level cache miss-driven approach  Temperature-constrained Power Control  TCP can be used as performance proxy instead of MIPS to guide power allocation of controller  Could be used to guide DVFS of programs under temperature constraints