Canturk Isci Advisor: Margaret Martonosi

Slides:



Advertisements
Similar presentations
Discovering and Exploiting Program Phases Timothy Sherwood, Erez Perelman, Greg Hamerly, Suleyman Sair, Brad Calder CSE 231 Presentation by Justin Ma.
Advertisements

Hardware-based Devirtualization (VPC Prediction) Hyesoon Kim, Jose A. Joao, Onur Mutlu ++, Chang Joo Lee, Yale N. Patt, Robert Cohn* ++ *
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
Techniques for Multicore Thermal Management Field Cady, Bin Fu and Kai Ren.
Green Governors: A Framework for Continuously Adaptive DVFS Vasileios Spiliopoulos, Stefanos Kaxiras Uppsala University, Sweden.
Automatically Characterizing Large Scale Program Behavior Timothy Sherwood Erez Perelman Greg Hamerly Brad Calder.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.
Catching Accurate Profiles in Hardware Satish Narayanasamy, Timothy Sherwood, Suleyman Sair, Brad Calder, George Varghese Presented by Jelena Trajkovic.
By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and
Adaptive Video Coding to Reduce Energy on General Purpose Processors Daniel Grobe Sachs, Sarita Adve, Douglas L. Jones University of Illinois at Urbana-Champaign.
Parapet Research Group, Princeton University EE Vice-Versa Talk #2 Apr 29, 2005 Phase Analysis on Real Systems Canturk ISCI Margaret MARTONOSI.
Managing Multi-Configuration Hardware via Dynamic Working Set Analysis By Ashutosh S.Dhodapkar and James E.Smith Presented by Kyriakos Yioutanis.
1 An SLA-Oriented Capacity Planning Tool for Streaming Media Services Lucy Cherkasova, Wenting Tang, and Sharad Singhal HPLabs,USA.
Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.
Low-Power Wireless Sensor Networks
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.
Ramazan Bitirgen, Engin Ipek and Jose F.Martinez MICRO’08 Presented by PAK,EUNJI Coordinated Management of Multiple Interacting Resources in Chip Multiprocessors.
ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Execution Characteristics of SPEC CPU2000 Benchmarks: Intel C++ vs. Microsoft VC++
An Analysis of Efficient Multi-Core Global Power Management Policies Authors: Canturk Isci†, Alper Buyuktosunoglu†, Chen-Yong Cher†, Pradip Bose† and Margaret.
Parapet Research Group, Princeton University EE IEEE International Symposium on Workload Characterization IISWC ’05, Austin, TX Oct 06, 2005 Detecting.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
VGreen: A System for Energy Efficient Manager in Virtualized Environments G. Dhiman, G Marchetti, T Rosing ISLPED 2009.
Embedded System Lab. 정범종 A_DRM: Architecture-aware Distributed Resource Management of Virtualized Clusters H. Wang et al. VEE, 2015.
Princeton University Electrical Engineering 12th International Symposium on High-Performance Computer Architecture HPCA-12, Austin, TX Feb 14, 2006.
Ensemble Learning for Low-level Hardware-supported Malware Detection
Full and Para Virtualization
BarrierWatch: Characterizing Multithreaded Workloads across and within Program-Defined Epochs Socrates Demetriades and Sangyeun Cho Computer Frontiers.
Hardware Architectures for Power and Energy Adaptation Phillip Stanley-Marbell.
Combining Software and Hardware Monitoring for Improved Power and Performance Tuning Eric Chi, A. Michael Salem, and R. Iris Bahar Brown University Division.
Best detection scheme achieves 100% hit detection with
Quantifying and Controlling Impact of Interference at Shared Caches and Main Memory Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, Onur.
GPGPU Performance and Power Estimation Using Machine Learning Gene Wu – UT Austin Joseph Greathouse – AMD Research Alexander Lyashevsky – AMD Research.
The CRISP Performance Model for Dynamic Voltage and Frequency Scaling in a GPGPU Rajib Nath, Dean Tullsen 1 Micro 2015.
Parapet Research Group, Princeton University EE Workshop on Hardware Performance Monitor Design and Functionality HPCA-11 Feb 13, 2005 Hardware Performance.
Confessions of a Performance Monitor Hardware Designer Workshop on Hardware Performance Monitor Design HPCA February 2005 Jim Callister Intel Corporation.
PINTOS: An Execution Phase Based Optimization and Simulation Tool) PINTOS: An Execution Phase Based Optimization and Simulation Tool) Wei Hsu, Jinpyo Kim,
SketchVisor: Robust Network Measurement for Software Packet Processing
Canturk ISCI Margaret MARTONOSI
Raghuraman Balasubramanian Karthikeyan Sankaralingam
Supervised Learning Based Model for Predicting Variability-Induced Timing Errors Xun Jiao, Abbas Rahimi, Balakrishnan Narayanaswamy, Hamed Fatemi, Jose.
Decoupled Access-Execute Pioneering Compilation for Energy Efficiency
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Bank-aware Dynamic Cache Partitioning for Multicore Architectures
Hyperthreading Technology
A Framework for Automatic Resource and Accuracy Management in A Cloud Environment Smita Vijayakumar.
Energy-Efficient Address Translation
A Review of Processor Design Flow
Application Slowdown Model
What we need to be able to count to tune programs
Tosiron Adegbija and Ann Gordon-Ross+
Haishan Zhu, Mattan Erez
Address-Value Delta (AVD) Prediction
Phase Capture and Prediction with Applications
Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle
Smita Vijayakumar Qian Zhu Gagan Agrawal
Adaptive Single-Chip Multiprocessing
A High Performance SoC: PkunityTM
Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt
Hardware Counter Driven On-the-Fly Request Signatures
Patrick Akl and Andreas Moshovos AENAO Research Group
Lois Orosa, Rodolfo Azevedo and Onur Mutlu
Phase based adaptive Branch predictor: Seeing the forest for the trees
Canturk Isci Gilberto Contreras Margaret Martonosi
Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project
Presentation transcript:

Canturk Isci Advisor: Margaret Martonosi Workload Adaptive Power Management with Live Phase Monitoring and Prediction Canturk Isci Advisor: Margaret Martonosi PrincetonUniversity November 19, 2018

Power Critical Across Computing Spectrum Cooling Packaging Battery Lifetime Peak Performance/Utilization Reliability Cost of Ownership Computing Scale

Powerful Observations [NY Times, 06/2006] The Dalles, OR $5M 5% Power Savings (%) Cost Savings $4M $3M $2M $1M 10% 15% 20% 25% [Eugene Gorbatov - Intel’06] Cost of power $0.10/KWh 1100 racks in datacenter WA – 48MW WA – 42MW > 30,000 homes US EPA Power Efficiency Specs: $1.8 billion savings / next 5 years - Equivalent to the annual savings of 2.7 million cars Power triple play: Save 1W processor power  ~1W power supply conversion  ~1W cooling power

Power Management Wide application scope Live adaptation and response Static/Offline Dynamic/Online Circuits Dual-VT VDD gating Dynamic pipeline/ cache reconfiguration Architecture Positional adaptation Compiler driven frequency scaling System Process cruise control Energy-aware / coordinated allocation Data center Dedicated servers Instruction/procedure energy accounting Application Dynamic compilation Wide application scope Live adaptation and response

Phase-driven dynamic adaptation Energy-efficient resource allocation My Work Circuits Architecture System Data center Application Dual-VT VDD gating Positional adaptation Dynamic pipeline/ cache reconfiguration Compiler driven frequency scaling Dedicated servers Instruction/procedure energy accounting Process cruise control Energy-aware / coordinated allocation Dynamic compilation Static/Offline Dynamic/Online Leverage application scope with architectural insight to infer dynamic workload behavior  Phases CMP power budgeting Phase-driven dynamic adaptation Energy-efficient resource allocation Detect and predict repetitive behavior for dynamic adaptations  Adaptive power management

Thesis Contributions A power-oriented phase analysis framework for real-system studies Generic runtime phase prediction methodology Phase-prediction-driven workload-adaptive power management Runtime power monitoring and estimation with hardware performance counters Phase detection approach resilient to system-induced variations

This Talk PART-I: On-the-fly phase monitoring and prediction Simple, target specific phases Application to dynamic management PART-II: Power-oriented phase analysis We look at… for… Left: Micro’06 Right: ~(Micro’03 + WWC’03) & HPCA’05 – hadi bakem Ever? IISWC’05?? -> Not likely (All IISWC is in extras.ppt NOW. But needs 15-20 mins by itself -> ditch) Detailed phases reflecting power behavior Workload power phase characterization

Program Phases Distinct and often-recurring regions of program behavior How can we detect recurrent execution under real-system variability? How can we predict future phase patterns? How can we leverage predicted phase behavior for workload-adaptive power management? Can we do better than simple, reactive methods? Useful for: Characterizing execution regions Use current phase/behavior to predict future behavior Managing dynamic adaptation

Dynamic Power Management with Live, Runtime Phase Prediction Current common dynamic adaptation approach: Assume last/recent observed behavior will persist Tracked Characteristic Great for stable execution! Inaccurate response for highly variable behavior! t Key questions: How can we accurately predict future application phase behavior on all types of execution? Can predicted phase behavior be leveraged for workload-adaptive power management?

Phases for Dynamic Power Management Need phases: Represent dynamic voltage and frequency scaling (DVFS) potential Resilient to system variations Invariant to dynamic management actions DVFS potential α Memory access rate Memory accesses per instruction (MPI) resilient to variations with fixed instruction granularity tracking MPI invariant to DVFS power modes Different MPI rates Different phases DVFS settings MPI Phase # DVFS Setting < 0.005 1 (1500 MHz, 1484 mV) [0.005,0.010) 2 (1400 MHz, 1452 mV) : > 0.030 6 ( 600 MHz, 956 mV) [Wu et al. Micro’05]

Applu execution snapshot: How Can We Predict Application Phase Behavior on All Types of Execution? Applu execution snapshot: MPI Phases 0.020 0.015 MPI Rate 0.010 0.005 0.000 1 2 3 4 5 Phases Now going back to our first question, lets look at a real ex 2.80E+10 2.90E+10 3.00E+10 3.10E+10 3.20E+10 3.30E+10 Cycles Significant variations exist! Phase patterns expose available recurrence!

Predicting Phases with the Global Phase History Table (GPHT) Predictor PHT Tags PHT Pred-n Age / Invalid GPHR Pt’ Pt’-1 Pt’-2 … Pt’-N Pt’ Pt’-1 Pt’-2 … … Pt’-N Pt’+1 20 Pt-1 Pt-2 … Pt-N Pt Pt-N-1 Pt’’ Pt’’-1 Pt’’-2 … Pt’’-N Pt’’ Pt’’-1 Pt’’-2 … … Pt’’-N Pt’’+1 Pt’’+1 15 : : : : : : : PHT entries : GPHR depth Pt Pt : : : : : : : : P0 P0 P0 … … P0 P0 -1 Last observed phase from performance counters GPHR depth Predicted Phase From GPHR(0) if no matching pattern From the corresponding PHT Prediction entry if matching pattern in PHT Similar to a global history branch predictor Implemented in OS for on-the-fly phase prediction

Prediction Accuracies 100 90 80 LastValue Prediction Accuracy (%) 70 PHT:1024, GPHR:8 60 PHT:128, GPHR:8 PHT:64, GPHR:8 50 PHT:1, GPHR:8 40 gzip_log mcf_inp gcc_200 gap_ref gcc_scilab gcc_expr ammp_in gcc_166 apsi_ref parser_ref mgrid_in applu_in equake_in wupwise_ref gcc_integrate bzip2_program bzip2_source bzip2_graphic Compare to reactive approaches (Last Value prediction) GPHT performs significantly better for highly varying applications Up to 6X and on average 2.4X misprediction improvement Similar results for the misprediction distances Good performance down to 128 PHT entries

Full-System Implementation Application Application Binary OS Predictor State Phase History Data Acquisition System Parallel Port CPU (V,I) PMI Interrupt Handler Predict Next Phase Stop/Read Counters Check/Set DVFS State Hardware Restart Counters Performance Counters DVFS Registers Pentium-M Processor

Phase-Driven Dynamic Adaptation: Complete Example MPI (GPHT) ACTUAL_PHASE PRED_PHASE (GPHT) 0.000 0.004 0.008 0.012 0.016 0.020 0.024 MPI GPHT can accurately predict varying application behavior! 1 2 3 4 5 Phases 2 4 6 8 10 12 14 Significant power savings compared to baseline! Power (Baseline) Power (GPHT) Power [W] 0.3 0.6 0.9 1.2 1.5 1.8 2.1 BIPS (Baseline) BIPS (GPHT) Small performance degradation! BIPS 1.5E+09 2.0E+09 2.5E+09 3.0E+09 3.5E+09 4.0E+09 4.5E+09 5.0E+09 Instructions

Improvement over Reactive Methods 0% 10% 20% 30% 40% 50% Energy-Delay Product (EDP) Improvement Last Value GPHT 63% 66% 70% 7% EDP improvement over reactive approach! GPHT 2X better for variable applications! Comparable or less performance degradation! Higher is better 0% 5% 10% 15% 20% bzip2_program bzip2_source bzip2_graphic mgrid_in applu_in equake_in swim_in mcf_inp average Perf. Degradation Last Value GPHT Lower is better Plots show EDP impr. And perf degr. For GPHT and last val, wrt baseline exec-n

This Talk PART-I: On-the-fly phase monitoring and prediction Summary: GPHT accurately predicts phases at runtime 2.4X fewer mispredictions Phase-driven dynamic management improves power efficiency EDP improvement: 27% over baseline, 7% over reactive PART-II: Power-oriented phase analysis We look at… for… Left: Micro’06 Right: ~(Micro’03 + WWC’03) & HPCA’05 – hadi bakem Ever? IISWC’05?? -> Not likely (All IISWC is in extras.ppt NOW. But needs 15-20 mins by itself -> ditch) Detailed phases reflecting power behavior Workload power phase characterization

Power Phase Characterization Insight to power behavior is very useful Identifying varying workload power demand Allocating power budgets Guiding thermal management Key questions: Can we leverage hardware performance counters to understand workload power behavior? How do different application attributes perform to characterize workload power phases? Understanding workload power behavior and how it changes is useful for many endgoals Identifying varying workload power demand – LAPSU / match (virtual) machine power cap allocation/cooling Allocating power budgets – CMP core knobs / VM consolidate Responding to thermal implications – swap hot code / cooperate hotspot shifting

Performance Counters Reflect Application Power Behavior Composition of performance monitoring counters (PMC Vectors) as proxy to power behavior High issue & exec. power High bus power High L2 cache power High L1 cache power EOMicro’03 CPU Bound L1 Bound L2 Bound Memory Bound

Identifying Phases with PMC Vectors Application execution ≡ PMC Vector samples Similar PMC vectors ≡ Same phase How to quantify vector (dis)similarity: “Similarity Distance (SD)” Absolute Manhattan(L1) Distance Normalized Manhattan Distance &

Phase Classification Results All pair-wise distances constitute a “Similarity Matrix” Classify similarity matrix into phases EOWWC’03 Small number of (5-10) phases capture power variation within 10%

Evaluating Different Features for Phase Analysis Several studied program characteristics - Specific metrics (IPC, EPI) - Hardware performance vectors - Branch counts - Working sets - Basic block vectors - Procedures Two main approaches: Control Flow Methods Basic Block Vectors (BBVs) [Sherwood et al. ASPLOS’02] Event Monitoring Techniques Performance Monitoring Counters (PMCs) [Isci and Martonosi Micro’03] Key Question: How do these methods perform in terms of accurate representations of power phase behavior?

Experimental Setup Application Binary Pin Sample basic block head addresses Instrument basic block heads Pin Application Binary Application Pintool OS OS serial device file Goal: To acquire control flow, performance metric and power behavior of workload execution at matching & controlled observation points on a real system Hardware Performance Counter Hardware External Power Measurement via Current Probe

A {BBV,PMC,Power} Sample Visited basic blocks: PMCs: Power history: 0x8048520 10844 0x8048554 0x8048554 463832 36.5W 0x8048570 4862349 37.5W 37.2W 0x8048774 0x804878d 299303 36.9W 0x804879c 36382 35.9W 1 Sample 0.5 0.02 0.7 1.4 0.16 1 PMC vector 37W 1 Power number 5 15 13 44 6 1 BBV Hash Every 100M instructions 5 15 12 13 44 6 BBV32 Every 1M instructions

Evaluation Main Steps Cluster BBV samples Cluster PMC vectors Compare each to true measured power Also compare to Oracle: classify directly for power Random: assign samples to target clusters randomly Deviation from power represents our error

Comparison of Techniques 30% BBVs 70% of Random PMCs 40% of Random Random 25% BBV PMC Oracle 30% of BBVs Oracle 50% of PMCs 20% Oracle PMCs achieve 40% less error than BBVs Percent Error w.r.t. Actual Power 15% 10% 5% 0% EO HPCA’05 I’ll show for first pivot and agglo-complete linkage only And one for whole set of suites We quantify these next AVE(SPECint) AVE(SPECfp) AVE(OTHER) AVE(Overall) Consistent results regardless of clustering method BBV and PMCs both improve on upper bounds, but also significant gap over lower bound PMCs generally lead to less errors than BBVs

This Talk PART-II: Power-oriented phase analysis Summary: Event counters useful to track runtime workload power profile PMC vectors better characterize power phase behavior with simpler monitoring and control 40% less error than BBVs We look at… for… Left: Micro’06 Right: ~(Micro’03 + WWC’03) & HPCA’05 – hadi bakem Ever? IISWC’05?? -> Not likely (All IISWC is in extras.ppt NOW. But needs 15-20 mins by itself -> ditch)

Future Directions Broader picture: Many core/Mini core/Accelerators Virtualization/Scalable enterprise Intentional/Unintentional variability DEMAND BASED EVERYTHING! Accurately projecting workload demand is key for dynamic adaptations! Configurable architectures/cores Allocating/migrating workloads/VMs to heterogeneous resources Dynamically tuning to schedules in data centers/real-time systems Cooperating PMC and control-flow features  ‘action’ dependent phases  multiple actions Before summing all these up for our final conclusions, lets look at some future research directions

Conclusions Event counters useful to track/predict/manage power Runtime phase predictions with the Global Phase History Table can accurately predict future application behavior 2.4X fewer mispredictions Dynamic power management guided by phase predictions improves system power-performance efficiency 27% EDP improvements over baseline, 7% over reactive “PMC vector” similarity identifies application phases PMC features provide better proxy workload power behavior compared to control-flow features PMC phases achieve 40% less error than control flow

THANKS!

Thanks! Canturk Isci Princeton University Department of Electrical Engineering Parapet Research Group Advisor: Margaret Martonosi canturk@princeton.edu http://www.princeton.edu/~canturk

Collaborators Princeton Margaret Martonosi Gilberto Contreras Qiang Wu Intel Eugene Gorbatov Sameer Abhinkar Rick Forand IBM Pradip Bose Alper Buyuktosunoglu Chen-Yong Cher Prabhakar Kudva Zhigang Hu Georgia Tech Ripal Nathuji

Research Overview Dynamic Management Power Estimation Phase Analysis Monitor application execution via specific features Classify features into phases Detect/Predict phase behavior Apply dynamic power management guided by phase predictions Validate with real measurements Dynamic Management Power Estimation Phase Analysis Power Estimation Runtime Monitoring Hardware Performance Counters Dynamic Program Flow Application Real Measurements

Publications Runtime Monitoring Power Estimation Dynamic Management Hardware Performance Counters Dynamic Program Flow [MICRO’03]  Runtime Power [WHPM in HPCA’05] Counters & Power Power Estimation [WWC’03]  Power Phases [IEEE MICRO’05]  Durations [IISWC’05]  Detection [HPCA’06]  PMC vs. BBV Phase Analysis [MICRO’06]  GPHT [MICRO’06]  CMP Budget [ICAC’07]  Hetero Datacenter Dynamic Management

Thesis Outline Power and Performance Measurement and Estimation on Real Systems: Methods and Basics [Micro-36’03][PMC Workshop in HPCA-11’05] Power Oriented Phase Analysis [WWC’03] [HPCA-12’06] Detecting Repetitive Phase Patterns with Real-System Variability [IISWC’05] Predicting Stable Phase Durations [IEEE MICRO’05] Runtime Phase Tracking and Phase Driven Dynamic Management [Micro-39’06]

MEASURED/ESTIMATED CPU POWER vs. TIME Other Work Event Counter Based Runtime Power Estimation Estimate component power breakdowns based on access rates [Micro’03] Gcc Gzip Vpr Vortex Gap Crafty Measured Estimated MEASURED/ESTIMATED CPU POWER vs. TIME Detecting Recurrent Phase Behavior under Real-System Variability Phase transformations due to variability effects Transition-guided phase detection framework [IISWC’05] Before giving our overall contribs & conclusions Here, let’s look at some of the other stuff we did

…Other Work Long-Term Value and Duration Prediction Predict duration and rate of change for stable phases [IEEE MICRO’05] Global Power Management for Chip Multiprocessors Optimize throughput for fixed global power budget [Micro’06] Before giving our overall contribs & conclusions Here, let’s look at some of the other stuff we did Power Management in Heterogeneous Data Centers Allocate workloads to heterogeneous platforms [ICAC’07]

Future Directions Many immediate research paths: Phase predictions and dynamic adaptations for thermally limited systems Extending to detailed ‘action-dependent’ phases via ‘across-mode’ phase predictions Dynamically tuning to service-level agreements in data centers/real-time systems Leveraging control-flow information in coordination with event-counters

…Future Directions Broader picture: Many core/Mini core/Accelerators Virtualization/Scalable enterprise Intentional/Unintentional variability DEMAND BASED EVERYTHING! Accurately projecting workload demand is key for dynamic adaptations! Multiconfigurable architectures Locally adapting CMP cores to workload demand Allocating/migrating workloads/VMs to heterogeneous resources (and vice-versa)

What Would I do? Real-System: Alternative mgmt  cache config Multidimension phases  across-mode predictions (golden patterns/BF model)  better characterizing power?? Woodcrest/Sossaman: DP-CMP  workload shifting for aligning phases  opportunity study first AMD-Barcelona(’07)  core-level mgmt benefits OPEN: How can I use control flow granularities w/o PIN? Simulation: Multiple features  big classifier  state dependent phase tables  Next-Phase State Machine  multiple actions cooperatively Phases to guide power gating? Phased/associativity-aware cache: Dual Vt or shutdown banks Phase-Driven Runahead: when to runahead & when not to In conjunction with MLP

Future Many core/mini core/NoC/accelarators/helper engines Security - LaGrande Parallelism – TM Virtualization/Platformization/Scalable Enterprise Consolidation/isolation/migration Process Variation Ultra low power/embedded/disposable/cheap computing Intentional/Unintentional heterogeneity DEMAND BASED EVERYTHING!

What ifs… We consider CMP+MP… SMT… Multithreaded? This is interesting (multiple virtual cntrs, pid dependent GPHT?) Non trashing vs trashing trade-off Thermal limited Fused caches?

Phase Visions

Phase-Driven Management Vision PC X A C S1 S3 S8 N V M S6 O Action to Controller Events PMCs Classifier History & State Table Phase State Machine I$ D$ Commit I$ Misses D$ Misses Instr-ns Completed DVS Cache Reconfig Phase State Next Phase

1.1) Why Care About Phases? Characterizing execution regions Summarize exec. Into repr exec. regions

1.1) Why Care About Phases? Characterizing execution regions Managing dynamic adaptation OFF ON Dynamic/adaptive mgmt

1.1) Why Care About Phases? Characterizing execution regions Managing dynamic adaptation Use current phase/behavior to predict future behavior 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 3 8 13 Time [s] Load Refs Store Misses

1.2) Why Care About Power Phases? Useful for: Guiding power budget / temperature limit management Slow down! Power [W] Temp. [oC] Time [s] Uncontrolled T Enforced T I.e. Montecito/Foxton I.e. Montecito/Foxton

1.2) Why Care About Power Phases? Useful for: Guiding power budget / temperature limit management Power/Temperature aware scheduling Power [W] This helps in 2 ways: Reduce cooling cost/heat removal rate for a server Extend battery life for a mobile as less cooling power/time is needed Time [s] [Bellosa et al. COLP’03]

1.2) Why Care About Power Phases? Useful for: Guiding power budget / temperature limit management Power/Temperature aware scheduling Power balancing for multiprocessor systems/activity migration Power Power Task1 Task2 Swap hot task Migrate hot task Or Slow down hot core Core/μP 1 Core/μP 2 Speed up! Slow down!

Counter Based Power Estimation

This Talk Application Runtime Monitoring Power Estimation Counter-based runtime power estimation Application Runtime Monitoring Hardware Performance Counters Dynamic Program Flow Power Estimation Power Estimation Phase Analysis Dynamic Management

Counter-Based Power Estimation Power of component i: MaxPower[i] · ArchScaling[i] · AccessRate[i] + NonGatedPower[i] Die area & stressmarks Microarch. properties Performance counters Empirical power measurements Fast (real-time) power estimation Offers estimated view of on-chip detail 22 subcomponents in the Pentium4 die  22 dimensional “Power Vector (PV)” EX: Trace cache delivers 3 uops/cycle in deliver mode and 1 uop/cycle in build mode: Power(TC)=[Access-Rate(TC)/3 + Access-Rate(ID)] x MaxPower(TC) + Non-gated TC CLK power

Experimental Framework 1mV/Adc conversion Voltage readings via RS232 to logging machine POWER SERVER POWER CLIENT Counter based access rates over ethernet Convert voltage to measured power Convert access rates to modeled powers

Power Estimation Results Gcc Gzip Vpr Vortex Gap Crafty Measured Estimated MEASURED/ESTIMATED CPU POWER vs. TIME Average Estimation Error: 3W (~6%)

Identifying Phases with Power Vectors

This Talk On-the-fly phase monitoring and prediction Offline phase characterization Simple, target specific phases Detailed phases reflecting power behavior Application to dynamic management Workload power phase characterization We consider… for…

This Talk Application Runtime Monitoring Power Estimation Counter-based runtime power estimation Identifying workload phase behavior with event counter information Application Runtime Monitoring Hardware Performance Counters Dynamic Program Flow Power Estimation Power Estimation Phase Analysis Dynamic Management

Power Vectors: Similarity Metrics Manhattan (L1) distance between vectors r,c Absolute Distance (AD): Manhattan distance between normalized r,c Normalized Distance (ND): Combination of AD & ND Similarity Distance (SD):

Phase Classification All pair-wise distances constitute a “Similarity Matrix” SD(r,c)  Matrix entry(r,c) Classify execution into phases: First Pivot Clustering Target: O(10) phases Cumulative errors: Max: 4.7W & RMS: 3.1W (~6%) Total power error < Σ(comp. errors) EOWWC’03

Similarity Matrix and Phase Classification All pair-wise distances constitute a “Similarity Matrix” SD(r,c)  Matrix entry(r,c) Classify execution into phases: First Pivot Clustering Cumulative errors: Max: 4.7W & RMS: 3.1W Total power error < Σ(comp. errors) EOWWC’03

Evaluating Different Features for Phase Analysis

This Talk Application Runtime Monitoring Power Estimation Counter-based runtime power estimation Identifying workload phase behavior with event counter information Evaluating event-counter and control-flow techniques for power phase characterization Application Runtime Monitoring Hardware Performance Counters Dynamic Program Flow Power Estimation Power Estimation Phase Analysis Dynamic Management

BAK-Different Target Number of Clusters Why worse for small # of phases: too much granularity in control-flow EO IISWC PMCs perform relatively better for the practical range of target clusters Relative BBV error is significantly larger than PMCs for small number of phases [1-10]

Summary Identifying phases with event counter vectors Comparison of event counter and control flow approaches Event counters provide good proxy for runtime power profile of applications Simple similarity analysis on the composition of PMC events help identify workload power phases PMC-based features generally provide a better characterization of power behavior compared to control-flow features By now we have a good idea of how to track phases, and confidence in our features, next is: Detecting recurrent under variability Phase-guided dynamic mgmt

Detecting Recurrent Phase Behavior under Real-System Variability

This Talk Application Runtime Monitoring Power Estimation Counter-based runtime power estimation Identifying workload phase behavior with event counter information Evaluating event-counter and control-flow techniques for power phase characterization Detecting Recurrent Phase Behavior under Real-System Variability Application Runtime Monitoring Hardware Performance Counters Dynamic Program Flow Power Estimation Power Estimation Phase Analysis Dynamic Management

Detecting Recurrent Phase Behavior under Real-System Variability Repetitive phases inevitably exhibit different behavior Values & durations vary Phase distributions vary Key Questions: How do phases manifest themselves with real-system effects? How can we extract recurrent behavior in spite of these variations?

Real-System Variability Effects on Phases Metric t Ideal A B C Glitch A B C D NEED TO BE ANUMATION, WITH CHANGING TIMELINES and PHASE LABELS, ALSO SHOULD HAVE GRADIENT Gradient A B C D E Shift A B C D E Mutation A B C D E F Time Dilation A B C D E F

Comparing Phase Signatures Metric Ideal t A B C Metric Final t A B C D E F A direct apples to apples comparison of phase signatures is not very relevant in real world!

Value-Based Phases Value Based Phases (VBP) 3 2 2 1 t A B C 6 5 4 3 2 2 Let’s revisit our concept example 1 t A B C D E F Value based phase representations do not show good correlation

Our Proposed Solution with Transitions Transition Based Phases (TBP) 1 1 1 00…0 00…0 00…0 00…0 t A B C 1 1 1 1 1 1 Here I intentionally ignore shifts to make my point, at the end all analyses are shift invariant! So In the remaining of talk, we focus on transitions, and prune the remaining effects 00…0 00…0 00…0 00…0 00…0 t A B C D E F Tracking phase transitions rather than phase sequences is more useful in detecting recurrent behavior

Our Transition-Guided Detection Framework Phase #1 Sample PMCs to form 12D vectors Phase #2 Vector stream #1 Identify Transitions Vector stream #2 TBPinit #1 Apply glitch/gradient filtering TBPinit #2 TBPgg #1 TBPgg #2 Apply near-neighbor blurring TBPggN #1 Match ⇒ Peak at best alignment Mismatch ⇒ No observable peak Apply cross correlation

Sampling Effects: Glitches & Gradients Glitch: Instability where before & after are same  Spurious transitions Gradient: Instability where before & after are different  A single true transition Glitch/Gradient Filtering: Very simple: no consecutive transitions Initial Transitions: GLITCHES: 1 Refined Trans-ns: GRADIENTS:

Time Dilations Mismatch! Observation: Dilations exist as small jitters (few samples) Proposed Solution: “Near-Neighbor Blurring” Blur edges slightly  Consider transitions as distributions around their actual locations Tolerance: Spread of this distribution, [t-x, t+x] samples Ex: Matching improvement with tolerance=2: run1 1 1 1 Mismatch! Results of this in a while t run2 1 1 1 1 t

Time Dilations Observation: Dilations exist as small jitters (few samples) Proposed Solution: “Near-Neighbor Blurring” Blur edges slightly  Consider transitions as distributions around their actual locations Tolerance: Spread of this distribution, [t-x, t+x] samples Ex: Matching improvement with tolerance=2: run1 1 1 1 Match! Results of this in a while .7 .3 t run2 1 1 1 1 t

Receiver Operating Characteristics 0 detect threshold P{hit} = 1 P{false alarm} = 1 Desired operating point P{hit} ~ 1 P{false alarm} ~ 0 Best detection scheme (tolerance=1) achieves 100% hit detection with <5% false alarms. Very high detect threshold P{hit} = 0 P{false alarm} = 0

Improvement with Transition-Based Phases In all cases transitions perform better In almost all cases near-neighbor blurring improves detection

Summary Detecting phase behavior on real systems has interesting challenges resulting from system induced variability Phase transition information improves detection capabilities TBP show 6X better detection capabilities than VBP Supporting methods, such as Glitch/Gradient Filtering and Near-Neighbor Blurring improve detectability of transition signatures Near-neighbor blurring with tolerance=1 achieves 100% recurrence detection with <5% false alarms

Dynamic Power Management with Live, Runtime Phase Prediction

This Talk On-the-fly phase monitoring and prediction Offline phase characterization Simple, target specific phases Detailed phases reflecting power behavior Application to dynamic management Workload power phase characterization We look at… for… Left: Micro’06 Right: ~(Micro’03 + WWC’03) & HPCA’05 – hadi bakem Ever? IISWC’05?? -> Not likely (All IISWC is in extras.ppt NOW. But needs 15-20 mins by itself -> ditch)

This Talk Application Runtime Monitoring Power Estimation Counter-based runtime power estimation Identifying workload phase behavior with event counter information Evaluating event-counter and control-flow techniques for power phase characterization Detecting Recurrent Phase Behavior under Real-System Variability Workload adaptive power management with live, runtime phase predictions Application Runtime Monitoring Hardware Performance Counters Dynamic Program Flow Power Estimation Power Estimation Phase Analysis Dynamic Management

Design Constraints and Decisions Target management technique Dynamic voltage and frequency scaling (DVFS) Experimental platform Pentium-M (Banias)  2 PMCs Instruction based monitoring Eliminate timing variations First PMC  Instructions retired DVFS potential: α Memory boundedness of application α (Available concurrent execution)-1 Second PMC  Memory accesses per instruction (MPI) DVFS invariance: Tracked features should not change with dynamic adaptations Here we shift gears from our general purpose phase analiz for specific target

Guiding Dynamic Power Management Target management technique Dynamic voltage and frequency scaling (DVFS) DVFS potential: Memory Access Rate Low High CPU-Memory Overlap Low   High  f: t CPU MEM ½ f: t Here we shift gears from our general purpose phase analiz for specific target CPU MEM Track (Main) Memory accesses per instruction (MPI) Different MPI rates  Different DVFS settings

Phase Definitions Assign different MPI ranges to different phases Higher phase number  more memory bound phase MPI Phase # DVFS Setting < 0.005 1 (1500 MHz, 1484 mV) [0.005,0.010) 2 (1400 MHz, 1452 mV) [0.010,0.015) 3 (1200 MHz, 1356 mV) [0.015,0.020) 4 (1000 MHz, 1228 mV) [0.020,0.030) 5 ( 800 MHz, 1116 mV) > 0.030 6 ( 600 MHz, 956 mV) [Based on Wu et al. Micro’05] Important phase properties Resilient to system variations Invariant to dynamic power management actions

Phase Definitions Need phases such that: Represent dynamic voltage and frequency scaling (DVFS) potential Resilient to system variations Invariant to dynamic management actions DVFS potential: f(Memory Access Rate, Overlapping CPU Execution) Memory accesses per instruction (MPI) - Exexuted IPC - (ROB entries)/(RS entries) Resilient to variations with fixed instruction granularity tracking MPI invariant to DVFS power modes Different MPI rates  Different phases  DVFS settings 6

Variability and Power Savings Quadrants

GPHT Prediction Accuracies 100 90 80 LastValue Prediction Accuracy (%) 70 PHT:1024, GPHR:8 FixWindow_8 60 VarWindow_128_0.005 50 40 gzip_log mcf_inp gcc_200 gap_ref gcc_scilab gcc_expr ammp_in gcc_166 apsi_ref mgrid_in applu_in parser_ref equake_in wupwise_ref gcc_integrate bzip2_program bzip2_source bzip2_graphic On the x-axis some of spec ordered Compare to reactive approaches Last Value / Fixed Window History / Variable Window History GPHT performs significantly better for highly varying applications Up to 6X and on average 2.4X misprediction improvement

Impact of PHT Size 128-entry PHT is plenty 100 90 80 LastValue Prediction Accuracy (%) 70 PHT:1024, GPHR:8 60 PHT:128, GPHR:8 PHT:64, GPHR:8 50 PHT:1, GPHR:8 40 gzip_log mcf_inp gcc_200 gap_ref gcc_scilab gcc_expr ammp_in gcc_166 apsi_ref parser_ref mgrid_in applu_in equake_in wupwise_ref gcc_integrate bzip2_program bzip2_source bzip2_graphic 128-entry PHT is plenty Converges to last value as PHT entries  1

Impact of Phase Granularities Average accuracy over experimented applications: N=1  Both 100% NO(10,000)  Both  0% 6

Mispredicted Distance vs. Prediction Accuracy Average distance between actual and predicted phase numbers over whole execution NOTE: Phases not uniform space though!! 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 gzip_log mcf_inp gcc_200 gcc_scilab wupwise_ref gap_ref gcc_integrate gcc_expr ammp_in gcc_166 parser_ref apsi_ref bzip2_program mgrid_in bzip2_source bzip2_graphic applu_in equake_in Prediction Error Distance LastValue GPHT_8_1024

DVFS Invariance Important constraint when talking “actions” If actions change phase classifications: Obsolete past history & unreliable predictions

Application Execution Control Flow Dynamic Adaptation Control: Stop/Read performance counters Every 100 million instructions Translate to phases Update phase predictor states Predict next phase Application Execution Translate to DVFS setting Same as current setting? No Apply new DVFS setting Yes Exit to program execution Clear interrupt Restart counters

Full-System Implementation Application Application Binary OS PMC and Phase Log Predictor State Performance Monitoring Interrupt PMI Interrupt Handler Predict Next Phase Stop/Read Counters Check/Set DVFS State Hardware I1 Restart Counters V1 VCPU Data Acquisition System Parallel Port Voltage Regulator I2 V2 Performance Counters DVFS Registers R1,2=2mΩ Power Supply Pentium-M Processor

Averages: GPHT: 21% & Last Value: 14%  50% Improvement Power Savings Averages: GPHT: 21% & Last Value: 14%  50% Improvement

Bounding Performance Degradation Phase mappings dynamically configurable Can limit performance degradation sacrificing power efficiency EO Micro’06

GPHT Overhead Insignificant ~0.02%

Summary Phase characterizations help identify repetitive application behavior under real-system variability and dynamic management actions Runtime phase predictions with the Global Phase History Table can accurately predict future application behavior Up to 6X and on average 2.4X less mispredictions than reactive approaches Dynamic power management guided by these phase predictions help improve system power-performance efficiency 27% EDP improvements over baseline and 7% over reactive approaches

Repository

Full-System Implementation V1 OS kernel R1=2mΩ VCPU I2 V2 Predictor state PMC and phase log R2=2mΩ Pentium-M Processor Performance counters Voltage regulator PMI Interrupt handler DVFS mode set registers Power supply

Guiding Dynamic Power Management Target management technique Dynamic voltage and frequency scaling (DVFS) DVFS potential: α Memory boundedness of application α (Available concurrent execution)-1 f: t CPU MEM ½ f: t Here we shift gears from our general purpose phase analiz for specific target CPU MEM Track Memory accesses per instruction (MPI) Different MPI rates  Different DVFS settings

Guiding Dynamic Power Management Target management technique Dynamic voltage and frequency scaling (DVFS) DVFS potential: α Memory boundedness of application α (Available concurrent execution)-1 f: t CPU MEM ½ f: t Here we shift gears from our general purpose phase analiz for specific target CPU MEM Track Memory accesses per instruction (MPI) Different MPI rates  Different DVFS settings

Guiding Dynamic Power Management Target management technique Dynamic voltage and frequency scaling (DVFS) DVFS potential: α Memory boundedness of application α (Available concurrent execution)-1 f: t CPU MEM ½ f: t Here we shift gears from our general purpose phase analiz for specific target CPU MEM Track Memory accesses per instruction (MPI) Different MPI rates  Different DVFS settings

CMP Management AMD Barcelona AMD's Barcelona core, due out in Q2 '07, will have support for independent clocks per core but all sharing the same voltage

CMP Management AMD Barcelona AMD's Barcelona core, due out in Q2 '07, will have support for independent clocks per core but all sharing the same voltage

PHASES & Phases & phases