Canturk Isci Gilberto Contreras Margaret Martonosi

Slides:



Advertisements
Similar presentations
A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.
Advertisements

Hardware-based Devirtualization (VPC Prediction) Hyesoon Kim, Jose A. Joao, Onur Mutlu ++, Chang Joo Lee, Yale N. Patt, Robert Cohn* ++ *
Reducing Leakage Power in Peripheral Circuits of L2 Caches Houman Homayoun and Alex Veidenbaum Dept. of Computer Science, UC Irvine {hhomayou,
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
1 OS Driven Core Selection for HCMP Systems Anand Bhatia, Rishkul Kulkarni.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Techniques for Multicore Thermal Management Field Cady, Bin Fu and Kai Ren.
Green Governors: A Framework for Continuously Adaptive DVFS Vasileios Spiliopoulos, Stefanos Kaxiras Uppsala University, Sweden.
Colorado Computer Architecture Research Group Architectural Support for Enhanced SMT Job Scheduling Alex Settle Joshua Kihm Andy Janiszewski Daniel A.
WCED: June 7, 2003 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-MadisonSlide 1 of 26 Exploring Efficient SMT Branch Predictor Design.
Perceptron-based Global Confidence Estimation for Value Prediction Master’s Thesis Michael Black June 26, 2003.
1 Instruction Based Memory Distance Analysis and its Application to Optimization Changpeng Fang Steve Carr Soner Önder Zhenlin Wang.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.
Code Coverage Testing Using Hardware Performance Monitoring Support Alex Shye, Matthew Iyer, Vijay Janapa Reddi and Daniel A. Connors University of Colorado.
Catching Accurate Profiles in Hardware Satish Narayanasamy, Timothy Sherwood, Suleyman Sair, Brad Calder, George Varghese Presented by Jelena Trajkovic.
1 Storage Free Confidence Estimator for the TAGE predictor André Seznec IRISA/INRIA.
Managing Multi-Configuration Hardware via Dynamic Working Set Analysis By Ashutosh S.Dhodapkar and James E.Smith Presented by Kyriakos Yioutanis.
Benchmarks Prepared By : Arafat El-madhoun Supervised By:eng. Mohammad temraz.
Accurate Analytical Modeling of Superscalar Processors J. E. Smith Tejas Karkhanis.
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.
ACSAC’04 Choice Predictor for Free Mongkol Ekpanyapong Pinar Korkmaz Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Institute.
Electrical and Computer Engineering University of Wisconsin - Madison Prefetching Using a Global History Buffer Kyle J. Nesbit and James E. Smith.
ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Execution Characteristics of SPEC CPU2000 Benchmarks: Intel C++ vs. Microsoft VC++
Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching Somayeh Sardashti and David A. Wood University of Wisconsin-Madison.
Dynamic Voltage Frequency Scaling for Multi-tasking Systems Using Online Learning Gaurav DhimanTajana Simunic Rosing Department of Computer Science and.
Embedded System Lab. 정범종 A_DRM: Architecture-aware Distributed Resource Management of Virtualized Clusters H. Wang et al. VEE, 2015.
Princeton University Electrical Engineering 12th International Symposium on High-Performance Computer Architecture HPCA-12, Austin, TX Feb 14, 2006.
11 Online Computing and Predicting Architectural Vulnerability Factor of Microprocessor Structures Songjun Pan Yu Hu Xiaowei Li {pansongjun, huyu,
BarrierWatch: Characterizing Multithreaded Workloads across and within Program-Defined Epochs Socrates Demetriades and Sangyeun Cho Computer Frontiers.
Hardware Architectures for Power and Energy Adaptation Phillip Stanley-Marbell.
1 Ubiquitous Memory Introspection (UMI) Qin Zhao, NUS Rodric Rabbah, IBM Saman Amarasinghe, MIT Larry Rudolph, MIT Weng-Fai Wong, NUS CGO 2007, March 14.
Combining Software and Hardware Monitoring for Improved Power and Performance Tuning Eric Chi, A. Michael Salem, and R. Iris Bahar Brown University Division.
Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.
Best detection scheme achieves 100% hit detection with
GPGPU Performance and Power Estimation Using Machine Learning Gene Wu – UT Austin Joseph Greathouse – AMD Research Alexander Lyashevsky – AMD Research.
The CRISP Performance Model for Dynamic Voltage and Frequency Scaling in a GPGPU Rajib Nath, Dean Tullsen 1 Micro 2015.
Parapet Research Group, Princeton University EE Workshop on Hardware Performance Monitor Design and Functionality HPCA-11 Feb 13, 2005 Hardware Performance.
BITS Pilani, Pilani Campus Today’s Agenda Role of Performance.
1 University of Maryland Using Information About Cache Evictions to Measure the Interactions of Application Data Structures Bryan R. Buck Jeffrey K. Hollingsworth.
Confessions of a Performance Monitor Hardware Designer Workshop on Hardware Performance Monitor Design HPCA February 2005 Jim Callister Intel Corporation.
PINTOS: An Execution Phase Based Optimization and Simulation Tool) PINTOS: An Execution Phase Based Optimization and Simulation Tool) Wei Hsu, Jinpyo Kim,
Canturk ISCI Margaret MARTONOSI
Raghuraman Balasubramanian Karthikeyan Sankaralingam
Assessing and Understanding Performance
Computer Structure Advanced Branch Prediction
Zhichun Zhu Zhao Zhang ECE Department ECE Department
Decoupled Access-Execute Pioneering Compilation for Energy Efficiency
Energy-Efficient Address Translation
What we need to be able to count to tune programs
Tosiron Adegbija and Ann Gordon-Ross+
Canturk Isci Advisor: Margaret Martonosi
Module 3: Branch Prediction
Haishan Zhu, Mattan Erez
Address-Value Delta (AVD) Prediction
Phase Capture and Prediction with Applications
Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle
Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt
Lecture 10: Branch Prediction and Instruction Delivery
Canturk ISCI Gilberto CONTRERAS Margaret MARTONOSI
Hardware Counter Driven On-the-Fly Request Signatures
Performance ICS 233 Computer Architecture and Assembly Language
Dynamic Hardware Prediction
COMS 361 Computer Organization
Lois Orosa, Rodolfo Azevedo and Onur Mutlu
Phase based adaptive Branch predictor: Seeing the forest for the trees
Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project
Presentation transcript:

Canturk Isci Gilberto Contreras Margaret Martonosi Live, Runtime Phase Monitoring and Prediction on Real Systems with Application to Dynamic Power Management Canturk Isci Gilberto Contreras Margaret Martonosi MICRO-39 Dec 9-13 2006 Orlando,FL PrincetonUniversity October 17, 2019

Workload Variability Enables Dynamic Power Management Power critical across computing spectrum Dynamic Power Management (DPM): Tune system to varying application demand! Inter-Workload Variability Intra-Workload Variability 30 35 40 45 50 55 0.0 0.5 1.0 1.5 2.0 IPC 0.2 0.4 0.6 0.8 swim equake mgrid lucas applu fma3d wupwise mcf apsi art facerec gap gcc bzip2 vortex vpr galgel parser eon ammp perlbmk mesa twolf gzip sixtrack crafty Mem Refs Power [W] ENABLER Inter-workload variability Often repetitive intra-workload variability  Phase Behavior Our Work: How can we project varying (repetitive) application behavior to better guide DPM techniques?

Research Overview Live, Runtime Phase Monitoring Prediction Phase Classification Runtime Monitoring Performance Counters Application and Prediction Phase Prediction with application to Dynamic Power Management Dynamic Management on Real Systems Real Measurements

Research Overview Application Phase Classification Runtime Monitoring Performance Counters Application 1) Track memory accesses per instruction (MPI) via performance counters 2) Classify execution into phase patterns based on MPI rates 3) Predict future behavior with the Global Phase History Table (GPHT) predictor Phase Prediction 4) Use phase predictions to guide dynamic voltage and frequency scaling (DVFS) Dynamic Management Real Measurements

Dynamic Power Management with Live, Runtime Phase Prediction Current (Reactive) dynamic adaptation approach: Assume last/recent observed behavior will persist Tracked Characteristic Great for stable execution! Inaccurate response for highly variable behavior! t Key questions: How can we accurately predict future application phase behavior on all types of execution? Can we use phase predictions to improve workload-adaptive power management?

Research Questions How can we accurately predict future application phase behavior on all types of execution? Specific phase definitions Phase prediction with the GPHT Prediction results Can we use phase predictions to improve workload-adaptive power management?

Guiding Dynamic Power Management Example technique: Dynamic voltage and frequency scaling (DVFS) Memory Accesses Low High Overlapping CPU Execution Low   High  f: t CPU MEM ½ f: t Here we shift gears from our general purpose phase analiz for specific target CPU MEM Track Memory accesses per instruction (MPI) Different MPI rates  Different DVFS settings

Phase Definitions Assign different MPI ranges to different phases Higher phase number  more memory bound phase MPI Phase # DVFS Setting < 0.005 1 (1500 MHz, 1484 mV) [0.005,0.010) 2 (1400 MHz, 1452 mV) [0.010,0.015) 3 (1200 MHz, 1356 mV) [0.015,0.020) 4 (1000 MHz, 1228 mV) [0.020,0.030) 5 ( 800 MHz, 1116 mV) > 0.030 6 ( 600 MHz, 956 mV) [Based on Wu et al. Micro’05] Important phase properties Resilient to system variations Invariant to dynamic power management actions

Execution Phase Patterns Applu execution snapshot: MPI Phases 0.020 0.015 MPI Rate 0.010 0.005 0.000 1 2 3 4 5 Phases Now going back to our first question, lets look at a real ex 2.80E+10 2.90E+10 3.00E+10 3.10E+10 3.20E+10 3.30E+10 Cycles Significant variations exist! Phase patterns expose repetition!

Predicting Phases with the Global Phase History Table (GPHT) Predictor PHT Tags PHT Pred-n Age / Invalid GPHR Pt’ Pt’-1 Pt’-2 … Pt’-N Pt’ Pt’-1 Pt’-2 … … Pt’-N Pt’+1 20 Pt-1 Pt-2 … Pt-N Pt Pt-N-1 Pt’’ Pt’’ Pt’’-1 Pt’’-2 … Pt’’-N Pt’’-1 Pt’’-2 … … Pt’’-N Pt’’+1 Pt’’+1 15 : : : : : : : PHT entries : GPHR depth Pt Pt : : : : : : : : P0 P0 P0 … … P0 P0 -1 Last observed phase from performance counters GPHR depth Predicted Phase From GPHR(0) if no matching pattern From the corresponding PHT Prediction entry if matching pattern in PHT Inspired by a global history branch predictor Software! Implemented in the OS for on-the-fly phase prediction

GPHT Prediction Accuracies 100 90 80 LastValue Prediction Accuracy (%) 70 PHT:1024, GPHR:8 FixWindow_8 60 VarWindow_128_0.005 50 40 gzip_log mcf_inp gcc_200 gap_ref gcc_scilab gcc_expr ammp_in gcc_166 apsi_ref mgrid_in applu_in parser_ref equake_in wupwise_ref gcc_integrate bzip2_program bzip2_source bzip2_graphic On the x-axis some of spec ordered Compare to reactive approaches Last Value / Fixed Window History / Variable Window History GPHT performs significantly better for highly varying applications Up to 6X and on average 2.4X misprediction improvement

Impact of PHT Size 128-entry PHT is plenty 100 90 80 LastValue Prediction Accuracy (%) 70 PHT:1024, GPHR:8 60 PHT:128, GPHR:8 PHT:64, GPHR:8 50 PHT:1, GPHR:8 40 gzip_log mcf_inp gcc_200 gap_ref gcc_scilab gcc_expr ammp_in gcc_166 apsi_ref parser_ref mgrid_in applu_in equake_in wupwise_ref gcc_integrate bzip2_program bzip2_source bzip2_graphic 128-entry PHT is plenty Converges to last value as PHT entries  1

Impact of Phase Granularities Average accuracy over experimented applications: N=1  Both 100% NO(10,000)  Both  0% 6

Research Questions How can we accurately predict future application phase behavior on all types of execution? Can we use phase predictions to improve workload-adaptive power management? Real-System implementation Dynamic power management results

Real-System Implementation Application Application Binary OS Predictor State Phase History Multimeter (DAQ) Parallel Port CPU (V,I) PMI Interrupt Handler Predict Next Phase Stop/Read Counters Check/Set DVFS State Hardware Restart Counters Performance Counters DVFS Registers Pentium-M Processor

Phase-Driven Dynamic Adaptation: Complete Example MPI (GPHT) ACTUAL_PHASE PRED_PHASE (GPHT) 0.000 0.004 0.008 0.012 0.016 0.020 0.024 MPI GPHT can accurately predict varying application behavior! 1 2 3 4 5 Phases 2 4 6 8 10 12 14 Significant power savings compared to baseline! Baseline GPHT Power [W] 0.3 0.6 0.9 1.2 1.5 1.8 2.1 Baseline GPHT Small performance degradation! BIPS 1.5E+09 2.0E+09 2.5E+09 3.0E+09 3.5E+09 4.0E+09 4.5E+09 5.0E+09 Instructions

Improvement over Reactive Methods 0% 10% 20% 30% 40% 50% EDP Improvement Last Value GPHT 63% 66% 70% 7% EDP improvement over reactive methods! Comparable or less performance degradation! 0% 5% 10% 15% 20% bzip2_program bzip2_source bzip2_graphic mgrid_in applu_in equake_in swim_in mcf_inp average Perf. Degradation Last Value GPHT Power perf firs -> then edp Plots show EDP impr. And perf degr. For GPHT and last val, wrt baseline exec-n

Summary Phase characterizations help identify repetitive application behavior under real-system variability and dynamic management actions Runtime phase predictions with the Global Phase History Table can accurately predict future application behavior Up to 6X and on average 2.4X less mispredictions than reactive approaches Dynamic power management guided by these phase predictions help improve system power-performance efficiency 27% EDP improvements over baseline and 7% over reactive approaches

THANKS! Download: www.princeton.edu/~canturk/platypus/ GPHT LKM Used kernel

Phase-Driven Management Vision PC X A C S1 S3 S8 N V M S6 O Action to Controller Events PMCs Classifier History & State Table Phase State Machine I$ D$ Commit I$ Misses D$ Misses Instr-ns Completed DVS Cache Reconfig Phase State Next Phase

Design Constraints and Decisions Target management technique Dynamic voltage and frequency scaling (DVFS) Experimental platform Pentium-M (Banias)  2 PMCs Instruction based monitoring Eliminate timing variations First PMC  Instructions retired DVFS potential: α Memory boundedness of application α (Available concurrent execution)-1 Second PMC  Memory accesses per instruction (MPI) DVFS invariance: Tracked features should not change with dynamic adaptations Here we shift gears from our general purpose phase analiz for specific target

Mispredicted Distance vs. Prediction Accuracy Average distance between actual and predicted phase numbers over whole execution NOTE: Phases not uniform space though!!

Application Execution Operation Flowchart Dynamic Adaptation Control: Stop/Read performance counters Every 100 million instructions Translate to phases Update phase predictor states Predict next phase Application Execution Translate to DVFS setting Same as current setting? No Apply new DVFS setting Yes Exit to program execution Clear interrupt Restart counters

Improvement over Reactive Methods 0% 10% 20% 30% 40% 50% 60% Power Savings Last Value GPHT 66% 76% Improved power savings! Comparable or less performance degradation! 20% Last Value GPHT 15% Power perf firs -> then edp Plots show EDP impr. And perf degr. For GPHT and last val, wrt baseline exec-n Perf. Degradation 10% 5% 0% mgrid_in applu_in equake_in swim_in mcf_inp average bzip2_program bzip2_source bzip2_graphic

Improvement over Reactive Methods 0% 10% 20% 30% 40% 50% EDP Improvement Last Value GPHT 63% 66% 70% Power perf firs -> then edp Plots show EDP impr. And perf degr. For GPHT and last val, wrt baseline exec-n mgrid_in applu_in equake_in swim_in mcf_inp average bzip2_program bzip2_source bzip2_graphic 27% EDP improvement over baseline! 7% EDP improvement over reactive methods!

Bounding Performance Degradation Phase mappings dynamically configurable Can limit performance degradation sacrificing power efficiency Bakup EO Micro’06

GPHT Overhead Insignificant ~0.02%

DVFS Invariance Important constraint when talking “actions” If actions change phase classifications: Obsolete past history & unreliable predictions

DVFS Invariance Need better explanation!!

Program Phases Distinct and often-recurring regions of program behavior How can we detect recurrent execution under real-system variability? How can we predict future phase patterns? How can we leverage predicted phase behavior for workload-adaptive power management? Can we do better than simple, reactive methods? Useful for: Characterizing execution regions Use current phase/behavior to predict future behavior Managing dynamic adaptation

Predicting Phases with the Global Phase History Table (GPHT) Predictor PHT Tags PHT Pred-n Age / Invalid Pt’ Pt’ Pt’-1 Pt’-2 … Pt’-N Pt’-1 Pt’-2 … … … … Pt’-N Pt’+1 15 20 : -1 GPHR Pt Pt-1 Pt-2 … Pt-N Pt-N-1 Pt’’ Pt’’-1 Pt’’-2 … Pt’’-N Pt’’ Pt’’-1 Pt’’-2 … … … … Pt’’-N Pt’’+1 Pt’’+1 : : : : : : : : : GPHR depth PHT entries Pt Pt : : : : : : : : : : : : : : : : : : Last observed phase from performance counters P0 P0 P0 … … … … P0 P0 GPHR depth Predicted Phase From GPHR(0) if no matching pattern From the corresponding PHT Prediction entry if matching pattern in PHT Inspired by a global history branch predictor Software! Implemented in the OS for on-the-fly phase prediction

Prediction Accuracies 100 90 80 LastValue Prediction Accuracy (%) 70 PHT:1024, GPHR:8 60 PHT:128, GPHR:8 PHT:64, GPHR:8 50 PHT:1, GPHR:8 40 gzip_log mcf_inp gcc_200 gap_ref gcc_scilab gcc_expr ammp_in gcc_166 apsi_ref parser_ref mgrid_in applu_in equake_in wupwise_ref gcc_integrate bzip2_program bzip2_source bzip2_graphic Compare to reactive approaches (Last Value prediction) GPHT performs significantly better for highly varying applications Up to 6X and on average 2.4X misprediction improvement Good performance down to 128 PHT entries Converges to last value as PHT entries  1

Full-System Implementation Application Application Binary OS PMC and Phase Log Predictor State Performance Monitoring Interrupt PMI Interrupt Handler Predict Next Phase Stop/Read Counters Check/Set DVFS State Hardware I1 Restart Counters V1 VCPU Data Acquisition System Parallel Port Voltage Regulator I2 V2 Performance Counters DVFS Registers R1,2=2mΩ Power Supply Pentium-M Processor

PHASES & Phases & phases

Cantürk İşçi Gilberto Contreras Margaret Martonosi Pt’ Pt’-1 Pt’-2 … Pt’-N Pt’’ Pt’’-1 Pt’’-2 Pt’’-N : P0 Pt’+1 Pt’’+1 Pt-N-1 Pt-1 Pt-2 Pt-N Live, Runtime Phase Monitoring and Prediction on Real Systems with Application to Dynamic Power Management Cantürk İşçi Gilberto Contreras Margaret Martonosi MICRO-39 Dec 9-13 2006 Orlando,FL PrincetonUniversity October 17, 2019