Presentation is loading. Please wait.

Presentation is loading. Please wait.

PPEP: ONLINE PERFORMANCE, POWER, AND ENERGY PREDICTION FRAMEWORK BO SU † JUNLI GU ‡ LI SHEN † WEI HUANG ‡ JOSEPH L. GREATHOUSE ‡ ZHIYING WANG † † NUDT.

Similar presentations


Presentation on theme: "PPEP: ONLINE PERFORMANCE, POWER, AND ENERGY PREDICTION FRAMEWORK BO SU † JUNLI GU ‡ LI SHEN † WEI HUANG ‡ JOSEPH L. GREATHOUSE ‡ ZHIYING WANG † † NUDT."— Presentation transcript:

1 PPEP: ONLINE PERFORMANCE, POWER, AND ENERGY PREDICTION FRAMEWORK BO SU † JUNLI GU ‡ LI SHEN † WEI HUANG ‡ JOSEPH L. GREATHOUSE ‡ ZHIYING WANG † † NUDT ‡ AMD RESEARCH DECEMBER 17, 2014

2 | PPEP : ONLINE PERFORMANCE, POWER, AND ENERGY PREDICTION FRAMEWORK | DECEMBER 17, 20142 BACKGROUND  Dynamic Voltage and Frequency Scaling (DVFS) ‒Widely employed ‒Boost performance, lower power, and improve energy efficiency ‒Good DVFS decisions require ‒Accurate performance & power predictions across Voltage Frequency (VF) states  Challenges Brought by Modern Processors ‒Cores and north bridge ‒Different clock domains and power planes ‒Off-chip Memory ‒Own clock domain

3 | PPEP : ONLINE PERFORMANCE, POWER, AND ENERGY PREDICTION FRAMEWORK | DECEMBER 17, 20143 Q1: How does the application’s performance change with the VF state? TWO QUESTIONS Q2: How does the application’s power change with the VF state?

4 | PPEP : ONLINE PERFORMANCE, POWER, AND ENERGY PREDICTION FRAMEWORK | DECEMBER 17, 20144 EXPERIMENTAL SETUP PLATFORM & BENCHMARKS  Processor: AMD FX-8320 ‒2 nd Generation Family 15h “Piledriver” ‒4 CUs (Compute Units) ‒2 Cores in a CU ‒5 VF (Voltage-Frequency) states ‒The NB (North Bridge) ‒Shared by all CUs ‒L3$ and memory controller  Power Measurement ‒Pololu ACS711: Current sensor ‒Arduino: Power reading  Benchmark Suites ‒SPEC® CPU 2006, PARSEC, NAS Parallel Benchmarks CU North Bridge ASUS M5A97 ACS711 Power Supply FX-8320 CPU Arduino

5 | PPEP : ONLINE PERFORMANCE, POWER, AND ENERGY PREDICTION FRAMEWORK | DECEMBER 17, 20145 OUTLINE  Background  Experimental Setup  Performance Prediction Across DVFS States  Power Prediction Across DVFS States  PPEP Framework  Conclusion Q1: How does the application’s performance change with the VF state?

6 | PPEP : ONLINE PERFORMANCE, POWER, AND ENERGY PREDICTION FRAMEWORK | DECEMBER 17, 20146 HOW CPI CHANGES WITH FREQUENCY Estimate #1 CPI does not change when frequency increases. Estimate #2 CPI also increases when frequency increases. bzip2 mcf GAP (LOWER IS BETTER)

7 | PPEP : ONLINE PERFORMANCE, POWER, AND ENERGY PREDICTION FRAMEWORK | DECEMBER 17, 20147 WHAT MAKES THE CPI DIFFERENCE? bzip2 Core L3+DRAM Core clock cycles Cycles working in the CPU Core. Core Frequency doesn’t matter. mcf Core L3+DRAM Cycles stalled on Memory Accessing. Scale with Core Frequency. Key: Memory Stall Cycles estimation GAP Core Frequency: f -> f’ Does not scale. Scales by f’/f.

8 | PPEP : ONLINE PERFORMANCE, POWER, AND ENERGY PREDICTION FRAMEWORK | DECEMBER 17, 20148 “LEADING LOADS” MEMORY TIME ESTIMATION [CF’10 Keramidas+], [TOC’10 Eyerman+], and [IGCC’11 Rountree+] Core L3+DRAM Leading Load Not Leading Leading Load Leading Loads Model: Memory stall cycles = cycles that leading loads are active Memory Stall Cycle Memory Stall Cycles Core clock cycles AMD hardware has MAB0 performance counter which can estimate Leading Loads cycles. [USENIX-ATC’14 Su+]

9 | PPEP : ONLINE PERFORMANCE, POWER, AND ENERGY PREDICTION FRAMEWORK | DECEMBER 17, 20149 CPI PREDICTOR  3 HW events  Step=200ms Core L3 + DRAM E1 Core clock cycles Instruction number = E2 Running Frequency = f Core L3 + DRAM Another Frequency = f’ CPI(f) = E1/E2 Instruction number = E2 CPI(f’) = (E1+E3*(f’/f - 1))/E2 #CodeEvent Name E10x76CPU_Clocks_not_Halted E20xc0Retired_Instructions E30x69MAB0_Wait_Cycles MAB based Leading Loads Cycles E3 E1-E3 E1+E3*(f’/f - 1) E3 * (f’/f) E1-E3 LL-MAB CPI Predictor

10 | PPEP : ONLINE PERFORMANCE, POWER, AND ENERGY PREDICTION FRAMEWORK | DECEMBER 17, 201410 PREDICTED CPI CURVE V.S. MEASURED CPI CURVE 1.7GHZ (VF2) -> 3.5GHZ (VF5)

11 | PPEP : ONLINE PERFORMANCE, POWER, AND ENERGY PREDICTION FRAMEWORK | DECEMBER 17, 201411 CPI PREDICTION ERROR  FX-8320 Processor  All 52 benchmarks in SPEC, PARSEC, and NPB  1.7GHz (VF2) -> 3.5GHz (VF5) 15.5%14.2% 3.0%

12 | PPEP : ONLINE PERFORMANCE, POWER, AND ENERGY PREDICTION FRAMEWORK | DECEMBER 17, 201412 OUTLINE  Background  Experimental Methodology  Performance Prediction Across DVFS States  Power Prediction Across DVFS States  PPEP Framework  Conclusion Q2: How does the application’s power change with the VF state?

13 | PPEP : ONLINE PERFORMANCE, POWER, AND ENERGY PREDICTION FRAMEWORK | DECEMBER 17, 201413 IDLE POWER MODEL + DYNAMIC POWER MODEL Heating Cooling Relationship? VF StateVoltage(V)Frequency(GHz)Error of P idle VF51.3203.52.4% VF41.2422.92.6% VF31.1282.33.6% VF21.0081.72.8% VF10.8881.42.7%

14 | PPEP : ONLINE PERFORMANCE, POWER, AND ENERGY PREDICTION FRAMEWORK | DECEMBER 17, 201414 IDLE POWER MODEL + DYNAMIC POWER MODEL  #EventCode E4Retired_uOPs0xc1 E5FPU_Pipe_Assignment0x00 E6L1I$_Fetches0x80 E7L1D$_Accesses0x40 E8L2$_Requests0x7d E9Retired_Branches0xc2 E10Retired_MisBranches0xc3 E11L2$_Misses0x7e E12Dispatch_Stalls0xd1 (PS= Per-Second Count Value)

15 | PPEP : ONLINE PERFORMANCE, POWER, AND ENERGY PREDICTION FRAMEWORK | DECEMBER 17, 201415 POWER MODEL VALIDATION  Benchmark Combination ‒Totally 152 combinations ‒Thread number: 1, 2, 3, and 4 ‒SPEC:single-/multi- programed ‒PARSEC: single-/multi- threaded ‒NPB:single-/multi- threaded  4-fold cross validation ‒Divide 152 benchmark combination in to 4 sets ‒Train on all combinations of 3 sets ‒Test on the remaining set EXPERIMENT NumberSPECPARSECNPBTotal 1-thread29131052 2-thread15131038 3-thread10121032 4-thread7131030 Total615140152

16 | PPEP : ONLINE PERFORMANCE, POWER, AND ENERGY PREDICTION FRAMEWORK | DECEMBER 17, 201416 POWER MODEL VALIDATION (LOWER IS BETTER)  Error: 4.6%  Standard Deviation: 2.8%

17 | PPEP : ONLINE PERFORMANCE, POWER, AND ENERGY PREDICTION FRAMEWORK | DECEMBER 17, 201417 PREDICTING POWER ACROSS VF STATES  VF5 VF4 VF3 VF2 VF1 3 Events => CPI(VF 4 ) CPI(VF 5 ) CPI(VF 3 ) CPI(VF 2 ) CPI(VF 1 ) T => P idle (VF 4 ) P idle (VF 5 ) P idle (VF 3 ) P idle (VF 2 ) P idle (VF 1 ) 9 Events => P dyn (VF 4 ) P dyn (VF 5 ) P dyn (VF 3 ) P dyn (VF 2 ) P dyn (VF 1 ) LL-MAB same T HW event prediction ? 2 observations + LL-MAB CPI Predictor E PS(i) (VF4) E PS(i) (VF5) E PS(i) (VF3) E PS(i) (VF2) E PS(i) (VF1) i=4,…,12 ?

18 | PPEP : ONLINE PERFORMANCE, POWER, AND ENERGY PREDICTION FRAMEWORK | DECEMBER 17, 201418 OBSERVATION 1 + OBSERVATION 2  At any given point in the execution of a program, core-private HW event counts per instruction are independent of VF state.  => To execute the same section of instructions in a program, the activities of core-private resources are independent of VF state. #Event Name E4Retired_uOPs E5FPU_Pipe_Assignment E6L1I$_Fetches E7L1D$_Accesses E8L2$_Requests E9Retired_Branches E10Retired_MisBranches E11L2$_Misses On FX-8320, between 3.5GHz (VF5) and 1.7GHz (VF2) Error 0.6% 0.9% 0.7% 5.0% 0.7% 1.3% 4.0%

19 | PPEP : ONLINE PERFORMANCE, POWER, AND ENERGY PREDICTION FRAMEWORK | DECEMBER 17, 201419 OBSERVATION 1 + OBSERVATION 2  #Event E1CPU_Clock_not_Halted E12Dispatch_Stalls Freq = f Freq = f’ Start end Cycles retiring 4 instructions Cycles retiring 2 instructions Cycles retiring 3 instructions Cycles without retiring Cycles retiring 1 instruction Overheads of mispredicted branches and exceptions Dispatch Stalls Frequency Independent On FX-8320, between3.5GHz (VF5) and 1.7GHz (VF2) Gap Error 1.7% 0.6%0.9%1.0%1.5%

20 | PPEP : ONLINE PERFORMANCE, POWER, AND ENERGY PREDICTION FRAMEWORK | DECEMBER 17, 201420 DYNAMIC POWER PREDICTION – FLOW  HW Events LL-MAB From PMC Observation 1 Observation 2 #Event E1CPU_Clock_not_Halted E2Retired_Instructions E3MAB0_Wait_Cycles E4Retired_uOPs E5FPU_Pipe_Assignment E6L1I$_Fetches E7L1D$_Accesses E8L2$_Requests E9Retired_Branches E10Retired_MisBranches E11L2$_Misses E12Dispatch_Stalls  Prediction Flow (Bridge: LL-MAB + 2 OBs) PS=per-second PI=per-instruction

21 | PPEP : ONLINE PERFORMANCE, POWER, AND ENERGY PREDICTION FRAMEWORK | DECEMBER 17, 201421 PREDICTING ERROR OF AVERAGE CHIP POWER  Error: 4.2%  Standard Deviation: 3.6% LOWER IS BETTER

22 | PPEP : ONLINE PERFORMANCE, POWER, AND ENERGY PREDICTION FRAMEWORK | DECEMBER 17, 201422 OUTLINE  Background  Experimental Methodology  Performance Prediction Across DVFS States  Power Prediction Across DVFS States  PPEP Framework  Conclusion Q1: How does the application’s performance change with the VF state? Q2: How does the application’s power change with the VF state?

23 | PPEP : ONLINE PERFORMANCE, POWER, AND ENERGY PREDICTION FRAMEWORK | DECEMBER 17, 201423 PPEP FRAMEWORK  PMC based User Daemon  Periodically Running Performance Power Energy Online Prediction Commercial Processors

24 | PPEP : ONLINE PERFORMANCE, POWER, AND ENERGY PREDICTION FRAMEWORK | DECEMBER 17, 201424 CONCLUSION On Commercial Processor  Demonstrated an across-VF CPI predictor “LL-MAB” ‒According to the Leading Loads theory  Demonstrated an across-VF power predictor ‒Through the “LL-MAB” CPI predictor and 2 observations  Combining them together: PPEP Framework ‒Supplies performance, power, and energy information ‒SW method w/o requiring HW or OS modification

25 | PPEP : ONLINE PERFORMANCE, POWER, AND ENERGY PREDICTION FRAMEWORK | DECEMBER 17, 201425 THANK YOU!  Questions

26 | PPEP : ONLINE PERFORMANCE, POWER, AND ENERGY PREDICTION FRAMEWORK | DECEMBER 17, 201426 DISCLAIMER & ATTRIBUTION The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION © 2014 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance Evaluation Corporation (SPEC). Other names are for informational purposes only and may be trademarks of their respective owners.

27 Backup Slides

28 | PPEP : ONLINE PERFORMANCE, POWER, AND ENERGY PREDICTION FRAMEWORK | DECEMBER 17, 201428 PERFORMANCE & POWER PREDICTION VF 5 VF 4 VF 3 VF 2 VF 1 CPI(VF 5 ) CPI(VF 3 ) CPI(VF 2 ) CPI(VF 1 ) Performance Power CPI(VF 4 ) Power(VF 5 ) Power(VF 3 ) Power(VF 2 ) Power(VF 1 ) Power(VF 4 ) Performance prediction? Core VF State Running at: Lower Higher CPU Power prediction?  Running VF state: VF4  Predicting performance, power of other VF states

29 | PPEP : ONLINE PERFORMANCE, POWER, AND ENERGY PREDICTION FRAMEWORK | DECEMBER 17, 201429 EXPERIMENTAL METHODOLOGY SOFTWARE TOOLS  Operating System ‒Ubuntu 12.04 LTS Desktop (kernel version 3.2.0-24)  Tools ‒taskset: A2C mapping ‒msr-tools: PMC control ‒CPUFreq userspace governor: VF Scaling ‒hwmon tree in sysfs: temperature measurement  Benchmark Suites ‒SPEC® CPU 2006 v1.2 (29 benchmarks) ‒PARSEC v2.1 (13 benchmarks) ‒NAS Parallel Benchmarks v3.3.1 (10 benchmarks)

30 | PPEP : ONLINE PERFORMANCE, POWER, AND ENERGY PREDICTION FRAMEWORK | DECEMBER 17, 201430 HOW TO ESTIMATE “MEMORY STALL CYCLES”? MODERN CORES MAKE THIS DIFFICULT Core Work L3+DRAM  Memory level parallelism  Accesses overlap computation  Variable latencies Multiple Parallel Accesses Accesses Overlap Computation Variable Latencies Core clock cycles memory access counts do not estimate memory time accurately

31 | PPEP : ONLINE PERFORMANCE, POWER, AND ENERGY PREDICTION FRAMEWORK | DECEMBER 17, 201431 LEADING LOADS ON AMD PROCESSORS L2 cache misses held in Miss Address Buffer (MAB) ‒MAB entries have a static priority (e.g. MAB0 is highest priority) ‒Highest priority empty MAB holds the miss until it returns from memory Performance event 0x69 allows SW to count # of cycles with filled MABs CPU Core & L1 Cache L2 Cache L3 Cache & DRAM Core Clock Domain NB and Memory Clock Domains Miss Address Buffers MAB0 MAB1 MAB2 … MABn MAB0 MAB1 MAB2

32 | PPEP : ONLINE PERFORMANCE, POWER, AND ENERGY PREDICTION FRAMEWORK | DECEMBER 17, 201432 MAB BASED LEADING LOADS MODEL: LL-MAB  Benchmark ‒Running on core 0.  Manager ‒Running on Core 7. Collects the counts every 200ms. EXPERIMENT Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 L3 Cache & Memory Controller L2 Cache CU 0 CU 1 CU 2 CU 3 Core 7 Core 0 Manager Measured Predicted CPI(f’) CPI(f) CPI(f’ ) CPI(f) CPI(f’) CPI(f) CPI(f’) CPI(f) CPI(f’) CPI(f) CPI(f’) CPI(f) CPI(f’) CPI(f) CPI(f’) Benchmark CPI(f) 200ms

33 | PPEP : ONLINE PERFORMANCE, POWER, AND ENERGY PREDICTION FRAMEWORK | DECEMBER 17, 201433 PREDICTED CPI V.S. MEASURED CPI 1.7GHZ -> 3.5GHZ  a) Divide each curve into 20 segments;  b) Compare cycle numbers in each segment-pair;  c) Average absolute error => CPI Prediction Error.

34 | PPEP : ONLINE PERFORMANCE, POWER, AND ENERGY PREDICTION FRAMEWORK | DECEMBER 17, 201434 CPI PREDICTION ERROR V.S. MEMORYBOUNDEDNESS 1.7GHZ -> 3.5GHZ

35 | PPEP : ONLINE PERFORMANCE, POWER, AND ENERGY PREDICTION FRAMEWORK | DECEMBER 17, 201435 CHIP IDLE POWER MODEL  MODEL AND ERROR

36 | PPEP : ONLINE PERFORMANCE, POWER, AND ENERGY PREDICTION FRAMEWORK | DECEMBER 17, 201436 CHIP DYNAMIC POWER MODEL  App threads mapping & Manager EXPERIMENT Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 L3 Cache & Memory Controller L2 Cache CU 0 CU 1 CU 2 CU 3 App thread Manager power-step (20ms) x10 1st half-step (100ms) Step = 200ms 2nd half-step (100ms) Manager sleeping Manager working Time

37 | PPEP : ONLINE PERFORMANCE, POWER, AND ENERGY PREDICTION FRAMEWORK | DECEMBER 17, 201437 FITTING ERROR: DYNAMIC POWER MODEL  Error: 10.6%  Standard Deviation: 5.8% (LOWER IS BETTER)

38 | PPEP : ONLINE PERFORMANCE, POWER, AND ENERGY PREDICTION FRAMEWORK | DECEMBER 17, 201438  Error: 8.3%  Standard Deviation: 6.9% (LOWER IS BETTER)

39 | PPEP : ONLINE PERFORMANCE, POWER, AND ENERGY PREDICTION FRAMEWORK | DECEMBER 17, 201439 THE IMPACT OF POWER GATING 

40 | PPEP : ONLINE PERFORMANCE, POWER, AND ENERGY PREDICTION FRAMEWORK | DECEMBER 17, 201440 4P dyn (A)+P idle (Chip) 3P dyn (A)+P idle (Chip) 2P dyn (A)+P idle (Chip) P dyn (A)+P idle (Chip) P idle (Chip) THE IMPACT OF POWER GATING  Micro Benchmark: “A” ‒Dataset is fitted in L1 data cache => No NB activity (dynamic power) ‒N instances of “A” times the dynamic power of “A” by N A2C Mapping CU0 C0, C1 CU1 C2, C3 CU2 C4, C5 CU3 C6, C7 4CUsA, - 3CUsA, - Idle 2CUsA, - Idle 1CUA, -Idle PowerPower Difference 4CUs- 3CUs 2CUs 1CU Idle- When Power Gating is disabled: When Power Gating is enabled: P idle (CU) 2P idle (CU) 3P idle (CU) 4P idle (CU) +P idle (NB) P idle (Base) Get P idle (CU), P idle (NB), P idle (Base) n = busy cores in Chip m = busy cores in CU

41 | PPEP : ONLINE PERFORMANCE, POWER, AND ENERGY PREDICTION FRAMEWORK | DECEMBER 17, 201441 PPEP FRAMEWORK  Computation complexity on each core  If core number is c, VF number is k COMPUTATION COMPLEXITY For the current VF state ADDMULDIV CPI calculation1 Power calculation1829 Total18291 For every other VF state ADDMULDIV CPI prediction4111 Power prediction18391 Total224012 OPNumberExample: FX-8320, c=8, k=5 ADDc*(18+22*(k-1))848 MULc*(29+40*(k-1))1512 DIVc*(1+12*(k-1))392 <3k FPOPs Per-step (computing flow w/o optimized)

42 | PPEP : ONLINE PERFORMANCE, POWER, AND ENERGY PREDICTION FRAMEWORK | DECEMBER 17, 201442 PPEP FRAMEWORK  Measured Running Overhead RUNNING OVERHEAD Overhead of PPEP (FX-8320, VF5)Step = 200msStep = 20ms Power Negligible1.3W Performance (App and PPEP is not attached on a same core) Negligible Performance (App and PPEP is attached on a same core) 1.6% slowdown 5.8% slowdown

43 | PPEP : ONLINE PERFORMANCE, POWER, AND ENERGY PREDICTION FRAMEWORK | DECEMBER 17, 201443 PPEP USAGE EXAMPLE: ONE-STEP POWER CAPPING  PPEP could rapidly reach the cap


Download ppt "PPEP: ONLINE PERFORMANCE, POWER, AND ENERGY PREDICTION FRAMEWORK BO SU † JUNLI GU ‡ LI SHEN † WEI HUANG ‡ JOSEPH L. GREATHOUSE ‡ ZHIYING WANG † † NUDT."

Similar presentations


Ads by Google