Identifying Program Power Phase Behavior Using Power Vectors

Slides:

Advertisements

Similar presentations

Discovering and Exploiting Program Phases Timothy Sherwood, Erez Perelman, Greg Hamerly, Suleyman Sair, Brad Calder CSE 231 Presentation by Justin Ma.

Advertisements

CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.

1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.

November 12, 2013Computer Vision Lecture 12: Texture 1Signature Another popular method of representing shape is called the signature. In order to compute.

CS 211: Computer Architecture Lecture 5 Instruction Level Parallelism and Its Dynamic Exploitation Instructor: M. Lancaster Corresponding to Hennessey.

CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

Automatically Characterizing Large Scale Program Behavior Timothy Sherwood Erez Perelman Greg Hamerly Brad Calder.

Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.

Runtime Processor Power Monitoring VICE VERSA 07/18/2003 Talk Canturk ISCI.

Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.

Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.

Hyper-Threading Technology Architecture and Micro-Architecture.

Tahir CELEBI, Istanbul, 2005 Hyper-Threading Technology Architecture and Micro-Architecture Prepared by Tahir Celebi Istanbul, 2005.

Runtime Power Monitoring in High-End Processors: Methodology and Empirical Data Canturk Isci & Margaret Martonosi Princeton University MICRO-36.

Automatically Characterizing Large Scale Program Behavior Timothy Sherwood Erez Perelman Greg Hamerly Brad Calder Used with permission of author.

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Michael D’Mello

Intel Multimedia Extensions and Hyper-Threading Michele Co CS451.

Identifying Program Power Phase Behavior Using Power Vectors Canturk Isci & Margaret Martonosi WWC Austin, TX.

Sunpyo Hong, Hyesoon Kim

Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.

Best detection scheme achieves 100% hit detection with

Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.

Parapet Research Group, Princeton University EE Workshop on Hardware Performance Monitor Design and Functionality HPCA-11 Feb 13, 2005 Hardware Performance.

PINTOS: An Execution Phase Based Optimization and Simulation Tool) PINTOS: An Execution Phase Based Optimization and Simulation Tool) Wei Hsu, Jinpyo Kim,

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

Experience Report: System Log Analysis for Anomaly Detection

Improving Multi-Core Performance Using Mixed-Cell Cache Architecture

Canturk ISCI Margaret MARTONOSI

Data Prefetching Smruti R. Sarangi.

Dynamic Branch Prediction

William Stallings Computer Organization and Architecture 8th Edition

Multiscalar Processors

CSC 4250 Computer Architectures

PowerPC 604 Superscalar Microprocessor

Cache Memory Presentation I

Smruti R. Sarangi Computer Science and Engineering, IIT Delhi

Yogesh Mahajan, Sharad Malik Princeton University

Flavius Gruian < >

Pipelining: Advanced ILP

Out of Order Processors

CSCI1600: Embedded and Real Time Software

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Comparison of Two Processors

Ka-Ming Keung Swamy D Ponpandi

Phase Capture and Prediction with Applications

Lecture: Out-of-order Processors

Adapted from the slides of Prof

How to improve (decrease) CPI

Advanced Computer Architecture

Guest Lecturer TA: Shreyas Chand

Canturk ISCI Gilberto CONTRERAS Margaret MARTONOSI

Parallelization of Sparse Coding & Dictionary Learning

* From AMD 1996 Publication #18522 Revision E

Instruction Execution Cycle

COMS 361 Computer Organization

Data Prefetching Smruti R. Sarangi.

Adapted from the slides of Prof

Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011

Dynamic Hardware Prediction

How does the CPU work? CPU’s program counter (PC) register has address i of the first instruction Control circuits “fetch” the contents of the location.

Chapter 11 Processor Structure and function

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

CSCI1600: Embedded and Real Time Software

Conceptual execution on a processor which exploits ILP

Phase based adaptive Branch predictor: Seeing the forest for the trees

Ka-Ming Keung Swamy D Ponpandi

Spring 2019 Prof. Eric Rotenberg

Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

Presentation transcript:

Identifying Program Power Phase Behavior Using Power Vectors Canturk Isci & Margaret Martonosi The title is our goal WWC-6 10.27.2003 Austin, TX

Power Phase Behavior Existence of distinguishable intervals during an application’s execution lifetime such that: They share significantly higher resemblance within themselves in terms of power behavior the application exhibits on a given processor This similarity is carried out by not only the total processor power, but also the distribution of power into processor sub-units Before describing the essence of our work, What do we mean by power phase behavior

Our Power Phase Analysis Goal: Identify phases in program power behavior Determine execution points that correspond to these phases Define small set of power signatures that represent overall power behavior

Our Approach Our Approach – Outline: Collect samples of estimated power values for processor sub-units <Power Vectors> at application runtime Define a power vector similarity metric Group sampled program execution into phases Determine execution points and representative signature vectors for each phase group Analyze the accuracy of our approximation Basically, what we’ll talk about today: Collect samples of estimated power values for processor sub-units <Power Vectors> at application runtime Define a similarity metric regarding these power vectors Process the outcome of this similarity metric to group sampled execution points (/Power Vectors) into phases Determine execution points and representative signature vectors for each phase group Quantify the closeness of our approximation based on these vectors to original power behavior

Motivation Characterizing power behavior: Utilizing power vectors: Future power-aware architectures and applications Dynamic power/thermal management Architecture research Utilizing power vectors: Direct relation to actual processor power consumption Acquired at runtime Identify program phases with no knowledge of application Dynamic power/thermal management Multiconfigurable hardware Thread scheduling, DVS, DFS Recurring phase prediction Architecture research Representative –reduced– simulation points Acquired at runtime  Similarity relations generated quickly Easy repeatability for different datasets/compilations Identify (recurring) phases over large scales of execution Identify program phases with no knowledge of application (i.e. no basic block profile, PC sampling, code space info, etc.)

Generating Power Vectors 1mV/Adc conversion POWER SERVER Voltage readings via RS232 to logging machine 22 Entries of each power vector sample Counter based access rates over ethernet POWER CLIENT Convert voltage to measured power Convert access rates to component powers

Power Vector Similarity Metric How to quantify the ‘power behavior dissimilarity’ between two execution points? Consider solely total power difference  Consider manhattan distance between the corresponding 2 vectors  Consider manhattan distance between the corresponding 2 vectors normalized  Consider a combination of (2) & (3)  Construct a “similarity matrix” to represent similarity among all pairs of execution points Each entry in the similarity matrix: Consider solely total power difference  Conceals significant phase information Consider manhattan distance between the corresponding 2 vectors  Vectors with small magnitudes are inherently closer Consider manhattan distance between the corresponding 2 vectors normalized  Indifferent to magnitude of power consumption Consider a combination of (2) & (3)  Restricts us to both absolute and ratio-wise similarities

(Similar to SimPoints work of Sherwood et al.) Gcc & Gzip Matrix Plots 0.44 44 44.44 88 132 88.44 176 220 132.44 264 Timeline [s] Timeline [s] 176.44 308 352 396 220.44 440 (Similar to SimPoints work of Sherwood et al.) 264.44

Gcc & Gzip Matrix Plots 0.44 44.44 88.44 132.44 Timeline [s] 176.44 Gcc Elaboration: Very variant power Almost identical power behavior at 30, 50, 180s. Although 88s, 110s, 140s, 210s and 230s show similar total power; 88, 210 and 230 share higher similarity. 44.44 88.44 132.44 Timeline [s] 176.44 220.44 264.44

44 88 Gcc & Gzip Matrix Plots 132 176 220 Timeline [s] 264 308 352 396 Gzip Elaboration: Much regular power behavior Spurious similarities are again distinguished by the similarity analysis 44 88 132 176 220 264 Timeline [s] 308 352 396 440

Grouping Execution Points “Thresholding Algorithm”: Define a threshold of similarity < % of max dissimilarity> Start from first execution point (0,0) and identify ones in the fwd execution path that lie within threshold for both normalized and absolute metrics Tag the corresponding execution points (j,j) as the same group Find next untagged execution point (r,r) and do the same along forward path Rule: A tagged execution point cannot add new elements to its group! We demonstrate the outcome of thresholding with Grouping Matrices So, we can identify similar power phases: I.e. informally: if similarity matrix(r,c) is DARK  Execution points r & c have similar power behavior 2 Questions: 1) How do we group the execution points (power vectors) based on their similarity? 2) Could we represent power behavior with reasonable accuracy, with a small number of ‘signature’ vectors?

Gzip Grouping Matrices # of Groups: 3 # of Groups: 1 # of Groups: 33 # of Groups: 254 Original Similarity Matrix # of Groups: 909 Also shows for 10%, 100-150 & 200-280s regions are discriminated Gzip has 974 power vectors Cluster vectors based on similarity using “thresholding” Max Gzip power dissimilarity: 47.35W

Generated Group Distributions

Representative Vectors & Execution Points We have each execution point assigned to a group For Each Group: For Each Execution Point: We can represent whole execution with as many power vectors as the number of generated groups Define a representative vector as the average of all instances of that group Select the execution point that started the group (The earliest point in each group) Assign the corresponding group’s representative vector as that point’s power vector Assign the power vector of the selected execution point for that group as that point’s power vector Back to the question: Could we represent power behavior with reasonable accuracy, with a small number of ‘signature’ vectors? And also select execution points?

Reconstructing Power Trace with Representative Vectors:

Reconstructing Power Trace with Selected Execution Points:

Component Power Characterizations

Approximation Error Due to thresholding algorithm  Errors for selected exec. points are bounded with the threshold Max Error: 4.71W & RMS Error: 3.08W As representative vectors are group centroids  Cumulative errors for repr. vectors are lower Max Error: 7.10W & RMS Error: 2.31W Error in total power< Σ(Component errors) Fig-s small, but show the dist-n Expoints more evenly distributed

Conclusion Presented a power oriented methodology to identify program phases that uses power vectors generated during program runtime Provided a similarity metric to quantify power behavior similarity of different execution samples Demonstrated our representative sampling technique to characterize program power behavior Can be useful for power & characterization research: Power Phase identification/prediction Reduced power simulation Dynamic power/thermal management Demonstrated our representative sampling technique to characterize program power behavior Representative vectors for program power signatures Execution points for representative simulation

Related Work Dhodapkar and Smith [ISCA’02] Working set signatures to detect phase changes Sherwood et. al. [PACT’01,ASPLOS’02,ISCA’30] Similarity analysis based on program basic block profiles to identify phases Todi [WWC’01] Clustering based on counter information to identify similar behavior Our work in comparison Power oriented Power behavior similarity metric Runtime No information about the application is required Bounded approximation error with thresholding Some here, rest in paper Similarity based on both distribution and magnitude of power Our metric is more appropriate for power

EOP

EXTRA SLIDES Power Vector Components Different Similarity Metrics More detail on Power Vectors Different Similarity Metrics Similarity matrices and equations for all discussed techniques Similarity & Grouping Matrices Exemplified description of the two matrices and plots Current & Future Research Discussion of the ongoing and future research Includes also some new ideas and some things to do to make our current analysis solid Questions & Rebuttals Starts with the discussion of presented work Discusses some shortcomings, things that need to be done to improve and to verify that it is unique and solid (Some parts of current work also discusses these issues) Also provides some answers to reviewers’ questions Includes some possible new ideas Some here, rest in paper Similarity based on both distribution and magnitude of power Our metric is more appropriate for power

Power Vector Components Defining Components Bus Control L2 Cache ITLB & Ifetch 2nd Level BPU L1 Cache Mem Cntrl Int RF Instr-n Queue Int EXE MOB DTLB Scheduler Check Replay Retire Instr-n Decode FP EXE FP RF I am not gonna talk about P4 architecture becoz of time instead I’ll use the die layout as a proxy memory sub: L2 ve BUS to chipset main mem Front end: fetch, decode, ITLB ref etc. O-O-O: does buffr alloc., reg rename, sched. and dispatch, and retire reorders instr-s EX: Rapid exec. + L1 stuff, Actual layout components important to use as thermal model based on those physical components 22 components here Trace Cache Instr-n Queue 1st Level BPU Ucode ROM Allocate Rename

P4 Architecture vs Layout Power Vector Components P4 Architecture vs Layout First let’s have a look at the March & layout to see what we are dealing with Components to Model: MOB Mem Control DTLB Int EXE FP EXE Int RF FP RF Decode Trace $ 1st Level BPU Microcode ROM Allocation Rename Inst-n Qs Schedule Retirement Bus Control L2 Cache 2nd Level BPU ITLB & Ifetch L1 Cache Back

Defining Events  Access Rates Power Vector Components Defining Events  Access Rates We determined 24 events to approximate access rates for 22 components Used Several Heuristics to represent each access rate Examples: Need to rotate counters 4 times to collect all event data Used 15 counters & 4 rotations to collect all event data 2nd step in the flowchart

Access Rates  Component Powers Power Vector Components Access Rates  Component Powers “Performance Counter based Access Rate estimations are used as proxy for max component power weighting together with microarchitectural details in order to estimate processor sub-unit powers” EX: Trace cache delivers 3 uops/cycle in deliver mode and 1 uop/cycle in build mode: Power(TC)=[Access-Rate(TC)/3 + Access-Rate(ID)] x MaxPower(TC) + Non-gated TC CLK power Total power is computed as the sum of all 22 component powers + measured idle power (8W): 3rd bullet is all about pwr conversion: we weight the comp. max powers with access rates and consider max. number of accesses that would mean max power, but at the end, they all end up as a single weighting factor merged together 1

Counter Access Heuristics Power Vector Components Counter Access Heuristics 1) BUS CONTROL: No 3rd Level cache  BSQ allocations ~ IOQ allocations Metric1: Bus accesses from all agents Event: IOQ_allocation Counts various types of bus transactions Should account for BSQ as well access based rather than duration MASK: Default req. type, all read (128B) and write (64B) types, include OWN,OTHER and PREFETCH Metric2: Bus Utilization(The % of time Bus is utilized) Event: FSB_data_activity Counts DataReaDY and DataBuSY events on Bus Mask: Count when processor or other agents drive/read/reserve the bus Expression: FSB_data_activity x BusRatio / Clocks Elapsed To account for clock ratios processor core puts accesses to BSQ, those that go to front side bus will be put in FSQ_IOQ (in our case all, as no L3)

Counter Access Heuristics Power Vector Components Counter Access Heuristics 2) L2 Cache: Metric: 2nd Level cache references Event: BSQ_cache_reference Counts cache ref-s as seen by bus unit MASK: All MESI read misses (LD & RFO) 2nd level WR misses 3) 2nd Level BPU: Metric 1: Instructions fetched from L2 (predict) Event: ITLB_Reference Counts ITLB translations Mask: All hits, misses & UC hits Metric 2: Branches retired (history update) Event: branch_retired Counts branches retired Count all Taken/NT/Predicted/MissP processor core puts accesses to BSQ, those that go to front side bus will be put in FSQ_IOQ (in our case all, as no L3)

Counter Access Heuristics Power Vector Components Counter Access Heuristics 4) ITLB & I-Fetch: etc……… 10) FP Execution: Metric: FP instructions executed event1: packed_SP_uop counts packed single precision uops event2: packed_DP_uop event3: scalar_SP_uop counts scalar double precision uops event4: scalar_DP_uop event5: 64bit_MMX_uop counts MMX uops with 64bit SIMD operands event6: 128bit_MMX_uop counts integer SSE2 uops with 128bit SIMD operands event7: x87_FP_UOP counts x87 FP uops event8: x87_SIMD_moves_uop counts x87, FP, MMX, SSE, SSE2 ld/st/mov uops processor core puts accesses to BSQ, those that go to front side bus will be put in FSQ_IOQ (in our case all, as no L3) Back

Similarity Based on Total Power Different Similarity Metrics now here is how the rest goes

Similarity Based on Absolute Power Vectors Different Similarity Metrics now here is how the rest goes

Similarity Based on Normalized Power Vectors Different Similarity Metrics now here is how the rest goes

Similarity Based on Both Absolute and Normalized Power Vectors Different Similarity Metrics now here is how the rest goes Back

Similarity Matrix Example 1 2 3 Consider 4 vectors, each with 4 dimensions: Similarity & Grouping Matrices Log all distances in the similarity matrix Color-scale from black to white (only for upper diagonal) 1 2 3 1 2 3 Similarity Matrix Plot: now here is how the rest goes 6 3 7 6 3 7 1 1 1 2 2 2 3 3 3

Interpreting Similarity Matrix Plot Level of darkness at any location (r,c) shows the amount of similarity between vectors –samples– r & c. i.e. 0 & 2 Similarity Matrix Plot: Vertically above the diagonal shows similarity of the sample at the diagonal to previous samples i.e. 1 vs. 0 1 Horizontally right of the diagonal shows similarity of the sample at the diagonal to future samples i.e. 1 vs. 2,3 Similarity & Grouping Matrices 2 All samples are perfectly similar to themselves All (r,r) are black 3 now here is how the rest goes Back

Grouping Matrix Example 1 2 3 Consider same 4 vectors: Mark execution pairs with distance ≤ Threshold Similarity & Grouping Matrices 1 2 3 1 2 3 Grouping Matrix Plots: 6 3 7 6 3 7 1 1 1 Threshold: 10% 2 2 2 now here is how the rest goes 3 3 3 1 2 3 1 2 3 6 3 7 6 3 7 1 1 1 Threshold: 50% 2 2 2 Back 3 3 3

Current & Future Research FOLLOWING SLIDES DISCUSS ONGOING RESEARCH RELATED TO POWER PHASES. PLANS FOR FUTURE RESEARCH ARE ALSO DISCUSSED now here is how the rest goes

Current & Future Research THE BIG PICTURE Performance Monitoring Real Power Measurement Power Modeling Thermal Phases Program Profiling Structure Bottom line… To Estimate component power & temperature breakdowns for P4 at runtime… To analyze how power phase behavior relates to program structure now here is how the rest goes

Current & Future Research Phase Branch Power Phase Behavior Similarity Based on Power Vectors Identifying similar program regions Profiling Execution Flow Sampling process’ execution “PCsampler” LKM Program Structure Execution vs. Code space Power Phases  Exec. Phases NOT YET? Program Profiling Power Modeling Phases Program Profiling Structure Power Phases Program Structure now here is how the rest goes

Current & Future Research POWER PHASE BEHAVIOR Power Phase Behavior Similarity Based on Power Vectors Identifying similar program regions Profiling Execution Flow Sampling process’ execution “PCsampler” LKM Program Structure Execution vs. Code space Power Phases  Exec. Phases NOT YET Power Modeling Phases Program Profiling Structure Power Phases now here is how the rest goes

Identifying Power Phases Current & Future Research Identifying Power Phases Most of the methodology is ready Complete Gzip case in WWC Extensibility to other benchmarks Generated similarity metrics for several Performing phase identification with thresholding Repeatibilty of the experiment Several other possible ideas such as: Thresholding + k-means clustering Two-pass thresholding PCA for dimension reduction (or SVD?) Manhattan = L1 norm Euclidian (L2) – not interesting Chebyschev (Linf) - ?? now here is how the rest goes

Program Execution Profile Current & Future Research Program Execution Profile Power Phase Behavior Similarity Based on Power Vectors Identifying similar program regions Profiling Execution Flow Sampling process’ execution “PCsampler” LKM Program Structure Execution vs. Code space Power Phases  Exec. Phases NOT YET Program Profiling Power Modeling Phases Program Profiling Structure Program Structure now here is how the rest goes

Program Execution Profile Current & Future Research Program Execution Profile Sample program flow simultaneously with power Our LKM implementation: “PCsampler” Not Finished… Generate code space similarity in parallel with power space similarity Relative comparisons of methods for: Complexity Accuracy Applicability, etc. now here is how the rest goes

Current & Future Research CURRENT STATE Sample PC  Binding to functions Reacquire PID Those SPECs, Runspec: always in fixed address at ELF_program interpreter Benches: change pid between datasets Verify PC with objdump So we can make sure it is the PC we’re sampling now here is how the rest goes

Initial Data: PC?? Trace For gzip-source Current & Future Research Initial Data: PC?? Trace For gzip-source Correspond to: <send_bits> <bit_reverse> <lm_init> <longest_match> <fill_window> functions now here is how the rest goes Back

DiscussionFrom here on Generality of the technique All Spec benchmarks show distinct phase behavior: Repeatability of the experiment Need to be able to arrive at similar phase behavior in order to characterize an application Correlation between vector components Inherent redundancy in power vectors Could be removed with PCA Alternative norms for similarity Applicability of selected execution points Questions & Rebuttals Some questions we need to answer for our method first 2 are important last 2 are not so much

Similarity Analysis for Other Applications? SPECs show similar applicable behavior Not always phase-like, i.e. twolf has more like a power gradient Results for other benchmarks: <NOT READY> Gcc & Twolf: # of groups w.r.t. thresholds Errors plots for reconstructed & selected vectors Apply to other applications: Desktop applications Will follow the bursty behavior, maybe determine action signatures?? Saving, computation, streaming, etc. Ghostscript might be interesting Correlation between phases vs. locations of images Questions & Rebuttals now here is how the rest goes

Dependence of Results to the Applied Power Model? The generic technique requires only sufficiently detailed power breakdowns that add up to total power Doesn’t matter how you acquire the ‘power vectors’ otherwise If you use some other characterization data other than power vectors Can still perform phase analysis, but cannot provide a direct estimation for reconstructed power behavior Maybe use some kind of mapping? Log measured power data as well as characterization metrics? Would still be unable to predict component-wise Questions & Rebuttals now here is how the rest goes

Possibility for Other Processors? Most recent processors are keen on power management There will be enough power variability to exploit for power phase analysis Porting the power estimation to other architectures Requires significant effort to Define power related metrics Implement counter reader and power estimation user and kernel SW Porting to same architecture, different implementation More straightforward Reevaluate max/idle/gated power estimates Experiences with other architectures: Castle project for Pentium Pro (P6) Few watts of variation Low dimensionality IBM Power3 II Very low measured power variation Questions & Rebuttals now here is how the rest goes

Other Statistical Techniques? Alternative measures of distance: Different norms Canberra Distance Squared Chi-squared distance, etc. Other similarity metrics: Pearson’s correlation coefficient Cosine similarity Questions & Rebuttals now here is how the rest goes

SPEClite vs. SimPoints vs. Power Vectors Approaches of 3 methods: SPEClite: Performance counts  (runtime)  (Perf. Oriented)  (k-means partitioning for vectors normalized to 0 mean and unit variance)  (PCA to create reduced vectors)  (selects vectors closest to centroids) SimPoints: Basic block accesses  (Simulation)  (Perf. Oriented)  (k-means partitioning for normalized basic block vectors)  (selects vectors closest to centroids –except for ‘Early’ Simpoints) Power Vectors: Component power consumptions  (Runtime)  (Power Oriented)  (threshold based partitioning for ‘normalized + absolute’ vectors)  (Selects earliest vector of each group) Questions & Rebuttals now here is how the rest goes

Why Power Vectors w.r.t. Others? Provides a direct interpretation for power consumption Could be used to identify specific power behavior for dynamic power/thermal management Power phases might be not a perfect translation of performance phases <CURRENT WORK INVESTIGATES> I.e. same basic block accesses during different architectural states Generated at runtime Easy repeatability, etc. Thresholding provides upper bound estimate for the power approximation with selected execution points Questions & Rebuttals 2) Memory references? FWDing, speculation? Or like blocked process still holding bus power

FOLLOWING SLIDES I MODIFIED FROM THE ORIGINAL, BUT STILL KEEP ‘EM Modified Stuff FOLLOWING SLIDES I MODIFIED FROM THE ORIGINAL, BUT STILL KEEP ‘EM

Our Power Phase Analysis Goal: Identify phases in program power behavior Determine execution points that correspond to these phases Define small set of power signatures that represent overall power behavior Our Approach: Collect samples of estimated power values for processor sub-units <Power Vectors> at application runtime Define a similarity metric regarding these power vectors Process the outcome of this similarity metric to group sampled execution points (/Power Vectors) into phases Determine execution points and representative signature vectors for each phase group Quantify the closeness of our approximation based on these vectors to original power behavior Basically, what we’ll talk about today

Motivation Characterizing power behavior: Utilizing power vectors: Future power-aware architectures and applications Dynamic power/thermal management Multiconfigurable hardware Thread scheduling, DVS, DFS Recurring phase prediction Architecture research Representative –reduced– simulation points Utilizing power vectors: Direct relation to actual processor power consumption Acquired at runtime  Similarity relations generated quickly Easy repeatability for different datasets/compilations Identify (recurring) phases over large scales of execution Identify program phases with no knowledge of application (i.e. no basic block profile, PC sampling, code space info, etc.)

Power Vector Similarity Metric How to quantify the ‘power behavior dissimilarity’ between two execution points? Consider solely total power difference  Conceals significant phase information Consider manhattan distance between the corresponding 2 vectors  Vectors with small magnitudes are inherently closer Consider manhattan distance between the corresponding 2 vectors normalized  Indifferent to magnitude of power consumption Consider a combination of (2) & (3)  Restricts us to both absolute and ratio-wise similarities Construct a “similarity matrix” to represent similarity among all pairs of execution points Each entry in the similarity matrix: Consider solely total power difference  Conceals significant phase information Consider manhattan distance between the corresponding 2 vectors  Vectors with small magnitudes are inherently closer Consider manhattan distance between the corresponding 2 vectors normalized  Indifferent to magnitude of power consumption Consider a combination of (2) & (3)  Restricts us to both absolute and ratio-wise similarities

Gcc & Gzip Matrix Plots Gzip Elaboration: Much regular power behavior Spurious similarities such as 100-150s and 200-280 are distinguished by the similarity analysis 0.44 Gcc Elaboration: Very variant power Almost identical power behavior at 30, 50, 180s. Although 88s, 110s, 140s, 210s and 230s show similar total power; 88, 210 and 230 share higher similarity. 44 44.44 88 132 88.44 176 220 132.44 264 Timeline [s] Timeline [s] 176.44 308 352 396 220.44 440 264.44

Similarity for Simplicity? So, we can identify similar power phases: I.e. informally: if similarity matrix(r,c) is DARK  Execution points r & c have similar power behavior 2 Questions: 1) How do we group the execution points (power vectors) based on their similarity? 2) Could we represent power behavior with reasonable accuracy, with a small number of ‘signature’ vectors? Our answer to Q.1: “Thresholding Algorithm”: Define a threshold of similarity < % of max dissimilarity> Start from first execution point (0,0) and identify ones in the fwd execution path that lie within threshold for both normalized and absolute metrics Tag the corresponding execution points (j,j) as the same group Find next untagged execution point (r,r,) and do the same along fwd path Rule: A tagged execution point cannot add new elements to its group! We demonstrate the outcome of thresholding with Grouping Matrices

Gzip Grouping Matrices # of Groups: 9 # of Groups: 4 # of Groups: 3 # of Groups: 1 # of Groups: 15 Original Similarity Matrix # of Groups: 33 # of Groups: 909 # of Groups: 254 # of Groups: 70 Also shows for 10%, 100-150 & 200-280s regions are discriminated Gzip has 974 power vectors Cluster vectors based on similarity using “thresholding” Max Gzip power dissimilarity: 47.35W

Conclusion Presented a power oriented methodology to identify program phases that uses power vectors generated during program runtime Provided a similarity metric to quantify power behavior similarity of different execution samples Demonstrated our representative sampling technique to characterize program power behavior Representative vectors for program power signatures Execution points for representative simulation Can be useful for power & characterization research: Power Phase identification/prediction Reduced power simulation Dynamic power/thermal management Some questions we need to answer for our method first 2 are important last 2 are not so much