CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki.

CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki

Memory Expansion Recall Array Dataflow Analysis start from loops get value-based dependences correspond to Alpha = no notion of memory It is sometimes called Full Array Expansion explicit dependences with single assignment full parallelism exposed 2

Memory vs Parallelism More parallelism requires more memory obvious example: scalar accumulation One approach: ignore the problem by using memory-based dependences Alternatively, we can try to find memory allocation afterwards 3

Memory Allocation Given a schedule: Memory Reuse Analysis [1996] Lefebvre-Feautrier [1998] Quilleré-Rajopadhye [2000] Lattice-Based [2005] For a set of schedules Universal Occupancy Vectors [1998] Affine Universal Occupancy Vectors [2001] Quasi-Universal Occupancy Vectors [2013] 4

Occupancy Vectors Main Concept: A vector (in the iteration space) that gives another iteration that can safely overwrite Universal OV: OV that is legal for any schedule affine and quasi- variants restrict the universe to smaller subset 5

Universal Occupancy Vectors Only for uniform dependence all iterations have the same dependence pattern large enough domain (no thin strips) Key Idea: Transitivity some iteration z can overwrite z’ if z depends on all uses of z’ → possibly transitively 6

UOV Example Find an UOV for the following i j 7

UOV Example Find an UOV for the following is [1,1] a valid UOV? how does it translate to memory mapping? i j 8

UOV Example Find an UOV for the following how about [1,0]? i j 9

UOV Example Find an UOV for the following i j 10

UOV Example Alternative Formulation as intersection of transitive closures i j 11

Affine UOV Example Restrict to affine schedules but allow affine dependences i j 12

Relevance of UOVs UOV allocates d-1 dimensional array for d-dimensional space Does this sound like a problem? What can you say about programs with only uniform dependences? How does this relate to tiling? 13

Memory Allocation/Contraction We are given an affine schedule θ per statement possibly multi-dimensional Problem: find affine pseudo-projections affine function + modulo factors per statement usually minimizing the memory usage 14

Pseudo Projection Assume lex. order as schedule what is a valid OV? i j 15

Pseudo Projection Assume lex. order as schedule what is a valid OV? [0,2], which translates to i j 16 for i for j A[i%2,j] = foo(A[(i-1)%2,j], A[i%2,j-1]); for i for j A[i%2,j] = foo(A[(i-1)%2,j], A[i%2,j-1]);

Allocation vs Contraction Most programs have: much more statements than arrays Memory allocation techniques: map each statement to its own array try to merge arrays afterwards Array contraction techniques: keeps the original statement-to-array mapping Little difference in the theory behind 17

Live-ness of Values Central analysis in memory allocation live-ness analysis in register allocation called by different names Given: value computed at S(z), used by T(z’) we cannot overwrite the value of S(z) written at θ(S,z) until θ(T,z’) forall T,z’ 18

Computing the Live-ness How to compute the live-ness? θ(i,j) = i θ(i,j) = i+j 19 i j

Lefebvre-Feautrier How to find the allocation? θ(i,j) = i 20 1. Start with scalar 2. Expand in a dimension 3. Use max reuse distance as modulo factor i j

Lefebvre-Feautrier Alternative Description 21 1. Start with full array 2. Project in a dimension 3. Compute modulo factor i j

Quilleré-Rajopadhye Based on non-Canonic Projections Main Result: Optimality for a d-D space if you find x independent projections what can you say about memory usage? 22

Lattice-Based Allocation Different formulation using lattices Consider some basis of an integer lattice 23 i j

Lattice-Based Allocation Lattices ≈ Occupancy Vectors Conflict Set values that cannot be mapped to the same memory locations Find the smallest lattice that is large enough to only intersect with the conflict set at its base enumeration of the space using HNF 26

Energy-Aware Compilation Power Wall 28

Power Density Lead to multi-core Saving Energy is important Barrier for Exa-Scale computing Battery lifetime of laptops Compiler Optimization has focused on speed Is there anything compilers can do for energy? Speed is still important 29

Starting Hypothesis Energy is Power consumed over Time E = PT P : Power consumption E : Energy consumption T : execution Time Faster execution time = Lower energy consumption Hypothesis : Optimizing for speed also optimizes energy 30

Single Processor Case Two main categories Purely Program Transformations Efficient use of data cache Energy Aware Compilation framework Dynamic Voltage and Frequency Scaling Profile based Loop transformation + DVFS 31

Efficient use of data cache HW with configurable cache line size Trade-off : larger CLS => better spatial locality, higher interference Main Contribution: Model to maximize hit ratio Configurable CLS leads to energy trade-off In GP processors, data locality optimization ≈ energy optimization of cache 32 [D’Alberto et. al, 2001]

Energy Aware Compilation Compiler framework with energy in mind Based on predicting power consumption from high-level source code Energy-Aware Tiling Optimal tiling strategy for speed != for energy Key : tiling adds instructions Main Weakness Improvement is relatively small (~10%) Energy is traded with speed 33 [Kadayif et. al, 2002]

Results by Kadayif et. al Increase in energy/execution cycle when optimized for the other Energy delay product would not change much 34 firconvlmsrealbiqua d comple x mx m vpenta i Energ y 4.1%7.7%6.8%3.9%2.0%8.8%5.9%7.3% Cycle5.9%8.7%7.2%2.9%2.3%7.6%9.2%6.8%

HW for Further Optimization Dynamic Voltage and Frequency Scaling Power consumption model for CMOS Voltage is the obvious target high frequency requires high voltage quadratic energy savings with reduced freq. 35 V : supply voltage f : frequency α : activity rate

DVFS : Main Idea Identify non-compute intensive stages Frequency/Voltage can be reduced without influencing speed processor is under-utilized DVFS states are coarse grained ~10 different frequency/voltage configurations State transition is not free 100s of cycles extra energy consumed 36

DVFS : Single Processor Profile Based Profile to identify opportunities Compile-time vs. Run-time Limited by available opportunities Loop Transformation First, optimize for speed Then convert speedup to energy savings Transformation to expose opportunities 37 [Hsu and Kremer 2003, Hsu and Feng 2005] [Ghodrat and Gvargis 2009]

DVFS : Single Processor Task-Based Programs Main Ides: Decoupled Access/Execute Compiler transformation to split into tasks One that does memory Accesses to fetch data Another that does Execute to compute Apply DVFS low frequency for Access high frequency for Execute 38 [Jimborean et al. 2014]

Single Processor : Summary Purely software based optimization No significant gains over optimizing for speed Hypothesis holds in this case DVFS based approaches HW for energy savings exposed to software Identify when processors is not fully utilized HW support breaks the hypothesis 39

Across Processors Parallelization is necessary to utilize modern architectures How does parallelism affect energy? Amdahl’s Law for Energy Opportunities in parallel programs 40

Static Power New Term to the Power Model Some power is consumed even when idle DVFS has less effect Static Power reaching 50% of the total power 41 I : leakage current static power dynamic power

Amdahl’s Law for Energy Simple model of energy and parallelism processors have DVFS Simple but more complicated than the original Speed-up energy trade-off analysis 42 [Cho and Melhem 2008] s : sequential p : parallel N : number of processors λ : static power y : power consumption as a function of frequency seq dynamicparallel dynamicstatic

Illustrating example from paper 43 (frequency)

When Static Power is 50% 44 (frequency)

Static Power dominates Static Power is significant Increases as N increases Excessive processors are bad With current technology (high static power and increasing cores) Running as fast as possible is a good way to save energy 45

Generalizing a bit Further Analysis based on high-level energy model Emphasis on power breakdown Find when “race-to-sleep” is the best Survey power breakdown of recent machines Goal: confirm that sophisticated use of DVFS by compilers is not likely to help much e.g., analysis/transformation to find/expose “sweet-spot” for trading speed with energy 46

Power Breakdown Dynamic (P d )—consumed when bits flip Quadratic savings as voltage scales Static (P s )—leaked while current is flowing Linear savings as voltage scales Constant (P c )—everything else e.g., memory, motherboard, disk, network card, power supply, cooling, … Little or no effect from voltage scaling 47

Influence on Execution Time Voltage and Frequency are linearly related Slope is less than 1 i.e., scale voltage by half, frequency drop is less than half Simplifying Assumption Frequency change directly influence exec. time Scale frequency by x, time becomes 1/x Fully flexible (continuous) scaling Small set of discrete states in practice 48

Case1: Dynamic Dominates Power  Time  Case2: Static Dominates Power  Time  Case3: Constant Dominates Power  Time  Ratio is the Key 49 P d : P s : P c Energy  Slower the Better Energy  Slower the Better Energy  No harm, but No gain Energy  No harm, but No gain Energy  Faster the Better Energy  Faster the Better

When do we have Case 3? Static power is now more than dynamic power Power gating doesn’t help when computing Assume P d = P s 50% of CPU power is due to leakage Roughly matches 45nm technology Further shrink = even more leakage The borderline is when P d = P s = P c We have case 3 when P c is larger than P d =P s 50

Extensions to The Model Impact on Execution Time May not be directly proportional to frequency Shifts the borderline in favor of DVFS Larger P s and/or P c required for Case 3 Parallelism No influence on result CPU power is even less significant than 1-core Power budget for a chip is shared (multi-core) Network cost is added (distributed) 51

Do we have Case 3? Survey of machines and significance of P c Based on: Published power budget (TDP) Published power measures Not on detailed/individual measurements Conservative Assumptions Use upper bound for CPU Use lower bound for constant powers Assume high PSU efficiency 52

P c in Current Machines Sources of Constant Power Stand-By Memory (1W/1GB) Memory cannot go idle while CPU is working Power Supply Unit (10-20% loss) Transforming AC to DC Motherboard (6W) Cooling Fan (10-15W) Fully active when CPU is working Desktop Processor TDP ranges from 40-90W Up to 130W for large core count (8 or 16) 53

Sever and Desktop Machines Methodology Compute a lower bound of P c Does it exceed 33% of total system power? Then Case 3 holds even if the rest was all consumed by the processor System load Desktop: compute-intensive benchmarks Sever: Server workloads (not as compute-intensive) 54

Desktop and Server Machines 55

Cray Supercomputers Methodology Let P d +P s be sum of processors TDPs Let P c be the sum of PSU loss (5%) Cooling (10%) Memory (1W/1GB) Check if P c exceeds P d = P s Two cases for memory configuration (min/max) 56

Cray Supercomputers 57

DVFS for Memory Still in research stage (since 2010~) Same principle applied to memory Quadratic component in power w.r.t. voltage 25% quadratic, 75% linear The model can be adopted: P d becomes P q dynamic to quadratic P s becomes P l static to linear The same story but with P q : P l : P c 60

Influence on “race-to-sleep” Methodology Move memory power from P c to P q and P l 25% to P q and 75% to P l P c becomes 15% of total power for Server/Cray “race-to-sleep” may not be the best anymore remains to be around 30% for desktop Vary P q :P l ratio to find when “race-to-sleep” is the winner again leakage is expected to keep increasing 61

When “Race to Sleep” is optimal When derivative of energy w.r.t. scaling is >0 62 dE/dF Linearly Scaling Fraction: P l / (P q + P l )

Summary and Conclusion Diminishing returns of DVFS Main reason is leakage power Confirmation by a high-level energy model “race-to-speed” seems to be the way to go Memory DVFS won’t change the big picture Compilers can continue to focus on speed No significant gain in energy efficiency by sacrificing speed 63

Balancing Computation and I/O DVFS can improve energy efficiency when speed is not sacrificed Bring program to compute-I/O balanced state If it’s memory-bound, slow down CPU If it’s compute-bound, slow down memory Still maximizing hardware utilization but by lowering the hardware capability Current hardware (e.g., Intel Turbo-boost) and/or OS do this for processor 64

The Punch Line Method How to Punch your audience how to attract your audience Make your talk more effective learned from Michelle Strout Colorado State University applicable to any talk excellent average good poor Normal Talk Punch Line Talk 66

The Punch Line The key cool idea in your paper the key insight It is not the key contribution! X% better than Y do well on all benchmarks Examples:... because of HW prefetching... further improve locality after reaching compute-bound 67

Typical Conference Audience Many things to do check emails browse websites finish their own slides Attention Level (made up numbers) ~3 minutes 90% ~5 minutes 60% 5+ minutes 30% conclusion 70% punch here! push these numbers up! 68

Typical (Boring) Talk 1. Introduction 2. Motivation 3. Background 4. Approach 5. Results 6. Discussion 7. Conclusion 69

Punch Line Talk Two Talks in One 5 minute talk introduction/motivation key idea X-5 minute talk add some background elaborate on approach... the punchshortest path to the punch 70

Pitfalls of Beamer Beamer != bad slides but it is a easy path to one Checklist for good slides no full sentences LARGE font size few equations many figures !paper structure beamer is not the best tool to encourage these 71

CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki.

Similar presentations

Presentation on theme: "CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki.

Similar presentations

Presentation on theme: "CR18: Advanced Compilers L08: Memory & Power Tomofumi Yuki."— Presentation transcript:

Similar presentations

About project

Feedback