Download presentation
Presentation is loading. Please wait.
Published byAudra Hutchinson Modified over 9 years ago
1
Computer Science Adaptive, Transparent Frequency and Voltage Scaling of Communication Phases in MPI Programs Min Yeol Lim Computer Science Department Sep. 8, 2006
2
Computer Science 2 Growing energy demand Energy efficiency is a big concern –Increased power density of microprocessors –Cooling cost for heat dissipation –Power and performance tradeoff Dynamic voltage and frequency scaling (DVFS) –Supported by newer microprocessors –Cubic drop on power consumption Power frequency × voltage 2 –CPU is the major power consumer : 35~50% of total power
3
Computer Science 3 Power-performance tradeoff Cost vs. Benefit –Power performance –Increasing execution time vs. decreasing power usage –CPU scaling is meaningful only if benefit > cost E = P1 * T1 E = P2 * T2 Time Power P1 T1 Benefit T2 P2 Cost
4
Computer Science 4 Power-performance tradeoff (cont’) Cost > Benefit –NPB EP benchmark CPU-bound application CPU is on critical path Benefit > Cost –NPB CG benchmark Memory-bound application CPU is NOT on critical path 2.0 1.8 1.6 1.4 1.2 1.0 0.8Ghz
5
Computer Science 5 Motivation 1 Cost/Benefit is code specific –Applications have different code regions –Most MPI communications are not critical on CPU P-state transition in each code region –High voltage and frequency on CPU intensive region –Low voltage and frequency on MPI communication region
6
Computer Science 6 Time and energy performance of MPI calls MPI_Send MPI_Alltoall
7
Computer Science 7 Motivation 2 Most MPI calls are too short –Scaling overhead by p-state change per call –Up to 700 microseconds in p-state transition Make regions with adjacent calls –Small interval of inter MPI calls –P-state transition occurs per region Call length (ms) MPI calls interval (ms) Fraction of calls Fraction of intervals
8
Computer Science 8 Reducible regions time user MPI library ABCDEFGHIJ R1R2R3
9
Computer Science 9 Thresholds in time –close-enough (τ): time distance between adjacent calls –long-enough (λ): region execution time Reducible regions (cont’) time user MPI library ABCDEFGHIJ δ < τ δ > λ δ < τ
10
Computer Science 10 How to learn regions Region-finding algorithms –by-call Reduce only in MPI code: τ=0, λ=0 Effective only if single MPI call is long enough –simple Adaptive 1-bit prediction by looking up its last behavior 2 flags : begin and end –composite Save patterns of MPI calls in each region Memorize the begin/end MPI calls and # of calls
11
Computer Science 11 P-state transition errors False-positive (FP) –P-state is changed in the region top p-state must be used –e.g. regions terminated earlier than expected False-negative (FN) –Top p-state is used in the reducible region –e.g. regions in first appearance
12
Computer Science 12 P-state transition errors (cont’) users MPI library AAABBB Program execution top p-state reduced p-state Optimal transition FN top p-state reduced p-state Simple FN top p-state reduced p-state Composite
13
Computer Science 13 P-state transition errors (cont’) users MPI library AAAAAA Program execution top p-state reduced p-state Optimal transition FN top p-state reduced p-state Composite FNFN FP top p-state reduced p-state Simple
14
Computer Science 14 Selecting proper p-state automatic algorithm –Use composite algorithm to find regions –Use hardware performance counters Evaluation of CPU dependency in reducible regions A metric of CPU load: micro-operations/microsecond (OPS) –Specify p-state mapping table OPSFrequency > 20002000 Mhz 1000 ~ 20001800 Mhz 400 ~ 10001600 Mhz 200 ~ 4001400 Mhz 100 ~ 2001200 Mhz < 100800 Mhz
15
Computer Science 15 Implementation Use PMPI –MPI profiling interface –Intercept pre and post hooks of any MPI call transparently MPI call unique identifier –Use the hash value of all program counters in call history –Insert assembly code in C
16
Computer Science 16 Results System environment –8 or 9 nodes with AMD Athlon-64 system –7 p-states are supported: 2000~800Mhz Benchmarks –NPB MPI benchmark suite C class 8 applications –ASCI Purple benchmark suite Aztec 10 ms in thresholds (τ, λ)
17
Computer Science 17 Benchmark analysis –Used composite for region information
18
Computer Science 18 Taxonomy –Profile does not have FN or FP Reduced p-state SingleMultiple Region findings NaiveBy-call AdaptiveSimple AdaptiveCompositeAutomatic StaticProfile
19
Computer Science 19 Overall Energy Delay Product (EDP)
20
Computer Science 20 Comparison of p-state transition errors Breakdown of execution time SimpleComposite
21
Computer Science 21 τ evaluation SP benchmark
22
Computer Science 22 τ evaluation (cont’) MGCG BT LU
23
Computer Science 23 Conclusion Contributions –Design and implement an adaptive p-state transition system in MPI communication phases Identify reducible regions on the fly Determine proper p-state dynamically –Provide transparency to users Future work –Evaluate the performance with other applications –Experiments on the OPT cluster
24
Computer Science 24
25
Computer Science 25 State transition diagram Simple OUTIN not “close enough” else “close enough” begin == 1 else end == 1
26
Computer Science 26 State transition diagram (cont’) Composite OUT INREC else “close enough” pattern mismatch “close enough” not “close enough” end of region operation begins reducible region
27
Computer Science 27 Performance
28
Computer Science 28 Benchmark analysis –Region information from composite with τ = 10 ms
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.