Optimum System Balance for Systems of Finite Price John D. McCalpin, Ph.D. IBM Corporation Austin, TX SuperComputing 2004 Pittsburgh, PA November 10, 2004
Overview The HPC Challenge Benchmark was announced last year at SuperComputing’2003 The HPC Challenge Benchmark consists of –LINPACK (HPL) –STREAM –PTRANS (transposing the array used by HPL) –RandomAccess (random read/modify/write) –FFT –and some low-level MPI latency & BW measurements No single figure of merit is defined
Overview (continued) Q: What is a “balanced” system? My answer: “A balanced system is one for which the primary applications are limited in performance by the most expensive component(s) of the system.”
The Two Questions We need to understand both performance and cost in the context of low-level component metrics Performance –What performance model? –Use a harmonically weighted, time-based model Cost –What cost model? –Simple linear additive cost model
Performance Model Composite Figures of Merit for Performance must be based on “time” rather than “rate” –i.e., weighted harmonic means of rates Why? –Combining “rates” in any other way fails to have a “Law of Diminishing Returns” Time = Work/Rate Repeat for each component: T i = W i /R i
A Simple Composite Model Assume the time to solution is composed of a compute time proportional to peak GFLOPS plus a memory transfer time proportional to sustained memory bandwidth Assume “x Bytes/FLOP” to get: Target SPECfp_rate2000 as the workload
Does Peak GFLOPS predict SPECfp_rate2000?
Does Sustained Memory Bandwidth predict SPECfp_rate2000?
Optimized Model Results Results rounded to nearby round values: –Bytes/FLOP for large caches === 0.16 –Bytes/FLOP for small caches === 0.80 –Size of asymptotically large cache === ~12 MB –Coefficient of best fit === ~6.4 –The units of the coefficient are SPECfp_rate2000 / Effective GFLOPS
Does this Revised Metric predict SPECfp_rate2000?
Cost Model Assume simple linear additive model –FLOPS cost some amount –Sustained BW costs a different amount –Define: = R mem / R cpu = W mem / W cpu = ($/BW) / ($/FLOPS) System characteristics Application characteristics Technology characteristics
How to Optimize? For a given application (W mem /W cpu ), what is the optimum system balance ? Many people seem to believe that the system should be “balanced” to the application: – optimal = i.e., – R mem /R cpu = W mem /W cpu This does not optimize price/performance
The Correct Optimization This is actually an easy optimization problem Minimize cost/performance –Same as minimizing cost * time Optimum cost/performance occurs at – = sqrt( / ) Definitely not intuitive!
Example: High BW, expensive BW = 3 relatively high BW = 3 relatively expensive BW Optimum price/performance is at =1, not =3 Improvement in price/performance is ~30%
High BW, very expensive memory = 3 relatively high BW = 10 very expensive BW Optimum price/performance is at =0.58, not =3 Improvement in price/performance is ~50%
Low-BW, expensive BW = 0.1 low BW application = 3 moderately expensive BW Optimum price/performance is at =0.18, not =0.1 Improvement in price/performance is ~5% More BW helps here even though it is expensive, because the application does not need much.
Medium BW, expensive BW = 1 modest BW = 3 moderately expensive BW Optimum price/performance is at =0.58, not =1 Improvement in price/performance is ~10%
Summary Balance is important to cost/performance You must understand performance You must understand cost Optimum cost-performance is not intuitive!